A Relational Model of Data for Large Shared Data Banks (1970) [pdf]

JunkDNA · on Sept 28, 2018

If you’ve never read this paper, you owe it to yourself to give it a thorough read. It truly gives you a sense for the problems relational databases were attempting to solve with the tech of the day. Also note the strong mathematical underpinnings of many of the core ideas. It’s no wonder these ideas have withstood the test of time.

mathattack · on Sept 28, 2018

I recall the first training class I had where I connected relational tables with the theory I learned in school. Life never was the same after.

On Codd we trust.

gnufx · on Sept 28, 2018

I've found it's generally worthwhile to read the original literature; especially so when it was all worked out, but the technology didn't support doing it properly and people have persisted with approximations and fudges needlessly (like some areas of science). The standard of writing may be better in the '60s and '70s literature than in recent stuff anyway.

EngineerBetter · on Sept 27, 2018

The key, the whole key, and nothing but the key.

mtVessel · on Sept 28, 2018

So help me, Codd.

DanielBMarkham · on Sept 27, 2018

This changed the way people coded, stored data, and thought about data in general. In my mind, the implications, not necessarily the details, were some of the most profound in the history of computational papers.

For the last few years, I've been working with the premise that there is an analogous information-minimization theory for all project information, not just data storage. So far I'm optimistic that this is true --- it's just not the same minimal format for different teams working on the same problem. It's people-dependent, not data-dependent. (Obligatory link: https://leanpub.com/info-ops ) If true, and it looks like it is, it means that teams can have "analysis compilers" that work inside the build pipeline alongside all the other tools. This leads to "continuous information deployment"

kthejoker2 · on Sept 28, 2018

As someone who serves as a product owner for analytics use cases, I can attest that the hardest problem in Agile is accurate qualitative and quantitative measurement of progress, value, and delivery.

Definitely going to check out the book.

DanielBMarkham · on Sept 28, 2018

To make it work, I had to propose a new working definition of what analysis is. The rough definition for most people is that analysis is all the "work around the work" If you're not coding you're doing analysis. I needed something more concrete than that, yet flexible enough to allow tractability on all the kinds of things teams do.

Once I got that definition in place, the rest of the process of information minimization proceeded quite naturally. At that point you know what it is you're trying to minimize, ie, you have a "key" in Codd's terms. Then the trick is just minimize information and maximizing flow using key-based storage and transmission. (This also lets you look at whether your processes are working or not from a new, rational, and objective standpoint, which is a nice side-effect)

redwood · on Sept 28, 2018

Amazing that something optimized for when storage was the bottleneck and shockingly expensive compared to people... is still the most popular model today (when people are the expensive and rare resource, with storage abundant and cheap).

lifepillar · on Sept 28, 2018

It’s a shame that no one in academia has bothered so far to curate a critical, annotated, edition of Codd’s collected works. His impact was immense (for good and for bad, e.g., nulls in SQL) and he foresaw many of the important issues of data management in the years to come (up to nowadays), including the relevance of distributed systems and the need for different ways of looking at data, as sets, as tables, as predicates, as graphs.

Unfortunately, people who knew or worked with him have died or are getting old, so I’m losing hope…

commandlinefan · on Sept 28, 2018

It's nearly 50 years old and still the best model for storing and querying data. YesSQL FTW.

ZenPsycho · on Sept 27, 2018

relational databases are secretly supposed to be logic programming languages like prolog, each “table”, (relation) representing a predicate function. each row (tuple) representing a fact about the world that satisfies the predicate. with proper structure and labelling you can ask very sophisticated questions. somewhere along the line people forgot this and typically just dump spreadsheets in, to satisfy the fields in a form or an app screen, then think they need a graph database to represent relationships. it is unfortunate that SQL based systems became accidentally synonymous with the “relational model” as it actually mostly gets in the way of implementing the relational model. oh well.

zimablue · on Sept 28, 2018

Datomic can be run on top of SQL, which is kind of ironic I guess. relational algebra => implemented incompletely in a SQL RDBMS => with an /actual/ logic programming language thrown on top of THAT using a couple of tables as KV (I think that's how it works).

paulddraper · on Sept 28, 2018

> SQL based systems became accidentally synonymous with the “relational model” as it actually mostly gets in the way of implementing the relational model

Please explain?

What makes SQL bad a relational databases, and what is better?

grzm · on Sept 28, 2018

There are at least two aspects of this:

1) SQL is not entirely faithful to the underlying logical principles of the relational model. Chris Date's "Database in Depth: Relational Theory for Practitioners"[0] or "SQL and Relational Theory: How to Write Accurate SQL Code"[1] are good texts that go into this in detail. There's plenty of stuff online if you search for it. For a very brief example of criticisms, you can take a look at Wikipedia: (https://en.wikipedia.org/wiki/SQL#Criticisms).

[0]: https://www.goodreads.com/book/show/573821.Database_in_Depth

[1]: https://www.goodreads.com/book/show/6219481-sql-and-relation...

2) Most SQL systems provide a 1-to-1 mapping of tables to underlying storage structure which places limits on performance, while relations are logical and independent of the implementation. Other than indexing strategies, there hasn't been a lot of work in production systems to truly abstract storage from logical representation. They've been kind of stuck in their thinking. This second point isn't as important as the first, however, and more off-topic with respect to your question.

paulddraper · on Sept 28, 2018

On the first point, I don't have either of those books, but IMO the Wikipedia complaints aren't compelling.

On the second point, many SQL databases offer indexing [1] (as you said), clustering [2] (arranging a table according an index), and partitioning [3].

I think database designers are very conscious of the difference between the logical relational model and the implementation details. For example, this [4] conversation on the PostgreSQL mailing list about unique indexes vs. unique constraints. I also think that in practice there just aren't many other ways to abstract the logical model from the physical storage that gives you something substantial.

IMO, this is one of the successes of the relational model: the logical model and the physical implementations don't need to differ for you to have a really good situation all around.

[1] https://www.postgresql.org/docs/10/static/indexes-intro.html

[2] https://www.postgresql.org/docs/10/static/sql-cluster.html

[3] https://www.postgresql.org/docs/10/static/ddl-partitioning.h...

[4] https://www.postgresql.org/message-id/CAO8h7BJMX5V1TqzScTx2N...

ZenPsycho · on Sept 28, 2018

the point is though, it's a little bit like a broken calculator. "The Relational Model" represents a logical model as a kind of pure piece of mathematics, and presents a system that isn't just dump disk persistence but actually a logic induction engine. However, this is not how SQL is used in practice. It is possible to implement the relational model in SQL, however if you assume SQL databases are supposed to be implementations of it in and of themselves, they violate the commutative, associative and distributive laws of the relational algebra- it's a bit like a calculator that sometimes gives you 2+2=5 (because it's faster to give wrong results sometimes). In addition, the ergonomics of a typical SQL database system do not make mathematical/logical soundness the easiest thing to achieve. Most developers don't know or care about logical induction or mathematical soundness, they just want a place to dump their objects.

That said, out of the open source SQL database systems, postgresql actually does try to implement the relational model properly. But I don't have enough direct experience with it to say definitively how well it does.

How many millions of dollars hss NULL != NULL cost the world?

mamcx · on Sept 28, 2018

I agree with all you said, except this one:

> they violate the commutative, associative and distributive laws of the relational algebra

I follow the idea of relational set, but implementing a relational language (http://tablam.org, alpha btw) I see why bags surely are better. Is not a mistake rdbms take it. Make everything sets demand more memory and runtime costs. You need to couple a index or pay O(N) for each mutation and get only a few upsides.

If exist a way around this, truly I wanna know.

However, exist another reason for what sets are wrong (as being default) them fail to cover the most simple and yet more powerful capability of the relational model: Model data.

Is posible to argue than [a, 1; b, 1] must be a set and order not mater. But [a, 1] & [b, 1] are both relations, right? also are [a, b] & [1, 1]. Is wrong to make [1, 1] to [1] just to get a math purity.[1, 1] is valid. Say that duplicates are wrong is non-sense. Also say that "order not matter" is non-sense too.

Ironically, the solution for this? add another column just to keep thins as set.

paulddraper · on Sept 28, 2018

> what is better?

ZenPsycho · on Sept 28, 2018

well, obviously "worse is better". historically, SQL has been "Good enough" and was sponsored by IBM/Oracle that nobody has bothered to try and implement the relational model correctly, except for perhaps postgres (which used to have its own query language before bolting SQL onto its core engine)

what I hope people get out of my comments is not about selecting the right system for today- A better thing does not necessarily exist yet-- But at least understand the relational model well enough to see what its utility is. Functional programming and Logic programming are coming back in fashion, and maybe someone here will build the better thing.

I don't think "nosql" databases or graph databases are necessarily the better thing. (but they might be for certain usecases). Though I think relational db systems could learn something about developer ergonomics from them. Interfacing with a DB using a fortran inspired query language is archaic. Constructing the queries using string concatenation has been a security nightmare. You could build something waaayyy better off of prolog or mercury, full programming languages with the inductive logic engine at the core of their paradigm, with a database system integrated in, and expose a graphql api for your isomorphic react web app cos that's how we do in 2018.

Even microsoft's linq is a massive improvement over using SQL. Language integration, strong typing. These are things we like, right?

paulddraper · on Sept 28, 2018

> well, obviously "worse is better".

> A better thing does not necessarily exist yet

Ah, of course.

> Interfacing with a DB using a fortran inspired query language is archaic. Constructing the queries using string concatenation has been a security nightmare.

Perhaps, but this this often for very ubiquitous technologies. See almost every HTML authoring tool.

You can write full application servers in Oracle PL/SQL now. [1] And with PostgreSQL's row-level security, you can easily have a GraphQL server with PostGraphile [2]. But those aren't popular solutions.

> ou could build something waaayyy better off of prolog or mercury, full programming languages with the inductive logic engine at the core of their paradigm

That doesn't really change the equation. You still either have to (1) choose your DBMS's query language to write your application server (like today) or (2) use one language to construct queries in another (like today).

I won't argue that a Prolog DBMS wouldn't be nice. I will argue that in terms of the hard parts of a DBMS, it's not going to fundamentally change things.

[1] https://docs.oracle.com/cd/B28359_01/appdev.111/b28424/adfns...

[2] https://www.graphile.org/postgraphile/

ZenPsycho · on Sept 28, 2018

first off, thanks for the links. Secondly,

> That doesn't really change the equation. You still either have to (1) choose your DBMS's query language to write your application server (like today) or (2) use one language to construct queries in another (like today).

I don't know if you're misunderstanding me, or I'm misunderstanding you. prolog is a query language. The logical programming paradigm is that a program is a "list of facts" and you run the program by "querying" the state of the world. The only piece missing is making the fact storage engine scalable/robust/production ready. Or.. alternatively, there's the Eve[1] project which starts off with prolog to make a hypercard like environment with a built in data storage layer. Its aim is less production ready apps and more developer ergonomics.

[1]: ( http://witheve.com )

paulddraper · on Sept 28, 2018

I don't think I misunderstood.

You yearn for the day when a DBMS uses Prolog for its query language.

Users of that DBMS have the choice to either (1) query seamlessly because they use prolog for the entire application (2) query by "gluing" Prolog to their application language.

mamcx · on Sept 28, 2018

The main problem that I see is that relational model / sql is ALWAYS coupled with RDBMS, even with the proponents of the relational model!

The thing is, sql is much less capable then even forth. You can't do:

    query = where a = ?1
    columns = [name, id ]
    console.write(columns ?rename name as a |> query)

not even

    console.write

not even

    Window.open

neither store table in a field (not serializate it, store) create your btree, make open gl game, make a osx app, create a rdbms and so on. (yes, i'm aware some rdbms could, hacking around, kind of do it witth pain)

In short, the relational model is not see as something that can be made for a full language WITHOUT RDBMS.

That is what I wish to change:

http://tablam.org/

because I work before as foxpro developer and see how easy and logical was to work with tables without sql.

tabtab · on Sept 28, 2018

I agree that the integration of domain programming and database programming/querying in FoxPro/xBase created a smoother CRUD-app shorthand that current tools can't match. Less converting of sh8t between the layers (and less need for layers). The newest generation must like to type and read a lot of code, I guess. Or, it's a conspiracy from Carpel Tunnel and eye doctors to get more revenue.

paulddraper · on Sept 28, 2018

I agree the relational model can be very useful.

I will often keep a lot of queries in the DMBS, or even create a :memory: database with sqlite when I want to use relational logic.

Thanks for the link.

gnufx · on Sept 28, 2018

> Interfacing with a DB using a fortran inspired query language is archaic.

Possibly amusing aside: Logica's RAPPORT, one of the earliest RDBs, was written in Fortran and was queried (or could be) via pre-processed Fortran. I don't know whether that was more faithful to the model than SQL.

Jaffer's Scheme RDB (https://people.csail.mit.edu/jaffer/slib/Relational-Database...) aimed to implement the model, possibly uniquely so at the time, he thought.

tabtab · on Oct 1, 2018

Actually, SQL seems more COBOL-inspired than FORTRAN-inspired to me.

As far as a "generic" query language not tied to relational databases, that's a rather tall order. Logic-based query languages are difficult to troubleshoot when you don't get the results you expected. It's kind of like a formalized version of Google: you get results, but you don't know if they are the best results, you don't know what you are missing, and you often don't know why you got what you did.

You have to have some kind of discipline and structure to the data in order to understand and troubleshoot results (at least for mortals and non-Sheldons). Some say relational is "too stiff" for some uses, but there is the Dynamic Relational proposal that gives it flexibility without abandoning relational and SQL. (Some argue it's not technically relational, which I don't agree with, but that issue doesn't change the similarity to existing RDBMS & SQL conventions.)

As far as relational-based alternatives to SQL, consider these discussions:

http://wiki.c2.com/?HowOtherQueryLanguagesAddressSqlFlaws

http://wiki.c2.com/?QueryLanguageComparison

http://wiki.c2.com/?ExperimentalQueryLanguageComparison

olavk · on Sept 28, 2018

> Most SQL systems provide a 1-to-1 mapping of tables

That is a bit weird criticism, since the whole point of relational databases is that the physical storage structure is abstracted away. SQL does not prevent databases from using different storage structures.

MaxBarraclough · on Sept 28, 2018

> Most SQL systems provide a 1-to-1 mapping of tables to underlying storage structure which places limits on performance

Is that the fault of the SQL language?

Also, serious DBMSs are generally engineered by pretty serious people, no? Are they really implementing such a mapping for no good reason?

mamcx · on Sept 28, 2018

> What makes SQL bad a relational databases, and what is better?

SQL is just a glorified DSL that is constrained, on purpose, to just make a SINGLE query at time.

Not for build A COMPLETE app. That is all what is "wrong" with it. SQL is assumed to be used by end-users, to make ad-hoc queries.

Is like google search. It provide a minimal language but not a why to make a game engine with it. Now imagine, that search box is ALL is provided to developers to build on. And then, all the others full-text indexing libraries do the same. Surely some will extend it with some more sugar, but the fact is, "search: google OR bing" is not enough to be a useful API.

That is why all developers in the world dance around SQL and kind of love-hate it. Them are not en-users and need more power, but that is only partially provided by some RDBMS as extended SQL yet the fundamental limitations hold. The only way around is using FFI with other languages because SQL is not capable enough.

This is partially, and correctly, because a RDBMS is mean to be like a REST server with restricted capabilities that could lend to problems. And because RDBMS are used by non-professionals that have not clue what them are doing... and not only I'm thinking of some end-users. So make sense to restrict their damaging capabilities (so, hard to make rm -rf "/")

The problem is that is ALWAYS assumed this way, even by something like sqlite. You just can't do a full app with sql, so you need to dance around with orms, stringly apis, non-type safety across the logic/tables, not why to make your own indexing strategies, nothing, or make it a REST server directly.

paulddraper · on Sept 28, 2018

PostgreSQL has row-level security, plus embedded languages (e.g. Python and V8). So you can actually give clients access to write raw SQL queries.

Or there is PostGraphile for automatically turning that into a GraphQL server.

[1] https://www.graphile.org/postgraphile/

avodonosov · on Sept 27, 2018

Not secretly

ZenPsycho · on Sept 28, 2018

well, it's not a secret, maybe just a truth nobody seems to want to know or care about.

swingline-747 · on Sept 27, 2018

No, that's an misinformed, all-or-nothing, gross-mischaracterization.

Data management systems are a toolbox.

The core tenent, distilling ACID, of RDBMSes is referential integrity: that what a foreign key points to can't simply disappear and leave a dangling reference. If you have a database of customers and their orders, you want orders destroyed with that customer, and the referring customer can't simply vanish. There are several delete dependent policies, depending on the use-case.

Key-Value stores (NoSQL) typically make no such guarantees. KVSes and Document store are fantasic for less critical and more nondeterministic data like user profiles, likes and tags... things that don't matter as much, as say, bank accounts transactions. Only a fool would use eventual consistency where it could be gamed by making maximum simulateneous withdrawls of say a $1000 at multiple locations. This is where atomicity and consistency are very important.

Use the right underlying data guarantees for the task at hand... not like a consulting company whom tried to turn a DBMS into a NoSQL by having a single database table store everything. IIRC the columns looked like this:

     Key | Value | Type | Notes

Good luck indexing that pile of slow, and say hello to full-table scans for nearly every fetch.

ZenPsycho · on Sept 28, 2018

I am not entirely sure what this is in response to. did you hit reply on the wrong comment perchance?

michaelmior · on Sept 28, 2018

> more nondeterministic data

What does that mean?