path: root/_articles/2021-04-29-a-relational-model-of-data-for-large-shared-data-banks-article-review.md
blob: e15b4781c6adebc200181d188c71c332610fa6e4 (plain) (tree)


title: A Relational Model of Data for Large Shared Data Banks - article-review

date: 2021-04-29

layout: post

lang: en

ref: a-relational-model-of-data-for-large-shared-data-banks-article-review


This is a review of the article "[A Relational Model of Data for Large Shared Data Banks][codd-article]", by E. F. Codd.

[codd-article]: https://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf

## Data Independence

Codd brings the idea of *data independence* as a better approach to use on databases.
This is contrast with the existing approaches, namely hierarquical (tree-based) and network-based.

His main argument is that queries in applications shouldn't depende and be coupled with how the data is represented internally by the database system.
This key idea is very powerful, and something that we strive for in many other places: decoupling the interface from the implementation.

If the database system has this separation, it can kep the querying interface stable, while having the freedom to change its internal representation at will, for better performance, less storage, etc.

This is true for most modern database systems.
They can change from B-Trees with leafs containing pointers to data, to B-Trees with leafs containing the raw data , to hash tables.
All that without changing the query interface, only its performance.

Codd mentions that, from an information representation standpoint, any index is a duplication, but useful for perfomance.

This data independence also impacts ordering (a *relation* doesn't rely on the insertion order).

## Duplicates

His definition of relational data is a bit differente from most modern database systems, namely **no duplicate rows**.

I couldn't find a reason behind this restriction, though.
For practical purposes, I find it useful to have it.

## Relational Data

In the article, Codd doesn't try to define a language, and today's most popular one is SQL.

However, there is no restriction that says that "SQL database" and "relational database" are synonyms.
One could have a relational database without using SQL at all, and it would still be a relational one.

The main one that I have in mind, and the reason that led me to reading this paper in the first place, is Datomic.

Is uses an [edn]-based representation for datalog queries[^edn-queries], and a particular schema used to represent data.

Even though it looks very weird when coming from SQL, I'd argue that it ticks all the boxes (except for "no duplicates") that defines a relational database, since building relations and applying operations on them is possible.

Compare and contrast a contrived example of possible representations of SQL and datalog of the same data:

-- create schema
  manager_id UUID,
  FOREIGN KEY (manager_id) REFERENCES people (id)

-- insert data
INSERT INTO people (id, name, manager_id) VALUES
  ("d3f29960-ccf0-44e4-be66-1a1544677441", "Foo", "076356f4-1a0e-451c-b9c6-a6f56feec941"),
  ("076356f4-1a0e-451c-b9c6-a6f56feec941", "Bar");

-- query data, make a relation

SELECT employees.name AS 'employee-name',
       managers.name AS 'manager-name'
FROM people employees
INNER JOIN people managers ON employees.manager_id = managers.id;

{% raw %}
;; create schema
#{ {:db/ident       :person/id
   :db/valueType   :db.type/uuid
   :db/cardinality :db.cardinality/one
   :db/unique      :db.unique/value}
  {:db/ident       :person/name
   :db/valueType   :db.type/string
   :db/cardinality :db.cardinality/one}
  {:db/ident       :person/manager
   :db/valueType   :db.type/ref
   :db/cardinality :db.cardinality/one}}

;; insert data
#{ {:person/id      #uuid "d3f29960-ccf0-44e4-be66-1a1544677441"
   :person/name    "Foo"
   :person/manager [:person/id #uuid "076356f4-1a0e-451c-b9c6-a6f56feec941"]}
  {:person/id      #uuid "076356f4-1a0e-451c-b9c6-a6f56feec941"
   :person/name    "Bar"}}

;; query data, make a relation
{:find [?employee-name ?manager-name]
 :where [[?person  :person/name    ?employee-name]
         [?person  :person/manager ?manager]
         [?manager :person/name    ?manager-name]]}
{% endraw %}

(forgive any errors on the above SQL and datalog code, I didn't run them to check. Patches welcome!)

This employee example comes from the paper, and both SQL and datalog representations match the paper definition of "relational".

Both "Foo" and "Bar" are employees, and the data is normalized.
SQL represents data as tables, and Datomic as datoms, but relations could be derived from both, which we could view as:

employee_name | manager_name
"Foo"         | "Bar"

[^edn-queries]: You can think of it as JSON, but with a Clojure taste.
[edn]: https://github.com/edn-format/edn

## Conclusion

The article also talks about operators, consistency and normalization, which are now so widespread and well-known that it feels a bit weird seeing someone advocating for it.

I also stablish that `relational != SQL`, and other databases such as Datomic are also relational, following Codd's original definition.