AM003 – A Worldview of Facts
In this post I'll show how granularly building a world-view in terms of small facts can give you a lot of flexibility where a traditional closed-world, relational modeling approach will not.
Every representation of reality is built of facts. That’s a fact. The fact that that is a fact, is a fact too.
When you code, you’re stating facts of different things. You’re saying “this function recurses over itself”, “this datatype exists”, “this name is bound to this value”. Programming facts.
But you’re also stating facts about information. You’re saying “a user exists”, and more specifically “this user exists”. And when the user provides an email, you’re stating that “this user has this email”.
You build up your worldview by stating these facts.
Let's dig deeper into this, starting from how it typically is approached. We’ll then talk about building worldviews more granularly and what the tradeoffs are there.
Traditional relational database modeling asks us to define upfront what structure and constraints our data will have and asks us to (mostly) provide upfront all of the information required to insert or update data. They deal with entire entities as their choice of granularity.
That is, if we had a user beforehand, the above fact would look like an update to the existing email column. So far so good.
But what happens when we don’t actually have a user yet? We typically can’t say anything about records that are not in the database yet. We’d have to create a user instead. And because we only have the email, and we’re forced to say other things about the users, we’ll have to lie about them (use default values, or allow them to be unset, which is typically a no-no in relational databases).
So traditional relational database modeling forces our hand to deal with entries in an atomic fashion. Either you create an entire user, or you don’t create one at all. And if you want to say something more, then you have to alter this agreement you originally had: the database schema.
For the most part, our systems will be okay with this. There are however a number of use-cases where these constraints are crippling our ability to extract, transform, load data across systems and domains, and turn it from bits to data, from data to information, and from information to knowledge.
For these use-cases, it can be much harder to collect everything we need to say about a User and send it all at once. We are then forced to lie. Say that their name is “” (empty string), for now. Some other systems will update it appropriately.
And yet we no longer know what updated what field to what value, nor when. So many questions. Audits become a lot harder, if not plain impossible.
And if you’re exploring a domain, or joining several ones, your database schema will go through a Kafka-esque migration process. The outcome of which tends to be to start from scratch. Sometimes this simply isn’t an option.
Let's look at how things can look when we build up our worldviews in a much more fine-grained manner, through tiny facts about things.
A fact can be thought of as a sentence with a subject, a predicate, and an object. Sometimes this is called a “semantic triple”, or just triple. For example: “Alexander Hamilton has email alexander@hamilton.com”.
The structure here is:
- Subject: Alexander Hamilton
- Predicate: has email
- Object: alexander@hamilton.com
We can state this fact without ever checking if anything has been said before about the subject (Alexander Hamilton). So in a way, stating this fact is both updating and creating the subject. We can even state facts about predicates and objects that we have never seen before and usher them into existence.
This seems to support well the examples we’ve seen so far, but it requires an additional level of indirection to support the notions of provenance, who said what, and temporal causality, when did something happen.
If we wanted to say that the Linn Manuel Miranda said that Alexander Hamilton has for email alexander@hamilton.com on January 20th, 2015, that would look like this:
f0 = "Alexander Hamilton has email alxander.hamilton@us.gov"
f1 = "Linn Manuel Miranda said {f0}"
f2 = "{f1} happened on January 20th, 2015"
This nesting is conceptually super clean. Facts about facts, it just makes sense to allow it. It however has a fundamental problem. Can a user have many emails? If they can, what does “{user} has email {email}” mean? That they only have that one, or that they also have that one?
And if we state that same fact 3 times, how many emails does the user have?
Lucky for us, in this example, a user has exactly one email.
Unlucky for us, Linn Manuel Miranda has stated more things than this. This means that the fact “Linn Manuel Miranda said {X}” above would be stated many times with different X values.
If we see a fact as establishing a relation between two entities, this introduces the notion of Cardinality of that relation. Can there be zero or more? one? exactly 5?
It also introduces the notion of symmetric, reflexive, and transitive relations. If I state a fact, does the fact state me? (symmetry), if the fact is then used to state another fact, does that mean I stated that fact too? (transitivity), and does a fact state itself? (reflexivity)
We can leave these topics aside, for now, we’ll certainly come back to them in future writings.
To address this, I like to extend the notion of a fact from a triplet, into a quintuple that requires provenance and temporal elements:
- A fact has a subject, referring to whom it affects
- A fact has a predicate, describing how the subject and an object relate
- A fact has an object, upon which the predicate works
- A fact has a source, referring to who or what stated it
- A fact happened at a specific time
With these 5 things, our above example would read more like
source = "Linn Manuel Miranda"
subject = "Alexander Hamilton"
predicate = "has email"
object = "alexander.hamilton@us.gov"
time = "January 20th, 2015"
"{source} said that {subject} {predicate} {object} at {time}"
This is now enough information to keep track of what updated what field to what value, and when, at all times. If you never delete facts, you essentially have an audit log:
- System1 said that User0 is an Admin at T0
- System2 said that User1 is an Admin at T1
- System1 said that User1 has email “user1@company.com” at T2
- System1 said that User3 has email “user2@company.com” at T3
- System2 said that User2 has email “user2@company.com” at T4
- System2 said that User3 has email “user3@company.com” at T5
Unfixed Representation
Something that changes drastically once we start thinking in facts, is that you no longer have a fixed representation of what is a User like we had in the traditional relational modeling worldview. You can’t pop the hood open and look at the table columns and know what fields are what, because you could potentially have new fields all the time, and many fields may never have been stated before so the database doesn’t know about them!
Instead, we have collections of facts stated about a specific User, and we need to make sense of them. Here is where this process really shines.
Let's call the process of smushing facts for a subject into a single object a consolidation strategy. This could be “last fact wins” or “depends who said it”.
If you build systems that work with streams of facts, you can use the information of any of the facts to decide how to consolidate them, depending on your use case:
- A lookup service may just smush all the facts together into a key-value store, where multiple facts about the same subject end up reduced into a single object about them.
- A search service can build appropriate reverse lookup tables to provide fast results for specific subjects, relations, objects, or even timestamps or sources. Essentially avoiding consolidation but “inverting facts”.
- A content moderation service can take into account the provenance of a fact to decide who has the last word. You can imagine a user trying to mark some content as deleted, and an admin marking the same content as visible. You’d expect most of the time for the admin to overrule the user, but perhaps if you’re building
And the list goes on and on. In a traditional database, you’d be stuck with whatever representation the owner exposed, typically via an API.
This does not come without tradeoffs, as storing facts can become a challenge (napkin math: entities of a kind * fields = facts for those entities), and processing them can be time-consuming (imagine processing 2,000,000 facts for a single entity to essentially “replay” the history of it using a new strategy to arrive at the current state). I’ll leave the engineering details of how to approach these issues for a different time.
This was an overview showing that granularly building your worldview can give you a lot of flexibility where a traditional modeling approach will not.
That’s a fact.
References
- Semantic triple [wiki]
- RDF 1.1 Concepts and Abstract Syntax [w3 spec]
- Introducing Time into RDF [whitepaper]