AM001 – Data Languages
Every software project needs to model information. What is the mismatch between what we are modeling, and the language we are modeling it in with?
Whenever I start, join, or pick up a software development project, I’m met with the question of what information will it process.
Sometimes this information is very small, such as command line flag “-v” communicating to the program that it should be a tad more verbose than it usually is. Some other times it is so large that I must assume there’s more information than I can model.
From a single bit flag, to an entire open-world knowledge base, we need to represent information as data using some language. And whichever data language we pick, it will make some assumptions that are somewhat appropriate to the usages it expects of our data.
When is information validated?
Many programming languages and frameworks designed and evolved smaller languages specifically to address the modeling of data. Some of them are done at runtime, like validation frameworks; some of them are done at compile time, such as type systems. For example, Elixir’s Ecto lets us define “schemas” from which to derive “record structs” on which we operate using “changesets”. These changesets will include runtime validation rules that will prevent us from creating values that represent valid information unless the validation has passed entirely. Another example would be Rust’s Clap framework for building command line tools. It asks you to define a data structure representing your programs commands, options, and inputs. Around it, a set of meta-programming constructs that help annotate that structure to generate code that at compile time will be evaluated to guarantee the information is processed correctly. There is a spectrum here.
Impedance Mismatch between Domain and Host
In addition, the degree to which the data is “reified” from the language, or the indirection with which it is represented also varies. Some programs will let you speak of a piece of information directly, whereas some others would use a data structure that represents the information. The meaning of the first would hopefully be encoded in the structure; the meaning of the second will be encoded in the interpretation of that representation. Some languages will be better equipped to deal with this than others: e.g, some will provide constructs to represent data directly that are flexible enough to model your domain, some others will force workarounds.
For example, a language like Reason or OCaml would let you model inductive data, such as Natural Numbers, with total elegance, by use of Algebraic Data Types. Just by looking at a value in its syntactic representation (source code), we can already tell exactly what kind of data it is, and what we can and cannot do.
How much more information are we willing to process than we are aware of?
Lastly, there is the idea that whatever representation of knowledge we have about the world can be either complete or incomplete. Either we know that this is all there is, and that if it is not said, it is false, or we know what we know and what we don’t know is simply outside of our knowledge. Understanding the key philosophical differences of this two positions helps us understand when some knowledge should be thought of exhaustive, when it shouldn’t, and how to model it in either case.
Asserting that everything that there is to be known is already known and is modeled within our data language lets us claim that many things are simply false.
The opposite makes us consider that there will be more information flowing through our system than there currently is, and thus the system will in itself be more flexible.
For example, some languages, like SQL, force us to up-front define exactly what things will be part of our domain model, and encourage a “closed world” assumption, where if the data is not part of the database, then the prepositions involving it must be false. Sure, we can learn more about our domain and try to fit data into it, but the alteration of table to include new columns simply represents a change in our absolute knowledge. No less data than we now require can be put in, but neither can more data find a way into that table. Relational Algebras seem to be bound to exhaustivity.
On the other side we have languages like OWL, or RDF, that base themselves on the “open world” assumptions: if something isn’t known, it just isn’t known. Normally RDF modeling is a much more involved process than stitching together a few SQL tables with foreign keys, and help involve domain experts too. It allows us to declare that there are some kinds of data, and to allow interpretation of data from a variety of projections. Two individuals or systems may see or ignore entirely disjoint sets of information from the same data. Adding more data does not mean that the new absolute truth has changed, but that there is more to be learned. It is a naturally evolving medium.
There is an spectrum here, too.
So far I’ve found that these 3 qualities of data languages make them more or less suitable for different modeling tasks. A small command line interface may benefit from a language with a priori verification of close-to-source, closed-world information modeling such as boolean flags. A system processing information from several news sources may need a much larger notion of validity, perhaps even several ones that are closer to whoever consumes the information, and thus several projections of the same information may come at play.
I am inclined to think that the extremes in this space are particularly more useful than any middle ground, but I’m entertaining the idea that a statically verified language with first-class constructs for modeling information under an open-world assumption could be very useful in the industry today, and tomorrow.
- A Semantic Web Primer [pdf]
- An Introduction to Description Logic [book]
- Building Ontologies with Basic Formal Ontology [book]
- Data and Reality [book]