By Leandro Ostera in ontologies — Jun 18, 2020

AM004 – Making Ontologies Practical through Metaprogramming

Let's revisit how meta programming and code generation can make working with ontologies and semantic data much more practical.

It is clear that ontologies are useful for knowledge systems. They help us define, refine, and understand the domain we work with, the entities that compose it, their relationships and constraints in sort of cumbersome ways.

The syntaxes for them are dreadful at best (XML) and completely foreign to most engineers out there at worst (Turtle). Independently of whether you’re using RDF, RDF/Schema, OWL 1.0, OWL 2, or any of their profiles.

Then you also have to consider the profiles. The subsets of them defined with certain computational properties. Some are good for reasoning, some are not. Which one to pick depends on your needs, and sometimes you don’t really know that.

If you are looking to get started, you’ll find that most literature out there is fairly dense. It is either a PhD thesis published by Springer, a specification of the semantics of an extension of RDF, or a few people that still dream of a Semantic Web. It is not a well trodden path, or at least it doesn’t really have a good on-boarding experience. Either you’re okay with the massive influx of information, or you hit the wall after downloading Protégé.

During my experience at Spotify I got acquainted with OWL 2.0 and Turtle in a less than ideal but somehow much more manageable way. An ontology for the music industry was started there. Later on I took my learnings over to Doorling, where I started an ontology for Real Estate, and I saw success applying some of these concepts and ideas. I have been using them on several projects ever since.

Today I want to focus on a couple of points that helped me take an Ontology to production successfully. The know-how needed to write an ontology from scratch is naturally different, and I’m bound to make an introduction to the subject at some point.

Nonetheless here’s a rough primer:

Ontologies are descriptions of data
There are Classes of Entities
There are Relations between Entities
It is written out in Triplets with a Subject, Predicate, and an Object

For example, to be able to make sense of the following sentence: “Alexander Hamilton has email alexander.hamilton@us.gov”, we would need to model it somehow. Below here is the definition of the schema that allows us to talk about a Person as a Class, and of the Email field that every Person has.

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix hm: <https://hamilton.com/ontology/>.

hm:Person a rdfs:Class.

hm:Email a rdfs:DatatypeProperty.
hm:Email rdfs:range :Person.
hm:Email rdfs:domain xsd:string.

The @prefix notation works exactly like an alias, so you can replace hm with https://hamilton.com/ontology/ as you would expect. The middle bits “a”, “rdfs:range”, “rdfs:domain”, are all just pointers to other entities that we have agreed mean something. So yes, this means we can literally come up with new relations and entities on the spot. Fathom them into existence from the void with a keystroke, and usher them away by removing all mentions to them.

NOTE: When it comes to defining actual data in this formats, I haven’t done much more than trying out sample datasets for well known ontologies. I’ve found the tools for it got in the way of me getting my work done. Nonetheless, you can.

The next question here is how to use this definition in your own code, because otherwise it is just a funky looking file in your repository doing nothing but haunt your colleagues. How can we use this ontology to constraint and guide the development?

I’ve taken 2 paths, and to this day I’m happy at Doorling I took them in the order I’m presenting them here.

The first path is to write out code yourself that will deal with this data. Code that will encode all of the information represented in that schema and will give you all the guarantees that you can have from it.

For example, if I was building a codebase in statically typed language, I would likely want to have types corresponding to these entities, properties, and perhaps even relationships. It wouldn’t be too far off to think that every “class” corresponds to an actual Class in an OOP language, or one or more types in a Functional language.

Let’s take this example and see what we could write for it in Reason, a newish syntax for OCaml:

module Person = {
  type t = {
    email: string
  };
}

This looks about right. A piece of code generated from the above definition. Now we could write out

let alexander_hamilton : Person.t = Person.({ email: "alexander@hamilton.com" });

You can write this out however you feel like, in whichever language you are working with, and you can make any interpretation of these rules you so wish. Ideally you’d limit yourself to just defining these tiny representations of information that you can work with. Whether they are statically typed or not is really not important. Here’s the same representation in unidiomatic functional Javascript:

// file Person.js
const FIELDS = Symbol("Person/Fields");
export const make = ({ email }) => { [FIELDS]: { email: email }};

// other file.js
let alexander_hamilton = Person.make({email: "alexander@hamilton.com"});

And one last time in Erlang:

-module(ontology_person).

-type t() :: #{ <<"https://hamilton.com/ontology/email">> := binary() }.

-export_type([t/0]).

This process will allow you to find the spots where the idiosyncrasies of your domain knowledge clash with the programming constructs your language offers. Repeat it for a couple of files, find a few patterns that you like.

In Erlang, for example, I liked using the fully qualified name of the property because it provided a good way to give extensibility to the main type specifications. This works particularly well when integrating data from different sources.

Javascript does have support for first-class maps with string keys, so it could’ve been done there too.

And then put these things to use. Use them for a few weeks, modify them, grow them, share them, scrap them, refactor them. Take notes of the changes you make and why. This code will be useful from day 1, since anyone relatively familiar with a programming language will be able to put together some representation of that data, but it will help you learn more about how to idiomatically handle this data in it too.

Once you’re comfortable and you see your patterns have been established, consider taking the second path: metaprogramming.

By the time you already know what your generated code should look you can take a step up the meta, and write a program that will write this code for you. Sounds a lot easier than it is, but it is also not particularly hard. Many language ecosystems will already have libraries to read up files and parse them to Turtle or RDF(S) data structures.

The rest will really just be a relatively small command line interface that will read those Turtle files and spit out the code for your favorite language to interpret or compile would be enough for a while. It doesn’t have to be very smart either, it can just piece together string templates of your language and write them to files on disk.

ontology files => code generation => code

This will give you all the benefits of the language you already know, with all the preserved semantics of your ontology, provided you make some trade-offs on how you want to generate code.

For example, the OCaml RDF library provides a way of reading strings into an RDF graph, of merging graphs together, and you can traverse that graph to create data structures representing your own code. In other words, it can turn the first schema we saw into something like:

{
  prefixes: [ { short: "hm", long: "https://hamilton.com/ontology/" } ],
  classes: [ { name: "hm:Person" } ],
  properties: [ { name: "hm:Email", domain: "string", range: "hm:Person" } ],
}

Which should be considerably clearer to transform into a data structure that represents a Javascript program.

That’s it. That is really it. Just make sure you generate the code at compile time. There’s enough uncertainty in software already to generate code on the fly that you can’t easily inspect.

Subscribe to Abstract Machines