Semantic Web Technologies on an Example of Family Trees

Tilde
Materials Informatics Lab
7 min readNov 27, 2015

--

The software capable of logical reasoning within some knowledge domain may seem a tech marvel. However, as it can be seen below, writing such software in Python is not difficult, if one makes use of semantic web technologies. Here we focus on a demonstrative usecase of ontologies — family trees— and will be able to infer a kinship of an arbitrary complexity for any family member (in fact, it is limited with computational resources).

For example, at the family tree of the Russian Tsars, Nicholas I of Russia, the first cousin twice removed of Peter II of Russia, is shown below.

Family tree of the Russian Tsars (House of Romanov)

Hereby your familiarity with triples, RDF and ontologies is assumed. For a description of kinships in a family tree we employ the OWL 2 ontology Family History Knowledge Base (FHKB). Note, that while the authors of FHKB declared their child to be a very good training case, they still did not recommend the use of OWL 2 in the real genealogical applications due to its computational complexity for today’s reasoners. Our purpose remains fully educational though, so we will be limited to one hundred family members in the family tree.

Genealogical data is typically available in the plain text format GEDCOM (.ged). Typically, desktop genealogy software, as well as web-portals allow downloading family trees in this format. We read GEDCOM using the Python library of the same name and then generate triples of individuals (the so-called ABox) for FHKB ontology. We already have the logic for reasoning (the so-called TBox) in the FHKB ontology, and all we need to do is to define the data this logic will be applied to.

Imagine, we have data for the following three subjects (abstractedly), on the example of aforementioned family of the Russian Tsars:

Alexander I is-brother-of Nicholas I.
Nicholas I is-father-of Alexander II.

and logic:

Property is-uncle-of is chain-of-properties is-brother-of and is-father-of.

then the reasoner will be able to infer the following fact:

Alexander I is-uncle-of Alexander II.

The same information in RDF dialect Turtle is below. It is relatively compact and easy to read:

fhkb:i1 a owl:NamedIndividual ;
fhkb:isBrotherOf fhkb:i2 ;
rdfs:label "Alexander I" .
fhkb:i2 a owl:NamedIndividual ;
fhkb:isFatherOf fhkb:i3 ;
rdfs:label "Nicholas I" .
fhkb:i3 a owl:NamedIndividual ;
rdfs:label "Alexander II" .
fhkb:isFatherOf a owl:ObjectProperty ;
rdfs:label "is-father-of" .
fhkb:isBrotherOf a owl:ObjectProperty ;
rdfs:label "is-brother-of" .
fhkb:isUncleOf a owl:ObjectProperty ;
owl:propertyChainAxiom ( fhkb:isBrotherOf fhkb:isFatherOf ) ;
rdfs:label "is-uncle-of" .

(Note: some details have been omitted here for clarity. In the original FHKB ontology the properties isFatherOf, isBrotherOf and isUncleOf are defined in a slightly different manner to optimize the logical reasoning.)

Thus we have defined the individuals i1, i2, i3, properties isFatherOf and isBrotherOf, assigned these properties to the individuals and introduced the new property isUncleOf. Pay attention to the prefixes rdfs:, owl: and fhkb: — they refer to the certain domains of knowledge. The prefix rdfs: refers to the standard RDF schema (in the example above — label property). Prefix owl: indicates standard ontology terms (individual, property, the sequence properties, etc). A prefix fhkb: is used by our FHKB ontology. There is the logic of kinships (such as isFatherOf, isBrotherOf, isUncleOf, as well as other terms like isGrandfatherOf, isFirstCousinOf etc).

In the end of the day, it is sufficient to extract only the minimum of information from GEDCOM for each individual. This is fatherhood or motherhood, brothers, sisters, and marriages. In fact, GEDCOM doesn’t contains much more than that. All other kinships (with the logic provided in FHKB ontology) will be inferred for us by a reasoner.

Kinships (from Wikipedia)

Now then the logical basis (TBox) is available in Turtle file header.ttl in the repository for this article. Family tree of the Russian Tsars in GEDCOM format is also present, but the readers are advised to take their own to try. And here is the script that will generate individuals for FHKB ontology from the GEDCOM file — gedcom2ttl.py. After cloning the repository, Python dependencies should be installed using pip install -r requirements.txt. Copy the logic of FHKB header.ttl to a new file and append the result of the script execution:

cp data/header.ttl tsar_family.ttl
./gedcom2ttl.py data/tsars.ged >> tsar_family.ttl

As a result, we have obtained the ontology (TBox+ABox) in Turtle format, which is supported by any external editor (for example, Protégé). If necessary, Turtle can be converted to XML-OWL using ttl2owl.py. Now inferring the new facts is rather easy. I am aware of three modern open source reasoners in Python: RDFClosure, FuXi and Fact++ (wrapped with owlcpp). In fact, there are more, if we link Python and Java virtual machine together. This is because historically Java is the leader in the semantic web technologies and provides more tools. The three reasoners above are lined up by the increase of their sophistication and performance. The first represents the naive approach, when all the possible triples are generated by means of bruteforcing. The second (FuXi) is based on the infix Python OWL notation and Rete algorithm. The third (Fact++) is an optimized low-level implementation of the Tableaux algorithm. In general, Fact++ is one of the most efficient open-source reasoners today. Notwithstanding, for our purposes, the first system (RDFClosure) suffices, since it is written in pure Python and installed with the trivial pip install. For reasonining on the family tree of the Russian Tsars tsars.ged (41 family members) RDFClosure takes about ten seconds on a laptop with Intel Core i7 1.70GHz.

As said before, the drawback of OWL 2 with respect to the family trees is its computational complexity. I omitted some of the kinships from the picture above, and reduced the family tree of the Russian Tsars to royals and their immediate family members. This way demo reasoning won’t load your computer too much. If you consider all the kinships from illustration above and extend family tree to at least several hundred members, RDFClosure becomes useless (Fact++, however, remains operational).

Run the reasoner for the ontology obtained above:

./infer.py tsar_family.ttl

While reasoning is running, I explain the point of the script infer.py, fitting into these six lines:

In the first two lines we import RDFClosure reasoner and RDFLib library to handle ontologies. In the third and fourth line we define the graph and fill it with the contents of ontology tsar_family.ttl. In the fifth line we launch reasoning. In this case, this is nothing more than the cyclical expansion of the input graph by the new tripes according to the rules OWL 2. In the sixth line we print the resulting graph in the same Turtle format used.

Now then we got the result tsar_family.ttl.inferred (it is several times larger than the input file). Let’s get it visualized. I wrote a simple web-app (index.html), displaying a graph of the inferred kinships in the browser with the aid of JavaScript library D3.js. It’s available right away in the online branch of repository for this article. The edges correspond to the data extracted from the GEDCOM (marriages, isFatherOf, isMotherOf). The inferred kinships are highlighted with the different colors while selecting a family member. The selection is made by mouse (or tap at the touch screen). The graph structure is given by the JSON document built up very simply: the list of edges, indicating vertices (i.e. individuals), and the types of connections (i.e. kinships) between them. This JSON document is obtained from the ontology of the previous step using script ttl2json.py:

./ttl2json.py tsar_family.ttl.inferred > tsar_family.json

This newly generated JSON document can be uploaded into the browser simply pressing a button on a webpage (the File API interface is used, so no server is required, visualization may work offline).

All the commands above are collected together in the Shell script gedcom2json.sh. It converts GEDCOM directly to the JSON with the inferred kinships for visualization in the web-app. Adding support of other kinships is relatively simple. First, an appropriate logic is added to the TBox of the FHKB. Second, an identifier of the new kinship is added to the Turtle-JSON converter ttl2json.py. Third, a color, identifier and title of the new kinship are added to the code of web-app (rel2col object and mmreasoner_legend HTML div, to be precise). Of course, the time of GEDCOM to JSON conversion must increase.

In addition, there is an idea that each ontology (not just genealogical one) can be drawn as a mind map. Of course, while drawing the strict rules must be followed, to be able to convert the mind map to the ontology ABox (using e.g. Python XMind SDK). For instance, this is how I ran the logical reasoning for my own family tree, which was made in the form of the mind map historically.

To summarize: basing on the nearest kinships (brothers and sisters, marriages, parenthood) and the logic of other kinships, we were able to infer the other kinships in the family tree using the semantic web technologies. We thus mastered the powerful science, which drives such products as Wolfram Alpha, Google Knowledge Graph, and IBM Watson. Today the ontologies and reasoners are mature and widely adopted, but the entry barrier in this field is not low.

--

--

Intelligent software for computational materials science and cheminformatics. Free and open-source. Inspired by #BlueObelisk Web: https://tilde.pro