Building DPRR’s RDF data and server

DPRR’s data is served from an instance of the rdf4j-generated RDF repository, but weas created in a relational database. So, to get the data from DPRR’s relational database into a set of triples required some software, some scripts, and a process that linked them together. Here I will briefly describe how this process was carried out.

D2RQ

The package D2RQ (http://d2rq.org/) was the main tool used to take the data from the relational database into a corresponding set of RDF triples. This is, indeed, D2RQ's job: as its website says, D2RQ “exposes the contents of relational databases as RDF”. Indeed, it has been used directly to publish dynamic data (for several life sciences projects) that is created and maintained in relational databases so that it can also be accessed as Open RDF data.

Thus, D2RQ is interesting because it can do to things that might have been directly useful for the expression of DPRR as RDF. First, it can act as an RDF server itself, displaying data that is actually in a relational database but presented as a collection of RDF triples. When run this way, one can see the RDF data it is presented by connecting to it through a browser:

Then, looking at one of the RDF entities shown here, say, “Office”, one is shown the URIs that are associated with the “Office” class:

Choosing one of the Offices (say, dictator), D2RQ will display the triples associated with that office:

The D2RQ architecture figure provided in the D2RQ documentation shows how this works. We can see the HTML browser shown above as the third box on the left, and can see how it connects to the Non-RDF Database (for DPRR, this is a Postgres database) on the right:

D2RQ Platform architecture diagram

(From http://d2rq.org/images/architecture.png)

Although D2RQ’s server can represent relational data as RDF triples, DPRR, for reasons given below, does not use the D2R Server directly to simply serve data from DPRR’s relational database as RDF. Instead, it uses the other service that D2RQ can offer: the RDF dump tool that generates pure RDF in a file that can be subsequently loaded into a separate triple store. In the diagram the path to DPRR’s triple store comes from the bottom left box labelled “Triple Store”.

Preparing the D2RQ mapping file

A key element in the use of D2RQ is the D2RQ Mapping File (shown in the diagram above near its centre) which provides the D2RQ engine with the instructions (expressed as a sequence of RDF statements) about how to translate the entities, attributes and relationships of the relational (“non-RDF”) database into triples. In the DPRR project work, this file is called dprr-mapping.ttl.

The information conveyed in this file is quite rich, and it is beyond the scope of this document to explain it in detail. You can get the details for the “Mapping Language” that it expresses from D2RQ’s documentation at http://d2rq.org/d2rq-language. However, a couple of short examples are still useful for us here.

First, let us look at a fragment that expresses the relationship between the database table for offices and the RDF expression of this data that D2RQ generates. The mapping from a database table to an RDF class is provided by an d2rq:ClassMap instance (shown here in RDF’s turtle notation):

map:Office a d2rq:ClassMap;
  d2rq:dataStorage map:database;
  d2rq:uriPattern "Office/@@promrep_office.id@@";
  d2rq:class vocab:Office;
  d2rq:classDefinitionLabel "Office";
.

The d2rq:class statement expresses the connection for our RDF “Office” class. It specifies that each instance of the RDF Office class will have an associated URI, and the d2r1:uriPattern statement provides a template for it, in this case drawing on the id field of the database’s promrep_office table.

The connection between attributes in a relational database’s table, and the properties of an RDF class is provided by a d2rq:PropertyBridge. Here is a simple example:

map:officeName a d2rq:PropertyBridge;
  d2rq:belongsToClassMap map:Office;
  d2rq:property vocab:hasName;
  d2rq:propertyDefinitionLabel "Office name";
  d2rq:column "promrep_office.name";
.

The property name for the RDF representation is given by the d2rq:property statement. The connection of this property to the Office class and table comes from the d2rq:belongsToClassMap statement, and the association to the particular table’s attribute is provided by the d2rq:column statement.

This example of a PropertyBridge defines a property that in OWL would be considered to be a owl:DatatypeProperty, but of course not all properties are simple DataProperties. Indeed, the definition for an owl:ObjectProperty must be rather different since it involves a relational join rather than a simple reference to an attribute of a table. Furthermore, there are other expressions for other kinds of data from the DB that needs to be mapped to OWL properties. We do not provide examples here, although full examples are available from d2rq's documentation for its mapping language.

The entire mapping file for DPRR consists of about 1800 lines/RDF statements. That’s a lot of stuff to create accurately by hand. And so D2RQ provides a utility tool, called generate-mapping that reads the structure in the relational database and creates a preliminary mapping file that maps all the data in the relational database into instructions that will cause D2RQ to present it.

Although this tool provided a starting form for DPRR’s mapping file, it was clear to me that could not be used without being edited significantly. Here are the bits of RDF that corresponds to Office and officeName as it appeared in the originally generated file:

map:promrep_office a d2rq:ClassMap;
  d2rq:dataStorage map:database;
  d2rq:uriPattern "promrep_office/@@promrep_office.id@@";
  d2rq:class vocab:promrep_office;
  d2rq:classDefinitionLabel "promrep_office";
.

map:promrep_office_name a d2rq:PropertyBridge;
  d2rq:belongsToClassMap map:promrep_office;
  d2rq:property vocab:promrep_office_name;
  d2rq:propertyDefinitionLabel "promrep_office name";
  d2rq:column "promrep_office.name";
.

Note that the names for the class and attributes are not as “user friendly” as what we have used in our mapping file. This is because all generate-mapping had to work with was the table and attribute names in the database, and the DPRR database is actually created and managed by Django, and so the naming conventions used reflect django’s needs and are not particularly convenient to use as public names in RDF.

It turns out that many changes had to be made to the file;s text to take the raw mapping file generated by generate-mapping into the form used to actually generate the RDF data used by DPRR. Among the various issues that needed to be dealt with were:

many extra tables were generated by Django for its own uses, and did not contain data from DPRR that needed to be public. All the material about them needed to be removed.
fields were inserted by Django for various administrative purposes such as “created by” that did not need to be published. The references to all these fields needed to be removed so that the data they contained did not appear in the RDF expression of DPRR’s data.
Django named most DPRR data tables with the “promrep_” prefix, and this needed to be removed.
Django used lowercase names only for tables and attributes, and these needed to be converted to camelcase.
property names had to be changed to provide the “has”, “is” or “for” prefix.
pure intersection set tables that provided many-to-many mappings in the conventional relational paradigm way needed to be removed as classes, and the many-to-many relationship that this intersection set table supported instead expressed as an owl:ObjectProperty that could occur multiple times if necessary.

Furthermore, there were several more global changes, not listed here, that had to be made to make the mapping for DPRR work more naturally as RDF data.

The work to do all these by hand would have been tedious and subject to error. Thus, the process of fixing up the data was expressed as a sequence of individual steps. Those that could be automated were carried out by little python scripts that took one version of the file in, performed a set of automated changes, and generated a new version out. Not all the steps were able to be automated. In the end, there were 14 automated and by-end steps of transformation to the mapping file.

In the end, however, we had a d2rq mapping file that allowed the d2rq server to present a pretty good (but not completely satisfactory, see below) representation of the RDF statements that were to be served by DPRR’s RDF server. The next step, then, was to develop a process that used d2rq’s tool (called dump-rdf) to generate a set of RDF statements from the database using its mapping file as a guide.

Generating RDF statements with d2rq

Dump-rdf largely generated a good set (about a 3^rd of a million) of directly usable RDF statements. However, there were still a few refinements that I wanted to make to this automatically generated RDF data.

The most important change was that DPRR had developed a standard form for an historical person’s name related closely to conventions already used by Roman Republic scholars to identify individuals. This "standard naming convention" for DPRR persons required the putting together of data mainly from fields in DPRR's person table, but also from several other tables. The d2rq mapping language is quite flexible, so perhaps this could have been expressed there so that d2rq would have been able to create this name directly, but the process to build the name from DPRR data was complex enough that with my rather basic understanding of d2rq’s mapping language I couldn’t see how get it to work. Nonetheless, it seemed important that DPRR's names for persons be represented in the DPRR RDF, even if I couldn't see how to create them with the d2rq mapping language.
Many fields in the DPRR database are empty. D2RQ claims that it provides a mechanism to have empty fields suppressed as RDF data, but I couldn’t make it work. I wanted all statements that had null or empty string objects to be suppressed.
Several fields in DPRR were Boolean and reflected particular characteristics of the things to which they were attached. I wanted all situations where the value for these properties were false to be suppressed. (this may turn out to be a controversial decision)
The handling of the situation where a DPRR had two praenomens to associate with a person was handled in a way that did not really suit RDF principles, and had to be expressed differently.

All these refinements on the dumped data from d2rq could have been perhaps handled directly within d2rq, resulting in a file that could be simply directly loaded into the DPRR RDF server. However, I chose, instead, to write a python script (called processttl.py) that took the version of RDF data that dump-rdf had generated and made all the changes outlined above, creating a new turtle file called dppr_clean.ttl. It was this file that was to be loaded into the DPRR RDF server.

Building the DPRR RDF server

Now that we had a way to generate the RDF triples from the DPRR relational database, we needed a way to serve them.

A number of different software development projects worldwide have created toolkits for RDF servers. I chose rdf4j as the platform for building DPRR’s particular RDF server, since rdfj4 was a Java-based platform that included many components I knew we were going to need to make DPRR’s server work in the way I wanted. In particular, I liked the part of rdf4j called the workbench which operated as a Tomcat web application and that out of the various platforms I looked at seemed to me to best provide access to the RDF data in ways that assisted the user to grasp how the data was structured.

There was, however, an issue with the rdf4j workbench as it came "out of the box": the workbench provided a full set of tools not only to interact with existing RDF data but to change it too — including allowing its user to replace or remove some or all the RDF triples in its repository. Furthermore, the administrative system to manage permissions for data users in such a way that the DPRR data administrator would be able to update the data while ordinary users could not, was not sufficiently clearly laid out so that I was confident that I could readily control this.

So, I decided to use the rdf4j workbench code as the basis for DPRR’s RDF server but to make several changes to the way that it operates “out of the box”. As a result:

All features of the rdf4j workbench that went beyond data exploring have been removed, including the Java classes that implement these features.
The rdf4j workbench in the DPRR server has been directed to work directly with RDF data in the filesystem, rather than through an intermediate rdf4j server (which, by operating via http:, presented similar issues of data access and control that the raw workbench did).
Minor changes to the look and feel, beyond the removing of unwanted features, have been made. The web pages “livery” is now related to DPRR’s livery, etc etc.
A few classes have been developed by me to provide the data-oriented support for URI entities, and for a direct data-oriented (rather than HTML mediated) SPARQL endpoint. These classes that go outside the functionality provided by the rdf4j workbench are in the JAVA package uk.ac.kcl.ddh.jb.rdf.server.

All this work was done by setting up the rdf4j workbench code in Eclipse via an Eclipse “web” project, and working on it there. The changes were relatively minor. The base Java code that came from rdf4j for the workbench classes was altered in a few trivial ways. In addition, to provide the data-oriented services (support for entity URIs, and the basic SPARQL endpoint: the features mentioned in the last point above) I had to add a small number of my own classes. Of course, the various “web content pages” had to be altered too to bring in elements of DPRR’s livery, and to remove references to services that the out-of-the-box rdf4j workbench offered but that were not to be made available in the DPRR’s RDF server.

This work on the web content pages revealed what was, to me, something interesting: the rdf4j workbench operates in an unusual way as a server of pages for a web browser in that it actually makes use of browsers that support XML and XSLT — sending XML that it generates as the data, with a reference to an XSLT stylesheet that, on the browser, gets transforms into HTML for proper display. Thus, one of the jobs I needed to do was alter the stylesheets (particularly the “template.xsl”, but others too) to remove all HTML links to services that the DPRR RDF server was not going to offer. In addition, unwanted functionality was removed, and new functionality added, to the WEB-INF/web.xml file.

The entire Eclipse project in which DPRR's RDF server is licenced as open source through the EPL license for the Java code, and Creative Commons for the non-Java elements. In addition, it is available for anyone to use through a github repository at johnBradley501/uk.ac.kcl.cch.dprr.rdf.server

Having now created mechanisms to generate the RDF data, and having built an RDF server based on the stripped down (and somewhat extended) rdf4j workbench, I was ready to turn my attention to an ontology for DPRR.

Building the DPRR ontology

I chose to use rdf4j’s memory repository that supported RDF Schema inferencing (according to rdf4j documentation at (http://docs.rdf4j.org/javadoc/2.0/org/eclipse/rdf4j/sail/inferencer/fc/ForwardChainingRDFSInferencer.html) it follows the rules from the RDF Semantics Recommendations (https://www.w3.org/TR/2004/REC-rdf-mt-20040210/#rdfs_interp). Furthermore, in my view information about data in terms of a class-subclass structure makes it easier for a human to grasp. These two reasons, then, meant that it was worthwhile constructing the DPRR Ontology in ways that made use of class-subclass structuring mechanisms. D2RQ, which generates DPRR's RDF statements, does not provide direct support for a subclassing model, in part because the ideas in subclassing are available, if at all, only implicitly in relational database semantics. Thus, the class-subclass structure that is presented in DPRR’s ontology had to be constructed by me entirely “by hand”. D2RQ is not aware of this class hierarchy at all.

I chose to express the ontology in terms of the OWL language (https://www.w3.org/OWL/), although I only used a few basic constructs of OWL. You can find the DPRR ontology at:

http://romanrepublic.ac.uk/rdf/ontology#

A certain amount of the material (base classes and properties) in the DPRR ontology could be generated automatically from the information D2RQ has in its mapping file. Thus, I loaded the DPRR mapping file, which is in RDF, into a temporary RDF store using rdf4j’s workbench, and ran SPARQL construct queries against it to create preliminary components of what was going to become the DPRR ontology. The following query, for example, created class definitions for all of d2rq’s ClassMap constructs:

PREFIX owl: http://www.w3.org/2002/07/owl#
PREFIX rdfs: http://www.w3.org/2000/01/rdf-schema#
PREFIX d2rq: http://www.wiwiss.fu-berlin.de/suhl/bizer/D2RQ/0.1#
PREFIX vocab: http://romanrepublic.ac.uk/rdf/vocab/

construct {?name a owl:Class;
   rdfs:label ?label.
}
where {
   ?a a d2rq:ClassMap;
   d2rq:class ?name;
   d2rq:classDefinitionLabel ?label;
}

This query created a rdf:Property assertion and domain mapping for all d2rq:PropertyBridge assertions:

construct {
   ?name a rdf:Property ;
      rdfs:domain ?domainName.
}
where {
   ?a a d2rq:PropertyBridge;
           d2rq:belongsToClassMap ?domainMap;
           d2rq:property ?name.
    ?domainMap d2rq:class ?domainName.
    filter (?name != rdfs:label)
}

A final example (there were several more similar to this), created rdfs:range assertions for properties that were to become OWL ObjectProperties:

>construct {
   ?name rdfs:range ?rangeName.
}
where {
   ?a a d2rq:PropertyBridge;
  d2rq:property ?name;
  d2rq:refersToClassMap ?rangeMap.
    ?rangeMap d2rq:class ?rangeName.
}

After running my 5 construct queries, I combined all the triple results they generated into one large file by hand, added a basic owl:Ontology definition, and opened the resulting file in Protégé (http://protege.stanford.edu/).

There was then a fair amount of work by hand in Protégé to further enrich the ontology to the state that you see it now — in particular adding what appeared to me to be an appropriate class-subclass structure, and adding at least a brief rdfs:comment field to each class and property.

Once I had an owl file containing the DPRR ontology, a reasonably readable set of documentation files for the DPRR ontology could be generated using Protégé’s OWLDoc plugin, and then modified slightly with a small Python file to fix up a few awkward bits in OWLDoc’s generated HTML. You can see the result at:

http://romanrepublic.ac.uk/rdf/ontology/doc

DPRR: RDF Services Documentation