SHACL – “Shackles” for better data quality

By | 8. December 2017

Linked data technologies were originally designed to publish semantically enriched data on the web, to connect heterogeneous data sets and to allow for a better information retrieval. Thus the focus of Linked Data ontologies and vocabularies is interoperability and not restriction of data. Ontologies help to draw conclusions and to see information in a broader context. But they are not intended to constrain data.

But what if Linked Data is not used to only publish information but to originally create and store it? amsl is a native Linked Data application. Every piece of information is entered through a Linked Data editor and saved into a triplestore. How is it possible to ensure the consistency and quality of the stored data? In the last years of amsl usage we found that this indeed is becoming a problem. Querying our data, we often find problematic triples – starting with incorrectly typed objects, but also contextually wrong information like two separate start dates for a single subscription. Complex, and in many cases also very time consuming SPARQL queries can help to sort out some of the errors, but others need to be corrected manually. But how to find them – or better, how to prevent them in the first place?

The Shapes Constraint Language (SHACL, hinting at the “shackles” put on the data), W3C Recommendation since Summer 2017, proposes an elegant solution. In addition to the data graph, a so called shapes graph is defined, containing information on how the data should be formed. SHACL shapes can focus for example on nodes of a certain class, and constrain the properties allowed to be used as well as the form the objects should have. These constraints can easily become very complex, allowing for validation of datatypes, string patterns, cardinality, class membership or logical constraints and many other rules and restrictions. A SHACL processor then validates the data graph based on the defined shapes graph and returns yet another graph, containing the validation results. This graph holds information on all found violations of shape rules.

The validation report can be used to manually or automatically correct the faulty data. Short validation cycles will result in reasonable workloads for data correction and relatively consistent data in the triplestore. Even better is a validation not of the stored data, but of a form’s content that is to be saved. Perspectively, amsl will implement a form validation based on SHACL files. SHACL also offers several form building features that amsl will also be taking advantage of in the future, simplifying the current amsl templates.

As shortly explained in a lightning talk at SWIB 2017 in Hamburg, amsl’s first SHACL validation features are being implemented at the moment. Some important constraints can already be validated, we are currently working on the definition of other and more complex shapes.