Shape Expressions Mapping Language (ShExML)

This document describes Shape Expressions Mapping Language (hereinafter referred as ShExML), an heterogeneous data mapping language based on Shape Expressions (ShEx). ShExML has the objective to allow users to map and merge their heterogeneous data sources into a single RDF representation. As being based on ShEx, the syntax of ShExML is similar to the ShEx one and the gap between them is low. The rest of the document describes the syntax of ShExML and how each element can be used.

Declarations

Prefix

A prefix is the way to declare a variable that will be substituted for the corresponding URI when it is called. This prefixes are normally used as a shorter version of the URI and allow to avoid the repetition of long URIs over the document. This is the same notion as in Turtle and SPARQL. In ShExML, they are composed as showed in the following example.

PREFIX : <http://example.com/>
PREFIX dbr: <http://dbpedia.org/resource/>
PREFIX schema: <http://schema.org/>

Source

A source is the declaration that allows to define a data source from a given URL. This URL will be the link to a file in one of the supported formats. For databases the JDBC URL will be used instead. A variable is defined to refer to this declaration. The following example shows how to declare three files, one from the internet and two other local files (absolute and relative paths). In addition, a database connection is also defined under the database variable.

SOURCE xml_file <https://example.com/file.xml>
SOURCE json_file <file:///example/path/to/file/file.json>
SOURCE json_relative_path <file.json>
SOURCE database <jdbc:mysql://localhost:3306/mydb>

Types of allowed sources

ShExML is conceived with the mission to process many different formats. Therefore many formats are allowed and more are planned to be added in the future. Nowadays, the following ones are fully supported:

JSON
XML
CSV
Relational Databases (using JDBC URL)
RDF via SPARQL queries

Wilcards

In order to allow conversion of multiple files the "*" wildcard is allowed and can be used as shown in the following example. So, example*.xml will match all xml files beginning with the word "example".

    SOURCE xml_file </path/to/example*.xml>

Query

Query declaration is used whenever a query wants to be separated from the iterator definition (e.g., long queries or multi-line queries). The query can be defined inline or it can be externalised in a file. In the case of defining it inline the same syntax as in iterators is used (see first example). In the case of external files and URL is expected as within the Source declaration. Here, there is no need to use any iterator keyword as it is inferred from file extension.

QUERY inline_query <sql: SELECT * FROM example;>
QUERY external_query </path/to/example_query.sparql>

Iterator

An iterator is used whenever there is a data structure that can be repeated along the file. Therefore, a query is defined to specify where the iterator must iterate. The iterator accepts XPath, JSONPath and SQL queries which must be identified with the 'xpath:', 'jsonpath:' or 'sql:' keyword before its definition. In the example below there is the definition of an Iterator for XPath and for JSONPath. In the following subsections the elements that can be nested inside the iterators are described. The '{' and '}' symbols are used to open and close the block of nested content respectively.

ITERATOR example <xpath: /path/to/entity> { }
ITERATOR example <jsonpath: $.path.to.entity[*]> { }

Iterator types

The following iterators and their keywords are now available in ShExML. One for each format type available.

xpath: (XPath query for XML files)
jsonpath: (JSONPath query for JSON files)
csvperrow (CSV iterator row by row)
sql: (SQL query with a table of results, to be iterated row by row)
sparql: (SPARQL query with a table of result, to be iterated row by row)
External query: It is possible to define queries in an external clause or file using their variable name (previously defined with QUERY declaration)

Field

A field is a partial query defined from the base of an iterator query. Once the result of the iterator query is iterated, the partial query can be applied to obtain the field value or values. It is important to remark that it is possible to return not only one result but more than one. That is, if the result is a list of results, that will be reflected on the output. The partial query can be defined as a normal XPath or JSONPath query but ommiting the first '/' or '.' navigational symbols because they are attached by the engine. In the case of CSV or Relational databases results field are the names of the columns that want to be retrieved. Therefore, an example iterator with fields is presented below.

ITERATOR example <xpath: /path/to/entity> {
    FIELD field1 <@attribute>
    FIELD field2 <field2>
    FIELD field3 <path/to/field3>
}

Pushed and popped fields

Pushed and popped fields can be used to push values down into a hierarchical iteration model. For example, with JSONPath [[Jsonpath]] it is not possible to query parent nodes to get their values as it can done with XPath. Therefore, ShExML offers this possibility with PUSHED_FIELD keyword which tells the engine to save this value when going deeper in the hierarchy iteration. Then, this value can be retrieved with POPPED_FIELD keyword using pushed field variable name as the query.

ITERATOR example <jsonpath: $> {
    PUSHED_FIELD field1 <id>
    ITERATOR nestedIterator <nestedElements[*]> {
        POPPED_FIELD field2 <field1>
        FIELD field3 <field3>
    }
}

Nested Iterator

It is also possible to nest iterators inside other iterators. This allows to cover more complicated structures where there are nested entities. Defining a nested iterator is made as a normal iterator, i.e., it has the same syntax of an iterator but nested inside the main one. Then, the engine will iterate over the results of the first iterator. The Field example but expanded for nested iterators is presented below.

ITERATOR example <xpath: /path/to/entity> {
    FIELD field1 <@attribute>
    FIELD field2 <field2>
    FIELD field3 <path/to/field3>
    ITERATOR nested <path/to/sub/entity> { }
}

Expression

Expressions allow to apply queries over different files by the use of the previous defined declarations. Moreover, they also permit to make some operations over the results. These operations can be applied in the iterator level or in the field level.

Basic expression

This is the most basic expression that can be defined and it is used when only one source is defined. It is composed of a file variable and the route to the iterators and fields to apply.

Basic expression over iterators

A basic expression can be used with an iterator without defining the fields that want to be accesed. Therefore, the expression will produce a set of values that can be accesed later in the shapes by the name of the iterators fields. In the example expression a set containing the values of field1 and field2 is produced.

ITERATOR example <xpath: /path/to/entity> {
    FIELD field1 <@attribute>
    FIELD field2 <field2>
}
EXPRESSION exp <file.example>

Basic expression over fields

A basic expression can be used with an iterator field. Therefore, the expression will produce only the value of this query. This kind of expression can be used with further operations to define with higher granularity the output to produce. In the example, only the value of field1 is taken.

ITERATOR example <xpath: /path/to/entity> {
    FIELD field1 <@attribute>
    FIELD field2 <field2>
}
EXPRESSION exp <file.example.field1>

Union

Unions are the way to merge the results of various different basic expressions. With this operation it is possible to combine different sources to produce a new RDF graph.

Union over iterators

Unions can be used over iterators when the different iterators fields want to be merged. In order to merge, the requirement is that those fields that want to be merged must have the same name. In the example below, the field1 will be the merge of the two iterators fields, but field2 and field3 will only have their respective values.

ITERATOR it1 <xpath: /path/to/entity> {
    FIELD field1 <@attribute1>
    FIELD field2 <field2>
}
ITERATOR it2 <jsonpath: $.path.to.entity> {
    FIELD field1 <field1>
    FIELD field3 <field3>
}
EXPRESSION exp <file.it1 UNION file.it2>

Union over fields

Unions are also capable of handle the union of fields without taking the entire iterators. Therefore, as with the basic expression, it offers more flexibility and granularity when needed. In the example below, field2 and field3 are merged.

ITERATOR it1 <xpath: /path/to/entity> {
    FIELD field1 <@attribute1>
    FIELD field2 <field2>
}
ITERATOR it2 <jsonpath: $.path.to.entity> {
    FIELD field1 <field1>
    FIELD field3 <field3>
}
EXPRESSION exp <file.it1.field2 UNION file.it2.field3>

Join

Joins allow to merge the results of two iterators based in a common field. Therefore, if the field values are equal the two iterators results will be merged into a single one. If the field to be merged exists in both iterators it will generate a list of all the combined results, if a field does not exist in one of the iterators it will only include the results from either the left or the right part containing them.

ITERATOR it1 <xpath: /path/to/entity> {
FIELD field1 <@attribute1>
FIELD field2 <field2>
FIELD field3 <field3>
}
ITERATOR it2 <jsonpath: $.path.to.entity> {
FIELD field1 <field1>
FIELD field3 <field3>
}
EXPRESSION exp <file1.it1 JOIN file2.it2 ON file1.it1.field1 = file2.it2.field1>

Substitution

Substitutions permit to extract identifiers information from another file when a common attribute is present. So, for example, if in file A there is an id and a name but in file B there is only a name, it is possible to substitute the result name for the corresponding id when there is a coincidence of names.

Substitution over iterators

A substitution can be used over iterators when this operation is wanted to be applied over all fields of the iterator. In this case, the operation will work as with fields but taking into account the equality on field naming.

ITERATOR it1 <xpath: /path/to/entity> {
    FIELD field1 <@attribute1>
    FIELD field2 <field2>
}
ITERATOR it2 <jsonpath: $.path.to.entity> {
    FIELD field1 <field1>
    FIELD field3 <field3>
}
ITERATOR it3 <jsonpath: $.path.to.entity> {
    FIELD field1 <field1>
    FIELD field4 <field3>
}
EXPRESSION exp <file.it1 UNION file.it2 SUBSTITUTING file.it3>

Substitution over fields

Substitutions can be applied over fields like other expression operators. In this case we can define the results as A, B and C. Consequently, the syntax is A UNION B SUBSTITUTING C where the results of B are replaced for the results of A when C is equal to B. In the example, this is done for substituting the names of B for the ids of A when this is possible.

ITERATOR it1 <xpath: /path/to/entity> {
    FIELD id <@id>
    FIELD name <name>
}
ITERATOR it2 <jsonpath: $.path.to.entity> {
    FIELD name <name>
}
EXPRESSION exp <file.it1.id UNION file.it2.name SUBSTITUTING file.it1.name>

String operation

String operation allows to concatenate the results as a string. Thus, permitting to create transformation of the results based on a string concatenation.

String operation over iterators

String operations can be used with iterators to combine the fields with the same name through a string concatenation. As with other expressions for iterators, the requirement is the equality of the field name. In the example below, field1 of it1 is concatenated with field1 of it2 using a dash to join them. However, field2 and field3 are not concatenated as they do not have the same name.

ITERATOR it1 <xpath: /path/to/entity> {
    FIELD field1 <@attribute1>
    FIELD field2 <field2>
}
ITERATOR it2 <jsonpath: $.path.to.entity> {
    FIELD field1 <field1>
    FIELD field3 <field3>
}
EXPRESSION exp <file.it1 + "-" + file.it2>

String operation over fields

String operations can be used also with fields as the other expressions visited in this specification. In this case, only the selected fields will be concatenated. This allows to have more flexibility in this kind of transformations that could not be so usual. In the example, field2 and field3 are concatenated using a dash.

ITERATOR it1 <xpath: /path/to/entity> {
    FIELD field1 <@attribute1>
    FIELD field2 <field2>
}
ITERATOR it2 <jsonpath: $.path.to.entity> {
    FIELD field1 <field1>
    FIELD field3 <field3>
}
EXPRESSION exp <file.it1.field2 + "-" + file.it2.field3>

Matcher

Matchers can be used to replace a result for another one. They are designed to change some of the results to a string that can match with some existing URI in the LOD cloud. Therefore, that is why they are called matchers. In this example, a matcher is defined for the region of Asturias where all the possible occurrences in different languages are matched to Asturias as it appears in the URI http://dbpedia.org/resource/Asturias. In the shapes section an example of use will be showed.

MATCHER ast <Principality of Asturias, Principado de Asturias, Principáu d'Asturies, Asturies AS Asturias>

Multiple entry Matcher

As a way to avoid defining multiple matchers for different terms it is possible to group them in a single matcher using the "&" operator. Therefore, using the previous example including another entry will look like:

    MATCHER regions <Principality of Asturias, Principado de Asturias, Principáu d'Asturies, Asturies AS Asturias &
                    Spain, España, Espagne AS Spain>

Autoincrement ids

When there are no natural ids present in the content it can be useful to build your own ones, like autoincremental ids on databases. Therefore in ShExML AUTOINCREMENT keyword allows to define an autoincremental id to be used as subjects in triples. It is defined as the concatenetion of a beginning string (optional), a range definition (mandatory) and an ending string (optional). Range definition is specified as beginning integer (mandatory), an ending range integer (using to keyword and optional, default to infinite) and a step incremental string (using by keyword and optional, default to 1). Therefore, in the following example we define an id which will generate ids: my0Id, my2Id, my4Id, my6Id, my8Id, my10Id.

AUTOINCREMENT myId <"my" + 0 to 10 by 2 + "Id">

Functions

Functions allow to extend the functionality of ShExML in order to be able to clean, normalise and transform the obtained values before using them in the triples generation. For this purpouse ShExML allows to invoke functions residing in an external Scala class. Then, the class can be loaded using the FUNCTIONS directive, assigning a variable name for later use and defining the URL where the file resides, in the same manner that we define data sources.

class Helper {

    def allCapitals(input: String): String = {
        input.toUpperCase
    }
    
    def addOne(number: Int): Int = {
        number + 1
    }
}

FUNCTIONS helper <scala: https://raw.githubusercontent.com/herminiogg/ShExML/enhancement-%23121/src/test/resources/functions.scala>

Generators

Shapes

Shapes are the way to define the form of the output data. Shape concept is taken from ShEx and in this specification only basic notions of shapes are explained—those notions needed for the understanding of ShExML—so we encourage people who want a further reading to go to ShEx specification [[SHEX]].

Shapes in ShExML

Shapes in ShExML are similar to those in ShEx but with some modifications. As showed in the example, a shape is conformed by: the shape name which can take a prefix to define its namespace; the expression for generating the subjects of the triple which is enclosed between square brackets and preceded by the prefix name; and a set of predicate and object tuples. Each of these tuples contain: a predicate in the form of prefix plus terminal (like in Turtle) and the object which is an expression enclosed between square brackets and with an optional prefix. In the absence of the optional prefix a literal will be generated and an URI if the prefix is present.

    ITERATOR film_xml <xpath: //film> {
        FIELD id <@id>
        FIELD name <name>
        FIELD year <year>
        FIELD country <country>
        FIELD directors <directors/director>
    }
    ITERATOR film_json <jsonpath: $.films[*]> {
        FIELD id <id>
        FIELD name <name>
        FIELD year <year>
        FIELD country <country>
        FIELD directors <director>
    }
    EXPRESSION films <films_xml_file.film_xml UNION films_json_file.film_json>

    :Films :[films.id] {
        :name [films.name] ;
        :year :[films.year] ;
        :country [films.country] ;
        :director [films.directors] ;
    }

In the example above, which is extracted from the first example of this specification, we are defining two iterators that then are merged using the union expression. Then, we define a shape called ':Films' whose subjects are extracted from the union films taking the id attribute. The id attribute is the union of the id fields of both iterators as we have explained earlier. Then 4 tuples of predicate and object are defined. These tuples will generate the different triples as they constitute the basic subject predicate object structure of the RDF triples. Notice that when a ':' is used before an expression, like in films.year, a URI will be generated taking into account the prefix given. In the example below, the same transformation is done but using field expressions instead of iterator expressions.

    ITERATOR film_xml <xpath: //film> {
        FIELD id <@id>
        FIELD name <name>
        FIELD year <year>
        FIELD country <country>
        FIELD directors <directors/director>
    }
    ITERATOR film_json <jsonpath: $.films[*]> {
        FIELD id <id>
        FIELD name <name>
        FIELD year <year>
        FIELD country <country>
        FIELD directors <director>
    }
    EXPRESSION films_ids <films_xml_file.film_xml.id UNION films_json_file.film_json.id>
    EXPRESSION films_names <films_xml_file.film_xml.name UNION films_json_file.film_json.name>
    EXPRESSION films_years <films_xml_file.film_xml.year UNION films_json_file.film_json.year>
    EXPRESSION films_countries <films_xml_file.film_xml.country UNION films_json_file.film_json.country>
    EXPRESSION films_directors <films_xml_file.film_xml.directors UNION films_json_file.film_json.directors>

    :Films :[films_ids] {
        :name [films_names] ;
        :year :[films_years] ;
        :country [films_countries] ;
        :director [films_directors] ;
    }

Linking shapes

There are occasions when a type has some nested subtypes. In ShEx, it is possible linking shapes to say that an object has to comply with a certain shape. In ShExML, this will trigger the generation of the new entity and they will be linked in RDF using the subject of the nested entity. In the example below an :Actor shape is linked to :Films shape.

    :Films :[films.id] {
        :name [films.name] ;
        :year :[films.year] ;
        :country [films.country] ;
        :director [films.directors] ;
        :cast @:Actor ;
    }

    :Actor :[films.actors.id] {
        :name [films.actors.name] ;
    }

Matcher

As previously described, a matcher can be used to substitute a result for another string in order to match with some existing URI in the LOD cloud. Firstly, it is declared in the declarations part and then it is used inside the shape in the expression that we want to replace its results. In the example below, Spain mention in Spanish, French, German and Portuguese are converted to the English version to match with the URI http://dbpedia.org/resource/Spain.

    MATCHER spain <España, Espagne, Spanien, Espanha AS Spain>

    :Films :[films.id] {
        :name [films.name] ;
        :year :[films.year] ;
        :country dbr:[films.country MATCHING spain] ;
        :director [films.directors] ;
    }

Data types (static version)

Object generation clauses can be tagged with an XMLSchema data type if a literal is generated (without prefix). Therefore, to output an speficif type, it should be declared after object generation clause (see the example below).

    :Films :[films.id] {
        :name [films.name] xsd:string ;
    }

Data types (dynamic version)

It is also possible to retrieve data types values from input sources. For doing so, a generation clause should be indicated next to the object generation clause (using prefix depending on the input value).

    :Films :[films.id] {
        :name [films.name] xsd:[films.datatype] ;
        :year [films.name] [films.yearDatatype] ;
    }

Lang tags (static version)

Object generation clauses can be tagged with a lang tag (conformant to BCP 47 [[BCP47]]) indicating the language of the output string. Therefore, to output an speficif lang tag, it should be declared after object generation clause (see the example below).

    :Films :[films.id] {
        :name [films.name] @en ;
    }

Lang tags (dynamic version)

It is also possible to retrieve lang tag values from input sources. For doing so, a generation clause should be indicated next to the object generation clause.

    :Films :[films.id] {
        :name [films.name] @[films.lang] ;
    }

Invoking functions

To invoke a function from a previosly declared Scala class the function call should be declared in the generation expression and the value extraction expressions should be passed as arguments of this call. The syntax follows the same convetions used in object oriented languages like Java and Scala (see the example below).

    :Films :[films.id] {
        :name [helper.allCapitals(films.name)] ;
        :year [helper.addOne(films.year)] ;
    }

Conditional statement generation

It is possible to generate a statement or not based on a condition. These conditions can be expressed as value extraction expressions or function calls, having in both cases to return a boolean value (true or false). These conditions can be expressed either in subject expressions (conditioning the generation of all the triples containing the concerned subject) or a object expression (conditioning the generation of the triple containing the concerning object). The example below shows how a function can be used in the subject and in the object expressions.

    :Films :[films.id IF helper.isBefore2010(films.year)] {
        :name [films.name] ;
        :year [films.year] ;
        :countryOfOrigin [films.country IF helper.outsideUSA(films.country)] ;
    }

Graphs

Graphs is the mechanism to generate named graphs in ShExML and it follows the proposed syntax for ShEx. Thus, it is possible to include various shapes inside which generated triples will be under the indicated named graph. Shapes out of any graph definition will be bounded to the default graph as described on the RDF Datasets specification [[RDFDatasets]]. In the example below all triples generated with the :Films shape will be under the :MyFilms graph.

:MyFilms [[
    :Films :[films.id] {
        :name [films.name] ;
        :year :[films.year] ;
        :country [films.country] ;
        :director [films.directors] ;
    }
]]

Introduction

ShExML at a glance

Declarations

Prefix

Source

Types of allowed sources

Wilcards

Query

Iterator

Iterator types

Field

Pushed and popped fields

Nested Iterator

Expression

Basic expression

Basic expression over iterators

Basic expression over fields

Union

Union over iterators

Union over fields

Join

Substitution

Substitution over iterators

Substitution over fields

String operation

String operation over iterators

String operation over fields

Matcher

Multiple entry Matcher

Autoincrement ids

Functions

Generators

Shapes

Shapes in ShExML

Linking shapes

Matcher

Data types (static version)

Data types (dynamic version)

Lang tags (static version)

Lang tags (dynamic version)

Invoking functions

Conditional statement generation

Graphs

Other instructions

Import