This document describes Shape Expressions Mapping Language (hereinafter referred as ShExML), an heterogeneous data mapping language based on Shape Expressions (ShEx). ShExML has the objective to allow users to map and merge their heterogeneous data sources into a single RDF representation. As being based on ShEx, the syntax of ShExML is similar to the ShEx one and the gap between them is low. The rest of the document describes the syntax of ShExML and how each element can be used.

Introduction

A ShExML script can be divided in two main blocks: declarations and shapes. The declarations part allows the users to define some variables that will be used later in other declarations and in the shapes. The shapes, by its side, are the way for defining the pattern of the output.

Inside the declarations there are prefixes, sources, iterators, fields and expressions. Prefixes are the way for defining the prefixes that will be used in the RDF output, this is the same concept as in Turtle and SPARQL. Sources allow to define the files from which the user want to take the information to process. Iterators define in which part of a document there are multiple entities and therefore the engine must iterate over them. Fields, by its side, define the queries to get different values of an entity. And expressions define how to merge or transform the obtained values.

A shape is conformed by a shape name (variable to refer to it), the subject expression to generate the different subjects and a set of terminal and object expressions to generate the triples for the various entities. For more information about how shapes are conceived please refer to ShEx specification [[SHEX]].

ShExML at a glance

In this section a simple example is showed for better understanding of the language and the rest of the document.

As previously stated, a ShExML script can be divided in two main parts: declarations and shapes.

In the declarations part a prefix is defined which is the ':' prefix for the example URI. This prefix will be used later in the shape construction. Then, two sources are defined, one for the XML file and the other one for the JSON file. Two iterators are defined for each of the sources. Each iterator has a base query and different partial queries that compose each field. Finally, a union between two iterators is defined which allows to combine its fields results.

In the shapes part, a shape is defined to specify the form of the output data. The :Films shape obtains its subjects from the films.ids expression (note that this is possible because the union of two iterators has been made; therefore, it is possible to extract the values with the '.' accessor). Then, four tuples of predicate and object are defined. Each of these tuples defines a predicate (in the form of prefix + name) and an object which will be, normally, an expression. Therefore, triples will be created in the form of subject, predicate and object.

Declarations

Prefix

A prefix is the way to declare a variable that will be substituted for the corresponding URI when it is called. This prefixes are normally used as a shorter version of the URI and allow to avoid the repetition of long URIs over the document. This is the same notion as in Turtle and SPARQL. In ShExML, they are composed as showed in the following example.

Source

A source is the declaration that allows to define a data source from a given URL. This URL will be the link to a file in one of the supported formats. A variable is defined to refer to this declaration. The following example shows how to declare two files, one from the internet and the other one from the local storage.

Iterator

An iterator is used whenever there is a data structure that can be repeated along the file. Therefore, a query is defined to specify where the iterator must iterate. The iterator accepts XPath and JSONPath queries which must be identified with the 'xpath:' or 'jsonpath:' keyword before its definition. In the example below there is the definition of an Iterator for XPath and for JSONPath. In the following subsections the elements that can be nested inside the iterators are described. The '{' and '}' symbols are used to open and close the block of nested content respectively.

Field

A field is a partial query defined from the base of an iterator query. Once the result of the iterator query is iterated, the partial query can be applied to obtain the field value or values. It is important to remark that it is possible to return not only one result but more than one. That is, if the result is a list of results, that will be reflected on the output. The partial query can be defined as a normal XPath or JSONPath query but ommiting the first '/' or '.' navigational symbols because they are attached by the engine. Therefore, an example iterator with fields is presented below.

Nested Iterator

It is also possible to nest iterators inside other iterators. This allows to cover more complicated structures where there are nested entities. Defining a nested iterator is made as a normal iterator, i.e., it has the same syntax of an iterator but nested inside the main one. Then, the engine will iterate over the results of the first iterator. The Field example but expanded for nested iterators is presented below.

Expression

Expressions allow to apply queries over different files by the use of the previous defined declarations. Moreover, they also permit to make some operations over the results. These operations can be applied in the iterator level or in the field level.

Basic expression

This is the most basic expression that can be defined and it is used when only one source is defined. It is composed of a file variable and the route to the iterators and fields to apply.

Basic expression over iterators

A basic expression can be used with an iterator without defining the fields that want to be accesed. Therefore, the expression will produce a set of values that can be accesed later in the shapes by the name of the iterators fields. In the example expression a set containing the values of field1 and field2 is produced.

Basic expression over fields

A basic expression can be used with an iterator field. Therefore, the expression will produce only the value of this query. This kind of expression can be used with further operations to define with higher granularity the output to produce. In the example, only the value of field1 is taken.

Union

Unions are the way to merge the results of various different basic expressions. With this operation it is possible to combine different sources to produce a new RDF graph.

Union over iterators

Unions can be used over iterators when the different iterators fields want to be merged. In order to merge, the requirement is that those fields that want to be merged must have the same name. In the example below, the field1 will be the merge of the two iterators fields, but field2 and field3 will only have their respective values.

Union over fields

Unions are also capable of handle the union of fields without taking the entire iterators. Therefore, as with the basic expression, it offers more flexibility and granularity when needed. In the example below, field2 and field3 are merged.

Join

Joins permit to extract identifiers information from other file when a common attribute is present. So, for example, if in file A there is an id and a name but in file B there is only a name, it is possible to substitute the result name for the corresponding id when there is a coincidence of names.

Join over iterators

Join can be used over iterators when this operation is wanted to be applied over all fields of the iterator. In this case, the operation will work as with fields but taking into account the equality on field naming.

Join over fields

Joins can be applied over fields like other expression operators. In this case we can define the results as A, B and C. Consequently, the syntax is A UNION B JOIN C where the results of B are replaced for the results of A when C is equal to B. In the example, this is done for substituting the names of B for the ids of A when this is possible.

String operation

String operation allows to concatenate the results as a string. Thus, permitting to create transformation of the results based on a string concatenation.

String operation over iterators

String operations can be used with iterators to combine the fields with the same name through a string concatenation. As with other expressions for iterators, the requirement is the equality of the field name. In the example below, field1 of it1 is concatenated with field1 of it2 using a dash to join them. However, field2 and field3 are not concatenated as they do not have the same name.

String operation over fields

String operations can be used also with fields as the other expressions visited in this specification. In this case, only the selected fields will be concatenated. This allows to have more flexibility in this kind of transformations that could not be so usual. In the example, field2 and field3 are concatenated using a dash.

Matcher

Matchers can be used to replace a result for another one. They are designed to change some of the results to a string that can match with some existing URI in the LOD cloud. Therefore, that is why they are called matchers. In this example, a matcher is defined for the region of Asturias where all the possible occurrences in different languages are matched to Asturias as it appears in the URI http://dbpedia.org/resource/Asturias. In the shapes section an example of use will be showed.

Shapes

Shapes are the way to define the form of the output data. Shape concept is taken from ShEx and in this specification only basic notions of shapes are explained—those notions needed for the understanding of ShExML—so we encourage people who want a further reading to go to ShEx specification [[SHEX]].

Shapes in ShExML

Shapes in ShExML are similar to those in ShEx but with some modifications. As showed in the example, a shape is conformed by: the shape name which can take a prefix to define its namespace; the expression for generating the subjects of the triple which is enclosed between square brackets and preceded by the prefix name; and a set of predicate and object tuples. Each of these tuples contain: a predicate in the form of prefix plus terminal (like in Turtle) and the object which is an expression enclosed between square brackets and with an optional prefix. In the absence of the optional prefix a literal will be generated and an URI if the prefix is present.

In the example above, which is extracted from the first example of this specification, we are defining two iterators that then are merged using the union expression. Then, we define a shape called ':Films' whose subjects are extracted from the union films taking the id attribute. The id attribute is the union of the id fields of both iterators as we have explained earlier. Then 4 tuples of predicate and object are defined. These tuples will generate the different triples as they constitute the basic subject predicate object structure of the RDF triples. Notice that when a ':' is used before an expression, like in films.year, a URI will be generated taking into account the prefix given. In the example below, the same transformation is done but using field expressions instead of iterator expressions.

Linking shapes

There are occasions when a type has some nested subtypes. In ShEx, it is possible linking shapes to say that an object has to comply with a certain shape. In ShExML, this will trigger the generation of the new entity and they will be linked in RDF using the subject of the nested entity. In the example below an :Actor shape is linked to :Films shape.

Matcher

As previously described, a matcher can be used to substitute a result for another string in order to match with some existing URI in the LOD cloud. Firstly, it is declared in the declarations part and then it is used inside the shape in the expression that we want to replace its results. In the example below, Spain mention in Spanish, French, German and Portuguese are converted to the English version to match with the URI http://dbpedia.org/resource/Spain.