This document describes Shape Expressions Mapping Language (hereinafter referred as ShExML), an heterogeneous data mapping language based on Shape Expressions (ShEx). ShExML has the objective to allow users to map and merge their heterogeneous data sources into a single RDF representation. As being based on ShEx, the syntax of ShExML is similar to the ShEx one and the gap between them is low. The rest of the document describes the syntax of ShExML and how each element can be used.

Introduction

A ShExML script can be divided in two main blocks: declarations and generators. The declarations part allows the users to define some variables that will be used later in other declarations and in generators. The generators, by its side, are the way for defining the pattern of the output.

Inside the declarations there are prefixes, sources, iterators, fields and expressions. Prefixes are the way for defining the prefixes that will be used in the RDF output, this is the same concept as in Turtle and SPARQL. Sources allow to define the files from which the user want to take the information to process. Iterators define in which part of a document there are multiple entities and therefore the engine must iterate over them. Fields, by its side, define the queries to get different values of an entity. And expressions define how to merge or transform the obtained values.

Generators are divided in graphs and shapes (generation of quads and triples respectively). A shape is conformed by a shape name (variable to refer to it), the subject expression to generate the different subjects and a set of terminal and object expressions to generate the triples for the various entities. For more information about how shapes are conceived please refer to ShEx specification [[SHEX]]. A graph is conformed by a graph name and various shapes. The graph name will be used to give a name to the generated named graph.

ShExML at a glance

In this section a simple example is showed for better understanding of the language and the rest of the document.

As previously stated, a ShExML script can be divided in two main parts: declarations and generators.

In the declarations part a prefix is defined which is the ':' prefix for the example URI. This prefix will be used later in the shape construction. Then, two sources are defined, one for the XML file and the other one for the JSON file. Two iterators are defined for each of the sources. Each iterator has a base query and different partial queries that compose each field. Finally, a union between two iterators is defined which allows to combine its fields results.

In the generators part, a shape is defined to specify the form of the output data. The :Films shape obtains its subjects from the films.ids expression (note that this is possible because the union of two iterators has been made; therefore, it is possible to extract the values with the '.' accessor). Then, four tuples of predicate and object are defined. Each of these tuples defines a predicate (in the form of prefix + name) and an object which will be, normally, an expression. Therefore, triples will be created in the form of subject, predicate and object.

Declarations

Prefix

A prefix is the way to declare a variable that will be substituted for the corresponding URI when it is called. This prefixes are normally used as a shorter version of the URI and allow to avoid the repetition of long URIs over the document. This is the same notion as in Turtle and SPARQL. In ShExML, they are composed as showed in the following example.

Source

A source is the declaration that allows to define a data source from a given URL. This URL will be the link to a file in one of the supported formats. For databases the JDBC URL will be used instead. A variable is defined to refer to this declaration. The following example shows how to declare three files, one from the internet and two other local files (absolute and relative paths). In addition, a database connection is also defined under the database variable.

Types of allowed sources

ShExML is conceived with the mission to process many different formats. Therefore many formats are allowed and more are planned to be added in the future. Nowadays, the following ones are fully supported:

  • JSON
  • XML
  • CSV
  • Relational Databases (using JDBC URL)
  • RDF via SPARQL queries

Wilcards

In order to allow conversion of multiple files the "*" wildcard is allowed and can be used as shown in the following example. So, example*.xml will match all xml files beginning with the word "example".

Query

Query declaration is used whenever a query wants to be separated from the iterator definition (e.g., long queries or multi-line queries). The query can be defined inline or it can be externalised in a file. In the case of defining it inline the same syntax as in iterators is used (see first example). In the case of external files and URL is expected as within the Source declaration. Here, there is no need to use any iterator keyword as it is inferred from file extension.

Iterator

An iterator is used whenever there is a data structure that can be repeated along the file. Therefore, a query is defined to specify where the iterator must iterate. The iterator accepts XPath, JSONPath and SQL queries which must be identified with the 'xpath:', 'jsonpath:' or 'sql:' keyword before its definition. In the example below there is the definition of an Iterator for XPath and for JSONPath. In the following subsections the elements that can be nested inside the iterators are described. The '{' and '}' symbols are used to open and close the block of nested content respectively.

Iterator types

The following iterators and their keywords are now available in ShExML. One for each format type available.

  • xpath: (XPath query for XML files)
  • jsonpath: (JSONPath query for JSON files)
  • csvperrow (CSV iterator row by row)
  • sql: (SQL query with a table of results, to be iterated row by row)
  • sparql: (SPARQL query with a table of result, to be iterated row by row)
  • External query: It is possible to define queries in an external clause or file using their variable name (previously defined with QUERY declaration)

Field

A field is a partial query defined from the base of an iterator query. Once the result of the iterator query is iterated, the partial query can be applied to obtain the field value or values. It is important to remark that it is possible to return not only one result but more than one. That is, if the result is a list of results, that will be reflected on the output. The partial query can be defined as a normal XPath or JSONPath query but ommiting the first '/' or '.' navigational symbols because they are attached by the engine. In the case of CSV or Relational databases results field are the names of the columns that want to be retrieved. Therefore, an example iterator with fields is presented below.

Pushed and popped fields

Pushed and popped fields can be used to push values down into a hierarchical iteration model. For example, with JSONPath [[Jsonpath]] it is not possible to query parent nodes to get their values as it can done with XPath. Therefore, ShExML offers this possibility with PUSHED_FIELD keyword which tells the engine to save this value when going deeper in the hierarchy iteration. Then, this value can be retrieved with POPPED_FIELD keyword using pushed field variable name as the query.

Nested Iterator

It is also possible to nest iterators inside other iterators. This allows to cover more complicated structures where there are nested entities. Defining a nested iterator is made as a normal iterator, i.e., it has the same syntax of an iterator but nested inside the main one. Then, the engine will iterate over the results of the first iterator. The Field example but expanded for nested iterators is presented below.

Expression

Expressions allow to apply queries over different files by the use of the previous defined declarations. Moreover, they also permit to make some operations over the results. These operations can be applied in the iterator level or in the field level.

Basic expression

This is the most basic expression that can be defined and it is used when only one source is defined. It is composed of a file variable and the route to the iterators and fields to apply.

Basic expression over iterators

A basic expression can be used with an iterator without defining the fields that want to be accesed. Therefore, the expression will produce a set of values that can be accesed later in the shapes by the name of the iterators fields. In the example expression a set containing the values of field1 and field2 is produced.

Basic expression over fields

A basic expression can be used with an iterator field. Therefore, the expression will produce only the value of this query. This kind of expression can be used with further operations to define with higher granularity the output to produce. In the example, only the value of field1 is taken.

Union

Unions are the way to merge the results of various different basic expressions. With this operation it is possible to combine different sources to produce a new RDF graph.

Union over iterators

Unions can be used over iterators when the different iterators fields want to be merged. In order to merge, the requirement is that those fields that want to be merged must have the same name. In the example below, the field1 will be the merge of the two iterators fields, but field2 and field3 will only have their respective values.

Union over fields

Unions are also capable of handle the union of fields without taking the entire iterators. Therefore, as with the basic expression, it offers more flexibility and granularity when needed. In the example below, field2 and field3 are merged.

Join

Joins permit to extract identifiers information from other file when a common attribute is present. So, for example, if in file A there is an id and a name but in file B there is only a name, it is possible to substitute the result name for the corresponding id when there is a coincidence of names.

Join over iterators

Join can be used over iterators when this operation is wanted to be applied over all fields of the iterator. In this case, the operation will work as with fields but taking into account the equality on field naming.

Join over fields

Joins can be applied over fields like other expression operators. In this case we can define the results as A, B and C. Consequently, the syntax is A UNION B JOIN C where the results of B are replaced for the results of A when C is equal to B. In the example, this is done for substituting the names of B for the ids of A when this is possible.

String operation

String operation allows to concatenate the results as a string. Thus, permitting to create transformation of the results based on a string concatenation.

String operation over iterators

String operations can be used with iterators to combine the fields with the same name through a string concatenation. As with other expressions for iterators, the requirement is the equality of the field name. In the example below, field1 of it1 is concatenated with field1 of it2 using a dash to join them. However, field2 and field3 are not concatenated as they do not have the same name.

String operation over fields

String operations can be used also with fields as the other expressions visited in this specification. In this case, only the selected fields will be concatenated. This allows to have more flexibility in this kind of transformations that could not be so usual. In the example, field2 and field3 are concatenated using a dash.

Matcher

Matchers can be used to replace a result for another one. They are designed to change some of the results to a string that can match with some existing URI in the LOD cloud. Therefore, that is why they are called matchers. In this example, a matcher is defined for the region of Asturias where all the possible occurrences in different languages are matched to Asturias as it appears in the URI http://dbpedia.org/resource/Asturias. In the shapes section an example of use will be showed.

Multiple entry Matcher

As a way to avoid defining multiple matchers for different terms it is possible to group them in a single matcher using the "&" operator. Therefore, using the previous example including another entry will look like:

Autoincrement ids

When there are no natural ids present in the content it can be useful to build your own ones, like autoincremental ids on databases. Therefore in ShExML AUTOINCREMENT keyword allows to define an autoincremental id to be used as subjects in triples. It is defined as the concatenetion of a beginning string (optional), a range definition (mandatory) and an ending string (optional). Range definition is specified as beginning integer (mandatory), an ending range integer (using to keyword and optional, default to infinite) and a step incremental string (using by keyword and optional, default to 1). Therefore, in the following example we define an id which will generate ids: my0Id, my2Id, my4Id, my6Id, my8Id, my10Id.

Functions

Functions allow to extend the functionality of ShExML in order to be able to clean, normalise and transform the obtained values before using them in the triples generation. For this purpouse ShExML allows to invoke functions residing in an external Scala class. Then, the class can be loaded using the FUNCTIONS directive, assigning a variable name for later use and defining the URL where the file resides, in the same manner that we define data sources.

Generators

Shapes

Shapes are the way to define the form of the output data. Shape concept is taken from ShEx and in this specification only basic notions of shapes are explained—those notions needed for the understanding of ShExML—so we encourage people who want a further reading to go to ShEx specification [[SHEX]].

Shapes in ShExML

Shapes in ShExML are similar to those in ShEx but with some modifications. As showed in the example, a shape is conformed by: the shape name which can take a prefix to define its namespace; the expression for generating the subjects of the triple which is enclosed between square brackets and preceded by the prefix name; and a set of predicate and object tuples. Each of these tuples contain: a predicate in the form of prefix plus terminal (like in Turtle) and the object which is an expression enclosed between square brackets and with an optional prefix. In the absence of the optional prefix a literal will be generated and an URI if the prefix is present.

In the example above, which is extracted from the first example of this specification, we are defining two iterators that then are merged using the union expression. Then, we define a shape called ':Films' whose subjects are extracted from the union films taking the id attribute. The id attribute is the union of the id fields of both iterators as we have explained earlier. Then 4 tuples of predicate and object are defined. These tuples will generate the different triples as they constitute the basic subject predicate object structure of the RDF triples. Notice that when a ':' is used before an expression, like in films.year, a URI will be generated taking into account the prefix given. In the example below, the same transformation is done but using field expressions instead of iterator expressions.

Linking shapes

There are occasions when a type has some nested subtypes. In ShEx, it is possible linking shapes to say that an object has to comply with a certain shape. In ShExML, this will trigger the generation of the new entity and they will be linked in RDF using the subject of the nested entity. In the example below an :Actor shape is linked to :Films shape.

Matcher

As previously described, a matcher can be used to substitute a result for another string in order to match with some existing URI in the LOD cloud. Firstly, it is declared in the declarations part and then it is used inside the shape in the expression that we want to replace its results. In the example below, Spain mention in Spanish, French, German and Portuguese are converted to the English version to match with the URI http://dbpedia.org/resource/Spain.

Data types (static version)

Object generation clauses can be tagged with an XMLSchema data type if a literal is generated (without prefix). Therefore, to output an speficif type, it should be declared after object generation clause (see the example below).

Data types (dynamic version)

It is also possible to retrieve data types values from input sources. For doing so, a generation clause should be indicated next to the object generation clause (using prefix depending on the input value).

Lang tags (static version)

Object generation clauses can be tagged with a lang tag (conformant to BCP 47 [[BCP47]]) indicating the language of the output string. Therefore, to output an speficif lang tag, it should be declared after object generation clause (see the example below).

Lang tags (dynamic version)

It is also possible to retrieve lang tag values from input sources. For doing so, a generation clause should be indicated next to the object generation clause.

Invoking functions

To invoke a function from a previosly declared Scala class the function call should be declared in the generation expression and the value extraction expressions should be passed as arguments of this call. The syntax follows the same convetions used in object oriented languages like Java and Scala (see the example below).

Conditional statement generation

It is possible to generate a statement or not based on a condition. These conditions can be expressed as value extraction expressions or function calls, having in both cases to return a boolean value (true or false). These conditions can be expressed either in subject expressions (conditioning the generation of all the triples containing the concerned subject) or a object expression (conditioning the generation of the triple containing the concerning object). The example below shows how a function can be used in the subject and in the object expressions.

Graphs

Graphs is the mechanism to generate named graphs in ShExML and it follows the proposed syntax for ShEx. Thus, it is possible to include various shapes inside which generated triples will be under the indicated named graph. Shapes out of any graph definition will be bounded to the default graph as described on the RDF Datasets specification [[RDFDatasets]]. In the example below all triples generated with the :Films shape will be under the :MyFilms graph.