Goethe Johann Wolfgang Goethe Universitaet ICS-FORTH



RDF Source related Storage System (RDF-S3) and easy RDF Query Language (eRQL)
Installation & Usage Instructions (Version 1.8)

     rdfs3-logo      erql-logo


Download RDFS3     


Installation

  1. Download the latest version of RDFS3
  2. Unzip the package. This will create the directory RDFS3 which should contains the following:
    • classes/ (directory) - Containing the java classes of RDF-S3.
    • Doc/ (directory) - Containing the JavaDoc documentation.
    • src/ (directory) - Containing all java source files of RDFS3.
    • jars/ (directory) - Containing needed jar files like, the xsdlib.jar and relaxngDatatype.jar files of the Sun XML Datatypes Library and the libraries of the CUP (0.10j) and JFlex (1.3.5) i.e., java_cup.jar and JFlex.jar and the vrp3.0.jar for the current version of VRP.
    • Manifest/ (directory) - Containing the manifest file Manifest.mf that is included in the jar file of rdfs3.jar under the jars folder.
    • GIF/ (directory) - Containing pictures used on this HTML page.
    • pdf/ (directory) - Containing some PDF files with futher information on the tool.
    • RDF_examples/ (directory) - Containing some valid RDF examples.
    • HowToUse.htm - Contains the installation & usage information.
    • RDFS3.rdf - The schema used for storing results of eRQL or processing times for RDF-S3 in RDF/XML encoding.
    • runRDFS3.bat - Execute this script (i.e., double click on it) to run RDFS3 in a Windows environment. First, you should set the variables JAVA_HOME and RDFS3_HOME found inside the file.
    • runRDFS3 - Execute this script to run RDFS3 in a Unix environment, i.e., type the command source ./runRDFS3 Before typing the command, you should set the variables JAVA_HOME and RDFS3_HOME found inside the file.
    • Changes.txt - Contains information about the changes from one version of RDFS3 to the next.
    • .project and .classpath - Can be used to import the RDFS3 project into Eclipse (open source) or WebSphere Studio from IBM.
  3. To run RDF-S3(for storing RDF data inside the database):
    • In Windows environment
      • Set the variables JAVA_HOME, JDBC_Driver and RDFS3_HOME found inside the script paths.bat
      • Execute runRDFS3.bat script e.g., double click on it
    • In Unix environment
      • Set the variables JAVA_HOME, JDBC and RDFS3_HOME found inside the runRDFS3 script
      • Execute runRDFS3 script i.e., type source ./runRDFS3
  4. To run easy RQL (eRQL) (to retrieve stored data from the database):
    • In Windows environment
      • Set the variables JAVA_HOME, JDBC_Driver and RDFS3_HOME found inside the script paths.bat
      • Execute runeRQL.bat script e.g., double click on it
    • In Unix environment
      • Set the variables JAVA_HOME, JDBC and RDFS3_HOME found inside the runeRQL script
      • Execute runRDFS3 script i.e., type source ./runeRQL

Note1: To run RDFS3 you need Java v1.4
Note2: If your RDF files are large, you should use the -mx parameter to increase the main memory Java can allocate. See in runRDFS3 scripts



RDF-S3     

      rdfs3-logo

Introduction to RDF-S3

The RDF-Source related Storage System (RDF-S3) is an application that:

  • keeps track of the source of each stored triple, without affecting the RDF model,
  • allows update and deletion of inserted sources,
  • tries to overcome the query performance problems of the GenRepr and SpecRepr approaches by merging these approaches,
  • can be used with any SQL3 conform RDBMS via JDBC (tested with IBM DB2 Universal Database and MySQL),
  • provides graphical user interfaces for easy handling,
  • provides an API to handle the additional source information of each triple.

The internal structure of RDF-S3 is shown in Figure 1. It comprises of a loader to store the data into the repository and an API to access the RDF data together with their source information. You can see also in Figure 1 that the loader is build on top of the ICS-Validating RDF Parser (VRP. This is similar to the database loader RSSDB of the ICS-RDF-Suite (RSSDB does not support a source tracking). Using VRP gives the user the ability to validate the data on a semantic level against RDF constraints, e.g., domain and range constraints, before entering them into the database. It is impor-tant to note that the validation of the single source does not guarantee that the combination of different sources will be valid. Anyway, working in this area showed that the semantic validation of VRP results in a clear improvement of data quality.

To enable this validation an internal in memory model needs to be created. For very large files this might not be practical, since the size of memory is limited. RDF-S3 therefore allows additional a stream based insertion of triples without semantic validation, that does not need such an in memory model. It is worth noting that stream based insertion takes more time, since RDF-S3 has to enter each triple separately and can not benefit of the sorting and knowledge provided by the internal VRP model. Another way to handle big RDF files is to split and load them separately. It goes without saying that wrong splitting can cause errors, too.

Overview of the internal structure and workflow of RDF-S3.

Figure 1. Overview of the internal structure and workflow of RDF-S3.

The RDF-S3 API extends classes that are used for the VRP internal RDF model. Depending on the Java type casting, the Java objects returned by the RDF-S3 API can either be viewed as representing pure RDF, or as representing RDF enriched with source information. This means that tools just dealing with the RDF model (without dealing with source information) can be easily linked up with RDF-S3.

To load RDF data into the database the graphical user interface (GUI) can be used. The GUI provides the ability to enter the information for the database connection and authentication and the settings for the underlying VRP parser.

Before the data can be loaded into the database some initial tables need to be created. Table I gives an overview and a short description of these initial tables. This initialization can be started under the menu item Database of the GUI. The namespaces of RDF and RDFS belong to each RDF graph. Therefore, they will be stored into the repository directly after creating the initial tables. To do so, RDF-S3 uses the RDF/XML serialization as it is stored within VRP's vocabulary package, but they can be updated later on as any other source.

TABLE I: The tables that will be generated during the initialization phase of the database.

Database Table Name Description
Resources For storing all resources and literals and mapping them to their internal integer IDs.
Sources For keeping the information about the already inserted sources.
Namespaces Used to abbreviate the URIs of classes, properties and other constructs.
Literals Contains literals including their type and language specification.
POI Point-Of-Interest table, the monolith table for the GenRepr.
Classes For keeping all stored classes.
Classes_Source_Usage Stores the source relationship of the classes.
Subclasses An extra table for the class hierarchy.
Properties For keeping all stored properties.
Properties_Source_Usage Stores the source relationship of the properties.
Subproperties An extra table for the property hierarchy.
Containers For keeping all stored containers.
Containers_Source_Usage Stores the source relationship of the containers.
Statements For keeping all stored reified statements.

As a first benefit RDF-S3 allows the deletion and update (delete and reenter) of single sources. In addition to the source URL the system provides the information when the last update or insertion of the specific resources took place. A separate GUI (as described in The RDF-S3 GUI) allows an easy handling for deleting and updating already inserted sources. The RDF and RDFS namespaces you see in Figure 2 with the IDs 1 and 2 are handled as normal sources and can be updated in the same way.

Figure 2. Graphical user interface to update or delete single sources form the RDF-S3 storage.

The internal storage structure of RDF-S3 combines the two storage approaches (GenRepr & SpecRepr) by redundant storage . This way both query types (schema queries & data queries) can run on the representation that is more effective for its type. Naturally redundant storage needs extra memory and the DBMS needs to do some extra work for keeping the information synchronized. The synchronization is ensured by a widespread usage of foreign keys. However, since this extra work is only needed during the insertion and deletion, this does not affect the query performance. To minimize the storage volume (especially because of the redundant storage), we decided to encode the resource URIs as integers. Of course this results in some extra work during the retrieval of the resource URIs.

A first version of RDF-S3 was published in October 2003. It is a 100% pure Java™ open source application . The system was implemented and tested on IBM DB2 Univer-sal Database v8.1 (ESE). In addition, it was tested with MySQL 4.0 where the loading worked fine. However, during the deletion of single sources RDF-S3 makes usage of nested queries, which are not supported by MySQL 4.0. Nested queries will be sup-ported by MySQL 4.1, but unfortunately this version is not fully stable yet. Furthermore, it is important to mention that RDF-S3 uses foreign keys. Currently only the InnoDB MySQL storage engine supports foreign keys. Without a support of foreign keys a proper deletion and update can not be performed.

There certainly is much more we could keep in addition to the source information. There is the time of loading (which is already kept by RDF-S3), the person who loaded the source to the storage, trust ratings about the source (e.g., by an external rating ser-vices) or even provenance information (information about all changes that had been done). Yet, there are already many ways on how to benefit from just keeping the source information in addition to each triple. In the following list you can find some of our ideas:

  • It provides the ability to delete complete sources and therefore also to update them. It is important to point out here that this update does not include schema evolutions. This is an extra part that needs to be explored separately. Anyway, keeping the source information for each triple can be useful for a schema evolution within an RDF storage. It might be that the schema evolution should not affect all graphs from every source.
  • In addition with the time of loading and updating, it could be used as a starting point for a versioning system inside the RDF storage.
  • It offers the possibility to refer to the source and therefore to retrieve the source itself to look for further information. This could be interesting for frequently chang-ing data or for just viewing a human readable version, e.g. in HTML.
  • The source information can be used for a first basic trust mechanism. There is a higher confidence to the data in case you know where they are coming from. Moreover, by rating the different sources you could judge how much they can be trusted.
  • It helps to explore contradictions between different sources. E.g., two classes are stated to have no intersection (e.g. by owl:disjointWith), but there exists a resource being instance of both classes. For the system there is no way to solve the situation, but by supporting the user with the URLs where the information was found, he has the ability to decide which one he trusts. An advanced system could further try to help the user by supporting him with statistics, like the numbers of sources that are stating one or the other point of view, or feedback from other users that explored the contradiction before.

The RDF-S3 GUI

Below you see the graphical user interface of RDFS3 for storing RDF data into a database. A short description of the fields, checkboxes and buttons can be found underneath.

RDFS3GUI


Menu   Input/Output File   DB Connection   Validation Options (VRP)   Output Options   Output Area   Buttons


Menu

Database
To initialise the database. The database itself must exist before. The initialisation will generate the needed basically tables to store the RDF data. Before it starts you will be asked if the database you use supports 'GENERATED ... AS IDENTITY'. This choice has an effect whether or not the Simple SQL checkbox should be checked or not.

Input/Output File

Select Input File or URL
Select the file or enter the URL you want to store (e.g., http://139.91.183.30:9090/RDF/VRP/Examples/cweb.rdf). RDF files are found in the directory 'examples' of RDFS3. More examples are available online.
Select Output File
Select a file or enter a new file for saving statistics for RDF-S3 actions. The statistics contain information what kind of RDF data have been stored, e.g. how many properties, plus its processing time. Depending on the settings the statistics can be encoded as delimitered set of integers or as RDF/XML.

DB Connection

Schema Name
To enter the schema name used in the database.
Database URL
To enter the URL of the database, e.g. jdbc:db2:RDFTEST for an jdbc connection to an IBM DB2 instance called db2 containing a database called RDFTEST.
JDBC Driver Name
The class name of the jdbc driver for the database you use. Please make sure that you set the classpath for it before starting RDFS3. For IBM DB2 the JDBC driver is contained in the zip file called db2java.zip which is normally located at <db2-installation-path>\SQLLIB\java\.
User Name
Username to authenticate at the database.
Password
The password to authenticate at the database.
Simple SQL
In case the database was initialised (see menu above) without 'GENERATED ... AS IDENTITY' support, it should be checked, otherwise not!

Validation Options (VRP)

Types of Resources
The resources used as RDF Classes, Properties, Containers and Statements should have been assigned the respective RDF type. The XML Datatypes values should belong to the lexical space of the respective data type.
Source/Target Resources of properties
The source/target values of a property should be instances of the domain/range classes of the property.

Output Options

Verbose
Messages about the actions performed by the RDFS3 will be reported.
Triples
The triples produced by the parser for the input RDF descriptions will be printed.
Statements
The triples included in the model produced by VRP will be printed. This set of triples may differ from the set of triples created by the parser (see Triples option), because the VRP model produced: does not contain duplicate statements, contains additional statements from external namespaces, if the option 'External Namespaces' is enabled, contains additional statements from inference on domain/range of subproperties.
Graph
Prints a textual representation of the created RDF graph containing: The topological order (top down) of the Properties according to the rdf:subPropertyOf statements. The topological order (top down) of the Classes according to the rdf:subClassOf statements. The RDF Classes, Properties, Statements, Containers, and Resources.

Output Area

Here the generated results and messages are shown.

Buttons

Start
Will start the storage of the given input file into the database.
Clear Output
Will clear the two output areas and will reset the process bar.
Exit
Will store the current settings that will be reused when RDFS3 is started the next time.

easy RDF Query Language (eRQL)    

Introduction to eRQL

The easy RDF Query Language (eRQL) was historically constructed as a wrapper for RQL that uses a SQL like syntax and can therefore not serve as an end-user query language for information portals etc. This changed with the evolution of RDF-S3 and eRQL. eRQL now is working directly on top of RDF-S3.

The main goal of eRQL is to be simple enough to be used without any knowledge of the underlying ontology used to describe the data. Also the query syntax itself should be intuitive. This goal is reached by being close to the syntax of Google. By simply entering keywords that can be combined by AND or OR, queries can be performed. As a result for these queries you would get those triples in return, that contain the given word as either subject, predicate or object. By putting a keyword into quotation marks, the request is restricted to literals. The matching triples are called direct hits. One peculiarity of eRQL is to return not only the triples fitting the request, but also those surrounding them (called PointOfInterest Mode - POI Mode). This way eRQL includes internal context information to the result. Therefore, a better understanding of the result can be reached. The distance how much of the surrounding graph should be returned can be defined by the number of leading ~ signs. The default is already one, meaning if you enter the query "bridges", the system will search for all direct hits. For each of them, the system then will include those triples that are connected to them to the result. For a query with a distance of zero, the query needs to be enclosed by brackets "[...]". The result then only contains the single direct hits (called Statement Mode).

Since RDF-S3 stores also the source information of the triples, this knowledge can be used too. With the so called Document Mode in eRQL single or a group of sources can be either left out or a query can be restricted to them. The syntax for the document mode is: "". The query can be any valid eRQL query, the source_list is a list of source URLs separated by comma and restrict is either 0 to leave out the defined sources or 1 to restrict the query to them. As an abbreviation the sources in the source_list can also be identified by internal IDs, that must be retrieved from the database before. Therefore, a valid query could look like: "". This would execute the query "bridges" on all stored information except those that comes from the sources internally identified by 5 and 6.

In addition to this functionality eRQL also supports the most RQL schema functions to get a quick overview of the underlying ontology. A complete list of functions and further possibilities supported by eRQL can be found below. These functions are splitted into general schema functions that do not need an input and schema functions on resources that will need an eRQL query as input parameter, like "domain(query)", that will return the domain definitions of the properties fitting the given query.

As RDF-S3 also the implementation of eRQL comes with an graphical user interface shown in the next section. It includes the settings for the database connection and also the mapping from source URLs to the internal used IDs to abbreviate them in the document mode.

The structure of the result that will be returned depends on the query. General schema functions will return lists of found resources. Results for schema functions on resources will list the resources together with the resource they belong to, e.g., the query "domain(father)" will list all properties fitting the query "father" together with their defined rdfs:domain classes. In case of triples being returned, they are grouped by the POIs they belong to. Triples always returned including their source information. For schema functions the source information is only included when the document mode was activated.

To work with the returned eRQL results there is either the possibility to reuse the internal Java classes or to store the results in an RDF format using a vocabulary given by the application. As in RDF-S3 the result also includes the processing time for the query.

Short description of the current eRQL syntax

  • POI Query or POI Mode: one-word-query - will return all stored triples, where the given query word is contained by either part of the triple. The surrounding graph with a distance of 1 is included to the result. A cumulative usage of '~' can be used to increase the distance. Wildcards "*" and "?" can be used, whereby "*" stands for any string and "?" for exact one character.
  • Border functions: The query can be restricted to operate on the subject, predicate or object part of the RDF model. Therefore, the functions subj(one-word-query), pred(one-word-query) or obj(one-word-query) can be used. Additional the function res(one-word-query) restricts on resources having a URI reference.
  • Literal Query : "query_string" - as the POI Query but concentrates on literals in the object part only. The query_string can contain spaces.
  • Statement Mode : [query] - concentrates on the direct hits, there is no surrounding graph included.
  • Document Mode: <query; source_list; restrict> - restrict can be either 0 or 1, in case of 1 the query is only executed on the given sources, in case of 0 these sources are left out.
  • Schema Functions: With the version 1.6 of RDF-S3 and eRQL also schema functions for queries on the schema level are supported. The schema functions can be separated into general functions that do not need an input and functions on resources that will need a query as input to find fitting resources. Whereas the POI Mode does not work for schema functions, meaning there will be no surrounding graph be returned, the Document Mode is compatible with schema functions. A list and short description of all currently supported schema functions can be found in Table II and Table III.

TABLE II: Overview of schema functions on resources supported by eRQL.

Syntax Description
directInstancesOf(q), di(q), dI(q) Will return all direct instances for the classes fitting the query q.
domain(q), d(q), D(q) Will return the defined domain classes for the properties fitting the query q.
instancesOf(q), i(q), I(q) Will return all instances (including the instances of subclasses) for the classes fitting the query q.
range(q), r(q), R(q) Will return the defined range classes for the properties fitting the query q.
subClassOf(q), subc(q), subC(q) Will return all sub classes of the classes fitting the query q.
subPropertyOf(q), subp(q), subP(q) Will return all sub properties of the properties fitting the query q.
superClassOf(q), superc(q), superC(q) Will return all super classes of the classes fitting the query q.
superPropertyOf(q), subp(q), subP(q) Will return all super properties of the properties fitting the query q.

TABLE III: Overview of general schema function supported by eRQL.

Syntax Description
classes(), c(), C() Will return all defined classes.
container(), con(), CON() Will return all defined containers.
literals(), l(), L() Will return all used literals.
properties(), p(), P() Will return all defined properties.
reifiedStatements(), rs(), RS() Will return all defined reified statements.
triples(), t(), T() Will return all triples.

The eRQL GUI

Below you see the graphical user interface of eRQL for RDF-S3. It can be used to retrieve the data stored by RDF-S3. A short description of the fields, checkboxes and buttons can be found underneath.

eRQLGUI


DB Connection (equal to RDF-S3 GUI)     eRQL     Query Result      Buttons

eRQL

Enter your query here
The eRQL query can be entered here. To find more information about how to write eRQL queries and their syntax please see eRQLModi (pdf).
The stored sources and their IDs
For eRQL document queries the internal IDs of the sources are used to abbreviate them. The IDs can be retrieved from the given table that can be actualized by pressing the refresh button under the table.
Case Sensitive
This influences the eRQL query to distinguish between upper and lower case. If selected the queries "picasso" and "Picasso" will return different results.
Follow double property usage
Important for the POI modus of eRQL queries. It will result in a smaller number of surrounding triples per hit by ignoring a double usage of the same property. In case a resource r is an instance of a class c (r rdf:type c), and the defined distance for the POI query goes further this triple, all instances i of the class c (i rdf:type c) would be included to the POI result. In case the checkbox is not selected these double usage of a property (in this case rdf:type) will not be included to the result set.

Query Result

Query Result
The query result starts with the time needed to execute the query followed by the number of direct hits. Depending on the distance that was specified with the query, each hit is shown with its surrounding triples separately. Each of these RDF graphs start with their size (number of triples), followed by the triples themselves. The first triple is always the hit itself. The triples are written in the form subject, predicate, object, followed by the source of the triple. For abbreviation the namespaces and source URLs are substituted. At the end of the query result the long versions can be found.

Buttons

Execute! (ALT-ENTER)
By pressing this button the entered eRQL query will be executed.
Close
Closes the program. The last used eRQL query and the DB Connection settings will be stored in the preferences. The next time you start the program these settings will be recreated.