Structural Properties as Proxy for Semantic Relevance in RDF Graph Sampling

Paper Abstract

The Linked Data cloud has grown to become the largest knowledge base ever constructed. Its size is now turning into a major bottleneck in many applications. In order to facilitate access to this enormous amount of structured information, this paper proposes an automatic sampling method targeted at maximizing answer coverage in applications using SPARQL querying. We empirically show that the relevance of triples for sampling (a semantic notion) is influenced by the topology of the graph (purely structural), and can be determined without prior knowledge of the queries. Usage of state-of-the-art high-performance methods allowed this analysis to be performed on a previously unprecedented scale. Experiments show a significantly higher recall of topology based sampling methods over random and naive baseline approaches

Plots

Sample Quality

We measure the quality of a sample by calculating recall for a given set of queries, and averaging these recall values. We present our results using visualization library D3.js. The initial plots show the best performing sample method for each of our datasets. You are able to select the combination of dataset and sampling method you would like to visualize. The six datasets used in our experiments are: DBpedia, Linked Geo Data, MetaLex, Open-BioMed, Bio2RDF (more specifically, the KEGG dataset), and Semantic Web Dog Food.

As the recall for the queries may deviate, we provide standard deviation plots (pdf) per sample as well. Finally, we provide plots where we take the median recall value (pdf) of these queries instead of the average.

Dataset Degree Distribution

As the structure of a dataset may have an influence on the performance for a particular sampling method, we present the degree distribution for each dataset as well:

Interactive Tables

Next to the aggregated sample results, we provide more detailed data as well in the form of interactive html tables. We provide an html table per dataset and sample size, where each table shows the recall of each query on all the created samples. (as the queries we use are not publicly available to a wide audience, we cannot share the exact queries. We provide a set of features per query instead
Select dataset:
Select sample size:

SampLD

Paper Abstract

Plots

Sample Quality

Dataset Degree Distribution

Interactive Tables