WikiJournal Preprints/Generic Tuple Store

Introduction
While there is prior art regarding the versioning of triple and quads. The review of existing software in Collaborative Open Data versioning: a pragmatic approach using Linked Data concludes that there still much work to do in this field to make it practical and usable. The most important indicator will be Wikidata. If there was a readily available implementation of versioned triple store that can scale to 9 billions triples and 1.7 TB that is free software, it would be Wikidata. In spite of lot of efforts, Wikidata, in particular Wikidata Query Service (WDQS) does not scale. Guillaume Lederrey, Engineering Manager, Search Platform at Wikimedia Foundation stated on Wikidata few month back: Wikidata was founded 7 years ago. Since then the tools and practices have evolved. In particular, WiredTiger was acquired by MongoDB, and as such, has seen much more production workload and bug fixes. More recently, in 2015, Apple released FoundationDB that was created in 2009 to tackle the difficult problem of distributed, scalable databases with strong guarantees. This two relatively new tools made it possible to build upon the Generic Tuple Store (nstore) standardized in Scheme Request for Implementation as SRFI-168.

The nstore that is presented in this document is the basic building block that allows to implement the Versioned Generic Tuple Store (vnstore) that will be presented in another document. In the use-cases section, the vnstore implementation is introduced as to legitimate the nstore.

The first part of this article will present the main topics that are discussed in the paper and how they relate to the current work. Then several use-cases for the nstore are presented. The document ends with a conclusion and future work.

Resource Description Framework
RDF is World Wide Web Consortium (W3C) set of standards that aims to facilitating cooperation around data by specifying several tools. Among other things it specified means to exchange, query and somewhat how to store data.

The following sections dive into several part of the RDF framework and explain how they relate to nstore.

SPARQL
SPARQL is a query language part of RDF that specify the language that must be used by RDF databases to store and query data. It also provides ways to do federated queries. That is, queries across several databases. Like it is explained in the literature SPARQL can be difficult at times to implement. Instead of aiming for direct interoperability, nstore take the stance to primarily deliver its main features:


 * tuple of n items
 * good performance
 * horizontal scalability
 * easy to setup

nstore internal query language is very similar in principle to SPARQL even if it is not same syntax.

SPARQL specify various data types based on XML specification. nstore doesn't conform to that specification. Instead, nstore can store anything that has a Scheme representation which is superset of RDF base data types. At query time, the user or client must convert the Scheme representation into the intended representation possibly relying on tuples added in the nstore at import time.

Vocabularies, Ontologies and Linked Data
With RDF comes a specification to describe the content of a database. For instance, the INSPIRE initiative is an interesting project that aims that standardizing across organization a vocabulary to exchange spatial data. There is many competing and cooperating vocabularies.

Ordered Key-Value Store
Key-value stores offers a rather high level primitive to build high performance, multi-model, and domain specific databases. The common denominator of key-value stores is that they are mappings of bytes where keys are always sorted in the lexicographic order. Even if they do not all expose a cursor interface, they certainly allow movements inside the mapping or range queries (also known as slices).

Nowadays there is numerous libraries offering compatible interface. FoundationDB is one. Similar software include Tokyo Cabinet, Kyoto Cabinet, LMDB, LevelDB and RocksDB. They offer different trade-offs and features. nstore use WiredTiger because it is not a bad choice. It performs well on some benchmarks. It takes in charge the difficult matter of delivering Atomic, Consistent, Isolated and Durable (ACID) storage engine. WiredTiger also handle in-memory caching.

The prototype only use WiredTiger, but thanks to SRFI-168 it is possible to use FoundationDB given the appropriate code is contributed.

Scheme Programming Language
According to Wikipedia: High-level languages like Scheme are not the preferred tools to build database abstraction, so far. That said, some had success with Java, Go and Clojure. With the advent of OKVS and the massive improvements of Scheme compilers the situation is different. Key-value stores solve the performance problems while Scheme allows to express quickly high-level abstractions that fit exactly the domain problem.

Chez Scheme is (prolly) the fastest Scheme implementation and in particular it is faster than Racket. Which makes Chez probably the fastest dynamically typed language in the known Universe. Scheme community has good Science culture. It has been the inspiration that allowed nstore to take the current form.

Use-cases
They are some situation where three or four items in a tuple are not enough. This includes but is not limited to tracking provenance, lineage, license or other metadata. There is several ways to reify a tuple in a triple or quad store as presented in. The approach taken by nstore to bundle metadata with the rest of the tuple is even more interesting from a performance point of view when the reification of a tuple is systematic. Metadata requirements are known beforehand and required for every tuple. For instance, in the case of the versioned generic tuple store (vnstore) where every tuple is associated with its history significance and a Boolean denoting whether it is alive or dead. Another example is when the user wants to know the provenance of every single tuple. While there is no rigorous benchmarks of the different reification approaches compared to the nstore approach. There is the intuition that the nstore is faster and consume less disk space than any kind of reification.

Conclusion and future work
Preliminary micro-benchmarks show that chez-nomunofu is faster than the competition whether it is time taken to import the data or query time. There is three aspects that remains to be explored: a) Include support for FoundationDB b) Implement the SPARQL middleware c) More rigorous benchmarks based on WDQS SPARQL logs.

Acknowledgements
StackOverflow user zhoraster helped pin the mathematics behind the implementation of generic tuple store and Mike Earnest provided an algorithm to compute the minimal set of tuple items permutations that allows efficient querying.

Grant
This work is part of a wikimedia grant request.