User:Iamamz3/Seyfu: Versioned Structured Data

= Seyfu: Versioned Structured Data =

Abstract
The use of a version control system to store open data is a good thing as it draws a clear path for reproducible science. But none, meets all the expectations. Seyfu aims to replace the use of git and make practical cooperation around the creation, maintenance, publication, storage and re-use of knowledge bases that are possibly bigger than memory. In order to achieve that, we propose a novel approach to tackle the challenge of querying versioned triples in a Direct-Acyclic-Graph (DAG). Seyfu only stores changes between versions. We introduce Seyfu Scheme implementation and compare it to existing solutions.

Keywords:


 * Databases
 * Version Control Systems
 * Knowledge Bases
 * Reproducible Science
 * Scheme Programming Language

Introduction
Resource Description Framework offers a good canvas for cooperation around open data but there is no solution that is good enough. ''TODO: explain what are those features required for cooperation and why those features are important. Maybe cite an article about the importance of git and git hosting solution in the context of software development''. Seyfu use a novel approach to query versioned data based on a topological graph ordering of changes. Unlike Cassidy et al., Seyfu does not rely on the Theory of Patches introduced by Darcs. Seyfu use WiredTiger database storage engine, an ordered key-value store, to deliver a pragmatic ACID-compliant versioned quad store. The first part will describe the implementation of Seyfu. The second part will present benchmarks.

Implementation

 * Space and Time complexity for each algorithm
 * In the spirit of Scheme programming language, Seyfu try to solve the problem using a minimalist core of powerful primitives upon which one can build abstractions to solve bigger problems. The  is such an abstraction that allows to take advantage of WiredTiger key-value store features and performance without scarifying too much expressiveness.
 * This in turn allows to use the same query language to ask the metadata store about the DAG history of the version quad store and query that quad store. The query language is based a logic language embedded in Scheme called minikanren.
 * TODO: implement optional and union, see https://github.com/Swirrl/matcha/commit/c8d21c1ec54020ec1f0cc002d4ccaefca1106cbf
 * Merge commits will resolve conflicts in a way that makes it possible to define a history significance measure that allows to linearize with a topological sort the Direct-Acyclic-Graph of changes. Without that, querying versions at any point is not pratical see the second part.
 * Cost based query engine was dropped for the time being.

Indices
Definition: values

Let $$X$$ be a set of values for which there is a binary relation $$\leq$$ that is a total order on $$X$$. Then the following statements hold for all $$a, b$$ and $$c$$ in $$X$$:


 * Antisymmetry: If $$a \leq b$$ and $$b \leq a$$ then $$a = b$$;
 * Transitivity: If $$a \leq b$$ and $$b \leq c$$ then $$a \leq c$$;
 * Connexity: $$a \leq b$$ or $$b \leq a$$.

Definition: n-tuples

n-tuples are the ordered set $$X^n$$ where $$n \geq 1$$ ie. $$X^n = \{(\alpha_1, \ldots, \alpha_n) \backslash \forall i \in [1..n] \ x_i \in X\}$$

Definition: knowledge base

Let $$K \subseteq X^n$$that is the subset of n-tuples that are stored in the knowledge base.

Definition: subsets

All the subsets of $$X$$are noted $$\phi(X) $$. TODO define all subsets of X

Definition: variable

Let $$V $$be a disjoint set of $$X $$ie. $$V \cap X = \emptyset $$ if $$v \in V $$then $$v $$ is a variable.

Definition: binding

A binding is a function $$b: V \longrightarrow X \cup \{\varnothing\} $$. The set of all bindings is called $$B $$.

Definition: pattern

A pattern is defined as the association of a template and a function as: $$p = \left({t\atop f}\right) $$


 * $$t = ( \pi_1, \pi_2, \ldots, \pi_n ) $$ and $$t[i] = \pi_i \in V \cup X $$, the number of variables in $$t $$ is noted $$|t| = m$$.


 * $$f : B \longrightarrow (X \cup \{\varnothing\})^n $$
 * $$f(b) = (\alpha_1, \ldots, \alpha_n) $$
 * $$f(b)[i] = \alpha_i $$
 * $$f(b)[i] = \begin{cases} b(\pi_i), & \text{if }\pi_i \in V \\ \pi_i, & \text{if }\pi_i \in X \end{cases} $$

Definition: pattern binding

Binding $$b $$ is called a pattern binding of $$p = \left({t\atop f}\right) $$ when $$f(b) \in K $$

Definition: pattern bindings

Pattern bindings $$S $$ for $$p = \left({t\atop f}\right) $$ is the biggest subset of $$B $$ where $$\forall b \in S,\ f(b) \in K $$

Definition: pattern image

Given a pattern bindings $$S $$ for $$p = \left({t\atop f}\right) $$, the pattern image is defined as $$I = \{y \text{ where } \forall b \in S,\ f(b) = y\} $$

Property: a pattern image is a subset of $$\phi(K) $$

Proof: TODO

Definition: permutation

it is a function from $S$ to $S$ for which every element occurs exactly once as an image value. This is related to the rearrangement of the elements of $S$ in which each element $s$ is replaced by the corresponding $f(s)$. For example, the permutation (3,1,2) is described by the function $$\alpha$$ defined as:


 * $$\alpha(1) = 3, \quad \alpha(2) = 1, \quad \alpha(3) = 2$$.

Definition: index

squi couvre K

An index $$I_(\epsilon_1, \ldots, \epsilon_n)$$is associated with the permutation $$i = (\epsilon_1, \ldots,  \epsilon_n)$$so that $$\forall x \in K \Rightarrow i(x) \in I$$

Definition: prefix range

Let prefix range be a function $$f_I: P_n \longrightarrow \phi(K)$$

TODO: define what querying means

TODO: define what index cover pattern means

Property: When n>=2, there is a strict subset of all indices that cover all patterns
Property: the rotations of the base index can be part of a minimal set of covering indices

What is the goal? What are the axioms? What are the properties? What are if ... then ...?!

$$ where $$\omega_i \in K_s $$ or is a variable
 * 1) We can choose a permutation of p that is less complex
 * 2) Let's consider a permutation $$s: K \longrightarrow K_s $$
 * 3) TODO: how to define a permutation? is permutation always reversible?
 * 4) Similarly there is permutation $$\sigma(p) = \sigma(\langle \pi_1, \pi_2, \ldots, \pi_n \rangle) = \langle \omega_1, \omega_2, \ldots, \omega_n \rangle = h
 * 1) Let's consider a permutation $$s: K \longrightarrow K_s $$
 * 2) TODO: how to define a permutation? is permutation always reversible?
 * 3) Similarly there is permutation $$\sigma(p) = \sigma(\langle \pi_1, \pi_2, \ldots, \pi_n \rangle) = \langle \omega_1, \omega_2, \ldots, \omega_n \rangle = h
 * 1) when $$p(\alpha_1, \ldots, \alpha_m) = y $$ and $$h(\beta_1, \ldots, \beta_m) = z $$ we can do $$s(z) = y $$
 * 2) we can choose the permutation of p that is the less complex
 * 3) We can permute the pattern so that the operation is less complex
 * 4) Where is the n-permutations of [1..n]
 * 5) Where is the product of Bool^n with [1..n]

Examples
All rotations of the 3-tuple  is a solution:


 * aka.  ie. rotation 0 of the 3-tuple
 * aka.  ie. rotation 1 of the 3-tuple
 * aka.  ie. rotation 2 of the 3-tuple

The remaining permutations are. All of their prefixes have a permutation that is prefix of one the index:


 * prefix of  is also prefix of the index
 * prefix of  is also prefix of the index
 * prefix of  is also prefix of the index
 * prefix of  has   as permutation which is a prefix of the index
 * prefix of  has   as permutation which is a prefix of the index
 * prefix of  has   as permutation which is a prefix of the index

All rotations of the 4-tuple (graph, subject, predicate, object) plus two permutations that have a prefix that is not the permutation of existing solution prefixes:


 * aka.  ie. rotation 0 of the 4-tuple
 * aka.  ie. rotation 1 of the 4-tuple
 * aka.  ie. rotation 2 of the 4-tuple
 * aka.  ie. rotation 3 of the 4-tuple
 * aka.
 * aka.

Problem
TODO: what is the (formal?) problem

Benchmarks
There is three existing solutions: R&WBase, R43ples and Quit Store. Those are the only solution with an implementation that is:


 * distributed in the sens of version control systems. That is that allows cooperation behavior enabled by branching and merging features,
 * semantically compliant in the sens that it implements versioning in a way that the commit is understandable by machines with a query.

The benchmark will try to quantitatively compare those software regarding several aspects:


 * Memory storage performance
 * Disk storage performance
 * Time required to access different versions

Those results are compared to baselines namely git and virtuoso.

Task 1: Wikidata
The benchmark baseline is Virtuoso X.Y. https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples

Task 2: Linux
Linux kernel git repository is exported as quadruples and loaded it into the solutions. The benchmark baseline is performances of git because it is the industry standard for versioning source code. Git is also the most common solution used to track changes in open data. It will provide a baseline for assessing the performance usability of Seyfu regarding the cooperation features.

Conclusion
hello, world!