Legality of Data Extraction in the World

This research project aims to classify the legal status of data extraction in each country of the world., taking into account a variety of situations.

Context
This project aims at clarifying legal implications of massive extraction and importation from or into a libre licensed data bank. It especially focus under extractions conducted under the "data about facts" flag. Some people think there is no legal problem with this kind of practice:
 * it only include what they consider to be fact,
 * facts are not copyrightable.

Although, this argument seems shaky, especially at an international level. Copyright is not the only informational monopoly out there. For example, the French Wikiquote was already forced to blank the whole database and restart from scratch due to sui generis issues.

Scope
This research only aim at a rather high level overview of worldwide legal status regarding data extraction. In particular, a detailed consideration of this status for each country is out of scope.

Also it only focus at bare legal requirements, not taking into account license terms that might modulate them. Thus said, a license can only diminish legal obligations of a third party regarding a data bank, they can not legally enact requirements more constraining than what is scoped by law.

Finally this research put out of scope the consideration of monopoly duration. For example, copyright duration might vary for each country depending on date of first publication, later date of death of authors, or even the way an author died.

Goals
More specifically this project aims at evaluating legal risks around a data bank which practice large import of massive extraction from other sources and its consequences for:
 * the sustainability of the receiving project, in regard to possible legal disband enforcement
 * the people feeding the database using means of massive extraction and import from incompatible licensed sources
 * the people reusing this data in downstream without following terms of use of original source
 * the hosting provider
 * the publisher

Methodology
To reach the previous goals in the given scope, the project propose to establish a legality matrix per country through a value grid based on the typology proposed in the next section.

Type of extraction

 * formal extraction: Any direct extraction of data quoted as they appear in the original source. For example, the full verbatim copy of Proposed Roads to Freedom: Socialism, Anarchism, and Syndicalism. Or a single sentence quote like The world that we must seek is a world in which the creative spirit is alive, in which life is an adventure full of joy and hope, based rather upon the impulse to construct than upon the desire to retain what we possess or to seize what is possessed by others. fall under this category. Just smallest extract as life is an adventure full of joy and hope, life is an adventure, life is, life or even l alone.


 * analytic extraction: Any peace of information inferred from a formal extraction. For example, from the previous quote might be inferred that Bertrand Russell thought that fostering creativity is an advisable goal for human society, or Bertrand Russell was an anarchist. Surely this can of statements won't fall under falsehood under most common interpretations of the previous text.

Because analytic statement are inferred rather than directly stated in the original text, they might be considered as some creative works, which is a common criteria for copyright eligibility. This is most likely the foundation under "data are not copyrightable" that some people are defending with no place for further nuance. However depending on the starting and ending statement, the difficulty of analytic extracting might be very different. It's far easier, at least by means of automaton, to extract data from a source organised in accordance with an determinate schema than from a free prosodic work, and it's far easier to produce a report consisting solely of tables or mere predicative single statements than producing.

A sentence such as Russell played a significant part in the Leeds Convention in June 1917, a historic event which saw well over a thousand "anti-war socialists" gather; many being delegates from the Independent Labour Party and the Socialist Party, united in their pacifist beliefs and advocating a peace settlement. would be far more difficult to both analyse and synthesize by automated means than Russell was a prominent anti-war activist and he championed anti-imperialism. .

Quantity extracted
Depending on how many statements the resulting data bank will include, the ease to synthesise a resource which is close or even identical to the original source will also extremely vary. This is why the type of extraction is not the sole criteria that law is taking into consideration, the quantity of extraction also matter.

For example, Wikiquote can exists thanks to legal right to provide limited extracts of any original work in an other. But as experienced showed with Wikiquote, it doesn't mean that any set of various quotes published as an original collection can be duplicated as is in an other data bank.

Even data which certainly most people would qualify as facts can be gathered into an original published collection. For example, physical characteristics of water. Probably no one is going to pretend that "Water boils at 100°." is a creative monostich. On the other hand, it's not difficult to find resources that gather misc. physical characteristics of water into a single table, and claim "Any recopy of this table on another website or in another form of publication is completely forbidden. […] Copyright © […] All Rights Reserved".


 * single extract
 * few extract
 * substantial extract
 * whole extract

Penalty

 * none: people don't risk anything, most likely because it's legal
 * fine: people might have to pay a fine, maximum amount should be given bracketed
 * jail: people might be imprisoned, maximum duration should be given bracketed
 * death: people might be killed
 * This is a scarce list, typology might be extended to reflect diversity of actual relevant penalties

Concerned people

 * extractor
 * hosting provider
 * publisher
 * consulter