User:OpenScientist/Open grant writing/Wissenswert 2011/Documentation/Crawler

Purpose
This tool shall regularly crawl open access repositories for articles containing supplementary materials that have not yet been uploaded to Wikimedia Commons.

Example workflow
''Taking the Open Subset of PubMed Central as an example. Technical background.''

Repeat at regular intervals

 * Fetch file list (CSV) (also available as plaintext version)
 * find articles that changed since last update
 * process each new article

XML files
Each about 1GB in size. Download only for initial run, not at every update.


 * ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/articles.A-B.tar.gz
 * ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/articles.C-H.tar.gz
 * ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/articles.I-N.tar.gz
 * ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/articles.O-Z.tar.gz

ID list
Not strictly necessary, but could be used for checking in case of problems or for updates.
 * ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/file_list.txt

Find articles with supplementary materials
xlink:href="info:doi/10.1371/journal.pone.0000133.s002" position="float" xlink:type="simple"> Movie S1               Movie showing the black smoker acoustic recording system deployed at the Sully vent in September 2004 with audio from the same deployment. The audio and video are not contemporaneous because the remotely-operated vehicle carrying the video camera generated excessive noise. The video is included to provide context for the audio. The audio has been upsampled to 8 kHz, and high-pass filtered at 10 Hz using a 4-pole Butterworth filter. It is played in real-time (i.e. without time stretching or pitch shifting). Because much of the acoustic energy falls below ∼100 Hz, speakers with good bass response are required to properly reproduce the sound. Most laptop speakers will not produce sound. (9.58 MB MOV)  )                                 xlink:href="info:doi/10.1371/journal.pone.0000133.s001"                                 position="float"                                 xlink:type="simple">            Audio S1                Audio file containing a short section of sound collected with the black smoker acoustic recording system at Puffer in September 2005. The audio has been upsampled to 8 kHz, and high-pass filtered at 10 Hz using a 4-pole Butterworth filter. It is played in real-time (i.e. without time stretching or pitch shifting). Because much of the acoustic energy falls below ∼100 Hz, speakers with good bass response are required to properly reproduce the sound. Most laptop speakers will not produce sound.                (0.96 MB WAV)           files
 * Check XML files for articles with supplementary files (e.g. Movie S1 and Audio S1 )
 * If such exist,
 * Check supplementary materials for
 * video (e.g. 1762412)
 * prefix the relative URL given in xlink:href with http://www.ncbi.nlm.nih.gov/pmc/articles/PMCPMCID/bin" to download the files under their "supplementary-material id"
 * http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1762412/bin/pone.0000133.s002.mov (suffixes added according to mime type or final line in caption)
 * and http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1762412/bin/pone.0000133.s001.wav.
 * store metadata for further processing.

Code
The code for the crawler is being developed within the Github project https://github.com/erlehmann/open-access-media-importer.

Many files

 * 10.1371/journal.pone.0000133 - audio and video in supplement
 * 10.1371/journal.pone.0005929 - audio and video in supplement
 * 10.1371/journal.pone.0010346 - audio and video in supplement


 * 10.1371/journal.pone.0000794 - multiple videos in supplement
 * 10.1371/journal.pone.0008793 - multiple videos in supplement
 * 10.1371/journal.pone.0010848 - multiple videos in supplement
 * 10.1371/journal.pone.0018243 - multiple videos in supplement
 * 10.1371/journal.pone.0020395 - multiple videos in supplement
 * 10.1371/journal.pone.0025109 - multiple videos in supplement
 * 10.1371/journal.pone.0027227 - multiple videos in supplement
 * 10.1186/1472-6785-10-9 - multiple audio files in supplement

Videos with sound

 * 10.1371/journal.pone.0016128, Movie S1; declared as "text" in XML
 * 10.1371/journal.pone.0032931, Video S1
 * 10.1371/journal.pone.0011385, Video S1

Audio

 * WAV, e.g. 10.1371/journal.pone.0001580, 10.1371/journal.pone.0005915, 10.1371/journal.pone.0007808
 * MP3, e.g. 10.1371/journal.pone.0004065
 * OGG: none in PLoS ONE or PLoS Biology so far

Video

 * MP4, e.g. 10.1371/journal.pone.0011385
 * M4V, e.g. 10.1371/journal.pone.0013812
 * AVI, e.g. 10.1371/journal.pone.0003826
 * MOV, e.g. 10.1371/journal.pone.0004497
 * OGV: none in PLoS ONE or PLoS Biology so far

Sources of errors

 * 10.1371/journal.pone.0005977 has multiple movies labeled as being in MP3 format
 * 10.1371/journal.pone.0002804 had Fig. 8 and 9 mixed up - this could also happen to supplementary files or their legends. Sometimes (as in this case), a formal correction is published.

Potential further targets for crawling

 * The World Bank's Open Knowledge Repository (CC BY, with XML version)
 * Hindawi XML Corpus Download (mostly CC BY, some CC0)

Blog posts

 * http://wir.okfn.org/category/open-access-media-importer/