User:OpenScientist/Open grant writing/Wissenswert 2011/Documentation

''This page hosts information about the project in its initial stages. We have since moved the codebase to GitHub entirely, while the bot now runs on Wikimedia Commons.''

Architecture
Translated from the German project proposal.

The Open Access Media Importer for Wikimedia Commons is designed in a modular fashion. This shall facilitate the addition of new media types, resources or output formats. In general, every component takes an item out of a queue, processes it and puts the data into the queue of the next component. We envisage all components to run on the same server.
 * The Crawler/ Scraper scans a list of Open Access resources for new articles with multimedia files (particularly video but also audio). This can be achieved via a search API (if available, e.g. at PLoS), a local search (PLoS example) or Google (example: PLoS ONE). For each matching article, the URLs and metadata (creator, description, licensing, original article etc.) of the media files are extracted and stored locally.
 * The Downloader downloads these media files and saves them locally.
 * The Transcoder converts the media files into Commons-compatible open formats (mainly Ogg Theora, Ogg Vorbis) and includes the metadata into the resulting files.
 * The Categorizer analyses the files and their metadata and suggests Commons categories that may be suitable.
 * The Review Tool allows the user to check image and sound quality, licensing, metadata and categories and to fix any errors before the upload is approved.
 * The Uploader uploads the files along with metadata and categories.
 * The configuration of the component shall be performed via a protected wiki page, so as to allow also non-programmers to add new resources for the tool to work on.
 * Before processing a file, all components check whether it has already been processed or even uploaded to Wikimedia Commons. In such cases, work on the file is skipped by default.

To Do
''This list is to be updated as we move forward. Before making changes here, please check the most recent uploads whether the proposed change has indeed been implemented.''

Wiki

 * update documentation,


 * add milestones


 * PMCID does not display in commons:Cite journal. ✅


 * Cite doi template formatting from enwp does not work on Commons

Code

 * Bug tracking moved to https://github.com/erlehmann/open-access-media-importer/issues

Discuss

 * RW: Should we add the license and source to each video's metadata? ✅
 * I'd say yes. --Daniel Mietchen 21:55, 20 February 2012 (UTC)


 * RW: Should we inform the corresponding author about the import to WM Commons?
 * I don't think so, though I have informed authors in the past about actual use of "their" files. --Daniel Mietchen 21:55, 20 February 2012 (UTC)
 * A single e-mail, with an opt-in for further notifications, and a link to an author's listed Commons contributions, seems minimally intrusive. It might also be useful, under "community outreach" or "broader impact" in a grant or tenure application. We'd rely on the corresponding author to forward the e-mail to interested co-authors, but we could also offer opt-in notifications (#icanhazwikicommonsnotifications?). We'd have to code each author with their e-mail address, has-been-notified-once and opted-in, so this would take some work, and may not be worth it. But I doubt anyone will object to one e-mail per e-mail address, ever, except on request. HLHJ (discuss • contribs) 16:11, 15 June 2019 (UTC)


 * I don't understand the purpose of having three different commands (oa-get, oa-put, and oa-cache), each of which takes as its first argument a longish subcommand (download-metadata, etc.). I would vote either to combine all of these into one command (maybe, "oami"), or to abbreviate the subcommands, or maybe even both. Klortho (talk) 23:14, 3 June 2012 (UTC)
 * We do not vote here. If you are willing to submit a patch to overhaul the option parsing system (maybe using python-opster?), feel free to do so.

Source Code

 * Source code is hosted on GitHub.
 * All source code is licensed under the terms of the GPL v3

About

 * Initial announcement; First progress report
 * Project proposal (in German)

OA Repositories

 * Directory of Open Access Journals
 * PubMedCentral: Open Access Subset documentation
 * PubMedCentral: Search results for supplementary videos in OA articles
 * Example of an article with multiple videos

Tools

 * BeautifulSoup (HTML/XML parsing) documentation

Blog posts
Blog posts are on the Wikimedian in Residence blog. Here is a list of posts by this category. Some specific posts are:


 * December 15, 2011: Announcement and overview
 * January 18, 2012: Roadmap and crawler
 * March 10, 2012: Frontend
 * March 29, 2012: Plugging in your own data source
 * April 30, 2012: Encoding
 * September 30, 2012: Open Access Media Importer: Presentation at WMDE, Collaborative Coding