Open Global Health/ContentMine Fellowship Application

This is an application by Ale Abdo to the Content Mine Fellowships.

It was successful and work continues at Open Global Health/ContentMine.

Proposal
My research idea involves extracting facts from a parcel of the global health literature in order to:
 * make them useful to researchers, health practitioners, policy makers and affected people in critical regions
 * publish research on the aggregate evolution of global health research, identifying new trends

Solving Global Health issues, such as those involved in epidemics, nutrition, reproduction and, more broadly, violence and chronic diseases and habits, is as much an issue of large scale coordination of society, engaging all different levels of health care, as it is an issue of having the correct knowledge about the current situation and factors influencing, or capable thereof, the evolution of these diseases and conditions.

Surprisingly, though, global health information often comes, and sometimes remains, either in the form of closed access research, such as conference abstracts, or in formulations inaccessible to most people and even practitioners, such as technical papers without a uniform framework for exchange and employing a variety of tools often of the "black box" type. The situation with research data is just as critical, although beyond our present scope.

My plan is to work on two fronts: access and usability. First, to profit from work I'm already engaged as part of my daily job, which includes automating the extraction of the totality of conference abstracts for major cancer research conferences, and reusing those tools to process global health abstracts in order to feed them to ContentMine. Second, using ContentMine to expose new facts from both research published in journals and in conference abstracts, looking for the emergence of topics and innovations that might guide practice and policy, but also comparing the two types of publications to spot knowledge that might be getting lost by never getting published, such as negative results or exploratory research.

As an extended goal, I'd also like to look for signatures of knowledge that is available in the conference literature and that does end up getting published in journals later, in an attempt to predict what facts from conferences can we trust in those cases where decisions cannot wait until they get to final publication, like epidemic outbreaks.

This proposal belongs in the context of a group of people working on related issues, the Open Global Health group, formed during OpenCon 2015 during an unsession led by Neo Christopher Chung.

This proposal will also likely profit from a series of formative methodological workshops being organized by Célya Gruson-Daniel and Peter Grabitz, of the Centre Virchow-Villermé for Public Health Paris-Berlin. As part in the organization myself, I would present and train interested researchers in the possibilities opened by ContentMining and the perspective of fact extraction.

Furthermore, the laboratory I work at, LISIS, coordinates the development of a literature analysis platform called CorText Manager, crafted for Science and Technology Studies (STS) researchers yet used by other fields as well, and whose coordinators have demonstrated interested in pursuing collaboration with ContentMine to expand each other's capabilities.

Exercise notes
The fellowship process proposed an exercise to applicants, these are my notes on the exercise.

See also my repository for the exercise at GitLab.

getpapers
I never install anything from npm/pip/similar as root, because I don't trust these packaging systems. So I'm trying to install locally for a user I created specifically for this exercise:

The only abnormal thing during the installation was a warning: npm WARN engine xmlbuilder@8.2.2: wanted: {"node":">=4.0"} (current: {"node":"v0.10.42","npm":"1.3.6"})

norma
I don't run debian on my personal computer, though I use it for most of the servers I run, so I chose to download the file below. It is kind of weird that the release number and the file name don't match.

 

Downloading and extraction went fine.

ami
Same as for norma with file below. It is kind of weird that the release number and the file name don't match.

 

Downloading and extraction went fine.

quickscrape
Same observations as for getpapers and npm. Everything went fine on installation.

getpapers
I run getpapers adapting the instructions to my local install.

It downloaded the files as expected.

info: Searching using eupmc API info: Found 131 open access results Retrieving results [==============================] 100% (eta 0.0s) info: Done collecting results info: Saving result metadata info: Full EUPMC result metadata written to eupmc_results.json info: Individual EUPMC result metadata records written info: Extracting fulltext HTML URL list (may not be available for all articles) info: Fulltext HTML URL list written to eupmc_fulltext_html_urls.txt warn: Article with pmid "27227226 did not have a PMCID (therefore no XML) warn: Article with pmid "27227218 did not have a PMCID (therefore no XML) warn: Article with pmcid "PMC4586171" was not Open Access (therefore no XML) warn: Article with pmcid "PMC4212306" was not Open Access (therefore no XML) info: Got XML URLs for 127 out of 131 results info: Downloading fulltext XML files Downloading files [============================--] 92% (117/127) [14.4s elapsed, eta 1.2]info: All downloads succeeded! Glancing at the data, results contain comparative tests of drugs, statistical analysis of large surveys, and interventions, related to diarrhea. They're centered around the region of interest as expressed in the query, but not strictly. One can see the different institutions, in Africa and elsewhere, that are concerned with the topic. Within Africa, it becomes evident that there is frequent collaboration between Angola and either Congo or Sudan. It is also noted that there is a disease which has the word "angola" in its name, which would have to be filtered at some point to improve the dataset. As expected, the dataset covers diverse diseases for which diarrhea is a symptom.

norma
Norma works fine, producing a few "!!!" and some "." and some "UNKOWN: ...". It created the 'scholarly.html' file for 117 out of the 131 downloaded papers.

ami
As it was suggested that EUPMC XML doesn't need norma, I tried running ami both after norma and directly.

At first I ran into an error because 'cmine' seems to mistake the space in the folder name for multiple arguments, even though it is properly escaped, and complains.

After renaming my folder to something without spaces it works.

Results indeed look the same with and without going through norma. In terms of content, they seem nicer than generic language processing and counting tools in that it does identify with good precision names for species and for genes, though it does mess up a bit genes and enzymes. I imagine there are options or the possibility to extend this to diseases and drugs, and perhaps further even into research techniques and treatment names, which would be great for my proposal. The links to Wikipedia are a nice touch.

quickscrape
The term 'CProject' does not exist in the repository, can't figure out what it is. I assume 'suppinfo' means 'Supporting Information'.

Passing a relative path to the scraper definition does not work as intended, seems to search from the output dir.

Requesting the URL for a figure supplement from eLife identifies the figures but does not download them.

Number of elements captured always smaller than total available, but that might be expected.

Getting suppinfo from PLOS that are on Figshare doesn't seem to work.