The usefulness in anthropology and digital ethnography of the free software corpus analysis tool TXM

Acknowledgement

I would like to thank Psychoslave for informing me about the existence of the Wikimedia-l mailing list, the team of the textomerie project developing and supporting the TXM software, the instructors of the course Méthodologie de l'analyse de corpus en linguistique taught at UCLouvain University during the years 2017- 2018 and 2018-2019 , and finally the deepl GmbH company for is free access online translation service used to write this text from my native French.

Introduction and initial question
As a result of the rise of ICTs, anthropologists, like other workers in the social sciences and humanities, are increasingly confronted with the use and observation of means of communication implemented in various digital spaces. Social networks, forums, instant messaging groups, mailing lists, collaborative sites, have become places of communication and life used by many human communities. As a result of this evolution, science, including anthropology, has turned them into new fields of research as digital anthropology or new.

It is therefore in the field of anthropology as a science and ethnography as a methodology that this research is part of. We will not discuss about linguistic anthropology here, analyse anthropological literature analysis, even nor online ethnography as a complement to the linguistic analysis of log data but well about corpus analysis software as research tool in anthropological ethnographic field. Corpora can be numerous within digital communication spaces as much as corpus analysis software, but our choice will be limited in this case of studies to the archives of a mailing list and the TXM software

Thus, the initial question of this research could be summarized as follows:

" How the corpus analysis software TXM can help a digital anthropologist researcher in his ethnographic field work? "

Why Wikimedia mailing list ?
One reason to chose Wikmedia mailing list as corpus was leads by than Wikimedia movement is my doctoral thesis focuses. An other reason was the facility to create the corpus by copy past the contain of mailing list archive month by month on separated files in .txt format directly usable by the software.

The contain of this mailing list could be interesting for me, but difficult to use by a simple reading. The idea of using TXM as sophisticated search engine to explore these thousands of messages came to my mind and was at the origin of this present research work.

As a last argument, the archives of the mailing list are directly copied and pasted, month by month, from a web page displaying the CC-BY 3.0 license, which makes their use very easy.

Mailing list description
Wikimedia community mailing list labeled "Wikimedia-l" is a discussion list for the Wikimedia community and the larger network of organizations (Wikimedia Foundation, chapter organizations, affiliates, partners) supporting its work. This mailing list can, for example, be used for:


 * The initial planning phase of potential new Wikimedia projects and initiatives
 * Organizational issues of the Wikimedia Foundation, chapter organizations, others
 * Discussing the set-up of local Wikimedia chapters
 * Developing and evaluating grant-making programs
 * Planning elections, polls and votes
 * Discussion of projects that don't already have a mailing list
 * Finding ways to raise funds
 * Other Wikimedia-related issues

Corpus description
The corpus was initially constituted by a folder containing X files (one file per month from April 2004 to April 2018) for a total of X MB. All the text was constitueted by X words.

Why TXM
Some people claim to be vegetarians and do not eat meat. For my part, I claim to be a libriste and do not "eat" proprietary software as defined by Richard Stallman. XTM 0.7.9 meets my expectations in this respect. On the other hand TXM is developed by a team of French researchers and a good documentation of the software in French was available from the project's Internet website in the form of manual video tutorials. Finally, the project has a mailing list and a Wiki that give to me the opportunity to receive support from community members in French..

TXM description
TXM is free, open-source Unicode, XML & TEI compatible text/corpus analysis environment and graphical client based on CQP and R. It is available for Microsoft Windows, Linux, Mac OS X and as a J2EE web portal. It provides.

Qualitative analysis

 * Concordances of lexical patterns based on the efficient CQP full text search engine and its CQL query language
 * CQL pattern frequency lists for any word property (type, lemma, pos...) thanks to the integration TreeTagger's integration for lemmatization and pos tagging
 * CQL pattern occurrence graphics
 * lexical patterns are expressed in the CQL query language, based on word & structure level properties: (for example)
 * "aiming" to simply search for the word ’aiming’
 * ".*ing" to search for words ending in "ing" (including mainly verb forms)
 * [pos="VERB" & word=".*ing"] to search for verb forms ending in ".ing" (where Part of Speech annotation is present)
 * [lemma="group"] [] 0,3 [pos="VERB" & word=".*ing"] to search for the collocation followed by a  with at most 3 words in between
 * rich HTML-based text edition navigation with links from all other tools

Quantitative analysis

 * factorial correspondance analysis
 * constrative word specificities
 * hierarchical classification
 * analysis of cooccurring words or lexical patterns

Corpus Data Model

 * Indexes words and their properties as well as hierarchical structure of texts
 * Indexes external or internal metadata of texts or speakers
 * Allows construction of various subcorpora and partitions (for constrative analysis between text structures or groups of words)

Personal feedback about Installation, importation and use of features
Before TXM, I had used very few textometric software and always in a very punctual way. Getting to grips with this software did not seem exessively difficult to me, but it would certainly have been if I had not previously acquired some knowledge of corpus analysis in linguistics. Without this previous training, I would have had to assimilate at the same time as discovering the software a whole set of concepts such as occurrence, lemma, tolken, etc.

Honestly, it seems to me that it is possible for someone who has enough time to successfully install the software and use it only from the manual that I used in French for myself, but that also exists in English in a Beta version.

In the end, the only problems I encountered in this experiment were the installation and use of the Treetagger automation software, which, unlike the R statistical processing software, is not pre-installed within TXM. These problems were related to configuration errors on my part and another problem probably related to a downloaded and corrupted file.

Finally, it should be noted that the process of importing my corpus leading to the creation of an XML file containing the categorization and lemmatization informations takes more than three hours on a desktop computer. At the end of the process an 8 GB overload of my RAM forcing the computer to use the swap space on the hard disk. Finally, the folder of the corpus binary format produced in more than one hour of calculation, was 6.5 GB in size and could not be loaded on my laptop due to lack of disk space while more than 15 GB was available.

It therefore seems important to me to point out that before embarking on the analysis of a corpus with TXM, it is necessary to ensure that the computer material is powerful enough according to the size of the text. Other example, after creating two partitions (12 months and 14 years) the software's start-up have increase to few seconds to nearly five minutes.

The software seemed relatively stable to me when you don't run a calculation until the end of a precedent. Faced with the size of the corpus and the power of my desktop computer, some processes can reach high or even excessive execution times. When the software freezes and its shutdown must be done via the computer's operating system, some of the work done before the shutdown may be lost. It would therefore be advisable to restart the application after performing an important job.

Useful informations for the ethnographer provided by TXM functionalities
One by one, we will discuss here the functionalities offered by the TXM software, and their ability to provide useful information for the ethnographer. For each useful feature, we will give an example applied to the analysis of the archives of the Wikimedia-l mailing list.

Edition
The editing function allows you to browse the entire corpus in html display with the display of an information bubble on each word indicating its lexical category.

The navigation is done file by file with the name of the file as the header of the tab and a contextual menu by right click allows the sending of a word to the concordancer."Application""Without leaving the TXM software, this function gave me the opportunity to briefly browse the corpus in full text to locate its structure and launch some concordance search from any name or pseudonym of known people. A way to easily browse all the interventions of an actor that you would like to follow within the mailing list. We will come back to the functionality of the concordancer later."

Lexicon
A Lexicon analysis (list of work ordered by frequency) already give some good informations to the ethnographer which words are most often cited by the community on the mailing list, a searcher can  get information on :


 * information on the main topics of discussion in the community, use full for guiding individual semi-directive interview;
 * information on the most active members on the mailing list, use full for selecting interviewees;
 * information about the most used email provider, use full to know in which communication channel will allow the maximum number of contacts to be contacted.

Example from the corpus :


 * In this corpus constituted by a mailing list archive, the lexicon show 863676 @ concurrency wich means than the same amount of messages posted on the mailing list,

Conclusion
Other type of corpus possible.

Theoretical resources

 * Strategic Interaction and Knowledge Sharing in the KDE Developer Mailing List.
 * What Can OSS Mailing Lists Tell Us? A Preliminary Psychometric Text Analysis of the Apache Developer Mailing List.
 * Analyse de complexité des textes coutumostratiques
 * Outline of natural language processing
 * TXM French user manual

Papers to explore

 * Explore, play, analyse your corpus with TXM

External resources

 * https://www.ortolang.fr/market/corpora/orthocorpus/v1.1#_blank

Note and sources
Outils numériques pour anthropologues/Utilité du logiciel de textométrie TXM dans le cadre d'une recherche ethnographique