Thesaurus (information retrieval)

This article is about thesauri in the context of information retrieval. Such a thesaurus is a controlled vocabulary expanded with relations of broader, narrower and related terms, serving subject indexing and vocabulary control. A controlled vocabulary is a collection of sets of words and phrases where one item in each set is marked as a preferred term and the other items, usually synonyms, are marked as non-preferred terms.

A thesaurus is sometimes classified as one type of other knowledge organization systems (KOS), which include ontologies and classifications.

One good way of learning from this article is following the links to example entries in various thesauri below, thereby getting an impression of the actual practice exemplified. For many learners, this is much better than learning from abstract descriptions.

Thesaurus entry features
Features of an entry of a thesaurus for information retrieval, where an entry corresponds to a concept, entity or subject:
 * The preferred term denoting the concept or entity. Also called a descriptor.
 * A set of non-preferred terms denoting the concept or entity. Also called non-descriptors. These are often labeled "use for" or "used for" (UF).
 * A broader term (BT). Sometimes called broader concept. This relationship covers three distinct relationships: generic or subclass-of one, instance-of one, and part-whole one, as per ANSI/NISO Z39.19-2005. Some thesauri sometimes use a more relaxed or indefinite approach, e.g. connecting price to price policy via BT, in none of the three mentioned relationships. UNESCO guideline from 1971 envisions abbreviations BTG (generic) and BTP (partitive); ANSI/NISO Z39.19-2005 adds BTI (instance).
 * A set of narrower terms (NT). Sometimes called narrower concepts. This is an inverse of BT, and therefore also covers three distinct relationships. UNESCO guideline from 1971 envisions abbreviations NTG (generic) and NTP (partitive); ANSI/NISO Z39.19-2005 adds NTI (instance).
 * A set of related terms (RT). Sometimes called related concepts. This relationship is described in more detail in chapter 8.4 Associative Relationships of ANSI/NISO Z39.19-2005.
 * Note or scope note, defining the term or restricting its scope.
 * Definition. This is less often used; usually, a definition is placed into a scope note. Used e.g. in NCI Thesaurus, NCI Metathesaurus and PACTOLS thesaurus.

Coverage of concepts or individual objects
A thesaurus may only cover terms referring to concepts, as does e.g. Art & Architecture Thesaurus. Alternatively, it may cover terms referring to individual objects, that is, proper names, as does e.g. Getty Thesaurus of Geographic Names. It may also cover both.

Names of individual objects can be connected by BT and NT relations as well, via part-whole relationship. Thus, e.g. Barcelona has BT Catalonia.

Relation to authority files
Many authority files or things named authority files are in fact thesauri in so far as 1) they provide controlled vocabulary, and 2) they provide BT, NT and RT relationships. The fact that they are not named thesauri does not diminish their being thesauri by definition.

Form of the preferred term
Many thesauri use a plural form for countable nouns, e.g. "cats" instead of "cat", consistent with 6.5.1 Count Nouns in ANSI/NISO Z39.19-2005. By contrast, Wikidata tends to use singulars and so does WordNet.

The preferred terms should be nouns or noun phrases, as per 6.4 Grammatical Forms of Terms in ANSI/NISO Z39.19-2005. Thus, there should be e.g. an entry for "invention" rather than "invent" or "swimming" rather than "swim". By contrast, WordNet has separate entries for nouns, adjectives and verbs, e.g. swimming, colorful, and swim.

Example English thesauri

 * AGROVOC: For food, nutrition, agriculture, fisheries, forestry and the environment; in general, for areas of interest of the . Has nearly 40,000 entries. Lacks scope notes. Covers up to 42 languages. An example entry: lithium. Entry in BARTOC.
 * Art & Architecture Thesaurus (AAT). For items of art, architecture, and material culture. Example entries: cathedrals (works by context), Canis (genus), lithium. The Note field provides a definition or scoping information. It has over 59,400 records. Secondary languages:    zh, nl, fr, de, it, es. A project of the Getty Research Institute. Entry in BARTOC.
 * Cultural Objects Name Authority (CONA). For cultural works, including architecture and movable works such as paintings, sculpture, prints, drawings, manuscripts, photographs, textiles, ceramics, furniture, other visual media such as frescoes and architectural sculpture, performance art, archaeological artifacts, and various functional objects that are from the realm of material culture and of the type collected by museums. A project of the Getty Research Institute. An example entry: Eiffel Tower. Entry in BARTOC.
 * European Thesaurus on International Relations and Area Studies: A multilingual, interdisciplinary thesaurus covering International Relations and Area Studies. Has about 8,200 records. Online public access unclear. Entry in BARTOC.
 * EuroVoc: A multilingual thesaurus maintained by the Publications Office of the European Union. Example entries: red wine, philosophy. Downloadable. Over 7200 concepts/entities. Lacks scope notes. Entry in BARTOC.
 * Faceted Application of Subject Terminology (FAST). "An enumerative, faceted subject heading schema derived from the Library of Congress Subject Headings (LCSH)." Example entry: cosmology. Lacks scope notes. Entry in BARTOC.
 * Getty Thesaurus of Geographic Names (TGN). For names and associated information about places, covering administrative political entities (e.g., cities, nations), physical features (e.g., mountains, rivers), as well as current and historical places. Has nearly 3 000 000 records. An example entry: Kattowitz. Has a note (not much of a scope one since the items covered are mostly specific entities, not categories or concepts). Entry in BARTOC.
 * Health Sciences Descriptors (DeCS). Bears some relation to Medical Subject Headings (MeSH). Over 30,000 records. Example entries: zinc, Uganda, hepatitis B. Has scope notes. Entry in BARTOC.
 * Library of Congress Subject Headings (LCSH). Example entries: Saturn (Planet)--Ring system, Galilei, Galileo, 1564-1642, Clocks and watches. Lacks scope notes. Entry in BARTOC.
 * Medical Subject Headings (MeSH). Example entries: Language, Physics. Has scope notes. Entry in BARTOC.
 * National Agricultural Library Thesaurus and Glossary (NALT). For terms related to agricultural, biological, physical and social sciences. Example entries: iridium, cooking fats and oils. Usually lacks scope note. Entry in BARTOC.
 * National Cancer Institute Thesaurus (NCI Thesaurus, NICt). Example entry: company. Has definitions acting as scope notes. Entry in BARTOC.
 * NCI Metathesaurus. Example entry: Soybean. Entry in BARTOC.
 * OmegaWiki. Thesaurus-like in that this multilingual dictionary has a defined meaning as a key entity shared between languages and synonyms, and it has the hypernym property linking defined meanings. Thesaurus-unlike in that no synonym is indicated as preferred one. The definitions serve as scope notes. Example entry: dog. Lacks BARTOC entry.
 * Union List of Artist Names. Controlled vocabulary from Getty, for specific entities, including artists and certain organizations. Example entries: Obama, Barack, British Museum. Has a note. Entry in BARTOC.
 * US Geological Survey Thesaurus (USGS Thesaurus). Example entry: vertebrates. Has scope notes. Entry in BARTOC.
 * WordNet. Thesaurus-like database, although not strictly a controlled vocabulary since no term in a WordNet synset is set as the preferred one. In contrast to thesauri, there is meronymy and holonymy for part-of relationships. WordNet's hypernym and hyponym relations provide the hierarchical aspect of a thesaurus (broader term, narrower terms) and WordNet's definitions act as scope notes. Covers top ontology, including entity. Example entry: planet. Over 175,000 synsets per WP. Entry in BARTOC.
 * International Classification of Diseases, which includes ICD-9, ICD-10 and ICD-11. Thesaurus-like database that features taxonomization and classification of diseases and their episodes. Each disease, disease group or episode is named by a preferred name and defined or characterized, which serves as a scope note. The hierarchical aspect of a thesaurus (broader term, narrower terms) is present by diseases and their episodes being placed into a hierarchy. Example entry for ICD-9: Influenza; for ICD-10: Influenza due to identified seasonal influenza virus; for ICD-11: Influenza. Entry in BARTOC.
 * UNESCO Thesaurus. Example entries: Planets, Subject headings. Sometimes has scope notes. Covers over 4400 concepts. Entry in BARTOC.
 * Wikidata: knowledge graph that de facto provides thesaurus features. One label for an entity/concept is designated as the primary one, acting as the preferred term. Other labels for an entity/concept are non-primary, acting as non-preferred terms; these are called "also known as" or "aliases". Instead of the relations of broader term and narrower terms, there are relations acting in a similar if more stringent roles including subclass-of, instance-of and part-of. There is no generic relation corresponding to the thesaurus related term relation (generic association) but there is a host of more specific relations (properties) de facto providing this service. The description field can serve as a scope note, although part of that role is delegated to properties. Example entries: house cat, planet, philosophy, Giverny. Entry in BARTOC.
 * English Wiktionary thesaurus. While actually a language thesaurus rather than one for information retrieval, has BT and NT via WordNet-inspired hypernyms/hyponyms and holonyms/meronyms, and has RT via "Various". Like WordNet, does not select one term in a synonym set as preferred in the synonym list, but it does choose one term as the headword under which the entry is tracked. Example entries: Wiktionary:Thesaurus:bird, Wiktionary:Thesaurus:value. Is vastly incomplete, being a work in progress.

Example non-English thesauri

 * Nuovo soggettario. Also known as BNCF thesaurus. Example entries: Mari, Aeroplani, Edonismo. Italian. Sometimes has scope notes, often lacking them. Entry in BARTOC.
 * Répertoire de vedettes-matière de l'Université Laval. French. Online public access unclear. Entry in BARTOC.
 * Schlagwortnormdatei. German. Was integrated into Integrated Authority File/Gemeinsame Normdatei/GND, and thereby effectively superseded. Lacks a BARTOC entry.
 * Integrated Authority File. Gemeinsame Normdatei. GND. German. Example entry: Planet. Has top ontology, e.g. Entität. Lacks scope notes.
 * Polythematic Structured Subject Heading System (Polytematický strukturovaný heslář; PSH). Czech. Example entry: planety. Entry in BARTOC.
 * Czech National Authority Database (NL CR AUT). Czech. Example entry: planety (astronomie). Lacks top ontology, e.g. entity. Lacks scope notes. Entry in BARTOC.
 * General Finnish Ontology; Yleinen suomalainen ontologia (YSO). Finnish. Example entry: planeetat. Sometimes has scope notes. Has over 37,000 records. Entry in BARTOC.
 * National Library of Israel Names and Subjects Authority File. English and Hebrew. Example entries: Planets כוכבי-לכת, Dogs כלבים. Sometimes has a scope note as part of "Source Data Found" field. Lacks a BARTOC entry.
 * Web NDL Authorities. Japanese. Example entries: planet, Einstein, Albert, 1879-1955. Lacks scope notes. Entry in BARTOC.
 * PACTOLS thesaurus. French. Example entry: savane. Has definitions. Entry in BARTOC.
 * RKD thesarus. Dutch. By Netherlands Institute for Art History. Example entry: thesaurus. Has scope notes. Lacks BARTOC entry.
 * Biblioteca Nacional de España authority file (BNE authority file). Spanish. Example entries: Gatos, Informática. Linked in VIAF. Lacks BARTOC entry.
 * Bibliothèque nationale de France authority file (BNF authorities). French. Example entry: Chats. Lacks BARTOC entry.

Thesauri with scope notes or definitions

 * Art & Architecture Thesaurus
 * Health Sciences Descriptors (DeCS)
 * International Classification of Diseases
 * Medical Subject Headings (MeSH)
 * NCI Thesaurus (definitions)
 * NCI Metathesaurus (definitions)
 * PACTOLS thesaurus (definitions)
 * RKD thesarus
 * USGS Thesaurus
 * WordNet
 * Wikidata

Standards for thesaurus development
Standards for thesaurus development are covered in the section History of the Wikipedia article, Thesaurus (information retrieval). The latest standard listed is ISO 25964, published in 2011 and 2013.

Two standards are available in full text online, and linked from Further reading:
 * UNESCO Guidelines for the establishment and development of monolingual thesauri. 1971
 * ANSI/NISO Z39.19 2005 Guidelines for the construction, format, and management of monolingual controlled vocabularies.

Further reading:
 * ISO 25964

BARTOC
BARTOC stands for Basic Register of Thesauri, Ontologies & Classification. It is a database of these kinds of items, that is, knowledge organization systems (KOS). For each registered thesaurus, it contains a description. Sometimes, it reports the number of items; thus, for African Studies Thesaurus it reports it has 5,200 English descriptors; for Getty Art and Architecture Thesaurus, it reports it has around 131,000 terms.

BARTOC is used by Wikidata for authority control of entities representing knowledge organization systems.

A BARTOC record features the following items:
 * Description; this may indicate number of items.
 * Titles; e.g. Getty Art and Architecture Thesaurus, The Art and Architecture Thesaurus and Art & Architecture Thesaurus
 * Abbreviation; e.g. AAT
 * KOS Type; e.g. Thesaurus
 * Subject; e.g. Arts & recreation (7) Architecture (720) http://www.iskoi.org/ilc/2/class/w (w) http://www.iskoi.org/ilc/2/class/x (x)
 * Languages; e.g. zh | nl | en | fr | de | it | es
 * Access; e.g. freely available
 * License; e.g. http://www.opendatacommons.org/licenses/by/1.0/
 * Format; e.g. Online Printed
 * Publisher; e.g. J. Paul Getty Trust
 * Etc.

Further reading:
 * bartoc.org
 * BARTOC

Non-English terms for thesaurus relations
What follows are non-English terms sometimes used for thesaurus relations such as BT, NT, RT, scope note, etc.


 * Czech
 * nadřazený termín: broader term
 * podřazený termín: narrower term
 * asociovaný termín: related term
 * Dutch
 * Gebruikt voor: used for
 * Ruimere term: broader term
 * Nauwere term: narrower term
 * Finnish
 * Yläkäsite: broader term
 * Alakäsitteet: narrower terms
 * Assosiatiiviset käsitteet: related terms
 * Huomautus: note
 * French
 * Employé pour: use for
 * Concept générique: broader concept
 * Concept spécifique: narrower concept
 * Concept associé: related concept
 * Terme(s) générique(s): broader term(s)
 * Terme(s) spécifique(s): narrower term(s)
 * Terme(s) associé(s): related term(s)
 * Définition: definition
 * German
 * Oberbegriffe: broader terms
 * Untergeordnet: narrower terms
 * Verwandter Begriff: related term
 * Italian
 * Usato per: use for
 * Termine apicale: top term
 * Termine più generale: broader terms
 * Termine più specifico: narrower terms
 * Termine associato: related terms
 * Nota d'ambito: scope note
 * Definizione: definition
 * Spanish
 * Usado por: use for
 * Término genérico: broader term
 * Término específico: narrower term
 * Término relacionado: related term

Wikidata items for these relations:
 * preferred term
 * non-preferred term
 * broader term
 * narrower term
 * related term
 * top term
 * scope note

Data interchange formats
One data interchange format useful for thesaurus data is Simple Knowledge Organization System (SKOS). The format has facilities for preferred and non-preferred ("alternative") labels of concepts, hierarchical and associative relations between concepts, textual notes and definitions, and more.

Further reading:
 * Simple Knowledge Organization System
 * SKOS Simple Knowledge Organization System, w3.org
 * SKOS Simple Knowledge Organization System Primer, w3.org