Digital Libraries/Indexing and searching

Module name
Indexing and Searching

Scope
This module covers the general principles of information retrieval (IR) models, indexing methods, and the working of search engines applicable to digital libraries.

Learning objectives:
By the end of this module, the student will be able to:
 * a. Understand the basic working of Search engines, and identify and explain the Indexing and Retrieval stages.
 * b. Understand how a crawler crawls sites, and collects documents for indexing.
 * c. Learn about indexing methods and use Lucene to index a given dataset.
 * d. Learn about the vector space and other information retrieval models and discuss and compare which models to deploy for specific datasets.

5S characteristics of the module:

 * a. Space: The concept of space for this module is physical space and virtual web space. The web pages, indexes and posting lists are stored in a physical space on servers, and the crawler crawls the virtual web space to find new content. Many IR models use spaces, e.g., vector or probability.
 * b. Stream: The web crawler finds URL's in a webpage, which can be considered a stream of data, which it follows to find more streams recursively.
 * c. Structure: The web crawler, inverted index and postings list all make use of particular data structures.

Level of effort required:

 * a. Class time: 3 hours (1.5 + 1.5)
 * b. Student time outside class: 6 hours
 * i. Reading before each class starts (2+2)=4 hours
 * ii. Homework assignment: 2 hours

Relationships with other modules:

 * a. 3-c Harvesting: Harvesting applications in digital libraries crawl a set of archives/ document collections to find information specific to a particular topic. This is similar to search engines, discussed in this module; they use a crawler to collect information on the web/document collection.
 * b. 6-b: Online info seeking behavior and search strategy: The 6-b module takes a user oriented view of the IR system and search strategies, while the Indexing and Searching module takes a more systems oriented view of the IR system, its structure and its functions.
 * c. 6-a: Info needs, relevance: The Indexing and Searching module addresses the user's information needs by collecting and presenting hopefully relevant data based on the user's query (needs).

Prerequisite knowledge required:

 * a. None

Introductory remedial instruction:

 * a. None

Topic 1: Search Engines

 * a. History of search engines in digital libraries
 * The need for Search Engines (SE)
 * Motivated and contextualized by reference to SE's that crawl the web


 * b. The need for search engines in digital libraries
 * Are libraries aware of the incredible volume of academic content that is available on the web?
 * It is estimated that about one billion individual documents are in the "visible" and nearly 550 billion documents on 200,000 web sites in the "deep" web, which cannot be crawled with standard methodology due to its dynamically generated content.
 * For the research and teaching community the "invisible" web is of specific interest as it includes a major proportion of (high) quality content in free or licensed databases.


 * c. Shortcomings of commercial search engines for digital libraries.
 * Commercial search engines cannot search the "invisible" part of the digital library.
 * A search engine's primary business is to obtain revenue through advertising; therefore the contents of the digital library could be used to boost sales of certain products, which is profitable for the search engines.
 * There is no guarantee of the long-term sustainability of an index; everything is left to the discretion of the search engine.
 * Search engines aim at content that can be automatically indexed. It cannot consider manual conversion of data formats, which doesn't work well with digital libraries.


 * d. The way search engines work in digital libraries is illustrated in the following figure. The first step in the figure, Web crawling, is explained in the following paragraphs. Each of the other steps is explained in detail in Topic 2, 3 and 4.


 * e. Web Crawling:
 * A web crawler (also known as a web spider) is a program or automated script, which browses the World Wide Web or a specific part of the web or private database in a methodical and automated manner.
 * Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches.
 * A crawler generally starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies.


 * f. Features a web crawler should provide include:
 * Distributed: The crawler should have the ability to execute in a distributed fashion across multiple machines.
 * Scalable: The crawler architecture should permit scaling up the crawl rate by adding extra machines and bandwidth.
 * Performance and efficiency: The crawl system should make efficient use of various system resources including processor, storage, and network bandwidth.
 * Quality: Given that a significant fraction of all web pages are of poor utility for serving user query needs, the crawler should be biased toward fetching "useful" pages first.
 * Freshness: In many applications, the crawler should operate in continuous mode: It should obtain fresh copies of previously fetched pages. A search engine crawler, for instance, can thus ensure that the search engine's index contains a fairly current representation of each indexed web page. For such continuous crawling, a crawler should be able to crawl a page with a frequency that approximates the rate of change of that page.
 * Extensible: Crawlers should be designed to be extensible in many ways - to cope with new data formats, new fetch protocols, and so on. This demands that the crawler architecture be modular.

Topic 2: Document Analysis

 * 1. Document Analysis consists of some of the following processes
 * a. Tokenization
 * Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens.
 * For example "We shall rejoice" is broken down into 3 tokens 'We', 'shall' and 'rejoice'.
 * Tokenization is language and domain specific. For example, "San Francisco" could be tokenized into 2 tokens 'San' and 'Francisco', which should have been a single token 'San Francisco'.
 * The process of tokenization should not chop up words just by detecting white spaces. Words or phrases that could pose a problem could be phone numbers (800-444-5555), proper nouns, punctuation characters, apostrophes, etc.


 * b. Stop Words
 * Some extremely common words which appear to be of little value in helping select documents matching a user's need often are excluded from the vocabulary entirely. Such words are called 'stop words'.
 * Examples include a, an, and, are, be, it, by, has, from, for, were, will.
 * This helps in dropping some tokens, and thus helps in reducing the size of the posting list (described in Topic 3. Indexing)
 * Some search engines do not use stop word filtering or just limit the size of stop word lists (stop lists). This is because stop lists could be unfair to certain queries, e.g., 'To be or not to be'


 * c. Stemming
 * Documents tend to use different forms of the same word, e.g., differ, differed, differentiate, different.
 * These could result in separate tokens, even though they stem from the same word 'differ'.
 * When a user queries for 'different' he might be interested in getting results that have other forms of the word 'different'.
 * Stemming reduces the size of posting lists and has been shown to improve document rankings. It enhances recall, but may reduce precision.
 * More linguistically accurate processing, lemmatization, typically has similar effect, but usually is more expensive.


 * d. Normalization
 * Normalization is the process of canonicalizing tokens so that matches occur despite superficial differences in the character sequences of the tokens.
 * The above mentioned processing should be done in such a way that consistent text handling occurs for queries as well as documents. This is especially important if documents can be queries.

Topic 3: Indexing

 * 1. An index is used to quickly find terms in a document collection. As a Digital Library grows, an efficient method to do a full-text search is required. An inverted index is typically used to achieve this objective.
 * 2. Text is preprocessed and all unique terms are identified. For each term, a list of document IDs containing the term also is stored. This list is referred to as a posting list.
 * 3. Indexing requires additional overhead of I/O and space for the index to be stored on a secondary storage.
 * 4. Indexing dramatically reduces the amount of I/O to satisfy an ad hoc query. Upon receiving a query, the index is consulted for each querying term, the corresponding posting list is referenced, and documents are ranked based on estimated query relevance.


 * 5. The indexing phase can be divided into the following sub-phases
 * a. Building an Inverted Index
 * i An inverted index consists of two components as shown in Figure 3
 * 1. Term list: A list of unique terms
 * 2. Posting List (for each term): List of document IDs containing that term.


 * b. Compressing an inverted index
 * i. Large indexes and posting lists can pose I/O bandwidth and storage problems.
 * ii. Compression can help significantly, e.g., see the 'Fixed Length Index Compression' algorithm.


 * c. Variable Length Index Compression
 * i. Differences between consecutive entries in a posting list are used to compress the index.
 * ii. Frequency distribution and offsets are used for compression.
 * iii. See: 'Gamma Codes' and 'Varying Compression Based on Posting List size'


 * d. Index Pruning
 * i. A 'lossy' approach in which certain posting list entries could be removed or pruned without significantly degrading performance.


 * e. Reordering documents prior to indexing
 * i. Index compression efficiency can be improved by algorithms that reorder documents before inverted index compression.
 * ii. Two types of algorithms can be used: Top-Down and Bottom-Up.

Topic 4: Searching

 * 1. Information retrieval includes searching for information within documents. It typically uses an index, and retrieval strategies to achieve this objective.
 * 2. A Digital Library can employ different retrieval strategies depending upon the type of documents stored in it.
 * 3. For a given retrieval strategy, many different utilities can be employed to improve the results of the retrieval strategy. Retrieval utilities should be viewed as Plug-n-Play utilities which can be coupled with any retrieval strategy.


 * 4. Retrieval Strategies:


 * a. Vector Space Model:
 * i. It is based on the idea that a meaning of a document is conveyed by the words used in it. If one can represent the words in a document or a query by a vector, it is possible to compare documents with queries to determine how similar their content is.
 * ii. The model involves constructing a vector that represents the terms in the document and another vector that represents the terms in the query.
 * iii. The angle between the two vectors is computed by using the inner product of the two vectors. This angle is also referred to as a Similarity Coefficient (SC).
 * iv. Some of the terminologies used in the Vector Space Model are:
 * t = number of distinct terms in the document collection
 * tfij = number of occurrences of term tj in the document Di. This is referred to as term frequency.
 * dfj = number of documents that contain tj. This is the document frequency
 * idfj = log(d/dfj) where d is the total number of documents. This is the inverse document frequency.
 * Dij = tfij * idfj :weighting factor for a term j in a document i
 * Similarity coefficient, SC (Q, Di) = Sigma(j=1 to t) (w_qj x d_ij), where (wq1, wq2, …, wqt) is a query vector and (di1,di2, …, dit) is a document vector.


 * b. Probabilistic Retrieval
 * i. A probability based on the likelihood that a term will appear in a relevant document is computed for each term in the collection.
 * ii. For terms that match between a query and a document, the similarity measure is computed as the combination of the probabilities of each of the matching terms. Sometimes other probabilities are considered, too.


 * c. Neural Networks
 * i. A sequence of 'neurons' or nodes in a network, that fire when activated by a query triggering links to documents.
 * ii. The strength of each link in the network is transmitted to the document and collected to form a similarity coefficient between the query and the document.
 * iii. Networks are trained by adjusting the weights on links in response to predetermined relevant and non-relevant documents.


 * d. Genetic Algorithms
 * i. An optimal query to find relevant documents can be generated by evolution.
 * ii. An initial query is used with either random or estimated term weights. New queries are generated by modifying the weights in queries from the prior generation.
 * iii. A new query survives by being close to known relevant documents, while queries with less 'fitness' are removed from subsequent generations.


 * e. Boolean Algorithm
 * i. A classical retrieval model, which is mostly adapted and used commercially.
 * ii. It is based on Boolean logic and classical set theory. Documents and query are represented as sets of terms.  Retrieval is based on whether the documents contain the query terms or not.
 * iii. In mathematical point of view, the algorithm is straightforward, but there are problems to be solved in practical point of view, e.g., stemming, choice of terms, etc.


 * f. Extended Boolean Algorithm
 * i. In conventional Boolean retrieval, relevance ranking to the query could not be estimated because the documents either satisfied or not satisfied the Boolean request.
 * ii. To incorporate relevance ranking, Extended Boolean was proposed.
 * iii. The idea is to assign term weights (ranges from 0 to 1) to the terms both in the query and the documents and incorporate those weights in the relevance ranking.
 * iv. The similarity coefficient (SC) is calculated using the Euclidean distance from the origin to the term weights and normalized.


 * g. Fuzzy Set Retrieval
 * i. A document is mapped to a fuzzy set (a set that contains not only the elements but a number associated with each element that indicates the strength of membership).
 * ii. Boolean queries are mapped into fuzzy set intersection, union and complement operations that result in a strength of membership associated with each document that is estimated to be relevant to the query. This strength is used as a Similarity Coefficient.


 * 5. Performance Metrics
 * a. Precision: the ratio of the number of relevant documents retrieved to the total number of documents retrieved. Precision provides one indication of the quality of the answer set, reflecting ability to avoid noise.


 * b. Recall: the ratio of the number of relevant documents retrieved to the

total number of documents in the collection that are believed to be relevant. This indication of quality reflects how comprehensive is the query.

Required reading for students

 * i. Search Engine Technology and Digital Libraries, "Libraries need to discover the academic content", Norbert Lossau, D-Lib Magazine, June 2004, Volume 10 Number 6 http://www.dlib.org/dlib/june04/lossau/06lossau.html
 * ii. Grossman, D. A. and Frieder, O. (2004). Chapter 2: Retrieval Strategies. In Information Retrieval: Algorithms and Heuristics, Second Edition, Springer. pg 9-20
 * iii. Grossman, D. A. and Frieder, O. (2004). Chapter 5: Efficiency, Inverted Index. In Information Retrieval: Algorithms and Heuristics, Second Edition, Springer. pg 182-195

Recommended reading for students

 * i. Manning C. D. et al. (2008). Chapter 2. The term vocabulary and posting list. In Introduction to Information Retrieval. Cambridge University Press. Online version retrieved in Sep. 14, 2009 from http://nlp.stanford.edu/IR-book/pdf/02voc.pdf
 * ii. Grossman, D. A. and Frieder, O. (2004). Chapter 1. Introduction. In Information Retrieval: Algorithms and Heuristics, Second Edition, Springer. pg 3-5.
 * iii. Manning, Christopher, Prabhakar Raghavan, and Hinrich Schütze. Chapter 4 In Introduction to Information Retrieval. Cambridge, England: Cambridge University Press, 2008. http://nlp.stanford.edu/IR-book/pdf/chapter04-construction.pdf
 * iv. Manning C. D. et al. (2008). Chapter 4.Index Construction. In Introduction to Information Retrieval. Cambridge University Press. Online version retrieved in Sep. 14, 2009 from http://nlp.stanford.edu/IR-book/pdf/04const.pdf

Reading for instructors

 * i. Lawrence, S. et al. (1999). Indexing and Retrieval of Scientific Literature. Proceedings of the eighth international conference on Information and knowledge management, Kansas City, Missouri, United States, Pages: 139 - 146. http://portal.acm.org/citation.cfm?id=319970&coll=GUIDE&dl=GUIDE&CFID=20419353&CFTOKEN=20765028&ret=1#Fulltext
 * ii. Summann, F. (2004). Search Engine Technology and Digital Libraries, "Moving from Theory to Practice". D-Lib Magazine, September 2004, Volume 10 Number 9. Article at http://www.dlib.org/dlib/september04/lossau/09lossau.html
 * iii. DeRidder, J. (2008) Googlizing a Digital Library, The Code4Lib Journal, Issue 2. http://journal.code4lib.org/articles/43
 * iv. Pant, G., Tsioutsiouliklis, K., Johnson, J., and Giles, C. L. 2004. Panorama: extending digital libraries with topical crawlers. In Proceedings of the 4th ACM/IEEE-CS Joint Conference on Digital Libraries (Tuscon, AZ, USA, June 07 - 11, 2004). JCDL '04. ACM, New York, NY, 142-150. DOI= http://doi.acm.org/10.1145/996350.996384

IR-related resources from Wikipedia

 * Annotation
 * Bayesian Network
 * Binary Classification
 * Digital Library
 * Content-based Image Retrieval
 * Edit distance
 * Federated Search
 * Hubs and Authorities
 * Human-computer Information Retrieval
 * Hypertext
 * Image Retrieval
 * Information
 * Information Filtering
 * Information Retrieval
 * Language Model
 * Latent Semantic Analysis
 * Latent Semantic Indexing
 * Machine Learning
 * Metadata
 * Metasearch
 * N-gram
 * Naive Bayes
 * Okapi BM25
 * Open Archives Initiative
 * Precision (Information Retrieval)
 * Query Expansion
 * Relevance Feedback
 * Skip List
 * Standard Boolean Model
 * Stemming
 * Term Discrimination
 * Tf-idf
 * Vector Space Model
 * Web Crawler
 * Zipf's Law

Exercise A: Web Sphinx

 * 1. Download the crawler application Web Sphinx from http://www.cs.cmu.edu/~rcm/websphinx/#download. Download the websphinx.jar file and save it to your computer in a directory. Use the command prompt, navigate to the directory where the jar file has been downloaded, and type "java -jar websphinx.jar". Note: You should have the latest version of the Java Runtime Environment (JRE) downloaded and installed on your computer.
 * 2. Once the application has been started, read the user manual at http://www.cs.cmu.edu/~rcm/websphinx/#examples to get acquainted with the tool.
 * 3. Use Depth-first/Breadth-first crawling, limit number of hops, crawl the sub-tree and observe changes in the visualization. Observe how certain groups are formed and certain pages try to stick together as they are under the same sub-root. Observe the outline as to how the pages are arranged in the server.
 * 4. Certain URL's you can try this tool on are
 * a. http://www.cs.vt.edu
 * b. http://www.dlib.vt.edu/

Exercise B: Building a simple index

 * 1. Create a simple inverted index with a posting list of the following documents
 * a. D1: John sells oriental pots for a dollar.
 * b. D2: Oriental pots are made of clay.


 * 2. Step1: Tokenization. Separate each word in documents D1 and D2.
 * 3. Step 2: Remove stop words.
 * 4. Stop words are removed by comparing the tokens against a list of predefined stop words. In this case, three stop words - for, a and are - are removed from the table above.  The result is shown below.

5. Step 3: Stemming. Stem the tokens to their roots. In this case, 'sells' became 'sell' and 'pots' became 'pot'.

6. Step 4: Remove duplicate terms, keep the document IDs of the removed. Now a simple inverted index with a posting list has been created.

Exercise C: Evaluation

 * 1. There are a total of 1000 documents in a document collection. For a given search query, the total number of relevant documents for the query are 400. The query is executed on the IR system, and it retrieves 600 documents. Out of the 600 documents, only 300 are relevant to the query. What is the precision and recall of the IR system?


 * 2. Explanation:
 * a. Total number of documents in the collection = 1000
 * b. Relevant documents = 400
 * c. Retrieved documents = 600
 * d. Relevant documents retrieved = 300
 * e. Precision = Relevant Retrieved / Retrieved = 300/600 = 50%
 * f. Recall = Relevant Retrieved / Relevant = 300/400 = 75%

Additional useful links

 * a. Web Sphinx : http://www.cs.cmu.edu/~rcm/websphinx/

Contributors

 * a. Initial author:
 * i. Aniket Arondekar
 * b. Guidance:
 * i. Dr. Edward A. Fox
 * ii. Seungwon Yang
 * c. Evaluation:
 * i. UNC-CH project team