User:Oleamm/Search engines history and architecture

= Search engines history and architecture =

Overview of Search engines history and architecture chapter
This chapter is about Search engines, their history, architecture and some related questions, such as ranking of search results and their possible biases.

After an overview and Introduction to the topic, search engines definition and their types is given in the third part, as well as motivation for search engines development. In the fourth part a brief history of some popular or historically important search engines is considered. The fifth part explains the basics of search engine architecture and its main blocks such as web crawler, search index and inverted index, query engine and ranking module. In the sixth part ranking techniques based on terms counting or networks structure and examples of their calculation are explained in more details. Finally, a brief overview of the search relevance and trust to the results, their biases and filtering questions is given, but considered in more details in the second chapter.

Introduction
In the information age few people can imagine the life without a good and simple way of finding information. Most people are interested in reliable and direct ways to access, search and exchange information. Moreover, supplied information should be always up to date and relevant to people needs.

It is possible to access a vast amount of information via the Internet today. But in early days, when information just started to be digitised, people understood the crucial need of finding, sorting and ranking it. That was the time when search engines emerged.

Success of any search engine is in a large part determined by the fact whether a user can find a good answer for his search query or not. That is why the most important aim of every search engine is to continuously improve its search results. A lot of different techniques, architectures, algorithms and models were invented and implemented in order to provide search results which users consider as relevant and interesting. Let us consider some of them in this chapter.

Learning Objectives
Learning Objectives of this section are to understand:
 * What a search engine is,
 * The categorization of existing search engines,
 * What a Web search engine is and what the most popular search engines there exist nowadays,
 * The motivation for search engines development.

What is a search engine?
A search engine is a tool that helps people to find information of different types on the Internet or on the user's PC. There exist a lot of different types of search engines operating in many various ways, some of them are:
 * Web search and meta search engines, for keyword search in Web documents;
 * Desktop search tools provide file search on the user's PC;
 * Audio search engines, which are services for finding a title or an artist names of a song, using a short fragment of the song recorded e.g. on a mobile phone. Examples for this are Shazam or SoundHound;
 * Image search engines, which allow a user to upload a picture and then is used to search for similar pictures or higher resolutions of this picture, or event to understand what this picture is about;
 * Vertical search engines are oriented on a deep search in some specific topics. For example, Amazon helps people to find products, PeopleFinder helps to find people, Indeed.com is an employment-related search engine (also it is classified as meta search engine).

Web Search Engine
Web search engines return information located on the World Wide Web. The keyword search can be considered at least from 2 directions: finding the documents matching a user entered a keyword (or a phrase consisted of several keywords) or finding documents which are about the entered keyword. For the end user a search engine is a website with a text form, where he can type in his search keywords and receive a list of links to relevant pages or documents as a result. The result page also consists of descriptive textual information for each link and usually some relevant advertisement. Nowadays, search engines are often sophisticated software and hardware systems with closed and commercially protected inner structure. In short, each search engine includes robots (also known as crawlers or spiders) which travel through the Web following each link and mine documents to its database. Then data is extracted from the documents and processed using special algorithms. Each document and keyword are sorted and obtain some rating. In the end, a user optimally receives a list of web pages relevant to his query and sorted by relevance. Accordingly to NetApplications, Web search engine market share in November 2013 is divided as follows:
 * Google — 70,91%;
 * Baidu — 16,51%;
 * Yahoo! — 5,95%;
 * Bing — 5,48%;
 * AOL — 0,27%;
 * Ask — 0,23%.

Despite on the dominate popularity of Google, in some countries local popular search engines exist, for example:
 * Baidu is the leading search engine in China and holds 5th place in Alexa top 500 sites raiting.
 * Yandex is the most popular search engine in Russia which holds more than 60% of market share in that country.

Motivation for creating Web search engines
In the beginning of the World Wide Web there were no search engines, but a web (or link) directory managed by Tim Berners-Lee and hosted on the CERN webserver. With time, further directories came up like Yahoo! and the Open Directory Project. They were websites with lists of links to other sites. Key characteristic of web directories is that their content is not mined automatically (in opposed to search engines), but manually by humans. The author (or editor) of a web directory needs to somehow discover the existence of the website that should be added to the web directory. In the early days of the Web there were mainly 2 options to know about the emergence of a new website: The second important characteristic of web directories is limited categorization possibilities for websites: usually each website could only be associated with one (sometimes 2-3) specific category, depending on the topic of the whole website, rather than on the topic of a specific web page. As time went by, some Web directories evolved into well-structured and categorised Web directories with the possibility of sorting links in different ways, e.g. by popularity or by adding time.
 * 1) To receive an email from a site owner with a request to add the link to his resource to the web directory;
 * 2) Accidental discovery of a new website during surfing through the Web, when another website had already linked to it.

Learning Objectives

 * To find out an overview of some important developments in the history of web search engines
 * To learn briefly about the development of search engines like Yahoo!, Alta Vista, Google, etc. and their predecessors like Archie and Gopher.

The structure of narrative in this section is the following: all case studies are sorted accordingly to their creation date. Each case may include more detailed description and important dates in its development. At the end, some alternative search engines are considered.

Archie
Archie, developed in 1990, is very likely to be the first search engine on the Internet. It was created by Alan Emtage to provide file search on FTP servers. Despite the fact that the World Wide Web had not yet been developed in 1990, the Internet already existed and a vast amount of public FTP servers were already functioning, providing file exchange. The Archie database contained the contents of over 800 FTP servers and performed the search by file name using regular expressions. It served users with the results via telnet sessions or email.

Gopher
Gopher, launched in 1991, was an application layer protocol, which was accessible through the TCP port 70. It provided similar to the World Wide Web functionality. In contrast to the Web, Gopher had text-based look and menus, that is why a lot of network administrators preferred Gopher as a system consuming less resources. Gopher had its own search engines like Veronica, which provided search by Gopher menus.

Yahoo!
Yahoo! was founded by Jerry Yang and David Filo in January 1994. It is an example of a Web directory organised in a hierarchy. In addition to the Web directory, a Web portal was added. By the end of 1994, Yahoo! became very popular and received more than 1 million hits overall. In 1998 Yahoo! was the most popular starting page and had about 30 million unique visitors per month. Originally, Yahoo had no search crawlers, but in 2000 it started to use Google search, which then was changed several times on others.

AltaVista search
AltaVista, as a keyword-based Web search engine, introduced several important innovations in 1995. Firstly, it had a very fast multi-threaded crawler (called Scooter) which could reach most of the Web sites existing back then. Also, AltaVista used a very advanced hardware for its back-end. The third important innovation was AltaVista's minimalistic web site interface. This allowed to reach high popularity in the very first days of existence of the search engine. Data collected by the AltaVista's crawler was used in the very first analysis of the connections strength in the World Wide Web represented as a graph structure. With the rise of Google Web Search, AltaVista soon lost its popularity. In 2003 Yahoo! bought it and in 2013 operation of AltaVista was shut down.

Google search
Google Inc. was founded by Larry Page and Sergey Brin in 1998. They started to work on the project in 1996 as PhD students. The revolutionary idea of Google Web Search was PageRank algorithm for ranking web sites, which will be described in more detail in a later section. PageRank is a special quality ranking method of documents, which allowed Google to overcome all existing search engines in terms of search results quality. Nowadays Google is the most popular search engine in the world (Figure 1).

Alternatives

 * Ask.com is a question-answering based search engine. It was founded in 1996 by Garrett Gruener and David Warthen as an idea of helping people to find answers on their questions using every day natural language in addition to the usual keyword search.
 * Wolfram Alpha is a computational knowledge engine and not a search engine . It uses natural language processing techniques to process a user request to an understandable for a machine format and then computes an answer using its internal knowledge base. Wolfram Alpha does not return links to external websites.
 * Meta search engines are search engines, that combine search results from several other search engines, newsgroups, or other sources following their own algorithms . The results of a search are then shown in a single list or in separate lists split by their sources. So, meta search engines usually do not have their own database (or search index) and mechanism for mining it (crawlers). Some examples for meta search engines are Dogpile and Metacrawler, which combine Google, Yahoo and Yandex.

Learning Objectives
In general, a search engine implements a lot of different tasks. Therefore, the learning objectives of the section are to understand how to perform the most important tasks (Figure 2):
 * Crawl the Web and mine data
 * Store data in search indexes
 * Calculate special parameters for ranking web pages and further crawling optimisation (like PageRank)
 * Search among the stored information considering ranking and more other parameters (like user's interests, search history, etc.)

Web crawler
Web crawlers are an essential part of almost every search engine. They are a computer program that collects data from the Web. Occasionally, they are also called Web spiders, ants or automatic indexers.

The process of crawling starts from a primary list of URLs or queue. In the very first run this list can be created by humans. The Web crawlers visit all URLs and extract new links for further crawling. Additionally, webpages are downloaded and saved to a preliminary storage called page repository for later indexing.

Crawling the Web is a challenging task, because various unnecessary situations can occur. For example, infinite variance of  parameters in each URL is possible, what may cause endless loops while crawling a web site. At the same time the amount of unique content is limited. Huge amount of web sites were created without applying an engineering approach and hence, without following any standards. Furthermore, spelling or nesting errors of HTML tags are common, DOM (Domain object model) can often be violated. These and many other reasons may cause simple implemented Web crawlers to crash or get stuck in loops. Secondly, it is very important to have up to date information in search engine's results. For realising this, a search engine would have to download all the documents from the entire World Wide Web each hour, minute and second. This is obviously impossible. Moreover, it is not necessary, since not all web sites are updated frequently. For example, a private person's homepage might be updated less frequently than the website of a popular news agency. Another big issue is performance optimisation which is closely dependant on supplementary systems like DNS, data storage, etc.. IN addition, the load generated by crawlers on end Web servers must be kept low. Otherwise, the owners of websites would try to ban search engine's crawlers.

Summarising the above and accordingly to Castillo, the main questions which should be dealt with during the process of designing a web crawler are:
 * 1) Which pages the crawler should download and in what order, or selection policy.
 * 2) How often the crawler should refresh each page, or re-visit policy.
 * 3) How the load on visiting web-sites should be decreased, or politeness policy.
 * 4) How the load on crawler's servers should be decreased, or parallelisation policy.

In the article “Searching the Web” the authors made a small comparison of different metrics that can be used for ordering URLs based on the structure of Stanford university website. The idea was to find the best metric for downloading popular pages first using the minimal bypass throughout the entire collection of the documents. Web pages that had more than 100 backlinks were considered to be popular. The results showed that using the PageRank metric (will be described in more detail in a later section) it is possible to reach more than 65% of the popular pages processing only 20% of the entire collection. In comparison to backlink, breadth (breadth-first search algorithm) and random metrics the results were ~50% / 20%, 30% / 20% and 20% / 20% respectively. Breadth-first search was chosen instead of depth-first search in order to cover the highest amount of web-sites first and by that reduce the risks associated with occurrence of infinite loops inside one web-site.

Crawlers are designed not to download all possible documents that can exist in the Web, but rather download only supported types of documents. Hence, crawlers might consider only HTML pages, XML documents, PDF documents and other popular document types, but avoid all other MIME types.

In order to avoid infinite loops in crawling and to protect from double visiting one document, URL normalisation is used. This is the process of modifying URLs to some standard look and clearing ambiguous parameters.

Search index and inverted index
Search (or document) indexing modules parse and store data about each document downloaded by the crawler. Search indexes are needed for increasing the speed of search and eliminating the need to low scan of each document in a database when a search query is received. Existence of the index increase the amount of necessary storage and consumption of computing resources (to fill in and update the index each time when new version of documents arrive), but it is a reasonable fee for the possibility of rapid search.

When a crawler has supplied its search engine with a number of documents, the next task is to find out which document contains which words. But before creating the structure of the index it is very useful to look at a search process again.

When a user visits a search engine website, he (or she) usually wants to receive the results about the given keyword or a group of keywords. At this moment of time the main task for any search engine is to find all documents which contain necessary keyword(s). Also, users desire this task to be performed as fast as possible. To solve this problem, a search index is designed in the following way: instead of creating a collection of documents with terms contained in each document (forward index), it is easier in terms of improving search speed to create a collection of terms with links to the corresponding documents in which each term appears. This data structure is called inverted index. Document IDs in the inverted index may be sorted accordingly to tf-idf weight of a term. A small example for an inverted index is shown in table 1.

If a user makes a query “the sun rises”, fast look at the search index will tell us that the term “the” can be found in documents having IDs: 1, 2, 3, 4, 5, 7, 8, 9, 12, 13, 18, 45, 989, 1005; the term “sun” can be found in documents having IDs: 2, 5, 8, 12, 13, 45; the term “rises” can be found in documents having IDs: 1, 5, 8, 13. The result of the search query will be an intersection (or sometimes a union) of these 3 sets of IDs. So in the end, the user will see a result page consisting of documents having IDs 5, 8, 13. In real life, of course, the amount of search results for a keyword (or a phrase) may exceed several millions (or even billions). Modern search engines may have much more amount of indexes than one, for example for storing html-tags, URLs, the position of a term in a paragraph etc.

Query engine and ranking module
Query engine module is responsible for interaction with the end user – it receives a search query from a user, processes it addressing to the search index and gives back the results. Modern search engines have a helper function called “autocomplete suggestions”. When a user starts to enter a search query in a query form, a search engine will immediately extend already written word, phrase, or even a part of the world with relevant and frequent search queries, called “suggestions”. That can dramatically increase overall usability and increase the speed of human-computer interactions. Another useful functionality is special operators which helps to clarify a search query. For example, in Google search engines such operators are :
 * “ ” – excludes from the search the documents containing the word “ ”.
 * “ ” – perform search on a given url, or specific domain zone, for example “site:.gov” will search on .gov domain zone;
 * “ ” – will search the documents of type “ ”.
 * binary operators,  ;
 * “ ” – search all websites that links to a given URL “ ”.

Ordering search results accordingly to relevance is another crucial task for every search engine. Huge variance of different factors can be considered for this purpose. All that metrics are stored in the search index, some of them are:
 * number of each word occurrence in a document (called hits or term frequency);
 * position of a particular word in a document, its font-size, surrounding tags (especially if surrounding tags are  or  );
 * links to other pages and important information about these links (like text of the link) – it is very important for creating a model (or a map) of the World Wide Web, and to know precisely which page refers from and to. Based on this information PageRank is calculated.

Ranking module perform sorting of the results in a way the most relevant results (or links) will be placed higher on a search engine's result page. Ranking mechanisms will be considered in the next part of the article.

Learning Objectives
The learning objectives are to understand what is and how to perform:
 * Term weighting using tf-idf importance metric
 * Link analysis using PageRank algorithm

Introduction to ranking and importance
Usually, there are thousands and millions pages which contain keywords from any user query. Some of these pages will be more relevant and interesting for a user, while others will be relatively useless. In general, no one wants to visit hundreds of pages before finding relevant documents. Instead, users rather wish to find information they are looking for as quickly as possible. This means, at best the user will find all relevant pages on the very first pages of search results.

Modern search engines use more complex algorithms for ranking than simply finding and counting keywords in documents. Those algorithms allow search engines to return back to a user intelligent and accurate results.

From the point of view of search engine creators, two aspects can be considered as ranking or importance measures of a Web page :
 * the content of web page can be considered in terms of term weighting as a cognitive social capital,
 * link analysis can be considered as a structural social capital, describing how important a page is in the Web.

Term weighting
Term frequency–inverse document frequency (short: tf-idf) is a statistical measure for estimating a weight (or importance) of a given term in a document in a collection of documents (or corpus). It can be considered as a Cognitive aspect in 3 dimensions of Social capital.

tf-idf, motivation and definition
tf-idf is often used for text analysis and information retrieval, as one of the criteria which determines the relevance of a document for a given search query. Application of tf-idf is also helpful for determining the topic of a particular document or even a paragraph only. In other words finding the best keywords that can describe the document (keyword summarization). The technique here is to calculate tf-idf for each keyword in the document and sort the words accordingly to obtained tf-idf in a descending order. Hence, the highest ranked words may give an impression about the content of a particular document.

Each document can be represented as a vector of terms. But instead of terms as vector components it is more suitably to use numbers which represent a weight of each term. To obtain these numbers the frequency of use of each term can be calculated. But considering only $$tf$$ (short for term frequency) for estimating the importance of a term is not enough. Since the most frequent words in English language according to the Oxford English Corpus ) are: These words do not describe the content of a document. To solve that problem $$idf$$ (inverse document frequency) was introduced . The weight of commonly used words can be reduced together with $$idf$$. So, the combination of $$tf-idf$$ not only deals with the frequency of a term in a particular document, but also finds how often this term appears in other documents. If the term rarely appears in a collection, hence the document getting ranked higher, since it describes a unique content.
 * the, be, to, of, and, a, in, that, have, I

The weight of a term is proportional to the number of times the term is used in the document, or term frequency, multiplied by the inverse document frequency which shows whether the term is common or not across all documents in a collection :

with


 * $$ \text{tf}(t,d)$$ – term frequency, which describes how many times term $$t$$ appears in a document $$d$$,
 * $$ \text{idf}(t, D)$$ – inverse document frequency, or the logarithm of the inverted number of documents in a collection D which contain the term $$t$$,
 * $$ |D| $$: the total number of documents in the collection (corpus).

Example of tf-idf application
Let us consider the example of tf-idf use and calculation for determining the most important terms in a document. There is a document containing the flowing paragraph: The objective of Web Science is to understand how the Web evolves and to understand how we can make the Web a better tool for our individual needs.

Table 2 contains almost all the terms from the above sentence. We exclude the terms [of, is, to, we, can, our] because they are quite commonly used in the English language. But to illustrate the effect these words have on tf-idf, the tf-idf value for “the” is still given in this example. As a total number of documents $$|D|$$, the estimated amount of documents indexed by google in December 2013, is used. So, consider $$|D| = 15 \times 10^{9}$$.

As it can be seen from table 2, the terms with the highest tf-idf value – and hence, the words best describing this sentence – are:
 * Web Science, understand, evolves, Web.

Cosine similarity
After calculating $$tf-idf$$ for each term in a document, a search engine needs to find: To do that, the cosine similarity between 2 vectors is being used. The basic steps are:
 * 1) whether a user query is matching a particular document;
 * 2) after finding several matching documents determine which documents are more similar to the query.
 * 1) Represent a set of terms of a document as a vector $$d$$. Here, unique terms are axes and $$tf-idf$$ values are the coordinates.
 * 2) Calculate $$tf-idf$$ for the user query. Then, represent it as a vector $$q$$ modelled like the one introduced in the previous step.
 * 3) Calculate the cosine similarity between two vectors $$d$$ and $$q$$ using the following formula :

where


 * $$n$$ represents the amount of dimensions in space.

Important notes:
 * Before being able to calculate the similarity between vectors, they must be placed in the same n-dimensional space. Due to the amount of documents crawled by today's search engines, the value of $$n$$ can be of several thousands. If the document $$a$$ includes the term $$t$$, $$tf-idf$$ value for it will be calculated. While in the document $$b$$ which does not contain the term $$t$$, it will receive value $$0$$.
 * The cosine similarity of two documents ranges from $$0$$ to $$1$$, since $$tf-idf$$ can not be negative.
 * In a result $$1$$ means the highest similarity and $$0$$ means no similarity.

Calculating the similarity for the user query vector and all document vectors constructed from the search engine's collection of documents will result in a set of similarity values. Higher numbers in the set represent higher similarity between the query and the corresponding document.

In conclusion, $$tf-idf$$ has proven itself to be an extremely robust measure for ranking documents and was used by many search engines.

PageRank Definition


PageRank is a quality ranking value that is calculated for every indexed document, based on the link graph structure of the Web. The idea is that the importance of a particular web page can be measured in calculating the amount of other pages that link to it and their importance. PageRank can be considered as a Structural aspect in 3 dimensions of Social capital.

An easy explanation of PageRank logic can be found in the scientific world, where popular articles tend to be quoted more often than less popular ones. Similar relations can be traced when a professor writes a recommendation (gives an authoritative opinion) on a student, which later on has a higher chance among students without recommendations. Another example is celebrities or any other prominent figures in human society: more often people talk about some person (in other words links to a person), more popular it is. So, the more pages link to some page, the more relevant it should be. Indeed, there is no sense to link to a page which serves no interesting information.

However, links based ranking does not work well in real life and can be easily manipulated by spammers. Relatively low efforts are needed to create a large amount of fake pages and link them to the page which should be promoted. PageRank algorithm extends the idea of citation count popularity by adding to consideration another important factor, which is the weight of each page. It turns out, not only the amount of back-links is important, but also how important is each back-link (back-link means a link points to a given page).

Assume we create a page $$i$$ and make a hyperlink to page $$j$$. This means that we consider page $$j$$ to be related or relevant to the topic of our page $$i$$. If a lot of other web pages start linking to page $$j$$, hence, inside the Web there is a belief that page $$j$$ has a high-level of importance. In a similar way it can be found that if a web page $$j$$ has only few backlinks, but that links are pointed from very influential and popular websites $$M$$ (for example www.reuters.com or www.usa.gov), it also means that the page $$j$$ is important. Example of PageRank values for a simple network are shown in Figure 3. A simplified formula of PageRank algorithm is the following:

where


 * $$PR(A), PR(B), PR(C), PR(D)$$ – PageRank of pages $$A, B, C, D$$ respectively,
 * $$Out(B), Out(C), Out(D)$$ – the number of outgoing links from pages $$B, C, D$$ respectively.

Generally, the PageRank value for a page $$u$$ can be expressed as follows:

where


 * $$PR(v)$$ - PageRank of page $$v$$,
 * $$Out(v)$$ - the number of outgoing links from page $$v$$, or outdegree of page $$v$$,
 * $$B_u$$ is a set of all pages, which link to page $$u$$.

The formula is recursive and it will converge at some point of time, despite on the initial rank of each page. To converge, all dangling links (that points to pages with no outgoing links) are preliminary removed and after PageRank calculation can be restored back to the system without severe effect.

PageRank is based on the probability that a user will reach a particular page while randomly walking and clicking successive links he finds on the Web. This model is named random surfer and will be introduced in the following section.

Explanation of PageRank - Random surfer model
Consider the graph depicted in Figure 4. It represents a small network of connected (link to each other) pages on the Web. Nodes represents web pages and vertices represents connections (links) between them. Each node has equal probability to be chosen as a starting point.

If a user is on page A he has two choices to go next, which are B and D. Both choices have the probability 1/2 each as the user randomly decides where to go. For this example, consider he goes to D. Then, he may choose 3 different ways, which are A, C and E. Any decision will have the probability 1/3. If he goes to C, there is only 1 way to go from it. Hence, the probability to go from C to A is 1. Then the process starts from the beginning, but now user can go using a completely different way due to the randomness of his decisions.

This is an example of Markov chain where the next step, or transition, depends only on the current state (and not on the past states). Markov chain can be represented as transition probability (or stochastic) matrix: the elements of each column of a stochastic matrix are non-negative and their sum is equal to 1. The matrix describes the probabilities where a random surfer can go next being at any given node. Let denote the matrix by $$M$$. If there are $$n$$ pages in the Web, then such matrix has $$n$$ rows and $$n$$ columns. The element $$M_{ij}$$ (sits on row $$i$$ and column $$j$$) represents a transition probability to go from page $$j$$ to page $$i$$. $$M_{ij}$$ has value $$1/Out(j)$$ if page $$j$$ has $$Out(j)$$ outgoing links one of which points to page $$i$$. The element $$M_{ij}$$ may also have value 0 if page $$j$$ does not point to page $$i$$.

So we receive the following matrix $$M$$ for representing the graph from Figure 4:

The matrix $$M$$ is also related to the formula ($$): the element $$M_{ij}$$ (sits on row $$i$$ and column $$j$$) represents a score page $$i$$ receives from each page $$j$$ linking to $$i$$. Let denote by $$x_A$$, $$x_B$$, $$x_C$$, $$x_D$$ and $$x_E$$ the weight (or importance) of each of 5 pages from Figure 4. The following system of equations can be obtained for describing matrix $$M$$:

From the system of equations ($$) it can be seen that for example page A has an importance $$ x_A = \frac{1}{3} x_B + x_C + \frac{1}{3} x_D $$ since pages B, C and D link to page A; and page B has 3 outgoing links, page C has only 1 outgoing link, page D has 3 outgoing links.

Before we go next, let us emphasise the meaning of the matrix $$M$$ ($$) and its elements:
 * 1) Elements $$M_{ij}$$ in each column $$j$$ of matrix $$M$$ represent the probabilities to go from page $$j$$ to page $$i$$. The sum of probabilities in each column is equal to 1.
 * 2) Elements $$M_{ij}$$ in each row $$i$$ of matrix $$M$$ represent the importance each page $$i$$ receives from each page $$j$$ linking to $$i$$, described by the system of equations (4.6).

The system of equations ($$) can be rewritten as

where


 * $$ x = [ x_A, x_B, x_C, x_D, x_E ]^{T}$$ is the importance vector, or PageRank vector of the system from Figure 4.

Now it can be seen ($$) is almost the equation which ties a matrix, its eigenvector and eigenvalue together :

with

As it is known, stochastic matrices have the largest eigenvalue $$z$$ equal to 1 (by Perron-Frobenius Theorem). Therefore equations ($$) and ($$) become equivalent. Now we only need to find the eigenvector $$x$$, corresponding to the eigenvalue $$z = 1$$. So, substituting to equation ($$) matrix $$M$$, eigenvector $$x$$ and eigenvalue $$z = 1$$ we have:
 * $$z$$ – eigenvalue for matrix $$M$$,
 * $$x$$ – eigenvector corresponding to the eigenvalue $$z$$.

From ($$), the equation ($$) can be obtained again. Substituting $$x_B$$ and $$x_D$$ to ($$) in $$x_E$$ equation, we obtain: $$x_E = \frac{1}{3} x_A $$. Then, substituting $$x_B$$, $$x_D$$ and $$x_E$$ to ($$) in $$x_C$$ equation, we have:

Now, the resulting vector $$x$$ can be obtained:

So, the vector $$x$$ in ($$) is the PageRank vector, which represents the importance of each web page in the considered system. $$\frac{x_A}{6} $$ was changed to $$c$$ above, since PageRank shows only relative importance of web pages, and $$c$$ can be of any value. If we want to see the probabilities for visiting each page explicitly, we can take $$c = \frac{1}{18}$$ (the common divider in order to sum all values of the vector to 1) and then omit it:

That means that the most important page is A, which will be visited approximately 33% of time. B and D have roughly the same popularity of 17% each. The popularity of C and E is about 22% and 11% respectively. The sum of all values inside the vector is 1.

Power Method for PageRank calculation
Since the size of the Web is several billions pages, finding eigenvectors in a way of solving systems of equations (as it was done above) is rather complicated and resource consuming task. So, the same PageRank vector can be computed in another alternative way.

Consider the graph from Figure 4. From the very beginning each node has an equal probability to be chosen to start from. Each of them has a value 1/5, or 0.2, and can be denoted as an initial vector $$u$$:

If we multiply matrix $$M$$ given in ($$) by vector $$u$$, we receive a vector, each element of which represents a probability of corresponding page to be visited in the next step after the initial state $$u$$:

The probability of each page to be chosen during the second step is described by the equation:

If we repeat the multiplication of the result of ($$), we receive the probabilities vector for step 3, etc:

As it can be seen, starting from the step 31, or $$M^{31} u$$, the values of the result vector stop changing. The system found its equilibrium vector:

And we also see, that $$x^* = x$$ from ($$), where $$x$$ is the PageRank vector. In fact, only first ~50 steps should be computed in order to obtain a good approximation of a PageRank vector.

Random walks modelling of PageRank
Behaviour of a random surfer can be also modelled as random walks, using the following computer program (the code is written in java) : Running the above code will give a result very similar to the following: From which again we can see, that the most popular node is A (33%), the second one is C (22%), B and D have roughly the same popularity (17% each), and the least visited node is E (11%). The results are roughly the same with results computed earlier in ($$) or in ($$).
 * The node A was visited: 33345 times
 * The node B was visited: 16704 times
 * The node C was visited: 22194 times
 * The node D was visited: 16641 times
 * The node E was visited: 11116 times

PageRank and the real Web
The small graph example can hardly be compared to the real Web. It represents only a very small part of the Web in quite ideal fashion. In real life, a lot of obstacles can occur. For example, some nodes may be missing outgoing edges. Hence, a random surfer would get stuck and PageRank of such system would be 0. Another example are disconnected components. There, a a random surfer visiting one group of pages will never reach another group of pages not connected to the first group. In order to overcome these problems it is needed to add some damping factor $$p$$ (usually in range 0.8 to 0.9) and define the PageRank vector $$x_{new}$$ in the following way :

where


 * $$x$$ – PageRank vector from (4.12),
 * $$e$$ – vector of all 1's of appropriate size,
 * $$n$$ – number of nodes in a graph (pages in the Web).

The matrix $$M$$ models the random surfer as follows. Most of the time, he visits a page he may reach through the outgoing links of the Web page he currently is at. At other times, he will jump to a randomly chosen page. The page does not need to be connected through an outgoing link with the Web page the random surfer is currently at. These jumps have a lower probability $$1-p$$ to occur than the regular traversal through outgoing links, which happens with probability $$p$$. When jumping, every Web page has the probability $$\frac{1}{n}$$ to be chosen.

Learning Objectives
The learning objectives are to consider:
 * Biases in search results, their types and consequences
 * Personalized search, adaptive hypermedia notion, their pros and cons
 * Filter bubbles question

Biassed search results
Search engines are programmed to use ranking algorithms and sort the results accordingly to their popularity. But sometimes these rules can be changed and biases may take place, some studies reveal that. Biases and their revelation, in turn, lead to loss of trust of users to a search engine. There exist various reasons which influence on search results, basically they can be categorized on social, economic, political interests.

“Google bombing” is a name of the practise when a lot of web sites create links to some web page, or image, using off-topic keywords for its description. That results in appearance of biased results, often for satirical purposes. A similar strategy is used in search engine optimization, or SEO, improving web sites' appearance in search results for particular search queries. Optimizing web site strategies may involve editing and structuring the content in a special way, by discovering and understanding of search engines, user behaviour and keywords they use in expectancy of receiving the desirable result, and a variety of other techniques. Spamindex, or black SEO, is closely related to Google bombing and used for artificial promotion of web sites. It can be classified into 2 common techniques: content spam and link spam. Black SEO is in permanent struggle with changing search algorithms, which are in turn try to eliminate such falsifications. SEO and search spam are good examples of commercial interests dominance which severely undermined the authority and usefulness of search engines in the middle nineties, causing in dramatic rise of not-relevant results.

Political interest usually concerns the questions of national security and anti-terrorism measures. But sometimes it turns out to influence negatively on freedom of speech, tending to provide censorship. Information filtering is common in Chine or North Korea and some Middle Eastern countries. Finally, even parents may use “safe search” functionality in order to protect their children from undesirable content.

Personalized search and filter bubbles
Personalized search is a term describing a strategy of search engines to determine search results for the end user based not only on ranking algorithms and relevance, but rather on previous user experience and many other unique user's parameters. Personalized search is linked to a notion of adaptive hypermedia. Adaptive hypermedia builds models for a set of all individual users of a system, considering their knowledge, experience and goals. Later on, these models are used for adapting the system exclusively for each user and interactions with him. Adaptive hypermedia tends to solve the problem, known as “Lost in hyperspace”, when a user can be disoriented by a huge amount of hyperlinks to choose from and little knowledge and experience which are more appropriate for him.

Popular search engines and websites, like Facebook or Youtube, have a possibility of tracking user's data and collecting it. All search history, visited pages, user's preferences and characteristics, like sex, age, geolocation and many more are stored and used during search. That tactic leads to dramatic increase in relevance of the results. For example for the search query “restaurant” two users in different cities will see different lists of restaurants because queries was initiated from two different geolocations. Moreover, a lot of other factors are usually considered, especially user's query history; so if in his history there exists queries about Chinese restaurant, it is very likely that a search engine will place similar types of restaurants to the first positions for this particular user. A scientists and a housewife will obviously see completely different, but more relevant results.

On the other hand, personalized search have drawbacks, one of which is known as “filter bubble”. Filter bubble is a phenomenon when a search engine use personalized search continuously guessing which information a user would like to see in search results, by that separating the user from the results which are not coherent with user's preferences in history. This situation leads to complete isolation of the user from all the information which he disagrees with and creates an ideological bubble around him. Thus, a user will not find new information and biassed search results again take place.

The feature also had a dramatic influence on the entire SEO industry, since search results no longer stay similar for each user and are in more dependency of end user model.

Conclusion
Search engines are an integral part of information society. In this chapter we enumerated different types of search engines, such as Web and meta search engines, as well as audio and vertical search engines specified in certain topics. This chapter gave an overview to the history of search engines development. It started in the 1990s with the easiest techniques like text matching and ended in enterprise corporations which earn billions of dollars with their complicated algorithms. A typical architecture of any Web search engine consists of various modules, main of which are:


 * Web crawler,
 * Search index,
 * Ranking module, query engine and Web interface.

Ranking in search is the main task which determines the entire success and popularity of any search engine. One basic simple algorithm is tf-idf, which is usually used for text analysis and information retrieval. Another algorithm is PageRank, which is the heart of the Google search engine and based on the random surfer model. We provided example computations of PageRank using three different approaches. Further, we discussed some problems which can occur when applying PageRank on a real world graph and gave a solution to it.

Relevance, trust and biases were the final subject of the chapter. Biases can be caused by social, economic and political interests and results in Google bombing phenomena, black SEO and censorship respectively. Personalization of search results, apart of increasing search relevance, can also cause biases in results. Personalization will be considered more deeply in the next chapter.

In the next chapter, apart from personalization of search results, the notion of multi stakeholder system will be considered together with economics of search engines, which mainly consist of advertising, its relevance and auctioning mechanisms.

Quiz questions
{Continue the phrase “A meta search engine is ...”} - a search engine which uses the information marked up by HTML meta-tags as the main source of information about the document. - a search engine which performs a search in some specific topics, for example Amazon. + a search engine which combines search results of several other search engines.

{What does the selection policy of Web crawlers stand for?} - How often the crawler should refresh each page. - Which web pages to select to display on the first page of search results. + Which pages the crawler should download and in what order. - How the load should be decreased on the visiting web-sites. - How the load should be decreased on the crawler's servers.

{In the tf-idf formula two measures are combined. Is idf really needed?} - We can use tf alone, but the result will not be precise. - We use idf to calculate the frequency of the word in a particular document. + We use idf to eliminate the impact of frequent words like “the”, “be” etc.. + It describes social importance of the term

{What method is mainly used for computing PageRank of the real Web?} - solving a system of equations - the tf-idf formula + a power method - the random walks method

{What key ideas can describe the “Adaptive hypermedia” concept?} + personalized search + building user models considering the users' knowledge, experience and goals - disorientation of user caused by large amount of links and text around him