User:Sergeyd/web structure chapter

= Chapter. Web Structure. Part 1 =

Section 1: Overview
This chapter is organized as following. The chapter will begin with the a small introduction and motivation of studying the structure of the Web and its development. After that in the section 3 we will take a look at technical and human aspects of link creation between documents on the Web. In the succeeding section the regularities in human behavior and different types of models will be reviewed. In section 5 the first model of link creation will be introduced. Section 6 will explain a technique for evaluating accuracy of produced model and the first proposed model will be assessed critically. In the next section power laws and their application will be described. Finally, in section 8 a refined model for link creation on the web will be examined. The chapter will be closed with a short brief conclusion and a small quiz to check the overall understanding of the chapter.

Section 2: Introduction
In the first part of our Web Science MOOC we learned about the technologies used for the Web functioning, but the Web is not just the protocols for data transfer or document content and links format. Knowing the content does not give us the idea of how this content is produced. Therefore people interacting with the Web, producing documents, communicating, searching, advertising are also a huge field for studying the Web and its properties.

In this chapter we will start the studying of human behavior on the Web and, in particular, its influence to the actual structure of interlinked documents graph. The study of human interaction on the Web will continue in the succeeding chapters with the overall goal to identify and describe different emergent properties of such a huge structure as the Web.

Learning Objectives of this Section

 * 1) What is the link creation process on the Web?
 * 2) Understand difficulties in studying link creation on the Web.

Subsection 3.1: Technical Aspects
As we have already learned from the technological part of our web science course, the World Wide Web is based on URL (Unified Resource Locator) and HTTP (Hypertext transfer Protocol) and uses HTML (Hypertext Markup Language) as a standard for creating web documents. One of the main design principles that Tim Berners Lee had in mind when creating the Web standards was the idea that people can work and contribute to the Web independently and on the other hand be able to connect related documents usually produced by different authors. To realize this principle the idea of Hyperlinks was invented.

Hyperlink is just a reference to a document, one of its parts or some other resource in the Web. It can be followed either by clicking on it in a browser or by hovering or followed automatically. The syntax for creating hyperlinks is described more in detail in the “Web content” chapter of this course. The main point is that once the hyperlink is inserted into some document on the Web, it creates a directed link from the document containing this hyperlink, to the document it points to.

Technically we can say that once the hyperlink is inserted into a document and this document is published on the Web, the link between this document and the linked document is considered established. Therefore this process of inserting a hyperlink and publishing a document with this link is referred to as the link creation process on the Web.

There are several difficulties in studying link creation on the Web. As it was discussed in our, among them are: However, technical difficulties are highly outnumbered by the human aspects of link creation, which are discussed in the next subsection.
 * 1) In most of the cases there is no possibility to know who actually created the Web page and all hyperlinks that are included into it. Some web servers on the Web log some information about user activities, however it is very difficult to clearly identify the author.
 * 2) Usually information about time of publication, modification, last visit of the document on the Web is unavailable as well. Unlike the previous case this information can be logged by web servers, however in most of the cases these logs are not stored for long time and not available for the public.
 * 3) Bergman in his work described the “Deep Web” – pages that are not accessible by search engines and their crawlers. These pages really exist on the Web, however nobody or only a limited number of people know about them. This segment of the Web is much larger than the part of the Web that we can explore.

Subsection 3.2: Human Factor
We already learned in previous lessons that the Web is being created by people. By browsing the web and interacting with it people everyday produce lots of content, including links between documents. This can be a researcher writing a scientific work and citing works of others, or a reporter writing an article related to other articles or many other people. As we see, incentives for document and link creation differ from person to person.

Another aspect is the level of informedness. (Informedness here means that different people informed in different degree with respect to some proposition: a person is more informed with respect to some proposition, if she has a better knowledge of the facts relevant for this proposition and is also in a better position to defend it in the face of criticism. Volumes of information available on the web increases constantly and no one can possess all of it. The fact that different people know different things greatly influences the content they produce.

We should also not forget the basic human behavior characteristics that are defined by : It is also very common for a human to act with a certain degree of randomness, which may be identified by psychological properties of a person or some external effects.
 * Genetics
 * Social norms
 * Core faith and culture
 * Attitude

Overall, we can see that there are a lot of factors that influence human activities on the Web, which makes modeling of the problem of link creation quite difficult in practice.

Learning Objectives of this Section

 * 1) Understand regularities in human behavior on the Web.
 * 2) Distinguish descriptive and predictive models.
 * 3) What is a probabilistic model?

Subsection 4.1: Regularities in Random Behavior
If we look from outside to the human behavior on the web, it may seem that this behavior is completely unpredictable or random. And therefore there exist approaches that model humans as “black boxes”, acting absolutely random. However numerous sociological experiments showed that there are also some regularities in seemingly random human behavior. There exist other researches in the sociological field to find patterns in human behavior. As we can see people express similar behavior in some particular situations. This expresses the regularities which a good model should take into consideration.
 * First example was given by Milgram and Toch back in 1969 . They performed an experiment on the streets of New York. They noticed that if some person seem to look at something interesting in some particular direction, people around him also tend to look in that direction. About 40% of naïve passers actually followed a gaze of one single person, and this percentage increased up to 80% when 5 persons were looking in the same direction. This experiment showed that people tend to do what other people do, and the more people are involved in the process, the bigger chance that others will also join it.
 * Second example of such research is based on work of Helbing et al . They studied the phenomenon of segregation of a bidirectional flow of pedestrians into lanes of people walking in the same direction. They came up with the idea that people tend to optimize their own route of walking in the crowd, which makes the crowd to self-organize and effectively deal with obstacles and bottlenecks on the way.
 * Last example is given by Helbing with their study of forming human trails on the university campus of Stuttgart-Vaihingen. When students walk between buildings on a university campus they tend to use shortcuts over the grass, which leaves trails on the grass. Interesting observation here is that every new student tend to use the trails that other students made before. Firstly the trail is barely seen on the grass. But the more people used the trail, the more marked this trail becomes, which makes the probability that the new person will notice and use this trail higher. This example is similar to the first one, but here the information is exchanged indirectly. It means that a person does not know who walked on the trail before her, but she can identify the average number of people used this way. This situation is very close to what we have on the Web, where we do not know who created the web page or hyperlink, but one can observe the popularity of the web page, which is identified by the number of links to this web page.

Getting back to human behavior on the Web we can consider several regularities in human behavior while modeling link creation: Despite the fact that it is often impossible to predict behavior of people on the Web, supporting discovered regularities may lead to better models.
 * People tend to rely on opinions or documents supported by others. And the more people are linking to some document on the Web, the higher will be the chance that a new document created by someone on the Web will also link to this document. This property is applied in the preferential attachment model which will be discussed later in this chapter.
 * Human actors usually try to optimize their interactions with the Web. For example, when a person is using a search engine she usually needs only the first search result page. If there is no relevant link found on this page, she most likely will refine her query rather than browse current search results page by page.
 * Specifics of the way to read the document. People usually read the document from top to bottom and, in western culture, from left to right. This implies that a person will likely use the link in top left corner of the page rather than in bottom right corner.
 * Search engine is also one influencing factor. While surf the Web, in most of the cases we use search engine to find a document. And the order in which documents appear on search result page may influence the choice of the link we use.

Subsection 4.2: Descriptive and Predictive Models
Before modeling the problem it is important to distinguish different types of existing models. In our we discussed two types of models: descriptive and predictive. Predictive models refer to a model that tries to best predict the probability of an outcome given a set amount of input data. One example could be a model of heating water. We can predict that when water temperature will reach 100 Celsius degrees, it will begin to boil. Another example, which we had in our classroom session is about stock market predictions. Traders analyze current figures and build trend curves, which are also a type of predictive modeling.

One another characteristic of predictive models is that they only give predictions and often do not give us the reasons why these predicted events will happen. Regarding this characteristic we should be careful trying to predict regularities and dependencies. For example Höfer et al showed that there was a significant correlation between the increase in the stork population around Berlin and the increase of deliveries outside city hospitals. Of course there is no real dependency between these two facts, however statistics by coincident showed a correlation between them.

Descriptive models describe the events that already happened in the past, identifying relationships between elements that create them. The example here was also mentioned in our classroom session – it is a double pendulum. We can find exact mathematical formulas that will explain mechanical and physical laws defining its movement. However using these formulas it is almost impossible to predict the position of double pendulum in 10 seconds from now.

We can also think of a model that is descriptive, but also can give some predictions. Example here can be a model of planets revolving around the Sun. In this model we can both describe the law of gravity and also predict future positions of planets.

Further, in the case of humans acting on the Web, every person have its own set of actions she can perform on the Web. Analyzing the statistics we may find that some actions are performed more often than others. To describe such inequality we may need to assign probabilities to all types of actions that each particular person may perform. In this case our model will be called a probabilistic model. The example of probabilistic model is exactly our problem with link creation on the Web. As we already learned, we cannot know for sure which link will be added next, therefore we can assign probabilities for different links to appear. This leads us to the next section where a very simple model for link creation will be introduced. It is important to note that all models for link creation on the Web presented in this chapter are descriptive probabilistic models.

Learning Objectives of this Section

 * 1) Understand the Erdős–Rényi model.
 * 2) Understand graph representation for the links between documents on the Web.

Subsection 5.1: Erdős–Rényi Model
If we think about people creating hyperlinks between webpages, the simplest way to model their behavior is to assume that they are acting absolutely random and independent one from another. Assuming that humans are acting independently is a naïve approach, however it may give some ideas about the link creation process.

The idea of such model came from graph theory and was proposed by two Hungarian mathematicians Paul Erdős and Alfréd Rényi in 1959. The idea of the model is that in a random graph with predefined number of nodes $$N$$, links between nodes are created independently with equal probability. There are two types of the algorithm. In the first type number of links to be added $$L$$ is given and therefore the probability of each link to appear in the graph is equal $$p=1/L$$. In the second type (introduced by Edgar Gilbert) the probability $$p$$ is given and the number of links in the resulting graph is unknown. Both types of agorithm provide similar results, therefore without any loss of generality we can consider only the first type of the algorithm. Since both types of the algorithm provide similar results, further only the first type will be used.

Erdős–Rényi is a descriptive probabilistic model and as with every model we will have to test the results that this model produces with the real world observations. E.g. we need to prove that the model is adequate to the process of link creation on the Web. Evaluation of this model will be discussed in the succeeding sections.

Subsection 5.2: Data Representation
As have already been mentioned above, Erdős–Rényi model came from graph theory and it constructs a random graph as output. If we think of documents on the Web as nodes and hyperlinks between them as directed links, we may think of interlinked document structure of the Web as of directed graph.

In our model we have number of nodes $$N$$ and number of links $$L$$ as the input for our model. The graph itself will be represented by its adjacency matrix $$A$$. In this matrix of size $$N*N$$ each element $$a_{i,j}$$ represent a link from node $$i$$ to node $$j$$. If this link exists then $$a_{i,j} = 1$$, otherwise $$a_{i,j} = 0$$.

Subsection 5.3: The Algorithm
Pseudo code for Erdős–Rényi model was presented in our and may look like the following: Example of the resulting graph without considering direction of links is presented on Figure 1. However it is hard to say from this graphical representation whether this model produces networks close to real Web or not. To evaluate the adequacy of the model, we will need to copmare some parameters of generated and real networks. These parameters are discussed in the next section.

Learning Objectives of this Section

 * 1) Understand statistical hypothesis testing for model accuracy verification
 * 2) Understand in-degree distribution
 * 3) Understand log-log plot for visualizing skewed distributions
 * 4) Accuracy of Erdős–Rényi model for link generation

Subsection 6.1: Observed Behavior vs. Probabilistic Model
Our goal now is to verify if Erdős–Rényi is a good model for link generation on the Web.

In case of predictive models, we could have calculated performance and correctness parameters used in information retrieval such as precision, recall, fall-out, F-measure and so on. However Erdős–Rényi does not predict which links will be created, it just gives an equal probability for each link to appear. Another point here is that it is almost impossible to predict precisely what a human will do on the Web. Therefore it is very unlikely that some predictive model will be developed for this problem that will accurately determine which link will be created on the Web next.

It is also hard to validate the model empirically by comparing the corresponding graph drawings. On figures 2 and 3 two graph drawings are shown: one is generated by Erdős–Rényi random graph model, and the second one corresponds to a preferential attachment model, which will be described later in this chapter. Even from these two pictures of graphs with only 30 vertices it is quite hard to find the difference. Moreover if we try to draw graphs with millions of nodes, it will become almost impossible to visually distinguish one model from another. Therefore we will need some other method to test the accuracy of our model.

To validate a probabilistic descriptive model like Erdős–Rényi a method called statistical hypothesis testing can be used. The core idea of the method is: On figure 4 the overview of the methodology is presented. It was also discussed in our.
 * 1) Choose some statistic type for testing (like degree distribution, which will be explained later in this section);
 * 2) Collect this statistic from both empirically observed data (real Web graph or one of its parts) and the tested probabilistic model (Erdős–Rényi);
 * 3) Compare these two statistics in some way.

The first question arises at this point is which kind of statistic can we take for testing? In the case of graphs the most commonly used statistic is the degree distribution. In our case we have a directed graph (each link has a beginning – the origin document, and the end – the document it points to), which has two types of degree distributions: in-degree and out-degree. For evaluation of Erdős–Rényi model we will take in-degree distribution. Reasons for this choice will be discussed in the next subsection.



In-degree of a node is just a number of incoming links that this node has. In the Web graph it correspond to the number of hyperlinks, referring to the particular page. The in-degree distribution $$P(k)$$ of a network is then defined as the fraction of nodes in the network with in-degree $$k$$. E.g. for each possible in-degree we compute how many nodes have exactly this number of incoming links. The result can be represented in the table like this: Another possible representation is the histogram. An example is shown on figure 5.

Subsection 6.2: In-degree or Out-degree Distribution?
So why is it better to take in-degree distribution rather than out-degree distribution as a statistic for comparison of model-created results and empirically observed data? This problem was also discussed in our flipped classroom session.

The main idea here is that people are limited in their cognitive capabilities. We can easily imagine a webpage with hundreds of thousands of in-links – it can be a search engine webpage like Google or a social network like Facebook. However a webpage with such a big number of hyperlinks on it (e.g. out-links) is barely imaginable, because no human will ever read it all. These can only be some spam-bots or sites which try to manipulate search engine rankings like PageRank, which will be discussed in subsequent chapters.

Hence, we see that pages with high out-degree are most likely created artificially and may not follow general tendencies of human behavior. Indeed, Broder et at studying web graph and its properties found that out-degree distribution only partially follows power law distribution (which will be discussed in the next section), while in-degree distribution conforms this well-known law of human societies much better.

Subsection 6.3: Visualizing Statistical Distributions
To perform a comparison of in-degree distributions of Erdős–Rényi model and real web graph we need to introduce one special type of plot first, namely the log-log plot. The reason for this is that if we try to depict the in-degree distribution of a graph with millions of edges on a histogram like was shown on figure 5, all the bars will just blur together and we will not be able to make any type of comparison.

Instead of histogram it is more appropriate to take a plot for representation of in-degree distribution. On the plot every pair (in-degree, number of nodes with this in-degree) is represented as a single point. However since on a real world network with millions of edges we will have very big and very small numbers on a plot at the same time (corresponding to nodes with low and high in-degree respectively), the axes of the plot usually use logarithmic scales. This means that instead of drawing a point $$(x, y)$$, a point $$(\log x, \log y)$$ is used on a plot. This type of plot with logarithmic scales is called a log-log plot.

One interesting property of a log-log plot is that all polynomial functions appear as straight lines on it. Figure 6 shows linear, quadratic and cubic functions on a log-log plot. The only difference between them is a slope. This can be explained as follows.

For a monomial function $$y=ax^k$$ if we take a logarithm (with any base) of both parts, we will get a following equation:
 * $$(\log y)=k (\log x) + \log a.$$

If we then rename the coordinates $$X = \log x$$ and $$Y = \log y$$, which corresponds to a log-log plot, we will get a linear function:
 * $$Y = kX + b$$

where $$k$$ is a slope and $$b = \log a$$ is the coordinate of intersection of this line with $$Y$$ axis.

Note that this property is also true for inverse polynomials like $$y=ax^{-k}$$. This property is important for understanding further sections of this chapter.

Subsection 6.4: Evaluation of Erdős–Rényi Model for Link Generation
Now when we know about statistical hypothesis testing, in-degree distribution and log-log plot for representing it, we can try to test the accuracy the Erdős–Rényi model of link creation on the Web.

For comparison we will need a big random graph consisting of at least 2 million of nodes. It is proven that for big number of nodes in-degree distribution of Erdős–Rényi generated graph follows a Poisson distribution. Therefore in-degree distribution of our random graph may look like shown on figure 7 or on figure 8 if we plot it on a log-log scale.

Now if we need to compare this figure with in-degree distribution of some real world network. For example we can take the article network of English Wikipedia. If we plot in-degree distribution of this network on a log-log scale we will get the picture like shown on figure 9.

It is clear enough to see the difference between these two figures (8 and 9). When Erdős–Rényi generated graph follows Poisson distribution, the Wikipedia article network generally follows a power law, which will be described more in detail in the next section of this chapter. This means that people creating links on the web do not act just random, like in Erdős–Rényi model. Instead they create some kind of structure, and attempts to study it are the main topic of this chapter.

Also it is important to mention that the method of overlaying two figures and trying to determine if two distributions look alike or not cannot provide mathematically correct conclusion. Instead, for comparing distributions there exist several statistical tests like Kullback–Leibler divergence, Jensen–Shannon divergence, Kolmogorov–Smirnov test and others. However these statistical tools are far beyond the scope of this chapter.

Learning Objectives of this Section

 * 1) What is a power law distribution?
 * 2) Understand visual representation of power laws, long tail
 * 3) Learn examples of power laws in human behavior

Subsection 7.1: Formalization
Power laws are already mentioned several times in this chapter and now we can introduce what they are and how they can appear on the Web.

As we can all notice, the popularity on the Web is extremely imbalanced. Some people like Lady Gaga or Barack Obama are nowadays extremely popular on Twitter, but most others are much less popular or are not popular at all. The same is true not only for people, but also for books, songs, movies and Web pages. Popularity of the Web page here, as previously, is identified by the number of in-links to this Web page.

One explanation for such inequality can be the tendency of people to make decisions based on the information conveyed from other people’s choices. This leads to the so-called “rich-get-richer” process, in which the more someone has of something, the more likely they are to get more of it. For example, the more friends you have, the easier it is to make more, or the more business a firm has, the easier it is to win more. This process greatly amplifies inequality.

Studying degree distributions of these imbalanced networks, scientists found that in most of the cases they at least asymptotically follow one rule:
 * $$P(n) \propto n^{-\gamma}$$,

where $$P(n)$$ is the fraction of elements having degree $$n$$. Parameter $$\gamma$$ in this formula appeared to be between 2 and 3 for most of the studied networks. Networks that exhibit dependency close to this formula are often referred to as power law networks.

In a broad statistical meaning, a power law can be defined as “a functional relationship between two quantities, where one quantity varies as a power of another.” And this type of law was found in many biological and physical phenomena. However in this section we will concentrate on applications of this law to different networks and the Web graph.

Subsection 7.2: Visualization
If we try to plot the distribution of a network following a power law we may get a picture like shown in figure 10. As we can see, the resulting figure can be divided into two parts. This refers to the well-known Pareto principle or 80/20 rule. In this example it means that about 20% of all nodes in the network has about 80% of all in-links. This principle can also be found in economics: Pareto found that 80% of Italy's land was owned by 20% of the population; and in business e.g. 80% of a company's profits come from 20% of its customers. The remaining 80% that form the right side of figure 10 is called the long tail, because it has much more nodes in it with much less in-degree than the top popular nodes. Therefore a power law distribution is also often called a long tail distribution.

Drawing a power law distribution on a log-log scale we will typically get the result similar to what we saw on figure 9 for the article network of English Wikipedia. As already mentioned in previous section, all polynomial functions, including inverse polynomial, become straight lines on a log-log plot. Parameter $$\gamma$$ in the power law in this case is just a slope of this straight line. It is also notable that real distribution follows this straight line only approximately and over a limited range. Bottom right part of the figure, which represents the long tail of the distribution, highly deviates from the straight line. However, in-degree distribution of article network of English Wikipedia clearly follows a power law distribution with $$\gamma=2.05$$.

Subsection 7.3: Examples of Power Law
Distributions that follow power law were found in many different aspects of human interaction. Apart from already mentioned examples of Pareto principle, power laws can also be found: There are many more examples of power laws in human societies. This gives us an incentive to include this phenomenon of human behavior into our model of link creation on the Web.
 * In social networks like Facebook or Google+. A large collection of degree distributions of different networks, which exhibit power law, was collected at the University of Koblenz and Landau.
 * In word frequency in the human-written text – a so-called Zipf’s law, which states that the frequency of any word is inversely proportional to its rank in the frequency table.
 * In scientific publications. Lotka’s law states that the number of authors in a given subject is inversely proportional to the number of publications they produced.
 * Finally, in worldwide city populations presence of power law is quite notable. Like in Zipf’s law, the number of cities with some amount of citizens is inversely proportional to this amount.

Learning Objectives of this Section

 * 1) Understand preferential attachment model and rich-get-richer dynamics
 * 2) Understand difference to Erdős–Rényi model
 * 3) Learn the aspects not included in the model

Subsection 8.1: The Model
The idea of preferential attachment for modeling power law networks was proposed by Barabási and Albert in 1999. We already learned that “rich-get-richer” dynamics in links creation directly leads to power law in in-degree distribution. Following this principle they proposed that instead of absolutely random, links should appear with probability, proportional to the number of in-links to nodes. E.g.
 * $$P(n) \sim \text{in-degree}(n)$$

where $$P(n)$$ is the probability for the new link to point to the node $$n$$. However previously unlinked nodes should also have some probability to receive an in-link, therefore we will need to add some value to the probability of each node so that it will never become equal to zero.
 * $$P(n) \sim \text{in-degree}(n)+1$$

And since probabilities for all nodes should sum up to 1, we will need a normalization factor $$Z$$:
 * $$Z = k+ \sum_{n=1}^{n=k}\text{in-degree}(n)$$

Finally, the probability for the new link to point to node $$n$$ will look like the following:
 * $$P(n) = (\text{in-degree}(n)+1)/Z$$

Subsection 8.2: The Algorithm
The original model of preferential attachment considers only new nodes that are being added to the existing network and for these new nodes some number of out-links is created to the existing nodes, with the above described probability. For the purpose of comparison of this algorithm with corresponding Erdős–Rényi code we will use a generalized version of preferential attachment, where the number of nodes is fixed and new links are added by choosing the starting node at random and the ending node of the link by the function Pick, which calculates the node using the described preferential attachment probability formula.

In our, the algorithm with picking both the starting and the ending nodes for a link with the preferential attachment probability was presented, however these 2 algorithms produce networks with the same in-degree distribution. Therefore the way of choosing the starting node for a new link can be different without influencing the in-degree distribution.

Figure 11 shows the in-degree distribution of preferential attachment-generated network. It is easy to note from this plot that unlike Erdős–Rényi model the resulting distribution here follows a power law. It means that preferential attachment model produces results closer to the real networks than Erdős–Rényi. Also this result shows that on practice new links on the Web are added depending on the existing links in the network.

Subsection 8.3: Further directions for improving the model
So far two models of link creation on the Web were discussed: Erdős–Rényi and preferential attachment. The last one seems to describe the process better, however there are still some things that preferential attachment cannot describe.

First of all, the preferential attachment model only considers link addition and once the link is added it will never be deleted. However in practice links are removed from the Web quite often. For example, Spinellis studying URL references appearing in published papers available at ACM and the IEEE Computer Society digital libraries found that about 28% of the hyperlinks mentioned in the articles between 1995 an 1999 became inaccessible already in 2000. Moreover, after four years 40-50% of the links were no longer accessible. This shows that further models should include link removal as well as creation.

Second thing is the stability of the model. Akkermans studied a sub-linear preferential attachment, e.g. the algorithm where the probability to connect to a node is not linearly dependent of the in-degree of a node, but to some lower than 1 power (say around 0.8, for instance). And he found that this sub-linear preferential attachment no longer lead to a power law distribution. Therefore the improvement of stability of the model can also be a direction for further research.

Another concept, which is not taken into consideration by preferential attachment model is the time. In the Web the links are not only created following preferential attachment, they are also created with different rate. Including time-variable into the model is also a direction for further improvement.

There are other possible directions for improving the model of link creation on the Web, and some improvements have already been proposed in scientific works. But it is impossible to cover them all in this chapter, and they are left for individual studying.

Section 9: Conclusion
In this chapter we learned that users' general human behavior on the Web is almost impossible to predict precisely. However there are some regularities, which still allow us to model creation of hyperlinks between documents on the Web. We introduced the first Erdős–Rényi model, which assumed that all links between documents are created independently and random. However using a method called statistical hypothesis testing for evaluation of accuracy of the model and comparing the in-degree distributions with observed empirical data from English Wikipedia we showed that Erdős–Rényi model provide the result which does not correspond to the reality. Therefore we concluded that the people are not acting absolutely random, but follow some regularities. Further we learned about power laws and that these laws may be observed in human societies very often. And trying to model power laws the preferential attachment model for link creation on the Web was presented. Analyzing this model we found that it describes real networks better than Erdős–Rényi model, however there are several aspects that are not included into the model and further improvements are possible.

It is worth mentioning that there is a lot of other models like Watts–Strogatz model and others that generate network structures very similar to the observed in the Web. However it is not possible to cover them all in this chapter and therefore they are left for individual learning.

This chapter is followed by the second part of web structure topic where the Web is considered in sociological terms – as a medium for social capital. Further we will continue with studying the search engine ecosystem, ranking memes, collective intelligence and other aspects of human behavior on the Web.

Quizzes
{Human behavior on the Web..} - is completely random + follows some regularities - can be predicted precisely - independent from other people's behavior + depends on search engine recommendations

{Which statements are true for which of the models} -+ Probability for the new edge to appear depend on previously added edges +- Edges are added independently one from another +- Edges are added with equal probability -+ Probabilities for different edges to appear may be different +- Degree distribution follows Poisson distribution -+ Degree distribution follows Power law
 * Erdős–Rényi | Preferential Attachment

{How can we identify whether our descriptive model reflects characteristics of the real networks or not?} - Compare 2 pictures of real and modeled networks - Calculate precision, recall, F-measure + Perform a statistical hypothesis testing

{In-degree distribution shows dependence between..} - in-degree and out-degree + in-degree and number of nodes with this in-degree - in-degree and number of adjacent links - in-degree and betweenness - in-degree and centrality

{On a log-log plot a power law function have a form of..} - an exponential curve - a logarithmic curve - a polynomial curve + a straight line