User:SteffenStaab/temp

= December 5th =

Let us summarize what we have achieved so far in this part:
 * 1) We have noted that part 1 covered the technical aspects of the Web, but not the creation process of Web content
 * 2) We have noted that we need to include human behaviour in our consideration, because human behavior determines what is created

In fact, we can draw a picture like the following:



The picture elaborates on what we have already used implicitly:
 * 1) A Human produces and consumes knowledge. Thereby his behavior is constrained by social norms, by his/her cognition, by his/her emotions, by his/her knowledge, etc.
 * 2) The same is true for all humans, but humans also observe what other people do in the Web
 * 3) What can be done - and what can be observed - in the Web is constrained:
 * 4) * by protocols
 * 5) * by applications
 * 6) * by available data and information and last, but not least,
 * 7) * by Web governance and legal laws

More specifically, we have in our Web science model now abstracted quite a lot, but then came up with ideas of how we can build and validate models:
 * 1) We have found that modeling human behavior in the Web requires to model some form of randomness that reflects what we cannot observe, namely the human way of making a decision.
 * 2) We have taken as a running example the creation of links in the Web and we have reduced it to the bare minimum, namely the introduction of an edge between two nodes - abstracting from almost everything in the Web world. (Of course keeping what is core to the Web, i.e. the links).
 * 3) For this extremely simple world, we have developed a model to mimic the creation of links in the Web, i.e. the Erdős–Rényi model:
 * 4) * It is a probabilistic model that creates a new link in completely random fashion between a pair of nodes.
 * 5) Then we have considered how to evaluate the quality of the model. We arrived at the idea of statistical hypothesis testing and we introduced a very simplistic model, i.e. the overlaying of plots.
 * 6) * For this purpose we learned how to identify different types of distributions by modified coordinate axes of a plot (log-plots and loglog-plots).
 * 7) * We found that what we observe for the Web (i.e. at least Wikipedia) does not easily match what is produced by the Erdős–Rényi model with regard to a very simple distribution: i.e. the in-degree distribution.

What we have not seen in our models yet, is the mimicking of other behavior, as this was not part of the Erdős–Rényi model. In fact, we will see next that this is exactly what has been missing from the Erdős–Rényi model.

Core Propositions of this Lesson

 * Know the Model for Preferential Attachment
 * Understand why and how it is a better model for link creation than Erdős–Rényi

Lesson on Preferential Attachment as an Improved Model for Link Generation
The core idea of preferential attachment is that popular nodes attract more links that less popular nodes, i.e.:


 * Popular people will attract more additional people who will know of them than less popular people (cf. the number of followers that people like Lady Gaga or Barack Obama have on twitter vs. the number of followers of Steffen Staab)
 * Popular movies will have more new viewers than less popular movies
 * Popular Web pages will be linked more often in the future than less popular Web pages

Let us consider the issue of Web pages. Given a Web page, this is a node n, we formulate that the probability that the next link that is created points at this node n is proportional to the number of links that already point to n, i.e. the in-degree(n):


 * $$P(n) \sim \text{in-degree}(n)$$

The origianl model of preferential attachment by Barabási and Albert only considers new nodes that join the network. To make it more easily comparable against the Erdős–Rényi model, we adopt their idea (like others have done) and say that the probability that a link is created from a node n to a node m, i.e. P(n,m), is proportional to each of them being selected:


 * $$P(n,m) \sim P(n) * P(m)$$

We need a slight adaptation, because the probability of an unlinked node to be linked should not be zero, hence let us assume that a revised probability for selecting one node was:


 * $$P(n) \sim \text{in-degree}(n)+1$$

Considering that P(n) must be a probability distribution we need to have a normalization factor Z


 * $$Z = k+ \sum_{n=1}^{n=k}\text{in-degree}(n)$$

such that


 * $$P(n,m) = (\text{in-degree}(n)+1) * (\text{in-degree}(m)+1) / (Z*Z)$$

Then pseudo code for (our slightly modified) preferential attachment may look like:

Quizzes
{Which of these statements are true? - probabilities represent how often events have been observed in an experiment + a histogram represents how often events have been observed in an experiment + the values of a probability distribution add up to 1 - the values of a histogram add up to 1
 * type="[]"}

Learning Objectives

 * 1) Further criteria for quality of a model: stability, coverage, generalizability, missing aspects

Lesson on Critical Assessment of Preferential Attachment
We have now seen two models for explaining link creation in the Web: Erdős–Rényi and preferential attachment. While the latter seems to be a better explanation for what is going on there are indeed questions that arise that have to do both with the mathematical background as well as with the emergence of power laws in many link creation scenarios, not only in the Web, but e.g. in citation networks (scientific papers cite older scientific papers; nodes, i.e. new papers, are only added but never destroyed; links, i.e. citation edges, are only added but never destroyed):
 * 1) susceptibility of coefficients (e.g. sub-linear attachment rates do not lead to power laws)
 * 2) Web pages and links are not only created, but also destroyed, still power laws of in-degrees emerge
 * 3) if power laws emerge in similar but different scenarios their creation mechanism under these other scenarios might more likely follow a similar mechanism
 * 4) timing (busy nodes do not only get more links, but they get more links more often)

Hence, what we can observe here is a fundamental critique of the model that considers assumptions such as
 * 1) stability of achieved model under slightly changing assumptions (slightly different assumptions may lead to qualitatively different results)
 * 2) coverage of phenomena modelled (destruction remaining unhandled)
 * 3) generalizability (part of the beauty test of a model)
 * 4) time-dependency of the model unmodelled

Some of that critique has been nicely formulated in by Hans Akkermans leading to further research and insights - more than we can handle in this lecture. Beyond, more criteria for critique may be uttered (Also by Akkermans).

Learning Objectives

 * Learn to generalize
 * From preferential attachment to (different) urn models
 * From one model (preferential attachment) to other model (word production)

Lesson on Word/Tag production models

 * Models of word occurrences
 * Reason to use tf-idf to describe word relevance of a page
 * Models of folksonomies
 * Zipf distributions
 * models of word/tag distributions
 * models of word/tag coocurrences

Preferential Attachment OR Word Production (Polya Urn Model)
There are n balls with different colors in an urn In each step: Randomly draw a ball Put it back together with a second ball of the same color Fixed number of colors Colors are distributed according to a power law

Linear Preferential Attachment OR Word Production with Growing Vocabulary (Simon Model)
Like the Polya Urn Model. Additionally in each step: Instead of drawing a ball, insert with low probability p a ball with a new color Linear increasing number of colors Colors are distributed according to a power law

Applying the Theory of Social Capital to the Web
We now have (extremely simplistic) models of the Web that describe how content is formed and how links are created - and this is the core of Web content in the Web 1.0: Text and hyperlinks. While we can think about how to extend our models to more sophisticated content (media documents, dynamic content), and to some very, very small extent this is done in the literature (much more should be possible), we should sometimes also take a high-level view and consider how people decide to interact with a particular piece of Web content.

While in general different disciplines have thought about how social interaction and content creation go hand in hand especially the fields of artificial life (language games, Luc Steels et al), philosophy (language games, Wittgenstein), the discipline that seems to offer the best fit (according to my knowledge) is actually management. Management has explored the notion of social capital - albeit at a very high level of abstraction.

In Wikipedia it is stated that "In sociology, social capital is the expected collective or economic benefits derived from the preferential treatment and cooperation between individuals and groups." In the Web we have a knowledge system consisting of documents and links (I abstract here from important dimensions like transactions) and in this knowledge system, actors navigate, interact, grow the knowledge network, and parts of the knowledge network are more successful than others.

Social capital is distinguished by Nahapietund Ghoshal(1998; http://www.jstor.org/stable/259373) into
 * cognitive social capital, knowledge of vocabulary, narratives, etc.
 * structural social capital, e.g. centrality, prestige, etc.
 * relational social capital, e.g. trust, social norms, reciprocity

This can be easily carried over to the Web, where a Web page
 * has vocabulary and styles that it shares with other (sets of) Web pages; depending on the vocabulary used, the Web page may be more or less influential in the reception of people and allow for more knowledge creation than others
 * is structurally linked to other Web pages; depending on its embeddedness, the Web page may be more or less influential in the reception of people and lead to more (or less) knowledge creation than others
 * is trusted more or less and triggers more actions by others (reciprocity) and hence is more or less influential.

Effectively, it is of course the person (or the institution) that has such higher social capital, but the Web page (or Web site) is the medium that carries the content and that leads to the increased benefit for the social actor.

We can now try to determine:
 * 1) What vocabulary makes a Web page important or - a slightly reworded, but easier question - when does one determine that a word is well represented by a Web page?
 * 2) Which kind of links make a Web page important - this we will look at in the Random Surfer model in the Part about Search Engines
 * 3) What determines the relational value of a Web page (or a smaller entity, e.g. a Web message), i.e. triggering reciprocity or other kinds of actions, having reputation etc.

Cognitive Capital: Word Importance
When is a word important or relevant? Or: How important is a word for representing a document?

First try: Words that occur most often are most important:

$$tf(w,d)=$$the frequency of term (word) in document $$d$$

Problem: What are the most important words now?


 * a
 * the
 * and
 * etc.

Are th

Relational Capital: The Example of Meme Spreading
What is a meme?

Let's look back at what is a gene and what is its basic mechanism:
 * A gene is information (e.g. DNA or RNA)
 * And to be successful, genes need a replication mechanism. In biological life, the gene information is contained in a cell and the cell either actively replicates the gene information generating a new cell (e.g. an egg cell) or it is contained in a cell that makes sure to replicate it by another way (e.g. a virus). Further replication mechanisms exist (e.g. for prions).

A meme is an idea, behavior, or style that spreads from person to person within a culture (cf. Meme in Wikipedia). In order for an idea to spread it needs some replication mechanism. People spread ideas in a culture by different ways, e.g. through gossip, or because people want to warn each other. The individual reason for replication may be selfish, it may be altruistic, etc. Like a gene, replication need not be perfect, and in fact copying mistakes occur.

Hoaxes are memes that spread via email or in social networks, because people believe, e.g., the warning that is in a hoax and want to warn their friends, too, and with sending the hoax to everyone the hoax keeps alive frequently over many years - or it reappears in a slightly different shape after some years. The investigation of hoaxes on the Web is an interesting topic (I cannot find the paper about it anymore --SteffenStaab (discuss • contribs) 16:28, 4 December 2013 (UTC))

In twitter, hashtags have been invented by users in order to mark specific and often complicated keywords in their messages. Quite frequently now hashtags are considered to be simple memes and people investigate models that explain how such memes spread in a network like Twitter (people imitating other people's behavior).

Learning Objectives

 * 1) What makes a Web page important?
 * 2) How is such importance evoked by the Web structure itself?
 * 3) How to model with

Lesson on Random Surfer
When we consider the behavior of a person surfing the Web, the person has a number of choices to make, namely he may: And maybe you can even come up with some more choices.
 * 1) Start on an arbitrary Web page
 * 2) Read the Web page
 * 3) Interact with the Web page (e.g. fill a form)
 * 4) Use the BACK-button
 * 5) Follow a link from that page OR Jump to another page by whatever mechanism (typing a URL, using a search engine, using a bookmark, etc. etc.)

This model has been crafted and exploited by Sergej Brin and Larry Page, the Google founders, in their PageRank algorithm in 1998. Cf. http://infolab.stanford.edu/pub/papers/google.pdf There are many modifications of this model, e.g. an intentional surfer model, rather than a random surfer, leading to improved rankings in certain situations, cf. http://pr.efactory.de/e-pagerank-themes.shtml.

Learning Objectives

 * 1) Randomized models may be accessible for analytical solutions
 * 2) Vectors are good for representing the state of complex systems
 * 3) Matrices are good for
 * 4) representing networks
 * 5) representing transition probabilities

Lesson on Random Surfer - The Analytical Model
When we consider the behavior of a person surfing the Web, the person has a number of choices to make, namely he may: And maybe you can even come up with some more choices.
 * 1) Start on an arbitrary Web page
 * 2) Read the Web page
 * 3) Interact with the Web page (e.g. fill a form)
 * 4) Use the BACK-button
 * 5) Follow a link from that page OR Jump to another page by whatever mechanism (typing a URL, using a search engine, using a bookmark, etc. etc.)

This model has been crafted and exploited by Sergej Brin and Larry Page, the Google founders, in their PageRank algorithm in 1998. Cf. http://infolab.stanford.edu/pub/papers/google.pdf There are many modifications of this model, e.g. an intentional surfer model, rather than a random surfer, leading to improved rankings in certain situations, cf. http://pr.efactory.de/e-pagerank-themes.shtml.

Further Thoughts on Research Needed
Most research about Web structure focuses on static HTML pages and folksonomies (sometimes with time stamps of data). There is very little research on Web structure analysis looking at dynamic content. The reason is that there are very few people who understand both dynamic Web content and Web structure analysis and models. If you can think about some nice hypothesis, this could lead to intriguing research.

The structures described up to here have started to arise in the very early Web; very many structures however came into existence in order to be found. Thus, understanding the search engine eco system is important to understand the Web as it is now.

The following video of the flipped classroom associated with this topic are available:

You can find more information on wiki commons and also directly download this file