Web Science/Part2: Emerging Web Properties/MoocIndex

User:Sebschlicht/moocIndex.js

--MOOC-Index =lesson|How big is the World Wide Web=
 * learningGoals=
 * 1) Scientific: Some questions are hard (impractical or even impossible) to answer.
 * 2) understand the limitations and benefits of scientific models.
 * 3) be able to name 3 different point of views (or model choices) we could have when we want to study the World Wide Web (Software system, Text Corpus, Graph).
 * 4) be aware of 3 different Modelling techniques (Descriptive, Generative and Predictive Models).
 * 5) The best way to understand the web is to be able to describe it properly via various models.

unit|Problems with the question about the size of the Web

 * furtherReading=
 * 1) http://www.worldwidewebsize.com/
 * learningGoals=
 * 1) The question will remain unanswered during the lesson and the entire course.
 * 2) question of size is underspecified because a measure is needed.
 * 3) measure depends heavily on the choice of how we model the web.
 * 4) We have not yet defined what we mean when we say World Wide Web.
 * video=File:Problems_with_the_Question_of_%22How_big_is_the_World_Wide_Web%22.webm

unit|3 ways to study the Web

 * furtherReading=# World Wide Web
 * learningGoals=
 * 1) The web as a software system.
 * 2) The web as a collection of text documents.
 * 3) The web as a graph of interlinked documents.
 * 4) Even when choosing 1 point of view we have fundamentally different ways of modelling.
 * numThreads=3
 * numThreadsOpen=3
 * video=File:3_ways_to_look_at_the_World_Wide_Web.webm

unit|A simplistic descriptive model

 * furtherReading=# Descriptive statistics
 * learningGoals=
 * 1) understand that only the model is described.
 * 2) description of the model can be used for interpretation.
 * 3) within the descriptive model one chooses measures to describe the object of study.
 * 4) understand the notion of a modelling choice
 * 5) be able to criticise a descriptive model and the modelling choices
 * video=File:Introduction_to_descriptive_modelling_of_the_World_Wide_Web.webm

unit|An unrealistic, simplistic generative model

 * furtherReading=# Scientific modelling
 * learningGoals=
 * 1) Can be used to try to give a reason why something works.
 * 2) need to be run more than once!
 * 3) understand the notion of a modeling parameter
 * 4) will be compared to the descriptive model of our object of study.
 * numThreads=3
 * numThreadsOpen=2
 * video=File:Introduction_to_Generative_Models_of_the_World_Wide_Web.webm

unit|Summary, further reading, homework

 * furtherReading=
 * a
 * b
 * learningGoals =
 * 1) understand why it is important to have web models. (e.g. for information retrieval, spam detection, understanding epidemic flow of information on the web)
 * 2) be aware that models are not the reality.
 * 3) even if a generative model yields same statistics and distributions as a predictive model it might fail badly
 * video=File:Under_construction_icon-blue.svg

=lesson|Simple statistical descriptive Models for the Web=
 * learningGoals=
 * 1) Formulating a research hypothesis and test it by means of simple descriptive statistics
 * 2) Reading diagrams

unit|Counting Words And Documents

 * furtherReading=
 * 1) tba
 * learningGoals=
 * 1) Understand why we selected simple English Wikipedia as a toy example for modeling the web
 * 2) Understand that a task already as simple as counting words includes modeling choices
 * 3) Be familiar with the term “unique word token”
 * 4) Know some basic tools to count words and documents
 * video=File:Counting-Words-And-Documents.webm

unit|Typical length of a document

 * furtherReading=# tba
 * learningGoals=
 * 1) Be familiar with some basic statistical objects like Median, Mean, and Histograms
 * 2) Should be able to relate a histogram to its cumulative distribution function
 * numThreads=3
 * numThreadsOpen=2
 * video=File:Typical-length-of-a-document-Histograms.webm

unit|How to formulate a research hypothesis

 * furtherReading=# tba
 * learningGoals=
 * 1) Understand the ongoing, cyclic process of research
 * 2) Know what falsifiable means and why every research hypothesis needs to be falsifiable
 * 3) Be able to formulate your own research hypothesis
 * numThreads=1
 * numThreadsOpen=1
 * video=File:How-to-formulate-a-research-hypothesis.webm

unit|Number of words needed to understand most of Wikipedia

 * furtherReading=
 * 1) http://www.courses.vcu.edu/PHY-rhg/astron/html/mod/006/index.html
 * 2) Falsifiability
 * learningGoals=
 * 1) Understand what a log-log plot is
 * 2) Improve your skills in reading and interpreting diagrams
 * 3) Know about the word rank / frequency plot
 * 4) Should be able to transfer a histogram or curve into a cumulative distribution function
 * numThreads=2
 * numThreadsOpen=1
 * video=File:Number-of-words-needed-to-understand-most-of-Wikipedia.webm

unit|Linguists way of checking simplicity of text

 * furtherReading=
 * 1) tba
 * learningGoals=
 * 1) Get a feeling for interdisciplinary research
 * 2) Know the Automated Readability Index
 * 3) Have a strong sense of support for our research hypothesis
 * 4) Be able to critically discuss the limits of our models
 * video=File:Linguists-way-of-checking-simplicity-of-text.webm

=lesson|Advanced statistical descriptive models for the Web=
 * learningGoals=
 * 1) fitting a curve
 * 2) work with logarithmic plots
 * 3) zipfs law
 * 4) power law

unit|The Zipf law for text

 * furtherReading=
 * 1) tba
 * learningGoals=
 * 1) Be able to name some fundamental properties about how frequencies of words in texts are distributed
 * 2) Be a little bit more cautious about visual impressions when looking at log-log plots
 * 3) Know both formulations of Zipf’s law
 * video=File:Questioning-the-Zipf-law.webm

unit|Visually straight lines on log log plots

 * furtherReading=
 * 1) tba
 * learningGoals=
 * 1) Be able to do a coordinate transformation to change the scales of your plots
 * 2) Understand in which scenario power functions appear as straight lines
 * 3) Know in which scenarios exponential functions appear as straight lines
 * 4) Be even more cautious about your visual impressions
 * video=File:Visually-straight-lines-on-log-log-plots.webm

unit|Fitting a curve on a log log plot

 * furtherReading=# tba
 * learningGoals=
 * 1) Know the axioms for a distance measure and how they relate to norms.
 * 2) Know at least two distance measures on functions spaces.
 * 3) Understand why changing to the CDF makes sense when looking at distance between functions.
 * 4) Understand the principle of the Kolomogorov-Smirnov test for fitting curves
 * numThreads=4
 * numThreadsOpen=1
 * video=File:Fitting-a-curve-on-a-log-log-plot.webm

unit|Zipf law powerlaw or pareto law.webm

 * furtherReading=
 * 1) tba
 * learningGoals=
 * 1) Know how to transform a rank frequency diagram to a powerlaw plot.
 * 2) Understand how powerlaw and pareto plots relate to each other.
 * 3) Be able to explain why a pareto plot is just and inverted rank frequency diagram
 * 4) Be able to transform the zipf coefficient to the powerlaw and pareto coefficient and vice versa.
 * 5) Understand that building the CDF is basically like building the integral.
 * video=File:Zipf-law-powerlaw-or-pareto-law.webm

=lesson|Modelling Similarity of Text=
 * learningGoals=

unit|Similarity Measures and their Applications

 * furtherReading=# tba
 * learningGoals=
 * 1) Know the properties of a similarity measure
 * 2) Be able to relate similarity and distance measures
 * 3) Know of two applications for modelling similarity


 * numThreads=1
 * numThreadsOpen=1
 * video=File:Similarity-Measures-and-their-Applications.webm

unit|Jaccard Similarity for Sets

 * furtherReading=# tba
 * learningGoals=
 * 1) Understand how text documents can be modeled as sets
 * 2) Know the Jaccard coefficient as a similarity measure on sets
 * 3) Know a trick how to remember the formula
 * 4) Be aware of the possible outcomes of the Jaccard index
 * 5) As always be able to criticize your model


 * video=File:Jaccard-Similarity-for-Sets.webm

unit|Cosine Similarity For Vectorspaces

 * furtherReading=# tba
 * learningGoals=
 * 1) Be familiar with the vector space model for text documents
 * 2) Be aware of term frequency and (inverse) document frequency
 * 3) Have reviewed the definitions of base and dimension
 * 4) Realize that the angle between two vectors can be seen as a similarity measure


 * numThreads=2
 * numThreadsOpen=2
 * video=File:Cosine-Similarity-For-Vectorspaces.webm

unit|Probabilistic Similarity Measures Kullback Leibler Divergence

 * furtherReading=# tba
 * learningGoals=
 * 1)  Be aware of a unigram Language Model
 * 2) Know Laplacian (aka +1) smoothing
 * 3) Know the query likelihood model
 * 4) The Kullback Leibler Divergence
 * 5) See how a similarity measure can be derived from Kullback Leibler Divergence


 * video=File:Probabilistic-Similarity-Measures-Kullback-Leibler-Divergence.webm

unit|Comparing Results of Similarity Merasures

 * furtherReading=# tba
 * learningGoals=
 * 1)  Understand that different modeling choices can produce very different results.
 * 2) Have a feeling how you could statistically compare the differences of the models.
 * 3) Know how you could extract keywords from documents with the tf-idf approach.
 * 4) Try to argue which model you like best in a certain scenario.


 * video=File:Comparing-Results-of-Similarity-Merasures.webm

=lesson|Generative Models for the Web=
 * learningGoals=

unit|Introduction to generative modelling.webm

 * furtherReading=# tba
 * learningGoals=
 * 1) Understand the principle methodology for building generative models
 * 2) Remember why people are interested in generative models
 * 3) Know why descriptive models are needed when evaluating a generative model
 * 4) Be aware of one way to create a model for text generation


 * numThreads=1
 * numThreadsOpen=1
 * video=File:Introduction-to-generative-modelling.webm

unit|Sampling from a probability distribution

 * furtherReading=# tba
 * learningGoals=
 * 1) Understand how to sample values from an arbitrary probability distribution
 * 2) Have seen yet another application of the cumulative distribution function
 * 3) Understand that sampling from a distribution is just a coordinate transformation of the uniform distribution


 * numThreads=1
 * numThreadsOpen=1
 * video=File:Sampling-from-a-probability-distribution.webm

unit|Evaluating a generative model

 * furtherReading=# tba
 * learningGoals=
 * 1) See that it makes sense to compare statistics
 * 2) Understand that comparing statistics is not a well defined task
 * 3) Be aware of the fact that very different models could lead to the same statistics


 * numThreads=1
 * numThreadsOpen=0
 * video=File:Evaluating-a-generative-model.webm

unit|Pittfalls when increasing the number of model parameters

 * furtherReading=# tba
 * learningGoals=
 * 1) See that one can always increase the model parameters
 * 2) Know that increasing model parameters often yields a more accurate model
 * 3) Be aware of the bigram and mixed models as examples for our generative processes


 * video=File:Pittfalls-when-increasing-the-number-of-model-parameters.webm

=lesson|Modeling the Web as a graph=
 * learningGoals=

unit|Reviewing terms from graph theory

 * furtherReading=#
 * learningGoals=
 * 1) Be familiar with a set theoretic way of denoting a graph
 * 2) Know at least 4 different types of graphs
 * 3) Have practiced your abilities in reading and writing mathematical formulas


 * video=File:Reviewing terms from graph theory.webm

unit|The standard web graph model

 * furtherReading=# tba
 * learningGoals=
 * 1) Be able to model web pages as a graph
 * 2) Know that the authorship graph is bipartite
 * 3) Know what kind of graph the graph of web pages is
 * 4) (as always) be aware of the fact that modeling is done by making choices


 * video=File:The standard web graph model.webm

unit|Descriptive statistics of the web graph

 * furtherReading=# tba
 * learningGoals=
 * 1) Know terms like Size and (unique) volume
 * 2) Be able to count the in and out degree of web pages
 * 3) Have an idea what kind of law (in & out) degree distributions follow
 * 4) Know that degree is not distributed in a fair way
 * 5) Know that the Gini coefficient can be used to measure fairness


 * video=File:Descriptive statistics of the web graph.webm

unit|Topology of the web graph

 * furtherReading=# tba
 * learningGoals=
 * 1) Understand the notion of a path in a (directed) graph
 * 2) Know that shortest paths between nodes need not be unique
 * 3) Understand the notion of a strongly connected component
 * 4) Know about the diameter of a graph
 * 5) Be aware of the bow tie structure of the Web


 * numThreads=1
 * numThreadsOpen=1
 * video=File:Topology of the web graph.webm

unit|Modelling-graphs-with-linear-algebra

 * furtherReading=# tba
 * learningGoals=
 * 1) Be able to read and build an adjacency matrix of a graph
 * 2) Know some basic matrix vector multiplications to generate some statistics out of the adjacency matrix
 * 3) Understand what is encoded in the components of the k-th power of the Adjacency matrix of a graph
 * numThreads=1
 * numThreadsOpen=1
 * video=File:Modelling-graphs-with-linear-algebra.pdf.webm