User:Gluteus/iws-u-ha

= Chapter on Web Content =

Problem Setting for Web Content Formats
Before we start talking about web content formats it is important to have a clear understanding of the term "web content". On the web, users want to share ideas and information. In order not to constrain the user in expressing his abstract ideas all to much, the language used for transporting that information should be as flexible as possible.

Thus, to convey information consistently from sender to receiver a structured representation is needed. By the way, sender and receiver are not quantified. The transfer need not be one on one, but could encorporate author groups respectively audiences.

How does such a structured represantation evolve? In data transfer structure comes per agreement, i.e. arbitrary choices. The choices need to be made for different content types on their own, in order to make naturally different formats adhere to the principles of flexibility and openness.

In general, content on websites can consist of:


 * Text
 * Audio
 * Images
 * Animation
 * Video

Web content formats have three main goals:


 * Adapt the way content is displayed depending on different kinds of software and devices such as computers, smartphones, TV sets or tablets with different screen sizes, resolutions, and modes of interaction


 * Structure content, such as text, into its different parts, e.g. headings, paragraphs and lists


 * Arrange different types of content in a readable manner, e.g. putting the cover of a book next to the text describing its content.

Approaches
The standardizing body at W3C had different possibilities to choose representations that satisfy the goals. What were possible ways to achieve these?


 * Use a programming language
 * Employ a markup language like the Wiki syntax.

The first solution would be very flexible, because programming languages offer a wide range of possibilities. Accordingly, wiki syntax is very easy. But this would still be a bad idea, because the Wiki syntax' approach cannot easily be generalized to new challenges such as graphics, multimedia and complex layouts.

What would be the way between the generality of a programming language and the ease of wiki?

Structured Content
When it comes to structuring content the publishing industry and other organisations have produced a multitude of standards over the years. Examples include:
 * PDF
 * Postscript
 * Latex
 * SGML

The most important one for the internet is SGML, the Standard Generalized Markup Language. As a document markup language, SGML was originally designed to enable the sharing of machine-readable large-project documents in government, law, and industry. Many such documents must remain readable for several decades—a long time in the information technology field. SGML also was extensively applied by the military, and the aerospace, technical reference, and industrial publishing industries. The inventor of the Web, Tim Berners-Lee and the Web community at large found that SGML is a wonderful basis for describing content and its structure. But seeing it as a bit too complex for engineering purposes, produced a simpler version, which is XML, the eXtensible Markup Language.

Another way to represent hirachical information is JSON - JavaScript Object Notation - which has become the industry's quasi standard. Compared with XML, JSON has the advantage of a smaller footprint in symbols needed to represent the same content. Yet, it is not considered to replace XML, because it lacks the status of a W3C recommendation.

Learning Objectives of this Section

 * 1) Understand the Domain Object Model and the DOM tree
 * 2) Understand that HTML is just a special dialect of XML
 * 3) Understand the relationship between HTML and XML

Overview
In the previous chapter we have discussed how information is transferred on the internet. We learned how a web client and web server communicate with each other using the HTTP protocol to exchange data and content.

What should be investigated next is the exchanged value. Therefore, this chapter focusses on content on the web. We will start by defining what content actually is and the importance of web content formats. XML will be presented as fundamental technology of web content transfer. Later, HTML, the prevalent format in the World Wide Web as major visual part of the Internet will be introduced.

Core Ideas of XML
The core idea of XML is to transfer structured information that is both readable for human and machines via serial media like data networks. Serial representation of data bears the constraint that each data portion has to be sent in a sequential order. But usually, data is not flat like plain text in classic novels. Some content differs in structure and comprises more complex relations between its sections. Therefore, a need to represent hierarchical information alongside the data arises.

XML annotates the raw data, which itself could be textual or binary, by markup passages. This annotation works in form of containers for multiple data fields. The fields are semantic areas which can comprise other nested fields as well. The framing concept of XML is truely flexible, whereby, this notation earned its attribution of being 'eXtensible'.

XML achieves these goals per syntactical properties which:


 * Markup pieces of content by understandable "parentheses", that annotate the enclosed field, e.g. Thomas Mann . Thereby, " " is markup, "Thomas Mann" is content (see problem 2).
 * Allow new applications to introduce new markup, e.g. multimedia display instead of text display (see problem 3)
 * Allow for nesting of markup such that larger pieces of content may be treated in a meaningful way, e.g. " Thomas Mann  ... ... " (see problem 2)
 * Simplify tools by stating a small set of simple rules to refer to (see problem 4):
 * each opening tag like " " requires a corresponding closing tag " "
 * proper nesting of markup: " Thomas Mann  ...  " is allowed; "  Thomas Mann  ...  " is disallowed
 * Each XML document comes with a preamble specifying the character encoding and XML version used, i.e. '' allowing for internationalization (see problem 5)

Implications that arise from applying these rules are:
 * Properly nested Markup gives us a data structure: a DOM tree
 * DOM tree can be navigated recursively by different applications and repurposed for different types of devices and device sizes (see problem 1)

Specification & Usage
The markup language XML was defined in a specification given by the W3C in 1998, which is refered to as version 1.0 and is open to the public. Later on, version 1.1 was introduced. Version 2.0 is on its way. All of these versions coexist besides one another.

XML as a markup language holds various advantages. At first, it is open, so everyone is able to reflect its standard body. Secondly, it is simple, allowing data to be processed by man and machine. Then there is its extensibility, wherein any necessary set of new tags could be introduced, just as need arises. Additionally, there is no restriction, on what type of data to embed in XML, it can even contain multiple data types alongside each other. Moreover, it implements a strict separation of content (XML is all about meaning) from presentation (controlled via derivatives like XSL stylesheets). And lastly, it visualizes the nesting of components better than any foreign-key pattern would.

Use cases are the transfer of banking information, storage of organizational hierarchies or for instance the stockkeeping in a library. We will go on explaining the costituents of a XML document by use of an example use case, where data structures have not yet been defined. This allows us to reconcile the underlying necessities of each structural compartement.

Document structure
For our example think of a conceptual library with a stock of thousands of books. In order to map the physical inventory in the librarians system, each book in a shelf is chosen to resemble a XML document.

For example an empty book entry could start like so:

As becomes visible in the example, each document's structure begins with a specification of the context in which the whole document was placed and should be interpreted. This part is known as "declaration". It specifies version number, encoding and schema. Version number denotes the specific specification that is used, encoding defines the operative way content is represented by symbols of agreed standards and lastly, the optional schema definition can refer to a superset of statements which the document at hand adheres to.

What follows is the empty pair of book-statements in angle brackets. When filling the book entry with relevant details, further statements are added in between the opening (" ") and the closing (" ") tag. These semantically enclose the area in between, which can be understood as the following.

Each statement inside is valid in the context of its surrounding tags, whereas these context capsulation can have multiple levels. A book contains chapters, which, in case the library's data granularity encourages to, could be placed inside our XML page. Thereby, inside of the book-tags the chapters are listed in form of inner tags. This has the effect of nesting. As could be intuitively grasped, each chapter entry is only meaningful in context of that book which holds the chapter. Therefore, it is essential for the evaluation of any sub-tag in a XML document to know its context scope, i.e. its surrounding tags. Relatively, the outer tags are called ancestors, whereas the inner are referred to as descendants.

Containers & Value
Besides nesting the same tags inside one another, different keyword-tags can be placed to denote various semantic statement-areas. Each keyword makes up a unique class, once it gets introduced into the namespace of a XML document. One is fairly flexible in the creation thereof. In case of our library, a tag for title and another one for author is introduced in the following example:

A book entry can get very complex, if one considers all its annotations and dependencies. Often special remarks are only applicable to a specific tag. In those cases it is useful, not to allocate a whole sub-tag on its own, but to apply attributes to existing tags.

Tag statements can themselves be annotated with attributes, to denote parameters that are special to one tag instance only. For example the author tag could hold notion of the persons national belonging. The resulting markup text may look like this:

Remember where you have seen this before? - Certainly, the XML type declaration, which is the first tag on top of the document, holds its information stored in attributes of this exact same form.

Attributes are a special way to annotate tags with information that is only of relevance locally. Conceptually, a value that is placed inside an attribute could just as well be nested in an additional inner tag. This even allows for having data that is more complex structured. As in attributes only serialized values can be stored.

This example illustrates that providing markup is easy for content of a database and for marking up pieces of running text. Next we will show that it is very easy to produce views on the given content as well as layout instructions from such a file. Some sections ahead of now, we will also show how multimedia will come into play.

In the previous section we have seen several scenarios that call for a separation of content proper and structural or layout information. In this part, we will show you how to work with the previous XML file such that it becomes easy to generate structure and/or layout.

DOM tree
A DOM tree is used to access elements in a XML document in the same way one would access objects, their properties and even methods. This allows to "navigate" in a structured manner through semantic scopes that are represented by nesting of tags. Via this convention it became possible to interact with documents in ways of information retrieval, e.g. request-response procedures. Let's look back at our running example.

[upload.wikimedia.org/wikipedia/commons/archive/1/15/20131028155155%21DOM-Tree-Example-For-WebScience-MOOC.png]

Let us assume some simple rules that map our element names as follows

  :   : This is   : John Doe  : 's list of favorite books.  :   :  <ul> : </li> <ul> : Buddenbrooks </li></ul></li> : </li> <ul> : Novel </li></ul></li> : </li> <ul> : 1901 </li></ul></li> : by                       </li> <ul> : Thomas Mann </li></ul></li> : </li> <ul> : </li> <ul> : </li> <ul> : Doctor Faustus </li></ul></li> : </li> <ul> : Novel </li></ul></li> : </li> <ul> : 1947 </li></ul></li> : by                       </li> <ul> : Thomas Mann </li></ul></li> : </li></ul></li> : </li></ul></li></ul></li></ul></li> : </li></ul></li></ul>
 * }

Mapping of Elements
Then executing a corresponding transformation gives us:

Or using our shorthand description:

The shorthand notation is, in fact, an example of HTML, a piece of Hypertext Markup Language as it is used in the Web. Realising this, you may also see that HTML, at least since version 4.0 and higher, is actually well-formed XML. Just like our description of records that we started with is well-formed XML. Hence, XML is not one language for structuring data and content, it is actually a meta-language that allows you to come up with infinitely many different languages for structuring data and content. In fact, you can even do it in other encodings than Latin characters, say Kanji or Arabic.

What you may have observed further is that the kind of mapping we have shown based on mapping elements onto new elements is awkward as it requires very strictly observing a certain order of XML elements in the source file. However, you can program a better transformation tool, use a generic tool like AWK, or you can use existing tools for manipulating XML, such as XPath, XQuery, XSL (eXtensible Stylesheet Language) consisting of XSLT (eXtensible Stylesheet Language Transformation) and XSL FO (XSL Formatting Objects). We do not go into the details of all these tools as this would require a whole course of its own, but for you it is important to know that such languages and tools exist and you can dig out their description whenever you have a problem in this direction and build your solution based on standardized mechanisms.

The core lesson to be learned about XML here is that XML-based markup gives you a very flexible handle for:
 * selecting content
 * repurposing content
 * reformatting content

What have we not considered yet?
 * Conformance of a particular XML document to a schema prescription, e.g. DTD https://en.wikipedia.org/wiki/Document_type_definition or XML Schema https://en.wikipedia.org/wiki/XML_schema
 * Many other (infinitely many) XML applications
 * Interactive XML formats
 * "Infinitely" long streams, i.e. streaming of video, audio, other data...

What we will show you next are some details about HTML and about formatting HTML pages with HTML and Cascading Style Sheets before we then go towards multimedia.

Learning Objectives of this Section

 * 1) Understand that meta data is necessary to communicate the semantics of content
 * 2) See that using HTML meta elements for ranking in search results is a bad idea
 * 3) Get introduced to modern ways of publishing meta data

The term meta data refers to "data about data". Structural meta data is about the design and specification of data structures and is more properly called "data about the containers of data". Meta data are used to describe digital data using meta data standards specific to a particular discipline. By describing the contents and context of data files, the quality of the original data/files is greatly increased. Also making it easier for machines to read and process the content automatically. For example, a web page may include meta data specifying what language it is written in, what tools were used to create it, and where to go for more on the subject, allowing browsers to automatically improve the experience of users and also allowing search engines to better read the content of a web document.

RDFa
A rather generic meta data format is RDFa (Resource Description Framework in Attributes). It adds a set of attribute-level extensions to HTML and various XML-based document types to embed meta data within Web documents.

The following example, similiar to our hCard example, adds meta data about a person using RDFa.

RDFa is commonly used to enhance XML documents with additional attributes. Those are used to embed anchors for linked data later referred to in expressions. In general, those attributes carry metadata to the content.

6: Conclusion
Summing it up, in order to represent web content, several obstacles were to be tackled. Conventions and standards like XML and HTML agree on a serialized projection of hierarchical content for automating parsability while maintaining readability for humans and adaptability to various devices.

From 1998, when XML was specified, and even before that standardization, up to today, the same principles hold true, that a serialized document-oriented approach to representing web content serves the issue best. Essentially, the first XML version is still valid today.

Next, we will take a closer look at HTML, which uses the constituents of XML and puts them to a directly tangible usecase, i.e. the World Wide Web. In fact, HTML is only one usecase for XML. A multitude of XMl-usage exists. For the most part it stays under the surface and works in the background, e.g. in banking operations. Additionally, convenience brought a growing set of markup specifications in HTML and languages like DOM, which themselves rely on and use the above principles themselves. Therefore, they are yet another form of representation of content or aid in retrieving it.

For the second part of the chapter on Web Content, we will look at the use of XML for static Web content. That is, content which lacks the ability to respond to change. Following in the chapter thereafter, the authors will take a look at ways to generate and represent dynamic web content.

Quizzes
{What does this line represent: } - DOM tree + XML header - ending of page - tag for purchase order

{What is the outcome of the following html code? - nested list <ul><li>ding</li><li>dong</li></ul> - nested list em ding b dong + nested list <ul><li> ding </li><li>dong</li></ul> - nested list <ul><li>ding</li><li> dong </li></ul>
 * type="[]"}

{How would you structure a linking pair directed at top resp. bottom of the page? - 2</a><a href="#bottom">bottom</a> 1</a><a href="#top">top</a> - 1</a><a href="#top">top</a> 2</a><a href="#bottom">bottom</a> - 2</a><a href="#bottom">top</a> 1</a><a href="#top">bottom</a> + 1</a><a href="#bottom">top</a> 2</a><a href="#top">top</a>
 * type="[]"}