Web technologies/2014-2015/Laboratory 5

Parsing XML documents
An important issue when dealing with XML is parsing the documents. There are several parser types including:


 * DOM parsers:
 * allow the navigation of the XML document as it were a tree.
 * the main drawback is that the document needs to be completely loaded into memory before actually parsing it.
 * DOM documents can be either created by parsing an XML file, or by users which want to create an XML file programmatic.
 * SAX parsers:
 * event-driven API in which the XML document is read sequentially by using callbacks that are triggered when different element types are meet.
 * overcomes the DOM’s memory problem, and is fast and efficient at reading files sequentially.
 * its problem comes from the fact that it is quite difficult to read random information from inside an XML file.
 * FlexML parsers:
 * follow the SAX approach and rely on events during the parsing process.
 * it does not constitute a parsing library by itself, but instead it converts the DTD file into a parser specification usable with the classical Flex parser generator.
 * Pull parsers:
 * use an iterator design pattern in order to sequentially read various XML items such as elements, attributes or data.
 * this method allows the programmer to write recursive-descent parsers:
 * applications in which the structure of the code that handles the parsing looks like the XML they process.
 * examples of parsers from this category include: StAX13, and the .NET System.Xml.XmlReader.
 * Non-extractive parsers:
 * a new technology in which the object oriented modeling of the XML is replaced with 64-bit Virtual Token Descriptors.
 * one of the most expressive parser belonging to this category is VTD-XML.

Note: We will use the Apache Xerxes library for these exercises. You can find more information here.

SAX
SAX (Simple API for XML) is a serial access XML parser. A SAX parser can be found in the Xerces library found here.

The following fragment of code shows how we could use SAX to parse an XML document:

Using Python3
Links:

SAX Tutorial

Python3 SAX documentation

DOM
DOM (Document Object Model) is a convention for representing XML documents. A DOM parser can be found in the Xerces library found here.

DOM handles XML files as being made of the following types of nodes:


 * Document node
 * Element nodes
 * Attribute nodes
 * Leaf nodes:
 * Text nodes
 * Comment nodes
 * Processing instruction nodes
 * CDATA nodes
 * Entity reference nodes
 * Document type nodes
 * Non-tree nodes;

Using Xerces and Java:
The following fragment of code shows how we could use DOM to traverse an XML tree:

DOM also allows users to create a new XML document or change the structure of an already existing one:

Using Python3:
To programmatically create an XML document using the xml.dom module: Links:
 * DOM tutorial
 * W3C DOM tutorial
 * Java XML Parsing Tutorials
 * Python3 DOM documentation

Exercises

 * Parse the XML created in your assignment from ../Laboratory 3/ using both SAX and DOM. Print out the parsing time of each method (hint: use System.currentTimeMillis to get the start and end time).
 * Create the XML from your assignment in ../Laboratory 3/ using DOM. Print the result to an XML file.
 * In order to save resulted xml to file you can use:

Or you can use the writexml method from the minimal DOM implementation, minidom if you are using Python.

Gabriel Iuhasz, 2019-10-30, [mailto:iuhasz.gabriel@e-uvt.ro iuhasz.gabriel@e-uvt.ro]