Python Programming/Internet Data

This lesson introduces Python Internet-based data processing, including web pages and email (HTML, XML, JSON, and SMTP).

Objectives and Skills
Objectives and skills for this lesson include:
 * Standard Library
 * urllib and json modules

Readings

 * 1)  HTML
 * 2)  XML
 * 3)  JSON
 * 4)  Simple Mail Transfer Protocol
 * 5) Python for Everyone: Networked programs
 * 6) Python for Everyone: Using Web Services

Multimedia

 * 1) YouTube: Python for Informatics - Chapter 12 - Networked Programs
 * 2) YouTube Python for Informatics Chapter 13 - Web Services (Part 1/3)
 * 3) YouTube: Python for Informatics Chapter 13 - Web Services (Part 2/3)
 * 4) YouTube: Python for Informatics Chapter 13 - Web Services (Part 3/3)
 * 5) YouTube: Python - Downloading Files from the Web

The urllib.request.urlopen Method
The urllib.request.urlopen method opens the given URL. For HTTP and HTTPS URLs, it returns an http.client.HTTPResponse object that may be read like a file object.

Output: 

The xml.etree.ElementTree.fromstring Method
The xml.etree.ElementTree.fromstring method parses XML from a string directly into an XML Element, which is the root element of the parsed tree.

The xml.etree.ElementTree.ElementTree Method
The xml.etree.ElementTree.ElementTree method returns an ElementTree hierarchy for the given element.

The xml.etree.ElementTree.iter Method
The xml.etree.ElementTree.iter method returns an iterator that loops over all elements in the tree, in section order.

Output: 

The xml.etree.ElementTree.findall Method
The xml.etree.ElementTree.findall method finds only elements with a specific tag which are direct children of the current element.

Output: 

The json.loads Method
The json.loads method converts a given JSON string to a corresponding Python object (dict, list, string, etc.). The Wikimedia Pageview API is documented at https://wikitech.wikimedia.org/wiki/Analytics/PageviewAPI.

Output: 

The smtplib Module
The smtplib module defines an SMTP client session object that can be used to send mail to any Internet machine with an SMTP or ESMTP listener daemon.

Output: Sent message.

Tutorials

 * 1) Complete one or more of the following tutorials:
 * 2) * TutorialsPoint
 * 3) ** Python Network Programming
 * 4) ** XML Processing
 * 5) ** JSON with Python
 * 6) **Parsing XML in Python

Practice

 * 1) Create a Python program that asks the user for a URL that contains HTML tags, such as:     Check for a URL parameter passed from the command line. If there is no parameter, ask the user to input a URL for processing. Verify that the URL exists and then use RegEx methods to search for and remove all HTML tags from the text, saving each removed tag in a dictionary. Print the untagged text and then use a function to display the list of removed tags sorted in alphabetical order and a histogram showing how many times each tag was used. Include error handling in case an HTML tag isn't entered correctly (an unmatched ). Use a user-defined function for the actual string processing, separate from input and output. For example:
 * 2) Create a Python program that reads XML data from http://www.w3schools.com/xml/simple.xml and builds a list of menu items, with each list entry containing a dictionary with fields for the item's name, price, description, and calories. After parsing the XML data, display the menu items in decreasing order by price similar to:
 * 3) Create a Python program that asks the user for a location (city and state, province, or country). Use Google's Geocoding API to look up and display the given location's latitude and longitude. Also display a URL that could be used to pinpoint the given location on a map. The output should be similar to:
 * 4) Create a Python program that asks the user for a Wikiversity page title and the user's email address. Check for URL and email address parameters passed from the command line. If there are no parameters, ask the user to input a page title and email address for processing. Verify that the Wikiversity page exists, and then check the page to see when it was last modified. If it was modified within the last 24 hours, send the user an email message letting them know that the page was modified recently. Include a link to the Wikiversity page in the email message.

Internet Data Concepts

 * HyperText Markup Language (HTML) is the standard markup language for creating web pages and web applications.
 * HTML describes the structure of a web page semantically and originally included cues for the appearance (layout) of the document.
 * HTML elements are the building blocks of HTML pages.
 * HTML elements are delineated by tags, written using angle brackets.
 * Tags are typically written using the syntax.
 * Some tags are written using the syntax.
 * Browsers do not display the HTML tags, but use them to interpret the content of the page.
 * Cascading Style Sheets (CSS) define the look and layout of content.
 * The HTML start tag may also include attributes within the tag, using the syntax.
 * The  attribute may be used to embed CSS style inside HTML tags using the syntax.
 * HTML comments are written using the syntax.
 * Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable.
 * Although the design of XML focuses on documents, the language is widely used for the representation of arbitrary data structures such as those used in web services.
 * Well-formed XML follows a syntax similar to HTML, using nested tags to represent data structure and values.
 * JSON (JavaScript Object Notation) is an open-standard format that uses human-readable text to transmit data objects consisting of attribute–value pairs.
 * JSON is the most common data format used for asynchronous browser/server communication, largely replacing XML.
 * Simple Mail Transfer Protocol (SMTP) is an Internet standard for electronic mail (email) transmission.
 * Although electronic mail servers and other mail transfer agents use SMTP to send and receive mail messages, user-level client mail applications typically use SMTP only for sending messages to a mail server for relaying. For retrieving messages, client applications usually use either IMAP or POP3.
 * SMTP communication between mail servers by default uses the TCP port 25. Mail clients often use port 587 to submit emails to the mail service. Despite being deprecated, the nonstandard port 465 is commonly used by mail providers.
 * SMTP connections secured by SSL, known as SMTPS, can be made using STARTTLS.

Python Internet Data

 * The urllib.request.urlopen method opens the given URL. For HTTP and HTTPS URLs, it returns an http.client.HTTPResponse object that may be read like a file object.
 * The xml.etree.ElementTree.fromstring method parses XML from a string directly into an XML Element, which is the root element of the parsed tree.
 * The xml.etree.ElementTree.iter method returns an iterator that loops over all elements in the tree, in section order.
 * The json.loads method converts a given JSON string to a corresponding Python object (dict, list, string, etc.).
 * The smtplib module defines an SMTP client session object that can be used to send mail to any Internet machine with an SMTP or ESMTP listener daemon.

Key Terms

 * API
 * Application Program Interface - A contract between applications that defines the patterns of interaction between two application components.


 * BeautifulSoup
 * A Python library for parsing HTML documents and extracting data from HTML documents that compensates for most of the imperfections in the HTML that browsers generally ignore. You can download the BeautifulSoup code from www.crummy.com.


 * ElementTree
 * A built-in Python library used to parse XML data.


 * JSON
 * JavaScript Object Notation. A format that allows for the markup of structured data based on the syntax of JavaScript Objects.


 * port
 * A number that generally indicates which application you are contacting when you make a socket connection to a server. As an example, web traffic usually uses port 80 while email traffic uses port 25.


 * scrape
 * When a program pretends to be a web browser and retrieves a web page, then looks at the web page content. Often programs are following the links in one page to find the next page so they can traverse a network of pages or a social network.


 * SOA
 * Service-Oriented Architecture. When an application is made of components connected across a network.


 * socket
 * A network connection between two applications where the applications can send and receive data in either direction.


 * spider
 * The act of a web search engine retrieving a page and then all the pages linked from a page and so on until they have nearly all of the pages on the Internet which they use to build their search index.


 * XML
 * eXtensible Markup Language. A format that allows for the markup of structured data.

Assessments

 * Flashcards: Quizlet: Networked Programs
 * Flashcards: Quizlet: Using Web Services
 * Quiz: Quizlet: Networked Programs
 * Quiz: Quizlet: Using Web Services
 * Python JSON Exercise: Python JSON Exercise