The internet makes a wealth of information available to your script. The difficult part is in extracting the information you need from a website. Suppose that you want your script to look up the current temperature in some location. There are many websites you could use to find this information, one of them is
local.msn.com
. For example, if you point your browser at the URL
http://local.msn.com/weather.aspx?q=redmond-wa&zip=98052
, it will display a webpage containing much information about current weather conditions in Redmond, but where only a tiny part of the page shows the temperature. A snapshot showing just a small part of this web page is reproduced in Figure 4-5.
We can, in principle, use the API call web→download to fetch the HTML code for the webpage as one very long string of characters. Then we can write some statements which search the HTML code for the little snippet of information that we need. In this example, we need to search the code for a sequence of characters with a structure like the following.
<span class="curtemp">53°F</span>
Here the two characters ‘53’ are the data we want to extract and convert to the number 53. You have to study the HTML code for the website to figure out what sequence of characters would be sufficient to accomplish the task, and no two websites are going to be the same. The script may also need to untranslate characters which have been replaced by HTML escape sequences. For example, an ampersand character displayed on a webpage appears in the HTML code as the five characters “&”. The API provides two methods for converting between special characters and their HTML escape sequences. These are web→html decode and web→html encode.
The kind of programming which analyzes webpages to extract information is known as web scraping (or web harvesting). You should write code like this only if there is no alternative and, even then, think twice. This is a job best left for professionals who have access to special software, and it is a job which has to be repeated whenever the web designers choose to change the layout of the website being accessed.
What can we do instead? The best answer is to find an internet site which serves up the information you need in a more easily digestible format. Two formats, widely used for delivering information in a systematic and simple manner, are XML and JSON.
Both these formats are supported by TouchDevelop, and will be explained with simple examples in the following sections of this chapter. In the particular case where we need to find the current weather in some location, there are several suitable websites. One of them is ‘The Weather Channel’ but, unfortunately access to this service requires a monthly subscription. A free alternative is Weather2 which supplies both JSON and XML:
http://www.myweather2.com/developer/
4.3.1 Downloading information in JSON format
JSON is short for JavaScript Object Notation. It is a text format which borrows notations and data structuring ideas from the JavaScript scripting language. It is a format which has been designed to be easy for processing by computer software (and therefore by TouchDevelop scripts too), but is human-readable too.
An example of some data expressed in JSON format appears in Figure 4-6. It is weather data obtained from the weather2 service.
There are only a few simple rules for what constitutes a valid JSON representation of information. A file in JSON format contains the following elements.
-
Numbers and Strings
-
Boolean values (true or false)
-
Arrays – written as a sequence of array elements separated by commas, with the whole sequence enclose in square brackets
-
Objects – written as an unordered collection of key-value pairs where a colon separates each key from the value, each pair is separated from the next by a comma, and the whole collection is enclosed in curly braces; the keys must be written as strings and they must be distinct from each other.
-
The special value null, meaning empty.
Referring back to Figure 4-6, we can see that the figure shows an object with just one key-value pair, where the key is “weather” and the associated value is another object. That object contains two key-value pairs; one key is “curren_weather” and the other is “forecast”. The value associated with “curren_weather” is an array that contains just one element, which is an object. The value associated with “forecast” is an array containing two elements, and the two elements are objects with identical structures. (The elements do not need to have the same structure, or even have the same types, but processing the JSON file is easier if they do.)
var
jobj := web→download json(
"
http://www.myweather2.com/developer/forecast.ashx?uac=X
&
output=json&query=SW1")
It will download JSON data similar to that shown in Figure 4-6. (The ‘X’ shown after ‘uac=’ in the URL must be replaced by a user access code which is given to you only if you register with the weather2 website.)
The value retrieved by this API call has the data type Json Object. The data type provides many methods for accessing information from inside a JSON object. These methods are listed in Appendix C. Using these methods, here is how we could obtain today’s temperature from the JSON object shown in Figure 4-6. The code is shown as a series of very simple steps.
// assume jobj has been read using the call previously shown
if
jobj → is invalid
then
"unable to download JSON data" → post to wall
else
var
w := jobj → field("weather")
var
cw := w → field("curren_weather")
// get first element of array
var
cw0 := cw → at(0)
// get temperature as a Number
var
temp := cw0 → string("temp") → to number
// get temperature units as a String
var
units := cw0 → string("temp_unit")
("Today’s temperature is " || temp || units) → post to wall
All we had to do was look at one example of the JSON data produced by our weather query. From that example, it was easy to figure out how to extract the information we needed. (Of course, we could have also read the documentation provided by the service provider.)
Two popular services which provide results in the JSON format are Flickr and Twitter. Two scripts in the TouchDevelop Samples collections implement libraries for using these services. A trivial script which searches for tweets containing a particular keyword (or #tag) is shown in Figure 4-7.
The code for the library can be found under the name twitter search (/stlm). It extracts enough information from each tweet to format it as a message with an author name, a picture of the author, the date when the tweet was posted, plus the message itself.
4.3.2 Downloading information in XML format
XML is short for Extensible Markup Language. It is a notation for adding markup to a text document so as to show its structure. It provides an alternative to JSON for delivering results from web services in a format which can easily be processed by software and which is moderately easy for a human to read.
An incomplete example of the XML produced by the weather2 service is shown in Figure 4-8. The information is the same as shown in Figure 4-6 but, because it is rather more voluminous, only the first 25 lines are displayed. As seen in the example, the start of a component (a logical unit) in the document is flagged by an opening tag such as <weather>. The end of that component is flagged by a matching close tag such as </weather>. The components can be nested, as seen in the figure.
An opening tag can include attributes, such as this one <font name=”Courier” size=”12”>, though this possibility does not occur in the weather data.
Downloading XML data requires a call to web→download to fetch the data as a string, and then a call to web→xml to parse the string as XML, as in the following example.
var
xobj := web→xml( web→download( "
http://www.myweather2.com/
developer/forecast.ashx?uac=X&output=xml&query=SW1"))
The result of the code is a value with the datatype Xml Object. This datatype provides methods for traversing the XML object and extracting various components. The methods are listed in Appendix C.
Extracting the current temperature from XML shown in Figure 4-8 can be programmed as follows.
// assume xobj has been read using the call previously shown
if
xobj → is invalid
then
"unable to download XML data" → post to wall
else
var
cw := xobj → child("curren_weather")
// get temperature
var
temp := cw → child("temp") → to string
// get temperature units
var
units := cw → child("temp_unit") → to string
("Today’s temperature is " || temp || units) → post to wall
As with JSON, it is fairly easy to figure out how to extract the desired information just by looking at an example of the XML data. However, the structure of the XML is almost always rigidly defined by a DTD (Document Type Definition) which specifies the tag names to use and how they are allowed to nest inside other tagged sections. It is preferable to consult the DTD when developing scripts for processing XML.