Tutorial : Text geotagging with OpenCalais
This tutorial will show how to use version 3.1 of the Calais API to identify geographic references in a text and display them on an OpenLayers map. I am going to use the Calais API with JSON output (which is a new feature of version 3.1), so that all the processing can be done easily in JavaScript entirely inside a web browser.
A running version of the demo application can be found here. The code can be downloaded here.
What is the Calais API?
From the documentation:
The Calais Web Service allows you to automatically annotate your content with rich semantic metadata, including entities such as people and companies and events and facts such as acquisitions and management changes.
Version 3.1, currently in beta, has added information regarding the latitude/longitude of a few types of entities: Cities, provinces/states and countries. It makes it possible to plot those entities on a map, in a way similar to what can be done with the Metacarta GeoTagger API.
The application page
The application will be one unique page, communicating with the Calais Web Service through AJAX. Here is the HTML code for the page :
|
|
Some observations :
- The Google Maps API is included so the satellite imagery can be used inside OpenLayers. A key can be obtained here.
- There are 3 areas : The text area input to enter the text to geotag, the result area, which will contain the same text but annotated with links to markers on the map, and the OpenLayers map, with markers at the places identified in the text by OpenCalais.
- A suitable CSS can be found in the code archive.
Initialization of the map
When the page is loaded, the OpenLayers map gets initialized in the initMap function:
|
|
This is straightforward OpenLayers initialization code. 2 layers are created: The Google Hybrid Map Layer will serve as the background layer and the Marker Layer will be used for the markers created at the locations identified in the text by OpenCalais.
Proxy
Due to the cross-domain restrictions in browsers, in order to use the Calais Web Service, I will need to proxy my requests through the web server that serves the application. It will transparently forward the queries to the web service and the responses back to the browser. Since Google App Engine is easily installed and makes it possible to host the application for free, I decided to go with the URL Fetch API of the GAE to do that, but there are countless other ways to achieve the same thing.
An App Engine project needs to be created with one unique request handler acting as a proxy (for brevity the configuration plumbing is omitted but it can be looked up in the code archive) :
|
|
So the behavior is as follows:
- All POST queries are forwarded to http://beta.opencalais.com/enlighten/rest/, which is the endpoint for the prerelease 3.1 version of the Calais non-SOAP API
- All the arguments of the original browser queries are forwarded as is
- The license key is added in the form of a licenseID argument. You need to register in order to get a valid key. Even if you have downloaded the source package, you will still need to replace the key with your own in the app.py file.
- This handler is registered for the /opencalais-geo/ocproxy route
Querying the Calais API
The next step is to actually query the Calais service with a text entered by the user. When the user clicks on the “Submit” button, the submiToOC function gets called:
|
|
The Calais web service can be queried with 3 arguments:
- licenseID: This argument is added by the proxy so it not sent from the browser
- content: This is the text to be analyzed
- paramsXML: The configuration parameters for the anaysis, in a XML format. For the purpose of this demo application, I am particularly interested in the contentType and outputFormat parameters, set respectively to text/raw (to perform the analysis on the text without any cleaning) and application/json (to receive the response in JSON format, instead of the standard RDF)
On completion of the query, the processOCResponse function will be called with the response from the Calais service.
Processing the response
Since the response is in JSON format, it can directly be evaled and put into a local variable. It is then an object with 2 kinds of attributes:
- One attribute is doc, the value of which is an object that contains some meta information about the text. It also contains the text in the form analyzed by the service, in case it has been cleaned up. In my case, since contentType processing directive was set to text/raw when submitting the text, the analyzed text should be identical to the one that was sent.
- The other attributes contain information detected in the text and are in the form of http://d.opencalais.com/genericHasher-1/1d1529b7-da5f-3884-8de0-c765b3b7d3a3, that is, a unique ID for an entity or relation that is then described in the value object for the attribute. The full list of possible information can be found here. When a piece of information is about a relation between entities, the value does not contain the full data about the entities but instead refers to their IDs.
The first step after evaluating the response is then to resolve the references to IDs. Since I am only concerned with entities (and only the geographic ones), this is not strictly necessary for this tutorial, but the Calais documentation details it and it can be useful if I wish to do something with the rest of the info. This is the purpose of the resolveReferences function. Along with that, information is grouped by type of information (entities or relations) and further, by the specific kind of entity or relation (eg City, Person, PersonPolitical) using respectively attributes _typeGroup and _type. This is done so that jsonObject.entities.City refers to an index of all city entities found in the text. This step is performed by the createHierarchy function.
|
|
When the hierarchy is created, the result is passed to the processGeoReferences function, which does the work of annotating the text and putting pushpins on the map.
Preparing the data
For the purpose of this tutorial, I am only interested in entities which can potentially hold information about latitude and longitude. According to the documentation, these are cities, countries and provinces/states (but it is still possible that such an entity does not hold such information, in which case I won’t represent it on the map or annotate its location in the text). Entities of these types are merged together in the mergeObjects function.
Each entity can appear multiple times in the text, possibly in different forms (eg US and United States). This is why there is an instances attribute array attached to each entity returned in the result: Each instance indicates where in the text the entity is present, by giving its offset from the start of the text and length. I will use this information to annotate the text with links, that, when clicked, center the map on the location of the entity. Since, to do that, I will need to modify the text, the offset information of instances may become invalid during the processing. To prevent that, I can perform the annotation in the order of decrementing offset so that the offset of instances that are still to be processed stay valid. This is why instances are sorted in the sortInstances function.
|
|
We then get a list of sorted instances, ready to be used for annotating the text and displaying on the map in the createTextAndMarkers function.
Using the data
The next step inside function createTextAndMarkers is to go through all the instances (now sorted by decreasing offset) to check that the current instance has geographic coordinates. If not, the instance is just ignored. Then I create a marker and annotate the text with a link, that, when clicked, will center on the marker and popup the complete name of the identified place. The marker is setup so that a click on it also opens this popup. To make it easy, in the createFeatureWithMarker function, instead of a plain OpenLayers.Marker, an OpenLayers.Feature is used. This is basically a marker with data and instructions on how to present this data in a popup. Different icons are used for each type of place (city, province/state, country). Finally, the text is presented to the user and the extent of the map is changed in the updateZoom function to the smallest extent that makes all markers visible at once. If there is only one marker, since the Calais Web Service does not return an indication of scale, the zoom is reset to the whole earth.
|
|
Conclusion
I tested the geotagging with a couple of articles, including Election fever fires up U.S. retirees in Mexico, and S.Korea joins global rescue, crisis summit from Reuters (who own OpenCalais and, incidently, whose news maps are powered by Metacarta). These 2 articles are used in the screenshots. The result is good, I think. Still there are few instances where a place is recognized as a city (with the presence of a City entity in the response) but does not have geographic coordinates. It seems to usually be with minor cities. Maybe it could be useful to query an additional service like GeoNames to resolve them. Also, one other thing which could be added is an indication of scale in addition to the lat/lon, to know what kind of zoom level should be suitable to represent the entity in its entirety. Despite these minor quibbles, I am looking forwad to the final 3.1 release of the API.
The final application can be tried here.