As part of my new venture in the education space, I've been working on a semantic database and techniques for loading the database with useful content.
Because so much content is available on the web, it's important to be able to parse it for particular pieces of information. However, since most of the content is only available as human-readable HTML, this task can be challenging and time-consuming.
In order to speed this process up, I've been working on various approaches for doing this automatically and have had fairly good success with some simple stuff. For example, the first version of my algorithm can extract lists of countries, states, professions, elements and planets from the web without any a-priori knowledge of specific page structure.
The goal for the next version of the algorithm is to be able to learn properties of things and extract their values automatically, such as the capital city of a particular country, the atomic weight of a particular element, and the distance of a particular planet from the sun.
While information extraction is not the most important part of the software I'm writing, it's certainly one of the more interesting and challenging areas. It's also a great way to truly learn the ins and outs of semantic processing.
Graham,
Can you tell us how you extracted the list of coutries, capitals automatically. Are you using REST ?
Thanks,
Praveen
Posted by: Praveen | Aug 31, 2005 at 03:31 PM