Fall Leaves of Chirrapunji: Taxonomy

Wednesday, March 29, 2006

An extended abstract that I submitted for a worshop in my company (www.infosys.com)

Relevancy ranking is a very important part of any search engine. Web search engines rank data using factors like natural language processing, popularity page ranks etc. Enterprise data generally being structured or semi-structured benefits immensely by using taxonomy categorization for ranking.

Traditionally categorization of data under various taxonomies is done by using meta-data. Data is associated with meta-data during data creation, with some flexibility to change the meta-data during the lifetime of the data. This approach has some difficulties. Categorization of data has to be done by an elite group - elite in the field of categorization & taxonomy as well as being subject matter experts - like authors, editors and librarians. This process is costly and also a bottleneck when the amount of data is huge and semi-structured. Sometimes the pre-defined set of taxonomies might not be sufficient. Adding/editing taxonomies either due to emergence of new kind of data or due to user feedback is fraught with difficulties.

One of the approaches to the above issues is the application of inversion of control to the above process. In the traditional process authors, librarians, editors i.e. produces categorize data. An inversion of control is to let readers, reviewers, users i.e. consumers categorize data by tagging taxonomy to the data. As the number of consumers of data are generally more than the number of producers this inversion will lead to better and relevant taxonomical categorization, at minimal cost and with no apparent bottlenecks. Taxonomy tags can be fluid enough to accomodate user preferences; this will ensure easy addition and updation of taxonomies and categorization. Finally, human intelligence being the best form of intelligence; this process will ensure that the collective intelligence of the consumers of the data is put to the task of taxonomy categorization - a very difficult task to achieve using natural language processing and artificial intelligence techniques.

The above can be achieved by following an approach similar to social networking web-sites like Del.icio.us, flickr, reddit, slashdot etc. These sites let users tag data and use these tags as a factor in relevancy ranking. Applying this technique - an enterprise can build a taxonomy database. This database not only stores the taxonomy, it also stores bi-directional mapping between taxonomy & data. Search results use the taxonomy database as a factor in relevancy ranking. Directory services uses the taxonomy database as a factor in browsing. Everytime a relevant document is choosen in a search or a directory browse result, the search engine provides the consumer a way to tag the data using relevant taxonomies, either existing or new. This information is added to the taxonomy database. During the lifetime of a document its taxonomical categorization will mature with each view and tagging combination. This will help in pushing the rank of the document up or down in a search result or directory browse based on its tagged taxonomies. The whole process acts in an endless loop.

The pros/cons as well as the cost/benefit of such an approach should be considered before implementation. The authors feel that this inversion of control in the process of taxonomy categorization will help enterprise search engines and prove to be beneficial in the long run

Fall Leaves of Chirrapunji

Taxonomy - Inversion of Control

0 comments:

Post a Comment

About Me

Topics

Blog Archive