Fall Leaves of Chirrapunji: March 2006

Wednesday, March 29, 2006

An extended abstract that I submitted for a worshop in my company (www.infosys.com)

Relevancy ranking is a very important part of any search engine. Web search engines rank data using factors like natural language processing, popularity page ranks etc. Enterprise data generally being structured or semi-structured benefits immensely by using taxonomy categorization for ranking.

Traditionally categorization of data under various taxonomies is done by using meta-data. Data is associated with meta-data during data creation, with some flexibility to change the meta-data during the lifetime of the data. This approach has some difficulties. Categorization of data has to be done by an elite group - elite in the field of categorization & taxonomy as well as being subject matter experts - like authors, editors and librarians. This process is costly and also a bottleneck when the amount of data is huge and semi-structured. Sometimes the pre-defined set of taxonomies might not be sufficient. Adding/editing taxonomies either due to emergence of new kind of data or due to user feedback is fraught with difficulties.

One of the approaches to the above issues is the application of inversion of control to the above process. In the traditional process authors, librarians, editors i.e. produces categorize data. An inversion of control is to let readers, reviewers, users i.e. consumers categorize data by tagging taxonomy to the data. As the number of consumers of data are generally more than the number of producers this inversion will lead to better and relevant taxonomical categorization, at minimal cost and with no apparent bottlenecks. Taxonomy tags can be fluid enough to accomodate user preferences; this will ensure easy addition and updation of taxonomies and categorization. Finally, human intelligence being the best form of intelligence; this process will ensure that the collective intelligence of the consumers of the data is put to the task of taxonomy categorization - a very difficult task to achieve using natural language processing and artificial intelligence techniques.

The above can be achieved by following an approach similar to social networking web-sites like Del.icio.us, flickr, reddit, slashdot etc. These sites let users tag data and use these tags as a factor in relevancy ranking. Applying this technique - an enterprise can build a taxonomy database. This database not only stores the taxonomy, it also stores bi-directional mapping between taxonomy & data. Search results use the taxonomy database as a factor in relevancy ranking. Directory services uses the taxonomy database as a factor in browsing. Everytime a relevant document is choosen in a search or a directory browse result, the search engine provides the consumer a way to tag the data using relevant taxonomies, either existing or new. This information is added to the taxonomy database. During the lifetime of a document its taxonomical categorization will mature with each view and tagging combination. This will help in pushing the rank of the document up or down in a search result or directory browse based on its tagged taxonomies. The whole process acts in an endless loop.

The pros/cons as well as the cost/benefit of such an approach should be considered before implementation. The authors feel that this inversion of control in the process of taxonomy categorization will help enterprise search engines and prove to be beneficial in the long run

Thursday, March 23, 2006

Strong AI - The belief that a computational engine can show intelligence. A stronger form of belief is that the human mind is also a computation engine. Mathematically intelligence is equivalent to church-turning machine!!!

Weak AI - The belief that intelligence can be studied by the aid of a computational engine. Computation can help in also simulating intelligence. It does not believe in a computational engine showing true/indepent intelligence. The most famous theorem being Godel's theorem.

Traditional web search - People were trying to build intelligence into the search engine. Search engine had bulky logic to rank search results. This logic had to be AI.

Google Search - Identified that search ranking cannot be solved only AI. Instead of using AI for search ranking, they used social behaviour. Social measure of something being popular is based on how many people know about it. Google used this for their PageRank algorithm. Of course a lot of AI concepts were used, but not as the backbone. Inversion of Control!!

Taxonomy - The earlier trend was to categorize information using automation. Build natural language processing to categorize knowledge.

Tags - The current trend used in categorizing information. It does not use natural language processing, but uses social behaviour. You have a photo, you know about a web page - go ahead and tag it. Tags act like categories. Most of the people will use the same tags. Lo, you have your cateogory. Most popular being Flickr, del.ic.ious, where you tag photos, links.
Inversion of Control!!

Music Player - All music players arrange music either by directory, artist or genre. What if I could tag songs based on my mood, my category? That would be great!!

Fall Leaves of Chirrapunji

Taxonomy - Inversion of Control

Inversion of Control as I see it in AI

About Me

Topics

Blog Archive