Integrating WordNet for Semantic Similarity based Indexing

Currently Beagle indexes lack semantic knowledge like interrelationships between words, language grammar involvements and ontologies from real world. Due to this, Beagle sometimes ends up with too many search results or too few. If we introduce semantic relationships in the index then we can have a mechanism to narrow or widen Beagle search results. Further on, current user context can also be incorporated to a more effective manner by such an addition. I propose using WordNet as a semantic filter to carry out indexing based on semantic similarity of tokens to allow narrowing and broadening search and to provide word-sense disambiguation. This filter can be used on top of other filters currently used in Beagle.

I happen to have hundreds of pictures of cars in my local folder. They all turn up when I search using Beagle. However, one simple way to narrow down the search results is to return only recent pictures. That is, pictures recently indexed (re-indexed) or used. It seems to work for the current user context but generally such a strategy promotes results which the user already have access to due to his current context. For example, files currently opened, files recently closed, etc. The true quality of Beagle will come out when it can return results related to current context which the user might not be aware of or have no easy access to.

For this ambition, two things have to be in place. An index based on semantic relationships. Someone on the Wasabi mailing list talked about incorporating domain specific ontologies suitable for particular people.

The idea is the same here only that instead of any domain specific ontology serving a vertical need, WordNet is used to better classify objects since most objects end up being text. Emails, HTMLs, OpenOffice, Chats, etc. harnessing language features. This can help crate indexes based on semantic similarities between terms, e.g. car and bus are closer to each other so a search on ‘car’ can be broadened to include ‘bus’ as well. Secondly word sense disambiguation can be utilized. So a search on say ‘amber’ under the context of say ‘cars’ will return ‘traffic light’ and under the context of say ‘musems’ will return ‘fossil resins’.

Secondly, a desktop user has several contexts while using her/his machine. At day time, he/she might be working on a project, in the evening surfing on his favorite leisure topics, and at night, writing code for Beagle. If this time sensitive and more importantly mission oriented desktop usage is incorporated with the indexes based on semantics, then it is expected that better search results will occur.

This functionality is crucial for Beagle to be truly scalable.
I haven’t figured out yet how these features will fit in the Beagle Architecture. I think this will be tricky because by default, Beagle filters are divided on MIME types whereas these filters can be applied to many if not all MIME types and they cannot be run independently. These filters have to be wrapped around conventional Beagle filters. These things have to be sorted out.

This project was proposed for Google Summer of Code 2007 and shamefully for me was rejected :(But I believe the idea is still reasonable and will b looking into it soon.

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: