Archive for the ‘ Programming ’ Category

ActiveSync Alternative for Linux

Windows Mobile based gadgets use the elusive ActiveSync protocol which Microsoft hasnt released yet for public use. However, for me to get my O2 XDA Cosmo (HTC Excalibur s620, T-Mobile DASH etc) to run on my Ubuntu Gutsy Gibbons, it wasnt a trivial choice as I had to choose between several promised but incomplete and buggy open source tools. Until I came across SynCE, another oper source alternative, things were gloomy.

It wasnt as trivial to get SynCE going as well due to various library conflicts and an inability for me to properly configure my firewall to allow a Remote NDIS based connectivity between the SynCE tools and the device. However, thanks to this guide, major things were resolved, I however, chose minor alternatives when choosing my source files to get it going for my device.

For me, usb-rndis didnt work, but the newer version usb-rndis-lite did.
For some reasons, GnomeVFS 0.11 still doesnot work for my device, however I tried the GnomeVFS 0.10 as recommended on SynCE website to finally enable Nautilus to recognize my device contents as a remote site.

I am still unable to actually sync anything between Thunderbird and this device. There is a possibility to hack Multisync to synchronize the above two but thats for later….


Automatic Rule Induction for Gendarme

While developers write code, they tend to make mistakes, from spelling mistakes to semantic ones. And at times, developers are dumbstruck finding out obvious problems in the code they fail to compile successfully. Secondly, it is always good to know the best practices of a particular problem while coding, i.e when developers are writing code, they should know the best practices others have adopted to solve a similar problem.

Gendarme tries to provide these features but is manually driven. The system relies on rules which are manually entered and thus, is far from being truly effective. If we can have a system which automatically induces these rules by observing all source code versions and if we can have an agent which identifies the necessary changes made by a developer to convert an unsuccessful build (one that contains errors/warnings) to a successful build, we can generate rules for Gendarme automatically.

Such rules will also be updated since developers will continue to write code and display their behavior, new rules will continue to emerge and old bugs might die out.

This task is a typical problem of ‘data mining over structured data’. The semantics of language will devise the structure. A simple data mining approach like Association Rule Mining can be used to induce rules which can act as checklists, possible errors reasons and some not so apparent rules.

The Gendarme Project is very interesting but how about a system with an automatic rule induction mechanism built into it? That is, a system which continuously observes the programmers’ patterns of debugging and coming out with immediately implementable rules through ‘data mining over structured data’ rather than a human figuring them out. Such rules can be used in many ways, some rules will be just a checklist, like for a null exception, a suggestion like ‘Did u try initializing the object?’ etc in a balloon tip for example. Such suggestions are relatively easy to find through a small add-on to MonoDevelop which saves the sequence of changes between builds (both successful and unsuccessful). In a naive way, any change by the programmer which converts an unsuccessful build to a successful one means the change is a solution to a problem just identified. However its a tricky part to identify the actual semantic changes taking place. This is where a lot of C# guidelines study will come into play along with language theory to induce true semantics.

Other more sophisticated rules can be found using ‘Association Rules’ and the trick is how we classify code entities or snippets as input to this data mining algorithm. Suppose if we say a block of { code } is A, then we can induce rules through Association Rules Discovery in the format:
‘if A and B occur in C then Error 432 occurs.’
or if we say that a fixed keyword is A, like object, struct, define, etc then
‘if A occurs => B also occurs’ is another rule induced!

Several other rule induction techniques can also be used instead of Association Rules but I think the main challenge here is preparing the data for the miner. Code repositories along with snapshots from the compiler agent (to be built) taken after every build attempt have to be aggregated and restructured in a format which will bring out the best rules through the mining algorithm (Association Rules or any other). Another fundamental problem is to mimic the Microsoft Team System feature of IDE connectivity within peers so that we have ample data to mine on. However this is not what I like to do, and this also deviates from the task at hand. Although a crucial component to have. Maybe some sort of a manual import export can be developed.

The uses of such a system are many. Developers who tend to miss out a common bug can become more systematic by going through a checklist of the obvious and not so obvious. In a connected IDE environment, team coding practices and weaknesses can be exposed without singling out the bad developers in the team. This way, the management can not only improve better software process cycles suited to the behavior of the team but can also manage the projects effectively. Developers can review their weekly coding behavior and see where they have been making mistakes and thus the major part of the problem is solved there, the programmer now knows what HIS problem actually is! And above all, we can induce rules which we humans as pattern matching machines fail to do so with the naked eye. There is just too much code!

Close but no cigar, a Google Summer of Code ’07 failed entry 😦

Integrating WordNet for Semantic Similarity based Indexing

Currently Beagle indexes lack semantic knowledge like interrelationships between words, language grammar involvements and ontologies from real world. Due to this, Beagle sometimes ends up with too many search results or too few. If we introduce semantic relationships in the index then we can have a mechanism to narrow or widen Beagle search results. Further on, current user context can also be incorporated to a more effective manner by such an addition. I propose using WordNet as a semantic filter to carry out indexing based on semantic similarity of tokens to allow narrowing and broadening search and to provide word-sense disambiguation. This filter can be used on top of other filters currently used in Beagle.

I happen to have hundreds of pictures of cars in my local folder. They all turn up when I search using Beagle. However, one simple way to narrow down the search results is to return only recent pictures. That is, pictures recently indexed (re-indexed) or used. It seems to work for the current user context but generally such a strategy promotes results which the user already have access to due to his current context. For example, files currently opened, files recently closed, etc. The true quality of Beagle will come out when it can return results related to current context which the user might not be aware of or have no easy access to.

For this ambition, two things have to be in place. An index based on semantic relationships. Someone on the Wasabi mailing list talked about incorporating domain specific ontologies suitable for particular people.

The idea is the same here only that instead of any domain specific ontology serving a vertical need, WordNet is used to better classify objects since most objects end up being text. Emails, HTMLs, OpenOffice, Chats, etc. harnessing language features. This can help crate indexes based on semantic similarities between terms, e.g. car and bus are closer to each other so a search on ‘car’ can be broadened to include ‘bus’ as well. Secondly word sense disambiguation can be utilized. So a search on say ‘amber’ under the context of say ‘cars’ will return ‘traffic light’ and under the context of say ‘musems’ will return ‘fossil resins’.

Secondly, a desktop user has several contexts while using her/his machine. At day time, he/she might be working on a project, in the evening surfing on his favorite leisure topics, and at night, writing code for Beagle. If this time sensitive and more importantly mission oriented desktop usage is incorporated with the indexes based on semantics, then it is expected that better search results will occur.

This functionality is crucial for Beagle to be truly scalable.
I haven’t figured out yet how these features will fit in the Beagle Architecture. I think this will be tricky because by default, Beagle filters are divided on MIME types whereas these filters can be applied to many if not all MIME types and they cannot be run independently. These filters have to be wrapped around conventional Beagle filters. These things have to be sorted out.

This project was proposed for Google Summer of Code 2007 and shamefully for me was rejected :(But I believe the idea is still reasonable and will b looking into it soon.