Automatic Rule Induction for Gendarme
While developers write code, they tend to make mistakes, from spelling mistakes to semantic ones. And at times, developers are dumbstruck finding out obvious problems in the code they fail to compile successfully. Secondly, it is always good to know the best practices of a particular problem while coding, i.e when developers are writing code, they should know the best practices others have adopted to solve a similar problem.
Gendarme tries to provide these features but is manually driven. The system relies on rules which are manually entered and thus, is far from being truly effective. If we can have a system which automatically induces these rules by observing all source code versions and if we can have an agent which identifies the necessary changes made by a developer to convert an unsuccessful build (one that contains errors/warnings) to a successful build, we can generate rules for Gendarme automatically.
Such rules will also be updated since developers will continue to write code and display their behavior, new rules will continue to emerge and old bugs might die out.
This task is a typical problem of ‘data mining over structured data’. The semantics of language will devise the structure. A simple data mining approach like Association Rule Mining can be used to induce rules which can act as checklists, possible errors reasons and some not so apparent rules.
N AUTOMATIC RULE INDUCTION SYSTEM FOR GENDARME:
—————————————————————————
The Gendarme Project is very interesting but how about a system with an automatic rule induction mechanism built into it? That is, a system which continuously observes the programmers’ patterns of debugging and coming out with immediately implementable rules through ‘data mining over structured data’ rather than a human figuring them out. Such rules can be used in many ways, some rules will be just a checklist, like for a null exception, a suggestion like ‘Did u try initializing the object?’ etc in a balloon tip for example. Such suggestions are relatively easy to find through a small add-on to MonoDevelop which saves the sequence of changes between builds (both successful and unsuccessful). In a naive way, any change by the programmer which converts an unsuccessful build to a successful one means the change is a solution to a problem just identified. However its a tricky part to identify the actual semantic changes taking place. This is where a lot of C# guidelines study will come into play along with language theory to induce true semantics.
Other more sophisticated rules can be found using ‘Association Rules’ and the trick is how we classify code entities or snippets as input to this data mining algorithm. Suppose if we say a block of { code } is A, then we can induce rules through Association Rules Discovery in the format:
‘if A and B occur in C then Error 432 occurs.’
or if we say that a fixed keyword is A, like object, struct, define, etc then
‘if A occurs => B also occurs’ is another rule induced!
Several other rule induction techniques can also be used instead of Association Rules but I think the main challenge here is preparing the data for the miner. Code repositories along with snapshots from the compiler agent (to be built) taken after every build attempt have to be aggregated and restructured in a format which will bring out the best rules through the mining algorithm (Association Rules or any other). Another fundamental problem is to mimic the Microsoft Team System feature of IDE connectivity within peers so that we have ample data to mine on. However this is not what I like to do, and this also deviates from the task at hand. Although a crucial component to have. Maybe some sort of a manual import export can be developed.
The uses of such a system are many. Developers who tend to miss out a common bug can become more systematic by going through a checklist of the obvious and not so obvious. In a connected IDE environment, team coding practices and weaknesses can be exposed without singling out the bad developers in the team. This way, the management can not only improve better software process cycles suited to the behavior of the team but can also manage the projects effectively. Developers can review their weekly coding behavior and see where they have been making mistakes and thus the major part of the problem is solved there, the programmer now knows what HIS problem actually is! And above all, we can induce rules which we humans as pattern matching machines fail to do so with the naked eye. There is just too much code!
Close but no cigar, a Google Summer of Code ‘07 failed entry
Integrating WordNet for Semantic Similarity based Indexing
Currently Beagle indexes lack semantic knowledge like interrelationships between words, language grammar involvements and ontologies from real world. Due to this, Beagle sometimes ends up with too many search results or too few. If we introduce semantic relationships in the index then we can have a mechanism to narrow or widen Beagle search results. Further on, current user context can also be incorporated to a more effective manner by such an addition. I propose using WordNet as a semantic filter to carry out indexing based on semantic similarity of tokens to allow narrowing and broadening search and to provide word-sense disambiguation. This filter can be used on top of other filters currently used in Beagle.
I happen to have hundreds of pictures of cars in my local folder. They all turn up when I search using Beagle. However, one simple way to narrow down the search results is to return only recent pictures. That is, pictures recently indexed (re-indexed) or used. It seems to work for the current user context but generally such a strategy promotes results which the user already have access to due to his current context. For example, files currently opened, files recently closed, etc. The true quality of Beagle will come out when it can return results related to current context which the user might not be aware of or have no easy access to.
For this ambition, two things have to be in place. An index based on semantic relationships. Someone on the Wasabi mailing list talked about incorporating domain specific ontologies suitable for particular people.
The idea is the same here only that instead of any domain specific ontology serving a vertical need, WordNet is used to better classify objects since most objects end up being text. Emails, HTMLs, OpenOffice, Chats, etc. harnessing language features. This can help crate indexes based on semantic similarities between terms, e.g. car and bus are closer to each other so a search on ‘car’ can be broadened to include ‘bus’ as well. Secondly word sense disambiguation can be utilized. So a search on say ‘amber’ under the context of say ‘cars’ will return ‘traffic light’ and under the context of say ‘musems’ will return ‘fossil resins’.
Secondly, a desktop user has several contexts while using her/his machine. At day time, he/she might be working on a project, in the evening surfing on his favorite leisure topics, and at night, writing code for Beagle. If this time sensitive and more importantly mission oriented desktop usage is incorporated with the indexes based on semantics, then it is expected that better search results will occur.
This functionality is crucial for Beagle to be truly scalable.
Shortcomings:
————–
I haven’t figured out yet how these features will fit in the Beagle Architecture. I think this will be tricky because by default, Beagle filters are divided on MIME types whereas these filters can be applied to many if not all MIME types and they cannot be run independently. These filters have to be wrapped around conventional Beagle filters. These things have to be sorted out.
This project was proposed for Google Summer of Code 2007 and shamefully for me was rejected
But I believe the idea is still reasonable and will b looking into it soon.

