Automatic Rule Induction for Gendarme

While developers write code, they tend to make mistakes, from spelling mistakes to semantic ones. And at times, developers are dumbstruck finding out obvious problems in the code they fail to compile successfully. Secondly, it is always good to know the best practices of a particular problem while coding, i.e when developers are writing code, they should know the best practices others have adopted to solve a similar problem.

Gendarme tries to provide these features but is manually driven. The system relies on rules which are manually entered and thus, is far from being truly effective. If we can have a system which automatically induces these rules by observing all source code versions and if we can have an agent which identifies the necessary changes made by a developer to convert an unsuccessful build (one that contains errors/warnings) to a successful build, we can generate rules for Gendarme automatically.

Such rules will also be updated since developers will continue to write code and display their behavior, new rules will continue to emerge and old bugs might die out.

This task is a typical problem of ‘data mining over structured data’. The semantics of language will devise the structure. A simple data mining approach like Association Rule Mining can be used to induce rules which can act as checklists, possible errors reasons and some not so apparent rules.

The Gendarme Project is very interesting but how about a system with an automatic rule induction mechanism built into it? That is, a system which continuously observes the programmers’ patterns of debugging and coming out with immediately implementable rules through ‘data mining over structured data’ rather than a human figuring them out. Such rules can be used in many ways, some rules will be just a checklist, like for a null exception, a suggestion like ‘Did u try initializing the object?’ etc in a balloon tip for example. Such suggestions are relatively easy to find through a small add-on to MonoDevelop which saves the sequence of changes between builds (both successful and unsuccessful). In a naive way, any change by the programmer which converts an unsuccessful build to a successful one means the change is a solution to a problem just identified. However its a tricky part to identify the actual semantic changes taking place. This is where a lot of C# guidelines study will come into play along with language theory to induce true semantics.

Other more sophisticated rules can be found using ‘Association Rules’ and the trick is how we classify code entities or snippets as input to this data mining algorithm. Suppose if we say a block of { code } is A, then we can induce rules through Association Rules Discovery in the format:
‘if A and B occur in C then Error 432 occurs.’
or if we say that a fixed keyword is A, like object, struct, define, etc then
‘if A occurs => B also occurs’ is another rule induced!

Several other rule induction techniques can also be used instead of Association Rules but I think the main challenge here is preparing the data for the miner. Code repositories along with snapshots from the compiler agent (to be built) taken after every build attempt have to be aggregated and restructured in a format which will bring out the best rules through the mining algorithm (Association Rules or any other). Another fundamental problem is to mimic the Microsoft Team System feature of IDE connectivity within peers so that we have ample data to mine on. However this is not what I like to do, and this also deviates from the task at hand. Although a crucial component to have. Maybe some sort of a manual import export can be developed.

The uses of such a system are many. Developers who tend to miss out a common bug can become more systematic by going through a checklist of the obvious and not so obvious. In a connected IDE environment, team coding practices and weaknesses can be exposed without singling out the bad developers in the team. This way, the management can not only improve better software process cycles suited to the behavior of the team but can also manage the projects effectively. Developers can review their weekly coding behavior and see where they have been making mistakes and thus the major part of the problem is solved there, the programmer now knows what HIS problem actually is! And above all, we can induce rules which we humans as pattern matching machines fail to do so with the naked eye. There is just too much code!

Close but no cigar, a Google Summer of Code ’07 failed entry šŸ˜¦

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: