Data Requirement for Advanced Analytics
TDWI author, Philip Russom has presented a fantastic checklist on the data requirements for advanced analytics.
First, it is a major BI/DW organization which pinpoints the need of different data architectures for reporting and analytics (particularly advanced analytics).
Second, it serves as an important document for data warehousing and modeling experts who usually dont consider the advanced analytics usage when designing the data storage.
Third, it promotes the provisioning of separate analytical data stores that advanced analytics demand.
Fourth, it serves a business case for in-Memory databases.
Standard reporting and analytics (OLAP) suffice well with multidimensional models (high level, summarized data) while advanced analytics require raw transactional data (low level, detail data) along with aggregated data and derived data usually in denormalized forms. The exact nature of the design is determined on the type of analysis to be carried out.
The data integration is also different for data warehousing serving reporting and analytics and for the analytics databases serving advanced analytics. The former mostly rely on ETL while the later is better served up both in practicality and the nature of analysis by ELT.
Secondly, the data integration for data warehousing deals mostly with aggregating, consolidating and changing the schema type from relational to multidimensional. Whereas in analytics database, the data integration is of an advanced mathematical nature where activities like discretization of continuous data, binning, reverse pivoting, data sampling and PCA are heavily employed.
A similar discussion had been carried out sometime ago here.
This white paper makes a strong case.
Data Mining Definition
Here is my definition of Data Mining:
Data Mining is a process of extraction of non-trivial patterns from massive datasets which either provides descriptive insights of the data (not perceived without this extraction) or provides actionable intelligence (in the form of reusable patterns which the process extracted). Where actionable intelligence is a structure of explicitly representable patterns which can be used for decision making either manually or computationally.
What do you think about it? Whats missing? Whats extra?
An overview of Statistical aspects of Fraud Detection
Here is a video presented by Mr. David Hand on the issues of automated Fraud Detection system. Worth watching:
David Hand
![]()
Statistical techniques for fraud detection, prevention, and evaluation
Complementing this video by his paper is a good combination.
Automatic Rule Induction for Gendarme
While developers write code, they tend to make mistakes, from spelling mistakes to semantic ones. And at times, developers are dumbstruck finding out obvious problems in the code they fail to compile successfully. Secondly, it is always good to know the best practices of a particular problem while coding, i.e when developers are writing code, they should know the best practices others have adopted to solve a similar problem.
Gendarme tries to provide these features but is manually driven. The system relies on rules which are manually entered and thus, is far from being truly effective. If we can have a system which automatically induces these rules by observing all source code versions and if we can have an agent which identifies the necessary changes made by a developer to convert an unsuccessful build (one that contains errors/warnings) to a successful build, we can generate rules for Gendarme automatically.
Such rules will also be updated since developers will continue to write code and display their behavior, new rules will continue to emerge and old bugs might die out.
This task is a typical problem of ‘data mining over structured data’. The semantics of language will devise the structure. A simple data mining approach like Association Rule Mining can be used to induce rules which can act as checklists, possible errors reasons and some not so apparent rules.
N AUTOMATIC RULE INDUCTION SYSTEM FOR GENDARME:
—————————————————————————
The Gendarme Project is very interesting but how about a system with an automatic rule induction mechanism built into it? That is, a system which continuously observes the programmers’ patterns of debugging and coming out with immediately implementable rules through ‘data mining over structured data’ rather than a human figuring them out. Such rules can be used in many ways, some rules will be just a checklist, like for a null exception, a suggestion like ‘Did u try initializing the object?’ etc in a balloon tip for example. Such suggestions are relatively easy to find through a small add-on to MonoDevelop which saves the sequence of changes between builds (both successful and unsuccessful). In a naive way, any change by the programmer which converts an unsuccessful build to a successful one means the change is a solution to a problem just identified. However its a tricky part to identify the actual semantic changes taking place. This is where a lot of C# guidelines study will come into play along with language theory to induce true semantics.
Other more sophisticated rules can be found using ‘Association Rules’ and the trick is how we classify code entities or snippets as input to this data mining algorithm. Suppose if we say a block of { code } is A, then we can induce rules through Association Rules Discovery in the format:
‘if A and B occur in C then Error 432 occurs.’
or if we say that a fixed keyword is A, like object, struct, define, etc then
‘if A occurs => B also occurs’ is another rule induced!
Several other rule induction techniques can also be used instead of Association Rules but I think the main challenge here is preparing the data for the miner. Code repositories along with snapshots from the compiler agent (to be built) taken after every build attempt have to be aggregated and restructured in a format which will bring out the best rules through the mining algorithm (Association Rules or any other). Another fundamental problem is to mimic the Microsoft Team System feature of IDE connectivity within peers so that we have ample data to mine on. However this is not what I like to do, and this also deviates from the task at hand. Although a crucial component to have. Maybe some sort of a manual import export can be developed.
The uses of such a system are many. Developers who tend to miss out a common bug can become more systematic by going through a checklist of the obvious and not so obvious. In a connected IDE environment, team coding practices and weaknesses can be exposed without singling out the bad developers in the team. This way, the management can not only improve better software process cycles suited to the behavior of the team but can also manage the projects effectively. Developers can review their weekly coding behavior and see where they have been making mistakes and thus the major part of the problem is solved there, the programmer now knows what HIS problem actually is! And above all, we can induce rules which we humans as pattern matching machines fail to do so with the naked eye. There is just too much code!
Close but no cigar, a Google Summer of Code ‘07 failed entry
Multi-Relational Learning
In many real world domains, hidden information is present in the inter-relationships between different classes within the data. This information can be relational, hierarchical or conditional in nature. Most of the times, this information is implicitly codified while designing the data schemas for the problem at hand. While data mining, all such schemas are denormalized since conventional data mining algorithms work on single table structures at a time. By denormalization, the implicit relationships present in the original schema are lost and thus, data mining starts off by losing valuable information.
To overcome this problem, data analysts denormalize data in an interactive fashion using their background domain knowledge to preserve data inter-relationships. However, challenge lies in fully automating this process and that is where the emerging field of Relational Data Mining appears. For a comprehensive book, check out. Relational Data Mining by Dzeroski and Lavrac.
Domains where RDM (Relational Data Mining) is holding great potential include bioinformatics, social networking, viral marketing, natural language processing and text mining to name a few. The inherent nature of all such domains is high data dimensionality, catagorical data and data that can be represented as a graph structure.
In high dimensional data mining, the main problem is the sparsity of feature vectors constructed. And the learned feature vectors tend to be larger than the orignal data set if the data is also catagorical in nature. A naive approach under such an environment is to try to superimpose data as a normal distribution but this is not a robust strategy. Adding on to this, in certain domains(like bioinformatics), it is quite often a problem for the data miner to fully understand the nature of the domain and thus there is a tendency to miss out important relationships while preparing the data for analysis.
The essence of RDMs is in an expressive language for patterns involving relational data, a language more expressive than conventional (propositional) data mining patterns and less complex than first order logic. Inductive Logic Programming (ILP) in context of KDD provides the language sometimes called relational logic to express patterns containing data inter-relationships.
There is a counterpart relational data structure for many data mining tasks, for CART, there is S – CART, for C4.5, there is a relational version called TILDE. Similary, there are relational association rules and relational decision trees which are build on the notion of a relational distance measure like RIBL.
However, even though Multi-relational learning holds promise, the field is still far from being able to generalize methodologies for the whole spectrum of data mining problems. The field of Statistical Relational Learning, as it is sometimes coined holds onto an assumption that models built over apparent data and relational data (within it) yields better results than models built over only apparent data. This however, as pointed out by [2] is not the general case and in certain data sets, only the intrinsic (apparent) data provides better models compared to those datasets containing relational data too.
Secondly, due to the inherent complexity of relational data, it has been observed that deterministic relational learners don’t produce as good results as probabilistic relational learners. Statistical relational learning accurately predicts structured data and is able to chalk out dependencies between data instances which have been ignored a lot in previous machine learning setups.
Besides the nature of the data sets, relational learning algorithms have also developed various approaches in solving the problem. Earlier relational learners concentrated more on propositionalization of relational data into ‘flat’ data and then applying conventional learners to it. However, recent tactics involve incorporating the relational data schemes in the learner’s framework directly. [1] However, both approaches continue to progress.
Thus, the field of relational learning is gaining wide acceptance and suitable methodologies for applications in general fields are being devised.

