Archive for the ‘ Data Mining ’ Category

The Pre-Sales Diary: Data Profiling before Proof of Concepts

The Raison D’Etre for many Pre-Sales Engineers is to carry out Proof of Concepts. Although for most of the potential leads, Proof of Concepts are to be avoided because they incur greater costs in the sales cycle, increase the sales closing time, increases chances of failure but there are certain cases where proof of concepts are really much more helpful for the Sales cycle then anything else.

Some of these cases include when there are competitors involved touting the same lingo/features/capabilities etc, others include a genuine customer scenario which needs addressing in a proof of concept either because the scenario is pretty unique, it is part of their due diligence, or your product hasn’t been tested on those waters before.

Pre-Sales folks are pretty comfortable on their technology which they like to showcase to such customers but they are totally new to the customer’s scenario. There are always chances of failure and there are many failures abound.

Before embarking on a scope for a proof of concept and promising deliverables, it is more than required, infact mandatory not just to analyze the customer organization, but also processes, metrics and ofcourse data.

The last part is where I find most proof of concepts depending on. Everything is set, you took extensive interviews with the stakeholders and know what needs to be ‘proved’, you scoped out a business process or two, figured out some metrics and one or two KPIs and they gave access to their data pertaining to it. Now the ball is in your court, but before you know it, your doomed!

The data is incomplete, inaccurate, and have tons of issues which data governance and MDM were meant to solve but didn’t, they don’t exist yet. In most likelihood, the customer is quite unaware of such issues, that is why you are offering them a Business Intelligence solution in the first place, to tap into their data assets. They have never done so before themselves or done so quite limited way to be able to uncover such obstacles. In other scenario when they are aware of these issues, they either are unable to tap it or it is a trick question for you, they want to check whether you cover this aspect or not.

You can either proof the ‘time’ challenge by jumping right into the proof of concept and ignoring all standard practices which are pretty standard during project implementations but then you ignore all of them (or most of them) simply because ‘its just a demo’!


I always carry out a small data survey activity before promising any value to be shown in the proof of concept to make sure what we have in store before we can do anything. Simple rule, GIGO – Garbage In, Garbage Out. If you want to have a good quality, successful demo, profile your data first, understand the strengths and weaknesses and above all let the customer know fully about the limitations, if possible, get enrichments in your data based on your profile to make your demo successful.

This one single step can lead to drastically different outcomes if it is performed or not.

Data Profiling:

Data Profiling is defined as the set of activities performed on datasets to identify the structure, content behavior and quality of data. The structure will guide you towards what links, what is missing, do you all have the required master data, do you have data with good domain representation (possible list of values), what granularity you can work with. Content Behavior guides you on what are the customer’s NORMS in terms of KPI and metric values. e.g. if the dataset contains age groups of 40+, then there is no need to showcase cross selling market basket targeted to toddlers. You can simply skim it out, or ask for data enrichment. if you dont data pertaining to more than one year, then you can’t have year’ as a grain level which for certain metrics and analysis might be critical. Data Quality assessment, albeit a general one, can save you many hours ahead. Most notable of quality issues are data formats, mixed units of measurements, spell checks. e.g. you have RIAD, RIYADH, RIYAD, RYAD all indicating the same city, mixed bilingual datasets like names and addresses etc.

There are many tools available out there which can aid in Data Profiling, including the ubiquitous SQL and Excel. However, Data Profiling, being a means to an end and not the end in itself does not warrant more time and energy than required, there fore a purpose built RAD enabled data Profiler is one of your most critical investments in your toolbox.

One which I have come across recently and which fits the bill very nicely is Talend OpenProfiler, a GPL-ed, Open Source and FREE software which is engineered with great capabilities and power. You can carry out structure analysis, content analysis, column or fields analysis, pattern based analysis on most source systems including many DBMS, flat files, excels etc with readily available results in both numerical and visual representations to make you get a better sense of your data.

I believe all Data Quality tools are (or should be) equipped with good data profiling capabilities, most ETL vendors have data profiling capabilities and some data analysis packages like QlikView can also be used albeit in limited ways to profile data in limited time.

The Data Profile can also be later shared with the customer as a value deliverable.


Happy Demoing!


Data Requirement for Advanced Analytics

TDWI author, Philip Russom has presented a fantastic checklist on the data requirements for advanced analytics.

First, it is a major BI/DW organization which pinpoints the need of different data architectures for reporting and analytics (particularly advanced analytics).

Second, it serves as an important document for data warehousing and modeling experts who usually dont consider the advanced analytics usage when designing the data storage.

Third, it promotes the provisioning of separate analytical data stores that advanced analytics demand.

Fourth, it serves a business case for in-Memory databases.

Standard reporting and analytics (OLAP) suffice well with multidimensional models (high level, summarized data) while advanced analytics require raw transactional data (low level, detail data) along with aggregated data and derived data usually in denormalized forms. The exact nature of the design is determined on the type of analysis to be carried out.

The data integration is also different for data warehousing serving reporting and analytics and for the analytics databases serving advanced analytics. The former mostly rely on ETL while the later is better served up both in practicality and the nature of analysis by ELT.

Secondly, the data integration for data warehousing deals mostly with aggregating, consolidating and changing the schema type from relational to multidimensional. Whereas in analytics database, the data integration is of an advanced mathematical nature where activities like discretization of continuous data, binning, reverse pivoting, data sampling and PCA are heavily employed.

A similar discussion had been carried out sometime ago here.

This white paper makes a strong case.

Data Mining Definition

Here is my definition of Data Mining:

Data Mining is a process of extraction of non-trivial patterns from massive datasets which either provides descriptive insights of the data (not perceived without this extraction) or provides actionable intelligence (in the form of reusable patterns which the process extracted). Where actionable intelligence is a structure of explicitly representable patterns which can be used for decision making either manually or computationally.

What do you think about it? Whats missing? Whats extra?

An overview of Statistical aspects of Fraud Detection

Here is a video presented by Mr. David Hand on the issues of automated Fraud Detection system. Worth watching:
David Hand

Statistical techniques for fraud detection, prevention, and evaluation

Complementing this video by his paper is a good combination.

Automatic Rule Induction for Gendarme

While developers write code, they tend to make mistakes, from spelling mistakes to semantic ones. And at times, developers are dumbstruck finding out obvious problems in the code they fail to compile successfully. Secondly, it is always good to know the best practices of a particular problem while coding, i.e when developers are writing code, they should know the best practices others have adopted to solve a similar problem.

Gendarme tries to provide these features but is manually driven. The system relies on rules which are manually entered and thus, is far from being truly effective. If we can have a system which automatically induces these rules by observing all source code versions and if we can have an agent which identifies the necessary changes made by a developer to convert an unsuccessful build (one that contains errors/warnings) to a successful build, we can generate rules for Gendarme automatically.

Such rules will also be updated since developers will continue to write code and display their behavior, new rules will continue to emerge and old bugs might die out.

This task is a typical problem of ‘data mining over structured data’. The semantics of language will devise the structure. A simple data mining approach like Association Rule Mining can be used to induce rules which can act as checklists, possible errors reasons and some not so apparent rules.

The Gendarme Project is very interesting but how about a system with an automatic rule induction mechanism built into it? That is, a system which continuously observes the programmers’ patterns of debugging and coming out with immediately implementable rules through ‘data mining over structured data’ rather than a human figuring them out. Such rules can be used in many ways, some rules will be just a checklist, like for a null exception, a suggestion like ‘Did u try initializing the object?’ etc in a balloon tip for example. Such suggestions are relatively easy to find through a small add-on to MonoDevelop which saves the sequence of changes between builds (both successful and unsuccessful). In a naive way, any change by the programmer which converts an unsuccessful build to a successful one means the change is a solution to a problem just identified. However its a tricky part to identify the actual semantic changes taking place. This is where a lot of C# guidelines study will come into play along with language theory to induce true semantics.

Other more sophisticated rules can be found using ‘Association Rules’ and the trick is how we classify code entities or snippets as input to this data mining algorithm. Suppose if we say a block of { code } is A, then we can induce rules through Association Rules Discovery in the format:
‘if A and B occur in C then Error 432 occurs.’
or if we say that a fixed keyword is A, like object, struct, define, etc then
‘if A occurs => B also occurs’ is another rule induced!

Several other rule induction techniques can also be used instead of Association Rules but I think the main challenge here is preparing the data for the miner. Code repositories along with snapshots from the compiler agent (to be built) taken after every build attempt have to be aggregated and restructured in a format which will bring out the best rules through the mining algorithm (Association Rules or any other). Another fundamental problem is to mimic the Microsoft Team System feature of IDE connectivity within peers so that we have ample data to mine on. However this is not what I like to do, and this also deviates from the task at hand. Although a crucial component to have. Maybe some sort of a manual import export can be developed.

The uses of such a system are many. Developers who tend to miss out a common bug can become more systematic by going through a checklist of the obvious and not so obvious. In a connected IDE environment, team coding practices and weaknesses can be exposed without singling out the bad developers in the team. This way, the management can not only improve better software process cycles suited to the behavior of the team but can also manage the projects effectively. Developers can review their weekly coding behavior and see where they have been making mistakes and thus the major part of the problem is solved there, the programmer now knows what HIS problem actually is! And above all, we can induce rules which we humans as pattern matching machines fail to do so with the naked eye. There is just too much code!

Close but no cigar, a Google Summer of Code ’07 failed entry 😦

Multi-Relational Learning

In many real world domains, hidden information is present in the inter-relationships between different classes within the data. This information can be relational, hierarchical or conditional in nature. Most of the times, this information is implicitly codified while designing the data schemas for the problem at hand. While data mining, all such schemas are denormalized since conventional data mining algorithms work on single table structures at a time. By denormalization, the implicit relationships present in the original schema are lost and thus, data mining starts off by losing valuable information.

To overcome this problem, data analysts denormalize data in an interactive fashion using their background domain knowledge to preserve data inter-relationships. However, challenge lies in fully automating this process and that is where the emerging field of Relational Data Mining appears. For a comprehensive book, check out. Relational Data Mining by Dzeroski and Lavrac.

Domains where RDM (Relational Data Mining) is holding great potential include bioinformatics, social networking, viral marketing, natural language processing and text mining to name a few. The inherent nature of all such domains is high data dimensionality, catagorical data and data that can be represented as a graph structure.

In high dimensional data mining, the main problem is the sparsity of feature vectors constructed. And the learned feature vectors tend to be larger than the orignal data set if the data is also catagorical in nature. A naive approach under such an environment is to try to superimpose data as a normal distribution but this is not a robust strategy. Adding on to this, in certain domains(like bioinformatics), it is quite often a problem for the data miner to fully understand the nature of the domain and thus there is a tendency to miss out important relationships while preparing the data for analysis.

The essence of RDMs is in an expressive language for patterns involving relational data, a language more expressive than conventional (propositional) data mining patterns and less complex than first order logic. Inductive Logic Programming (ILP) in context of KDD provides the language sometimes called relational logic to express patterns containing data inter-relationships.

There is a counterpart relational data structure for many data mining tasks, for CART, there is S – CART, for C4.5, there is a relational version called TILDE. Similary, there are relational association rules and relational decision trees which are build on the notion of a relational distance measure like RIBL.

However, even though Multi-relational learning holds promise, the field is still far from being able to generalize methodologies for the whole spectrum of data mining problems. The field of Statistical Relational Learning, as it is sometimes coined holds onto an assumption that models built over apparent data and relational data (within it) yields better results than models built over only apparent data. This however, as pointed out by [2] is not the general case and in certain data sets, only the intrinsic (apparent) data provides better models compared to those datasets containing relational data too.

Secondly, due to the inherent complexity of relational data, it has been observed that deterministic relational learners don’t produce as good results as probabilistic relational learners. Statistical relational learning accurately predicts structured data and is able to chalk out dependencies between data instances which have been ignored a lot in previous machine learning setups.

Besides the nature of the data sets, relational learning algorithms have also developed various approaches in solving the problem. Earlier relational learners concentrated more on propositionalization of relational data into ‘flat’ data and then applying conventional learners to it. However, recent tactics involve incorporating the relational data schemes in the learner’s framework directly. [1] However, both approaches continue to progress.

Thus, the field of relational learning is gaining wide acceptance and suitable methodologies for applications in general fields are being devised.