random thoughts

my spontaneous,unverified,premature thoughts...



"Interfaces" in data mining


In OR/MS (operation research & management science) field, Interfaces is a must read journal for those taking serious interest to "learn how to overcome the difficulties and issues encountered in applying operations research and management science to real-life situations".  most of the articles are authored by a team of industrial and academic people, each article provides details of the completed application, along with the results and impact on the organization.  The big companies like GM,GE, P&G, J&J,airlines etc. are frequent owner in those applications.


To my knowedge, there has not yet similar journal/magzine in the field of data mining. Close but not exact paralleles are

In terms of the relationship between application and research, OR/MS and data mining are very similar, both need close interaction between two groups of people.  But why the difference? is data mining still too young to cultivate the culture or data mining rarely succeeds (or even if succeeds, it doesn't have sufficient technical merits for publication)???


theory,algorithm,application


"Theory points at promising directions to new algorithms, while experiments validate them as practical and efficient algorithms ready for real world applications."  This is the from Guy Lebanon's (a CMU alumi)research statement.   The logic relationship among theory,algorithm and application is clearly stated here.  There are specialized group of people focusing on different section of the chain, those interested in theory-algorithm parts, those for algorithm-application parts, of course, there are also those who could tackle them all. Another insight is that algorithm creation / problem sovling process is not sporadic,  it's better to be guided by theory to be efficient.   The more general a theory is, the more capable it is to guide to algorithms leading to solving a more varied range of real problems.


data miner's qualification


Qualification is driven by the market demand. So it makes sense to have the fact to speak for itself. The job advertisement is probably the most straightfoward source for this kind of information. I did an informal analysis of job qualification from up to 20 industrial job adverstiment on kdnuggets.com in 2 most recent months(July and August 2005).  The job responsiblities varies significantly from software development, consultant (solution provider) or functional analytical position (such as marketing).  But at least one part of the job is data mining relevant.  I recorded briefly their company industry, position title, position responsiblity, qualification in terms of eduation, experience, skills. This is my general findings (if you're curious, you could find my scratch summary here)

I got the impression that "jack-of-all-trades" soft of people is highly sought after who could do modelling, development, deployment... another observation is that most of the positions are newly created: the business line are new, or the company are new, as the result, a large part of those jobs is to "invent" -- invent new algorithm, new solution, new methdology. That probably justifies the preference for candicates with advanced degree, and  even higher preference for Ph.Ds -- as they're usually expected to solve notoriously hard problems and ultimately, to invent.  Even if the employee is not expected to invent as most of the standard algorithm is already there, there is a problem to tailor to real situation. Such tasks as parameter tuning, code hacking, modification, adaption is inevitable to make the raw algorithm work: they're better done by somebody who know the "black box" inside out.  Another hypothesis for advanced degree preference is that the appropraite application of those algorithms requirement deep background of that area, a level more than "know-how".  Or, the modelling process is simply so complicated that only people with advanced degree could be qualified.


fame of neural nets and decision trees


Sifting through the news on business application, the neutral nets and decision tree dominate.  It seems almost a golden rule that for continuous value function approximation, go to neutral nets; for classification/rule mining, go to decision trees.


People seems to be fairly happy with the black box solutions like neutral nets if speed is not an issue.  For some of them, black box doesn't bother much because they care more about the output than the internal mechanism. Another guess is that they just accept the facts that the business world is too complex to be modeled, so they resort to pure empirical approach which tells them nothing but the output.  The question is if they're aware of better alternatives, will neural nets still prevail?  In a book written for financial professionals, I found this sentence defining nerual network as "universal approximateor in the sense that they can fit any non-linear function with any degree of accuracy". Without further qualification, statement like this easily misleads people to believe that neural network is a panacea.


The popularity of decision tree is not hard to understand: its tree representation and if ... then... rule format is a close map of people's thought process.  There are competitive classification algorithm, but it's hard to find another one which as intutive and simple and as powerful as decision tree.


natrual data mining


Continue the topic of decision tree... it leads me to the topic of making model more comprehensible and easier to use, or as I coined it - natrual data mining.  I can't help relating it to the Black-Scholes Option Pricing models which is now widely used in derivative securities trading. The derivation of the model requires a decent level of knowledge in stochastic calculus which itself hardly contains any intuition.  But the traders are happily using it, why?  There is a way to interprete the model in terms of replication (i.e. decompose the option into several simple investment tools like bond and stock), it requires nothing but high school math. With that, the business world embraced it almost immediately, with trust.


Back to data mining, except for a few algorithms like decision tree, most of others are less intuitive. It's hard to convince people of its value unless they could see it, feel it, touch it and play with it.   This leads to two possible solutions: one is to develop intutive model representation in such stuctures as trees, graphs, networks. Model visualization is as important as data visualization; another approach is to create really user-friendly interface to model parameters.  This includes basic requirement  like meaningful description to more advanced ones like parameter tuning intelligent aids.


collaborative mining


Most of the current tools assume that people work alone. It make sense when problem are simple and solution search spaces are small, but looking forward, people are more likely to work as a team instead of as an individual with the increasing complexity of the task.  This challenges the current practice of doing data mining in a stand alone mode. Indeed, data mining is a knowledge intensive activity, it require a lot of interactions among collaborator to get idea across, to critique a current solution, to inspire a new idea; it also requires the managing, sharing and visualization of data mining progress so that everybody could see where the project stands, and see where it's going on.  Version control is important, and resource planning and optimization is also important in particular because data mining is computational intensive.  


(I thought I coined "colllaborative mining" the first place, but later I found that I am not the only people worrying about this issue, it's already mentioned in SAS's marketing liternature for a most recent release of their data mining package)


On demand mining


The idea is inspired by salesforce.com, a rising star in CRM solution provider market for on demand CRM application, i.e. it allows the customer to use its web-based CRM application on a subscription basis without having to install a single piece of software at customer's basis. This is a market niche for small-mid sized companies who are short of resources to implement an in-house CRM application.   Analytics is inevitably part of the CRM package. This is also true with saleforce.com, though their current analytics seems not go farther than who is doing what when (so called "real time analytics") and some canned report under the hood of "sales metrics".  Do we have a market for general data mining application as well?  I think this is justifiable at least for a certain group of customer who don't have expertise in modelling and don't want to maintain a team in house, but also want to have on-going modelling maintainence. If done well,  this service delivery method could reduce per customer modelling/maitainence cost by spreading the cost across multiple customers.


object oriented mining


data mining productivity


data mining project life cycle management