my spontaneous,unverified,premature thoughts...
"Interfaces" in data mining
In OR/MS (operation research & management science) field, Interfaces is a must
read journal for those taking serious interest to "learn how to
overcome the difficulties and issues encountered in applying operations
research and management science to real-life situations". most of
the articles are authored by a team of industrial and academic
people, each article provides details of the completed
application, along with the results and impact on the
organization. The big companies like GM,GE, P&G,
J&J,airlines
etc. are frequent owner in those applications.
To my knowedge, there has not yet similar journal/magzine in the field
of data mining. Close but not exact paralleles are
In terms of the relationship between application and research, OR/MS
and data mining are very similar, both need close interaction between
two groups of people. But why the difference? is data mining
still too young to cultivate the culture or data mining rarely succeeds
(or even if succeeds, it doesn't have sufficient technical merits for
publication)???
theory,algorithm,application
"Theory points at promising directions to new algorithms, while
experiments validate them as practical and efficient algorithms ready
for real world applications." This is the from Guy Lebanon's
(a CMU alumi)research statement. The logic relationship
among
theory,algorithm and application is clearly stated here. There
are specialized group of people focusing on
different section of the chain, those interested in theory-algorithm
parts, those for algorithm-application parts, of course, there are also
those who could tackle them all. Another insight is that algorithm
creation / problem sovling process is not sporadic, it's better
to be guided by theory to be efficient. The more general a
theory is, the more capable it is to guide to algorithms leading to
solving a more varied range of real problems.
data miner's qualification
Qualification is driven by the market demand. So it makes sense to have
the fact to speak for itself. The job advertisement is probably the
most straightfoward source for this kind of information. I did an
informal analysis of job qualification from up to 20 industrial job
adverstiment on kdnuggets.com in 2 most recent months(July and August
2005). The job
responsiblities varies significantly from software development,
consultant (solution provider) or functional analytical position (such
as marketing). But at least one part of the job is data mining
relevant. I recorded briefly their company industry, position
title, position responsiblity, qualification in terms of eduation,
experience, skills. This is my general findings (if you're curious, you
could find my scratch summary here)
I got the impression that "jack-of-all-trades" soft of people is
highly sought after who could do modelling, development,
deployment... another observation is that most of the positions are
newly created: the business line are new, or the company are new, as
the result, a large part of those jobs is to "invent" -- invent new
algorithm, new solution, new methdology. That probably justifies the
preference for candicates with advanced degree, and even higher
preference for Ph.Ds -- as they're usually expected to solve
notoriously hard problems and ultimately, to invent. Even if the
employee is not expected to invent as most of the standard algorithm is
already there, there is a problem to tailor to real situation. Such
tasks as parameter tuning, code hacking, modification, adaption is
inevitable to make the raw algorithm work: they're better done by
somebody who know the "black box"
inside out. Another hypothesis for advanced degree preference is
that the appropraite application of
those algorithms requirement deep background of that area, a level more
than "know-how". Or, the modelling process is simply so
complicated that only people with advanced degree could be qualified.
fame of neural nets and decision trees
Sifting through the news on business application, the neutral nets and
decision tree dominate. It seems almost a golden rule that for
continuous value function
approximation, go to neutral nets; for classification/rule mining, go
to decision trees.
People seems to be fairly happy with the black
box solutions like neutral nets if speed is not an issue. For
some of them, black box doesn't bother much because they care more
about the output than the internal mechanism. Another guess is that
they just accept the facts that the business world is too complex to be
modeled, so they resort to pure empirical approach which tells them
nothing but the output. The question is if they're aware of
better alternatives, will neural nets still prevail? In a book
written for financial professionals, I found this sentence defining
nerual network as "universal approximateor in the sense that they can
fit any non-linear function with any degree of accuracy". Without
further qualification, statement like this easily misleads people to
believe that neural network is a panacea.
The popularity of decision tree is not hard to understand: its tree
representation and if ... then... rule format is a close map of
people's thought process. There are competitive classification
algorithm, but it's hard to find another one which as intutive and
simple and as powerful as decision tree.
natrual data mining
Continue the topic of decision tree... it leads me to the topic of
making model more comprehensible and easier to use, or as I coined it -
natrual data mining. I can't help relating it to the
Black-Scholes Option Pricing models which is now widely used in
derivative securities trading. The derivation of the model requires a
decent level
of knowledge in stochastic calculus which itself hardly contains any
intuition. But the traders are happily using it, why? There
is a way to interprete the model in terms of replication (i.e.
decompose the option into several simple investment tools like bond and
stock), it requires nothing but high school math. With that, the
business world embraced it almost immediately, with trust.
Back to data mining, except for a few algorithms like decision tree,
most of others are less intuitive. It's hard to convince people of its
value unless they could see it, feel it, touch it and play with
it. This leads to two possible solutions: one is to develop
intutive model representation in such stuctures as trees, graphs,
networks. Model visualization is as important as data visualization;
another approach is to create really user-friendly interface to model
parameters. This includes basic requirement like meaningful
description to more advanced ones like parameter tuning intelligent
aids.
collaborative mining
Most of the current tools assume that people work alone. It make sense
when problem are simple and solution search spaces are small, but
looking forward, people are more likely to work as a team instead of as
an individual with the increasing complexity of the task. This
challenges the current practice of doing data mining in a stand alone
mode. Indeed, data mining is a knowledge intensive activity, it require
a lot of interactions among collaborator to get idea across, to
critique
a current solution, to inspire a new idea; it also requires the
managing, sharing and visualization of data mining progress so that
everybody could see where the project stands, and see where it's
going on. Version control is important, and resource planning and
optimization is also important in particular because data mining is
computational intensive.
(I thought I coined "colllaborative mining" the first place, but later
I found
that I am not the only people worrying about this issue, it's already
mentioned in SAS's marketing liternature for a most recent release of
their data mining package)
On demand mining
The idea is inspired by salesforce.com, a rising star in CRM
solution provider market for on demand CRM application, i.e. it allows
the customer to use its web-based CRM application on a subscription
basis without having to install a single piece of software at
customer's basis. This is a market niche for small-mid sized companies
who are short of resources to implement an in-house CRM
application. Analytics is inevitably part of the CRM
package. This is also true with saleforce.com, though their current
analytics seems not go farther than who is doing what when (so called
"real time analytics") and some canned report under the hood of "sales
metrics". Do we have a market for general data mining application
as well? I think this is justifiable at least for a certain group
of customer who don't have expertise in modelling and don't want to
maintain a team in house, but also want to have on-going modelling
maintainence. If done well, this service delivery method could
reduce per customer modelling/maitainence cost by spreading the cost
across multiple customers.
object oriented mining
data mining productivity
data mining project life cycle management