opinions

Listen to what veterans, experts, gurus have to say...



Usama Fayyard's 10 grand challenges in KDD
(full text:  SIGKDD exploration 2003 December issue)


Technical

  1. How does the data grow? (large databases grow over time from dynamic and changing sources. Yet we don't have tools for modelling it formally)
  2. Complexity/understandability trade off
  3. Interestingness
  4. Scalability ( to large volumes, high dimension, variety of data types, need a framework for graceful degradation between high dimension and high fidelity in reduced subspaces, equivalent of SVD)
  5. A theory for what we do? (fundamental theory)
Pragmatic
  1. Where is the data? (instead of data overloading, we start our project finding data are not there. we need to let our algorithms go where the data are, we need build solution to contain data management infrastructure)
  2. Embedding algorithms and solutions within operational systems (... for an algorithm to be truly interesting, it needs to solve a real problem and be robust to all the variety and inconsistency that comes with every day life)
  3. Integrating domain knowledge (Common sense reasoning is dfficult,  but we can build highly focused and deeply integrated solutions with sufficient knowledge embedded to exhibit deep common sense reasoning)
  4. Managing and maintaining models(what happens with historical models? how are they retired? when are they updated in the whole data mining project lifecycle?)
  5. Effectiveness Measurement (metrics for measuring quality and fidelity, for measuring success and return on investment, no mesurement, no management or understanding)

Usama Fayyard's  comment on SQL and data mining tools
(full text: Drilling Down With A Data Mining Pioneer,2002 June)

(On the problem of SQL query)

"... because the interface was designed to address problems where you know the target and you want the database to quickly retrieve the result. If you don't have an exact description of the target, you're lost with a database today. This is why data mining is seeing a lot of demand."


(on problem of data mining tools)

"There are many tools available from companies such as SAS or IBM, but in order to use them properly, you had better be an expert, preferably a Ph.D. in the area of data mining or statistics. If you're not, you just bought a bunch of shelfware. ...."For most users, data mining tools offer the wrong interface. You need data mining solutions. If you have a large staff of experts who know data mining very well, data mining tools will do the job,However, this department of experts is now acting as the interface between the tools and the ultimate user."

 

Prof. Padhraic Smyth (UC Irvine)
(full text: Breaking out of the black-box: - research challenges in data mining)

This position paper takes the point view of two groups of data mining users:business persons(BP) and scientific persons(SP). The author hypothesizes that two groups of users have different focus and style in data mining process thus should be analyzed differently.  BP focus on model evaluation and deployment while SP on exploring,visualizing and modeling. While BP are keen in using the model to predict to guide business decision,SP seeks to "look inside the box" to discover new knowledge. As the result, BP often use predicative model while SP prefers generative model.


General Challenge:


General purpose data mining software environment. Next generation of interative user centered data exploration tool.

Challenges for BP:
Challenges for SP:

Dr. Diego Kuonen (founder of statoo consulting) comment on data mining process
(full text:
DATA MINING AND STATISTICS:WHAT IS THE CONNECTION?)

We think of data mining as the process of identifying valid, novel, potentially useful, and ultimately comprehensible understandable patterns or models in data to make crucial business decisions. “Valid” means that the patterns hold in general, “novel” that we did not know the pattern beforehand, and “understandable” means that we can interpret and comprehend the patterns. Hence, like statistics, data mining is not only modelling and prediction, nor a product that can be bought, but a whole problem solving cycle/process that must be mastered through team effort.



Heikki Mannila  comments on Theoretical Frameworks for Data Mining

(full text: Theoretical Frameworks for data mining)

In this article, the author tried to lay out possible theoretical frameworks for data mining. Simple approaches to adapt or simply borrow from statistics or machine learning are rejected, mostly because their theoretical framework doesn't apply to data mining practice directly. In particular, the author points out several specail aspect of data mining: database integration, simplicity of use and understandability of the result.

Four more approaches proposed:

(1)probalistic approach, however it lacks the ways for taking the iterative and interactive nature of the data mining process into account

(2)data compression approach: view data mining as knowledge discovery, i.e. find fewer bit representation of information

(3)microeconomic view: find decicison x which maximize utility function f(x)

(4)inducative database: "THERE IS NO SUCH THINGS AS KNOWLEDGE DISCOVERY,IT'S ALL IN THE POWER OF QUERY LANGUAGE"


The author's favorite approach is to combine (3) and (4)