MIME-Version: 1.0
Server: CERN/3.0
Date: Sunday, 01-Dec-96 19:31:05 GMT
Content-Type: text/html
Content-Length: 3351
Last-Modified: Monday, 21-Oct-96 21:38:43 GMT
Data Mining
EMV - CS537 Project Proposal
Classification in Data Mining
Eric Vitrano
Common Level
Class table
methods :
open(filename)
close(filename)
write tuple(char *) /* sends in a char string of the whole record in
ascii */
read tuple(tuple number) /* will read the tuple and return a tuple
instance */
get_scheme /* returns a char string which lists the scheme */
set_scheme /* takes in a char string and sets the scheme of the
table to that scheme */
Class tuple
method:
get_attribute(attribute number, location to copy the
attribute value to )
Data Mining Classifiers
Stage 1
Once the above groundwork is complete, I will implement a version of an elementary
data mining classification algorithm. This algorithm will be based on the ID-3
decision tree model, with limited pruning. A summary of the algorithm in pseudocode
form is as follows:
Tree Building
MakeTree (Training Data)
{
Partition (Training Data);
}
Partition (Data)
{
If all (s in S) in same class - tree done.
Else for each attribute, find best split (Split (S)), and partition.
Partition (All partitions from above).
}
Tree Pruning
RemoveNode (Node)
{
For all (nodes in Node)
If (node in Node) has same class value as parent, remove.
}
Split Evaluation
Split (Data)
{
For each attribute, calculate goodness of an attribute.
return highest goodness.
}
Split_Partition (Data)
{
Partition Data into two sets based on goodness from Split.
}
The above algorithm will be implemented in Visual C++, with the intention to build a decision
tree that will classify tuples into defined classes. The tree must be trained using a training
set where the classes of the tuples is known, and then tested on data to see if the returned
classes are of the appropriate type. The results can then be used for directing queries on
incoming data, as well as classifying existing data.
Stage 2
When the above algorithm is implemented, a further algorithm will be implemented. This next
algorithm will either be related to SLIQ, or will be something generated by observing the
development and processes of the general case.
Possible areas of improvement would be pruning on the fly, limiting the searches of the data
and the amount of data needed to be kept in memory, and presorting/partial classification of
the data.
Time Estimates
I would expect the following schedule to be an approximate scheme for progress:
October 21 - Completion of groundwork steps.
November 4 - Completion of the general algorithm.
November 25 - Completion of Stage 2 algorithm.
December 2 - Evaluation and further consideration of data mining classifictiaon.