AI sample inputs

Data Mining Assignment 3: Sample Input

Problem 2

Sample Files:

Coins
Books

Test files

Coins:

We need to determine which coins are valuable for collectors, and we view valuable coins as positive examples; the training set includes six examples:

Class	Rarity	Age	Wear
positive	rare	new	low
positive	rare	old	low
positive	common	old	low
negative	rare	old	high
negative	common	new	low
negative	common	new	high

We compute the information gain of each attribute:

Gain(Rarity) = I(3/6, 3/6) - (3/6) * I(2/3, 1/3) - (3/6) * I(1/3, 2/3) = 0.082

Gain(Age) = I(3/6, 3/6) - (3/6) * I(2/3, 1/3) - (3/6) * I(1/3, 2/3) = 0.082

Gain(Wear) = I(3/6, 3/6) - (4/6) * I(3/4, 1/4) - (2/6) * I(0,1) = 0.459

The "wear" attribute provides the greatest gain, and we use it for the root split:

All training examples with high wear are negative; since both test instances have high wear, they are also classified as negative:

Rarity Age Wear Class

rare new high negative

common old high negative

The coins with low wear require an additional split, but this split does not affect the classification of the given test instances.

Books:

We next learn to distinguish between cheap and expensive books, and we consider expensive books as positive examples. The training set includes eight books described by five attributes:

Class	Bind	Style	Pictures	Popularity	Length
positive	hardcover	novel	nocolor	popular	long
positive	softcover	textbook	nocolor	popular	long
negative	softcover	novel	nocolor	popular	short
positive	hardcover	textbook	color	popular	short
positive	hardcover	journal	color	unknown	short
negative	softcover	textbook	nocolor	unknown	short
positive	hardcover	journal	color	popular	long
negative	softcover	novel	color	unknown	short

We first determine the information gain of each attribute:

Gain(Bind) = I(3/8, 5/8) - (4/8) * I(1, 0) - (4/8) * I(3/4, 1/4) = 0.549

Gain(Style) = I(3/8, 5/8) - (3/8) * I(1/3, 2/3) - (3/8) * I(2/3, 1/3) - (2/8) * I(1, 0) = 0.266

Gain(Pictures) = I(3/8, 5/8) - (4/8) * I(2/4, 2/4) - (4/8) * I(3/4, 1/4) = 0.049

Gain(Popularity) = I(3/8, 5/8) - (5/8) * I(4/5, 1/5) - (3/8) * I(1/3, 2/3) = 0.159

Gain(Length) = I(3/8, 5/8) - (3/8) * I(1, 0) - (5/8) * I(2/5, 3/5) = 0.348

The "bind" attribute gives the greatest gain, and we use it for the root split. All training examples with "hardcover" are positive, whereas the "softcover" examples require an additional split:

We show the "softcover" examples in the following table.

Class Bind Style Pictures Popularity Length

positive softcover textbook nocolor popular long

negative softcover novel nocolor popular short

negative softcover textbook nocolor unknown short

negative softcover novel color unknown short

We next compute the information gains of the remaining attributes for the "softcover" examples:

Gain(Style) = I(1/4, 3/4) - (2/4) * I(1/2, 1/2) - (2/4) * I(0, 1) = 0.311

Gain(Pictures) = I(1/4, 3/4) - (3/4) * I(2/3, 1/3) - (1/4) * I(0, 1) = 0.123

Gain(Popularity) = I(1/4, 3/4) - (2/4) * I(1/2, 1/2) - (2/4) * I(0, 1) = 0.311

Gain(Length) = I(1/4, 3/4) - (1/4) * I(1, 0) - (3/4) * I(0, 1) = 0.811

The "length" attribute provides the greatest gain, and we use it for the next split. The resulting tree shows that hardcover books and long softcover books are expensive, whereas short softcover books are cheap:

This tree leads to the following classification of the test instances:

Bind Style Pictures Popularity Length Class

softcover journal color popular short negative

softcover novel nocolor unknown long positive

hardcover textbook nocolor unknown short positive

Days

The task is to distinguish Sunday from Monday; Sunday is positive, and Monday is negative.

Weather

The attributes are a state, month, and city, and the task is to determine whether a traveler needs a sweater when visiting this city.

Food

We need to decide whether to buy a hamburger; the attributes include the dollar cost of the hamburger, place (Checkers, Burger King, or McDonald's), and time.

Problem 3

Sample Files:

Test files

Multiple classes:

We consider the classification of books into three price categories: cheap, medium, and expensive.

Class	Bind	Style	Pictures	Popularity	Length
expensive	hardcover	novel	nocolor	popular	long
medium	softcover	textbook	nocolor	popular	long
cheap	softcover	novel	nocolor	popular	short
medium	hardcover	textbook	color	popular	short
medium	hardcover	journal	color	unknown	short
cheap	softcover	textbook	nocolor	unknown	short
expensive	hardcover	journal	color	popular	long
cheap	softcover	novel	color	unknown	short

The information gains of the attributes are as follows:

Gain(Bind) = I(2/8, 3/8, 3/8) - (4/8) * I(2/4, 2/4, 0) - (4/8) * I(0, 1/4, 3/4) = 0.656

Gain(Style) = I(2/8, 3/8, 3/8) - (3/8) * I(1/3, 0, 2/3) - (3/8) * I(0, 2/3, 1/3) - (2/8) * I(1/2, 1/2, 0) = 0.622

Gain(Pictures) = I(2/8, 3/8, 3/8) - (4/8) * I(1/4, 1/4, 2/4) - (4/8) * I(1/4, 2/4, 1/4) = 0.061

Gain(Popularity) = I(2/8, 3/8, 3/8) - (5/8) * I(2/5, 2/5, 1/5) - (3/8) * I(0, 1/3, 2/3) = 0.266

Gain(Length) = I(2/8, 3/8, 3/8) - (3/8) * I(2/3, 1/3, 0) - (5/8) * I(0, 2/5, 3/5) = 0.610

The "bind" attribute gives the greatest gain, which leads to the following split:

We next consider the hardcover books, which require an additional split; the set of training examples with "hardcover" is as follows:

Class	Bind	Style	Pictures	Popularity	Length
expensive	hardcover	novel	nocolor	popular	long
medium	hardcover	textbook	color	popular	short
medium	hardcover	journal	color	unknown	short
expensive	hardcover	journal	color	popular	long

The computation of information gains shows that "length" is the most informative attribute:

Gain(Style) = I(1/2, 1/2, 0) - (1/4) * I(1, 0, 0) - (1/4) * I(0, 1, 0) - (2/4) * I(1/2, 1/2, 0) = 0.500

Gain(Pictures) = I(1/2, 1/2, 0) - (1/4) * I(1, 0, 0) - (3/4) * I(1/3, 2/3, 0) = 0.311

Gain(Popularity) = I(1/2, 1/2, 0) - (3/4) * I(2/3, 1/3, 0) - (1/4) * I(0, 1, 0) = 0.311

Gain(Length) = I(1/2, 1/2, 0) - (2/4) * I(1, 0, 0) - (2/4) * I(0, 1, 0) = 1.000

We thus split hardcover books by "length":

We next consider the set of softcover books in the training set:

Class Bind Style Pictures Popularity Length

medium softcover textbook nocolor popular long

cheap softcover novel nocolor popular short

cheap softcover textbook nocolor unknown short

cheap softcover novel color unknown short

The most informative attribute of this set is also "length":

Gain(Style)	=	I(0, 1/4, 3/4) - (2/4) * I(0, 1/2, 1/2) - (2/4) * I(0, 0, 1)	=	0.311
Gain(Pictures)	=	I(0, 1/4, 3/4) - (3/4) * I(0, 1/3, 2/3) - (1/4) * I(0, 0, 1)	=	0.123
Gain(Popularity)	=	I(0, 1/4, 3/4) - (2/4) * I(0, 1/2, 1/2) - (2/4) * I(0, 0, 1)	=	0.311
Gain(Length)	=	I(0, 1/4, 3/4) - (1/4) * I(0, 1, 0) - (3/4) * I(0, 0, 1)	=	0.811

We split the softcover books by "length," which gives the final tree:

We now use this tree to classify the test instances:

Bind	Style	Pictures	Popularity	Length	Class
hardcover	novel	nocolor	unknown	long	expensive
hardcover	journal	color	popular	short	medium
softcover	textbook	color	popular	long	medium
softcover	novel	nocolor	unknown	short	cheap

Numeric Attributes:

We again consider the problem of identifying valuable coins, and we view valuable coins as positive examples. We now represent wear as a number between 0 and 100%. The training set includes six examples:

Class	Rarity	Age	Wear
positive	rare	new	5%
positive	rare	old	10%
positive	common	old	4%
negative	rare	old	20%
negative	common	new	2%
negative	common	new	25%

The candidate thresholds for the "wear" attribute are 3% and 15%; thus, we replace this attribute with two Boolean attributes: "Wear > 3" and "Wear > 15":

Class

Rarity

Age

Wear > 3
Wear > 15

positive

rare

new

yes
no

positive

rare

old

yes
no

positive

common

old

yes
no

negative

rare

old

yes
yes

negative

common

new

no
no

negative

common

new

yes
yes

We compute the information gains of these four attributes:

Gain(Rarity) = I(3/6, 3/6) - (3/6) * I(2/3, 1/3) - (3/6) * I(1/3, 2/3) = 0.082

Gain(Age) = I(3/6, 3/6) - (3/6) * I(1/3, 2/3) - (3/6) * I(2/3, 1/3) = 0.082

Gain(Wear > 3) = I(3/6, 3/6) - (5/6) * I(3/5, 2/5) - (1/6) * I(0, 1) = 0.191

Gain(Wear > 15) = I(3/6, 3/6) - (4/6) * I(3/4, 1/4) - (2/6) * I(0,1) = 0.459

The most informative attribute is "Wear > 15," which leads to the following split:

We next consider the training examples with "wear" below 15%:

Class

Rarity

Age

Wear > 3
Wear > 15

positive

rare

new

yes
no

positive

rare

old

yes
no

positive

common

old

yes
no

negative

common

new

no
no

The information gains for the remaining three attributes are as follows:

Gain(Rarity) = I(3/4, 1/4) - (2/4) * I(1, 0) - (2/4) * I(1/2, 1/2) = 0.311

Gain(Age) = I(3/4, 1/4) - (2/4) * I(1/2, 1/2) - (2/4) * I(1, 0) = 0.311

Gain(Wear > 3) = I(3/4, 1/4) - (3/4) * I(1, 0) - (1/4) * I(0, 1) = 0.811

The "wear" attribute is again the most informative, which leads to the following tree:

We use this tree to classify the test instances:

Rarity Age Wear Class

rare new 20 negative

common old 10 positive

common new 1 negative

Missing Values:

We consider the problem of identifying valuable books, and we use a training set with two missing values:

Class	Bind	Style	Pictures	Popularity	Length
positive	hardcover	novel	nocolor	popular	long
positive	softcover	textbook	nocolor	popular	long
negative	softcover	novel	nocolor	popular	MISSING
positive	hardcover	textbook	color	MISSING	short
positive	hardcover	journal	color	unknown	short
negative	softcover	textbook	nocolor	unknown	short
positive	hardcover	journal	color	popular	long
negative	softcover	novel	color	unknown	short

We observe that the majority of positive examples are "popular," and the majority of negative examples are "short," which leads to the following "repair" of the missing values:

Class Bind Style Pictures Popularity Length

positive hardcover novel nocolor popular long

positive softcover textbook nocolor popular long

negative softcover novel nocolor popular SHORT

positive hardcover textbook color POPULAR short

positive hardcover journal color unknown short

negative softcover textbook nocolor unknown short

positive hardcover journal color popular long

negative softcover novel color unknown short

The information gains for these repaired examples are as follows:

Gain(Bind) = I(3/8, 5/8) - (4/8) * I(1, 0) - (4/8) * I(3/4, 1/4) = 0.549

Gain(Style) = I(3/8, 5/8) - (3/8) * I(1/3, 2/3) - (3/8) * I(2/3, 1/3) - (2/8) * I(1, 0) = 0.266

Gain(Pictures) = I(3/8, 5/8) - (4/8) * I(2/4, 2/4) - (4/8) * I(3/4, 1/4) = 0.049

Gain(Popularity) = I(3/8, 5/8) - (5/8) * I(4/5, 1/5) - (3/8) * I(1/3, 2/3) = 0.159

Gain(Length) = I(3/8, 5/8) - (3/8) * I(1, 0) - (5/8) * I(2/5, 3/5) = 0.348

The "bind" attribute provides the greatest gain, which leads to the following split:

We next consider the softcover books, which require an additional split:

Class Bind Style Pictures Popularity Length

positive softcover textbook nocolor popular long

negative softcover novel nocolor popular MISSING

negative softcover textbook nocolor unknown short

negative softcover novel color unknown short

The majority of negative examples in this table are short books, which means that we again replace the missing value with "short":

Class Bind Style Pictures Popularity Length

positive softcover textbook nocolor popular long

negative softcover novel nocolor popular SHORT

negative softcover textbook nocolor unknown short

negative softcover novel color unknown short

We compute the information gain of the remaining four attributes:

Gain(Style) = I(1/4, 3/4) - (2/4) * I(1/2, 1/2) - (2/4) * I(0, 1) = 0.311

Gain(Pictures) = I(1/4, 3/4) - (3/4) * I(2/3, 1/3) - (1/4) * I(0, 1) = 0.123

Gain(Popularity) = I(1/4, 3/4) - (2/4) * I(1/2, 1/2) - (2/4) * I(0, 1) = 0.311

Gain(Length) = I(1/4, 3/4) - (1/4) * I(1, 0) - (3/4) * I(0, 1) = 0.811

The "length" attribute is the most informative, which leads to the following tree:

This tree is identical to the tree in Problem 2, and we get the same classification of the test instances:

Bind Style Pictures Popularity Length Class

softcover journal color popular short negative

hardcover novel nocolor unknown long positive

softcover textbook nocolor unknown long positive

Gasoline

The task is to select an appropriate gasoline type, which may depend on the vehicle type (sports, SUV, or economy), engine compression (low, medium, or high), and income of the driver (poor or rich).

Loans

We need to approve or disapprove a loan application; the attributes include the income of an applicant, graduate education ("yes" if she has a graduate degree), and home ownership ("yes" if she owns her home).

Medicine

We need to determine whether a patient needs a new medication based on four symptoms: fever, blood pressure (low, medium, or high), headache (yes or no), and allergy.

Back to the Data Mining home page

Gain(Rarity)	=	I(3/6, 3/6) - (3/6) * I(2/3, 1/3) - (3/6) * I(1/3, 2/3)	=	0.082
Gain(Age)	=	I(3/6, 3/6) - (3/6) * I(2/3, 1/3) - (3/6) * I(1/3, 2/3)	=	0.082
Gain(Wear)	=	I(3/6, 3/6) - (4/6) * I(3/4, 1/4) - (2/6) * I(0,1)	=	0.459

Gain(Bind)	=	I(3/8, 5/8) - (4/8) * I(1, 0) - (4/8) * I(3/4, 1/4)	=	0.549
Gain(Style)	=	I(3/8, 5/8) - (3/8) * I(1/3, 2/3) - (3/8) * I(2/3, 1/3) - (2/8) * I(1, 0)	=	0.266
Gain(Pictures)	=	I(3/8, 5/8) - (4/8) * I(2/4, 2/4) - (4/8) * I(3/4, 1/4)	=	0.049
Gain(Popularity)	=	I(3/8, 5/8) - (5/8) * I(4/5, 1/5) - (3/8) * I(1/3, 2/3)	=	0.159
Gain(Length)	=	I(3/8, 5/8) - (3/8) * I(1, 0) - (5/8) * I(2/5, 3/5)	=	0.348

Gain(Style)	=	I(1/4, 3/4) - (2/4) * I(1/2, 1/2) - (2/4) * I(0, 1)	=	0.311
Gain(Pictures)	=	I(1/4, 3/4) - (3/4) * I(2/3, 1/3) - (1/4) * I(0, 1)	=	0.123
Gain(Popularity)	=	I(1/4, 3/4) - (2/4) * I(1/2, 1/2) - (2/4) * I(0, 1)	=	0.311
Gain(Length)	=	I(1/4, 3/4) - (1/4) * I(1, 0) - (3/4) * I(0, 1)	=	0.811

Gain(Bind)	=	I(2/8, 3/8, 3/8) - (4/8) * I(2/4, 2/4, 0) - (4/8) * I(0, 1/4, 3/4)	=	0.656
Gain(Style)	=	I(2/8, 3/8, 3/8) - (3/8) * I(1/3, 0, 2/3) - (3/8) * I(0, 2/3, 1/3) - (2/8) * I(1/2, 1/2, 0)	=	0.622
Gain(Pictures)	=	I(2/8, 3/8, 3/8) - (4/8) * I(1/4, 1/4, 2/4) - (4/8) * I(1/4, 2/4, 1/4)	=	0.061
Gain(Popularity)	=	I(2/8, 3/8, 3/8) - (5/8) * I(2/5, 2/5, 1/5) - (3/8) * I(0, 1/3, 2/3)	=	0.266
Gain(Length)	=	I(2/8, 3/8, 3/8) - (3/8) * I(2/3, 1/3, 0) - (5/8) * I(0, 2/5, 3/5)	=	0.610

Gain(Style)	=	I(1/2, 1/2, 0) - (1/4) * I(1, 0, 0) - (1/4) * I(0, 1, 0) - (2/4) * I(1/2, 1/2, 0)	=	0.500
Gain(Pictures)	=	I(1/2, 1/2, 0) - (1/4) * I(1, 0, 0) - (3/4) * I(1/3, 2/3, 0)	=	0.311
Gain(Popularity)	=	I(1/2, 1/2, 0) - (3/4) * I(2/3, 1/3, 0) - (1/4) * I(0, 1, 0)	=	0.311
Gain(Length)	=	I(1/2, 1/2, 0) - (2/4) * I(1, 0, 0) - (2/4) * I(0, 1, 0)	=	1.000

Gain(Rarity)	=	I(3/4, 1/4) - (2/4) * I(1, 0) - (2/4) * I(1/2, 1/2)	=	0.311
Gain(Age)	=	I(3/4, 1/4) - (2/4) * I(1/2, 1/2) - (2/4) * I(1, 0)	=	0.311
Gain(Wear > 3)	=	I(3/4, 1/4) - (3/4) * I(1, 0) - (1/4) * I(0, 1)	=	0.811