Package mekano :: Module dataset :: Class Dataset
[hide private]
[frames] | no frames]

Class Dataset

source code

A Dataset with 'documents' and 'labels'

Documents should be AtomVector-like i.e. they should be iterable, yielding (a,v) pairs.

>>> ds = Dataset("reuters")
>>> ds.add(doc, labels)
>>> ds.digest()

To Do: Fix semantics of labels.

Instance Methods [hide private]
 
__init__(self, name='') source code
 
__iter__(self)
Iterate over (doc, labels) tuples.
source code
 
add(self, doc, labels)
Add a (doc, labels) pair to the dataset
source code
 
digest(self, force=False)
Analyze the data and generate an internal list of labels.
source code
 
isBinary(self) source code
 
getCategoryCounts(self)
Get a dictionary of labels and respective document counts.
source code
 
__repr__(self) source code
 
toMultiClassSVM(self, fout)
Write a multi-class dataset to fout in SVM format.
source code
 
toSVM(self, fout)
Write a binary dataset to fout in SVM format.
source code
 
toSVMSubsequent(self, fout, positions) source code
 
toSMART(self, fout) source code
 
binarize(self)
Create and return binary datasets.
source code
 
makeWeighted(self, cs=None)
Convert to a weighted (e.g.
source code
 
subset(self, count)
Creates count subsets of the dataset.
source code
 
kfold(self, count)
Create cross-validation folds.
source code
 
__add__(self, other)
Add two datasets.
source code
Static Methods [hide private]
 
fromSMART(filename, linkto=None) source code
 
from_rainbow(filename, linkto=None)
Create a dataset from rainbow's output.
source code
Method Details [hide private]

add(self, doc, labels)

source code 

Add a (doc, labels) pair to the dataset

'labels' can be either a sequence (e.g. [1,2,5], or a single value (e.g. True or False)

digest(self, force=False)

source code 

Analyze the data and generate an internal list of labels.

Useful for binarizing etc.

getCategoryCounts(self)

source code 

Get a dictionary of labels and respective document counts.

This is an O(n) operation!

toMultiClassSVM(self, fout)

source code 

Write a multi-class dataset to fout in SVM format.

This can be directly consumed by LIBSVM.

toSVM(self, fout)

source code 

Write a binary dataset to fout in SVM format.

Returns the byte positions of the labels, which can be used by toSVMSubsequent() to overwrite the labels with something else.

binarize(self)

source code 

Create and return binary datasets.

Returns:
A {k:v} dictionary where k is a category name, and v is a binary dataset.

makeWeighted(self, cs=None)

source code 

Convert to a weighted (e.g. LTC) dataset

Parameters:
  • cs - An optional CorpusStats object, otherwise it will be created an associated with the dataset.

subset(self, count)

source code 

Creates count subsets of the dataset.

Subsetting is performed using round-robin.

Parameters:
  • count - Number of subsets to create
Returns:
A list of datasets

kfold(self, count)

source code 

Create cross-validation folds.

The dataset is broken into `count` pieces, each fold (i.e. train-test pair) is created by assigning 1 piece to `train`, and `count-1` pieces to `test`.

Parameters:
  • count - Number of folds
Returns:
A list of [train,test] datasets

__add__(self, other)
(Addition operator)

source code 

Add two datasets.

If both datasets are non-empty, then they must be 'compatible', i.e., share the same factories and corpus stats.

The resulting dataset combines the docs and labels, and inherits the factories and corpus stats of the non-empty parent dataset.

If both parents were digested, the resulting dataset is also digested.

from_rainbow(filename, linkto=None)
Static Method

source code 

Create a dataset from rainbow's output.

$ rainbow -d model --index 20news/train/* $ rainbow -d model --print-matrix=siw > train.txt

>>> ds = from_rainbow("train.txt")

ds.catfactory holds the AtomFactory for category names. ds.tokenfactory holds the AtomFactory for the tokens.

A test set should share its factories with a training set. Therefore, read is like so:

>>> ds2 = from_rainbow("testfile.txt", linkto = ds)
Parameters:
  • filename - File containing rainbow's output
  • linkto - Another dataset whose AtomFactory we should borrow.
Returns:
A brand new dataset.