Package mekano :: Module dataset :: Class Dataset

Class Dataset

A Dataset with 'documents' and 'labels'

Documents should be AtomVector-like i.e. they should be iterable, yielding (a,v) pairs.

>>> ds = Dataset("reuters")
>>> ds.add(doc, labels)
>>> ds.digest()

To Do: Fix semantics of labels.

Instance Methods

[hide private]

__init__(self, name='') source code

__iter__(self)
Iterate over (doc, labels) tuples.

source code

add(self, doc, labels)
Add a (doc, labels) pair to the dataset

source code

digest(self, force=False)
Analyze the data and generate an internal list of labels.

source code

isBinary(self)

source code

getCategoryCounts(self)
Get a dictionary of labels and respective document counts.

source code

__repr__(self)

source code

toMultiClassSVM(self, fout)
Write a multi-class dataset to fout in SVM format.

source code

toSVM(self, fout)
Write a binary dataset to fout in SVM format.

source code

toSVMSubsequent(self, fout, positions)

source code

toSMART(self, fout)

source code

binarize(self)
Create and return binary datasets.

source code

makeWeighted(self, cs=None)
Convert to a weighted (e.g.

source code

subset(self, count)
Creates count subsets of the dataset.

source code

kfold(self, count)
Create cross-validation folds.

source code

__add__(self, other)
Add two datasets.

source code

Static Methods

[hide private]

fromSMART(filename, linkto=None)

source code

from_rainbow(filename, linkto=None)
Create a dataset from rainbow's output.

source code

Method Details

[hide private]

add(self, doc, labels)

source code

Add a (doc, labels) pair to the dataset

'labels' can be either a sequence (e.g. [1,2,5], or a single value (e.g. True or False)

digest(self, force=False)

source code

Analyze the data and generate an internal list of labels.

Useful for binarizing etc.

getCategoryCounts(self)

source code

Get a dictionary of labels and respective document counts.

This is an O(n) operation!

toMultiClassSVM(self, fout)

source code

Write a multi-class dataset to fout in SVM format.

This can be directly consumed by LIBSVM.

toSVM(self, fout)

source code

Write a binary dataset to fout in SVM format.

Returns the byte positions of the labels, which can be used by toSVMSubsequent() to overwrite the labels with something else.

binarize(self)

source code

Create and return binary datasets.

Returns:: A {k:v} dictionary where k is a category name, and v is a binary dataset.

makeWeighted(self, cs=None)

source code

Convert to a weighted (e.g. LTC) dataset

Parameters:

cs - An optional CorpusStats object, otherwise it will be created an associated with the dataset.

subset(self, count)

source code

Creates count subsets of the dataset.

Subsetting is performed using round-robin.

Parameters:

count - Number of subsets to create

Returns:

A list of datasets

kfold(self, count)

source code

Create cross-validation folds.

The dataset is broken into `count` pieces, each fold (i.e. train-test pair) is created by assigning 1 piece to `train`, and `count-1` pieces to `test`.

Parameters:

count - Number of folds

Returns:

A list of [train,test] datasets

add(self, other)
(Addition operator)

source code

Add two datasets.

If both datasets are non-empty, then they must be 'compatible', i.e., share the same factories and corpus stats.

The resulting dataset combines the docs and labels, and inherits the factories and corpus stats of the non-empty parent dataset.

If both parents were digested, the resulting dataset is also digested.

from_rainbow(filename, linkto=None)
Static Method

source code

Create a dataset from rainbow's output.

$ rainbow -d model --index 20news/train/* $ rainbow -d model --print-matrix=siw > train.txt

>>> ds = from_rainbow("train.txt")

ds.catfactory holds the AtomFactory for category names. ds.tokenfactory holds the AtomFactory for the tokens.

A test set should share its factories with a training set. Therefore, read is like so:

>>> ds2 = from_rainbow("testfile.txt", linkto = ds)

Parameters:

filename - File containing rainbow's output
linkto - Another dataset whose AtomFactory we should borrow.

Returns:

A brand new dataset.

Class Dataset

add(self, doc, labels)

digest(self, force=False)

getCategoryCounts(self)

toMultiClassSVM(self, fout)

toSVM(self, fout)

binarize(self)

makeWeighted(self, cs=None)

subset(self, count)

kfold(self, count)

__add__(self, other) (Addition operator)

from_rainbow(filename, linkto=None) Static Method

add(self, other)
(Addition operator)

from_rainbow(filename, linkto=None)
Static Method