| Home | Trees | Indices | Help |
|---|
|
|
A Dataset with 'documents' and 'labels'
Documents should be AtomVector-like i.e. they should be iterable, yielding
(a,v) pairs.
>>> ds = Dataset("reuters") >>> ds.add(doc, labels) >>> ds.digest()
To Do: Fix semantics of labels.
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
Add a (doc, labels) pair to the dataset 'labels' can be either a sequence (e.g. [1,2,5], or a single value (e.g. True or False) |
Analyze the data and generate an internal list of labels. Useful for binarizing etc. |
Get a dictionary of labels and respective document counts. This is an O(n) operation! |
Write a multi-class dataset to fout in SVM format. This can be directly consumed by LIBSVM. |
Write a binary dataset to fout in SVM format. Returns the byte positions of the labels, which can be used by toSVMSubsequent() to overwrite the labels with something else. |
Create and return binary datasets.
|
Convert to a weighted (e.g. LTC) dataset
|
Creates count subsets of the dataset. Subsetting is performed using round-robin.
|
Create cross-validation folds. The dataset is broken into `count` pieces, each fold (i.e. train-test pair) is created by assigning 1 piece to `train`, and `count-1` pieces to `test`.
|
Add two datasets. If both datasets are non-empty, then they must be 'compatible', i.e., share the same factories and corpus stats. The resulting dataset combines the docs and labels, and inherits the factories and corpus stats of the non-empty parent dataset. If both parents were digested, the resulting dataset is also digested. |
Create a dataset from rainbow's output. $ rainbow -d model --index 20news/train/* $ rainbow -d model --print-matrix=siw > train.txt >>> ds = from_rainbow("train.txt")
A test set should share its factories with a training set. Therefore, read is like so: >>> ds2 = from_rainbow("testfile.txt", linkto = ds)
|
| Home | Trees | Indices | Help |
|---|
| Generated by Epydoc 3.0.1 on Tue Oct 12 11:00:25 2010 | http://epydoc.sourceforge.net |