Documentation: Data Format and Reading

Data Format

PDNN currently supports two data formats: PFile and Python pickle.

Pickle

When reading a pickle file, PDNN assumes that the file serializes two numpy arrays, one for the feature matrix and the other for the label vector. Python can create pickle-formatted datasets easily. Check examples/mnist to see how to convert datasets into the pickle format.

> import numpy, cPickle, gzip
> feat = numpy.array([[0.2, 0.3, 0.5, 1.4],[1.3, 2.1, 0.3, 0.1], [0.3, 0.5, 0.5, 1.4]])
> label = numpy.array([10, 5, 32])
> cPickle.dump([feat, label], gzip.open('toy.pickle.gz','wb'), cPickle.HIGHEST_PROTOCOL)

PFile

The PFile is the ICSI feature file archive format. A standard PFile toolkit is pfile_utils-v0_51. This script installs it automatically if you are running Linux. Check examples/mnist to see how to convert datasets into the PFile format. The following table shows individual fields in the PFile, together with 3 example lines.

Sentence Index	Example Index	Feature Vector	Class Label
0	0	[0.2, 0.3, 0.5, 1.4, 1.8, 2.5]	10
0	1	[1.3, 2.1, 0.3, 0.1, 1.4, 0.9]	179
1	0	[0.3, 0.5, 0.5, 1.4, 0.8, 1.4]	32

For speech processing, sentences and examples correspond to utterances and frames respectively. Examples are indexed within each sentence. For other applications, you can use fake Sentence Index and Example Index. For example, with N examples, you can set their Sentence Index to 0 and Example Index to 0 1 ... N by order.

HTK users can convert HTK feature and label files into PFiles using this python code. Refer to the comments at the top for more information.

Data Reading and Specification

The training (or validation) data is specified by --train-data (or --valid-data). In the string, various data reading arguments are separated by commas. The first field is always the path to the data file. In addition to one single data file, you can also specify a file list using regular expression. For example, "--train-data train.pickle.*.gz" and "--train-data train.pickle.[1-10].gz". All the data files in the list are traversed during each training epoch. When processing one data file, PDNN has two modes for data reading.

Non-Stream. The size of GPU memory is in general smaller than the CPU memory. Therefore, PDNN reads the entire file first into CPU memory and splits it into separate data chunks. One data chunk is fed into GPU memory each time. This mode applies both to pickle and to PFile. Under this mode, the size of each data file should be smaller than the GPU memory.

Stream. Instead of loading the entire data file into GPU memory, PDNN each time reads one chunk of data from the data file. This is especially useful when the data file is huge and cannot be loaded into CPU memory. However, this model ONLY applies to PFile.

Field	Meaning
partition	the size of data chunk (in terms of megabytes) read into GPU memory; smaller than the GPU memory size
stream	whether data reading is in "stream" mode.
ramdom	whether data points are shuffled when a data chunk is loaded into GPU

For example: --train-data "./train.pfile.gz,partition=1000m,random=true,stream=false"