The AbstractDataSetConverter class is implemented to carry out the
task of converting text documents belong to test data set into
machine-understandable form.
Return the weight of a given term calculated by "ltc" (in the SMART notation) version of TFIDF
w (t, d) = {(1 + log_2 term frequency(t, d) x log_2 (N / document frequency(t)) } / { || d || }
where,
|| d ||: 2-norm of vector d
The Classifier class is a wrapper class which carries out the task
of classification by invoking a class implemented a specific
classification algorithm.
The Classifiers class is a wrapper class that is responsible for
handling the classification request from other classes which
specifies the method and its parameters.
The Condenser class is responsible for condensing the (financial) news articles into
a pseudo article which has the sentences of the specified company's name.
The DictionaryGenerator class is an abstract class that encapsulates
the job of generating the unique term (word or phrase) dictionary for a given data set.
Return the Euclidean distance value between two vectors.
Euclidean distance (p = 2) is a special form of
the Minkowski distances with its definition by:
D(x, y) = (sum_{i=1}^{d} |x_i - y_i|^p)^{1/p}
Note: the dimensions of two vectors must be identical
The DistributionalModel class is an implementation of the language
model is intended to generate a probability distribution model of
text data with the "Laplace smoothing" estimator.
The FinancialDataMaker class is implemented to generate a data
structure which contains an instance or a set of instances of the
"Financial" data set in a structural and machine-readable form.
The GAC class is the implementation of an variant of hierarchical agglomerative clustering (HAC) algorithms
that was implemented in [Yang et al., 1999].
Return TextDocument in Vector of containing all words
The returned vector is used to represent a text document
as the combination of words which are occurred once in a document regardless of its term frequencies
The HACClusterSet class is intended to maintain a tree structure
which contains intermediate results produced by Hierarchical
Agglomerative Clustering.
The Indexer class is an abstract class that provides a set of
constants and common functions required for the task of generating
the index of given data set.
Return Kullback-Leibler (KL) divergence between two given probability distributions.
KL divergence is a measure of how different two probability distributions
(over the same event space).
The NewsgroupdataMaker is implemented to generate a data structure
which contains an instance or a set of instances of "20 Newsgroup"
data set in a structural and machine-readable form.
The ResourceManager class is intended to provide an efficient
management of computing resources, such as memory, disk-space,
others, which are assigned to TextMiner.
The ReutersDataMaker class is primarily used to generate a data
structure for containing an instance (a document) or a set of
instances in a structural and machine-readable form of the
"Reuters-21578" data set.
The ReutersDataSetConverter class is implemented to convert text
documents belong to test data set of Reuters-21578 dataset into
machine-understandable form.
The Similarity class provides a set of measures that are used to
estimate the similarity between two abstract objects, such as
(real-valued and multi-dimensional) vectors and probability
distributions.
number of dimensions
(it can be interpreted in various ways:
attributes of an instance,
number of terms in bag of words, and
number of components in a document vector)
The TDTdataMaker class is implemented to generate a data structure
which is capable of containing an instance or a set of instances in
a structural and machine-readable form of the "TDT pilot"
corpus.
The TextNoiseRemover class is a utility class for removing all
noises from the natural language text before applying any other
text learning techniques.