How to use the SLIF Text Components
Invocation and basic options
The SLIF text components are distributed as single large JAR file. To
run it you will need a copy of Java. A typical invocation would be
% java -cp slifTextComponents.jar -Xmx500M SlifTextComponent -labels DIR -saveAs FILE -use COMPONENT1,COMPONENT2,.... [OPTIONS]
where -Xmx500M allocates additional memory for the Java heap, and the additional arguments are as follows:
The components available are:
- DIR is a directory containing some number of text files to
directory of files is available as a compressed tarfile. These
captions were all taken from PubMedCentral papers, e.g., the file
"p9770486-fig_4_2" is from Figure 4 of the paper with the PubMed
Id of 9770486.
- FILE is where annotations should be placed. These are output in 'minorthird format',
which is explained below.
- COMPONENT1,... are the names of 'text components' to use to
- CellLine: marks spans that are predicted to be the names
of cell lines with 'cellLine', using an entity-tagger trained using
the Genia corpus.
- CRFonYapex: marks spans that are predicted to be the
names of genes or proteins with 'protein', and also
'proteinFromCRFonYapex', using a gene-taggertrained using the YAPEX
corpus using the CRF algorithm, as described by Kou,
Cohen and Murphy (2005).
- CRFonTexas, CRFonGenia: analogous to CRFonYapex, but
using gene-taggers trained on different corpora (as outlined in the
Kou et al paper.)
- SemiCRFOnYapex, SemiCRFOnTexas, SemiCRFOnGenia: analogous
to the CRFon* components, but trained with the SemiCRF algorithm.
- DictHMMOnYapex, DictHMMOnTexas, DictHMMOnGenia: analogous
to the CRFon* components, but trained with the DictHMM algorithm.
- Caption: marks spans according to the criteria described
Murphy and Wang (2002):
Additionally, the span labels 'regional' and 'local' are synonyms
for bullet-style and citation-style, respectively.
- Spans marked as 'imagePointer' are predicted to be image
pointers. For a definition of image pointers, see Cohen,
Murphy, and Wang (2002).
- Spans marked as 'bulletStyle' and 'citationStyle' are
predicted to be bullet-style and citation-style image pointers,
- Spans marked as 'bulletScope' and 'localScope' are predicted
to be the scopes of bullet-style and citation-style
image pointers, respectively.
- Spans marked as 'globalScope' are text assumed to pertain
to the entire associated image.
- Spans marked as either 'bulletScope', 'localScope', or
'globalScope' are marked as 'scope'.
- Every 'scope' span is associated with a span
property called its 'semantics'. The 'semantics' of a span
is the concatenation of all the image pointers associated with
Briefly, to find out what parts of an image some span
S might refer to, you need to (1) find out what 'scope'
spans S is inside of and (2) find out what the 'semantics'
of these scope spans are. For instance, if the span 'RAS4' is
inside a scope T1 with semantics "A" and also inside a
scope T2 with semantics "BD", then 'RAS4' probably is
associated with the parts of the accompanying image labeled A",
"B", and "D".
The Minorthird format for stand-off annotation
The format for output is the one used by Minorthird. Specifically, the
output (in the default format) is a series of lines in one of these
addToType FILE START LENGTH SPANTYPE
setSpanProp FILE START LENGTH semantics LETTERS
- FILE is the name of the file containing some span;
- START and LENGTH are the initial byte position of the span, and its length;
- SPANTYPE is the type of span (e.g., 'imagePointer',
'cellLine', 'protein', 'scope', etc.
- LETTERS is (as noted above) the concatenation of all the
image pointers associated with that span
|Gives brief command line help
|Pops up a window that allows you to interactively fill in the other arguments, monitor the execution of the annotation process, etc.
|Pops up a window that displays the set of documents being labeled.
(This is not recommended for a large document collection, due to
|Pops up a window that displays the result of the annotation.
(Again, not recommended for a large document collection.)
|Outputs results as a
tab-separated table, instead of minorthird format. The first
column summarizes the type of the span, the file the span was
taken from, and the start and end byte positions, in a
colon-separated format. (E.g.,
"cellLine:p11029059-fig_4_1:1293:1303".) The remaining column(s)
are the text that is contained in the span (e.g., "HeLa cells",
for the span above) almost exactly as it appears in the document; the
only change is that newlines are replaced with spaces.
A number of people have contributed to these tools, including William
Cohen, Zhenzhen Kou, Quinten Mercer, Robert Murphy, Richard Wang, and
other members of the SLIF team.
The initial development of these tools was supported by grant 017396
from the Commonwealth of Pennsylvania Tobacco Settlement Fund. Further
development is supported by National Institutes of Health grant R01