How to use the SLIF Text Components

Invocation and basic options

The SLIF text components are distributed as single large JAR file. To run it you will need a copy of Java. A typical invocation would be

% java -cp slifTextComponents.jar -Xmx500M SlifTextComponent -labels DIR -saveAs FILE -use COMPONENT1,COMPONENT2,.... [OPTIONS]

where -Xmx500M allocates additional memory for the Java heap, and the additional arguments are as follows:

DIR is a directory containing some number of text files to annotate.
A sample directory of files is available as a compressed tarfile. These captions were all taken from PubMedCentral papers, e.g., the file "p9770486-fig_4_2" is from Figure 4 of the paper with the PubMed Id of 9770486.
FILE is where annotations should be placed. These are output in 'minorthird format', which is explained below.
COMPONENT1,... are the names of 'text components' to use to

The components available are:

CellLine: marks spans that are predicted to be the names of cell lines with 'cellLine', using an entity-tagger trained using the Genia corpus.
CRFonYapex: marks spans that are predicted to be the names of genes or proteins with 'protein', and also 'proteinFromCRFonYapex', using a gene-taggertrained using the YAPEX corpus using the CRF algorithm, as described by Kou, Cohen and Murphy (2005).
CRFonTexas, CRFonGenia: analogous to CRFonYapex, but using gene-taggers trained on different corpora (as outlined in the Kou et al paper.)
SemiCRFOnYapex, SemiCRFOnTexas, SemiCRFOnGenia: analogous to the CRFon* components, but trained with the SemiCRF algorithm.
DictHMMOnYapex, DictHMMOnTexas, DictHMMOnGenia: analogous to the CRFon* components, but trained with the DictHMM algorithm.
Caption: marks spans according to the criteria described by Cohen, Murphy and Wang (2002):
- Spans marked as 'imagePointer' are predicted to be image pointers. For a definition of image pointers, see Cohen, Murphy, and Wang (2002).
- Spans marked as 'bulletStyle' and 'citationStyle' are predicted to be bullet-style and citation-style image pointers, respectively.
- Spans marked as 'bulletScope' and 'localScope' are predicted to be the scopes of bullet-style and citation-style image pointers, respectively.
- Spans marked as 'globalScope' are text assumed to pertain to the entire associated image.
- Spans marked as either 'bulletScope', 'localScope', or 'globalScope' are marked as 'scope'.
- Every 'scope' span is associated with a span property called its 'semantics'. The 'semantics' of a span is the concatenation of all the image pointers associated with that span.
Additionally, the span labels 'regional' and 'local' are synonyms for bullet-style and citation-style, respectively.
Briefly, to find out what parts of an image some span S might refer to, you need to (1) find out what 'scope' spans S is inside of and (2) find out what the 'semantics' of these scope spans are. For instance, if the span 'RAS4' is inside a scope T1 with semantics "A" and also inside a scope T2 with semantics "BD", then 'RAS4' probably is associated with the parts of the accompanying image labeled A", "B", and "D".

The Minorthird format for stand-off annotation

The format for output is the one used by Minorthird. Specifically, the output (in the default format) is a series of lines in one of these formats:

addToType FILE START LENGTH SPANTYPE setSpanProp FILE START LENGTH semantics LETTERS

where

FILE is the name of the file containing some span;
START and LENGTH are the initial byte position of the span, and its length;
SPANTYPE is the type of span (e.g., 'imagePointer', 'cellLine', 'protein', 'scope', etc.
LETTERS is (as noted above) the concatenation of all the image pointers associated with that span

Other options

Option	Explanation
`-help`	Gives brief command line help
`-gui`	Pops up a window that allows you to interactively fill in the other arguments, monitor the execution of the annotation process, etc.
`-showLabels`	Pops up a window that displays the set of documents being labeled. (This is not recommended for a large document collection, due to memory usage.)
`-showResult`	Pops up a window that displays the result of the annotation. (Again, not recommended for a large document collection.)
`-format strings`	Outputs results as a tab-separated table, instead of minorthird format. The first column summarizes the type of the span, the file the span was taken from, and the start and end byte positions, in a colon-separated format. (E.g., "cellLine:p11029059-fig_4_1:1293:1303".) The remaining column(s) are the text that is contained in the span (e.g., "HeLa cells", for the span above) almost exactly as it appears in the document; the only change is that newlines are replaced with spaces.

References

Zhenzhen Kou, William W. Cohen & Robert F. Murphy (2005): High-Recall Protein Entity Recognition Using a Dictionary in ISMB-2005.
William W. Cohen, Richard Wang & Robert Murphy (2003): Understanding Captions in Biomedical Publications in KDD 2003: 499-504.
Robert F. Murphy, Zhenzhen Kou, Juchang Hua, Matthew Joffe, William W. Cohen (2004): Extracting and Structuring Subcellular Location Information from On-line Journal Articles: The Subcellular Location Image Finder in KSCE-2004.
The SLIF home page

Acknowledgements

A number of people have contributed to these tools, including William Cohen, Zhenzhen Kou, Quinten Mercer, Robert Murphy, Richard Wang, and other members of the SLIF team. The initial development of these tools was supported by grant 017396 from the Commonwealth of Pennsylvania Tobacco Settlement Fund. Further development is supported by National Institutes of Health grant R01 GM078622.