Home
Research
Work
Personal
Resources

Rule Induction Toolkit

Description:

Synchronous grammar rule learning for a Syntax based machine translation. For more details of the algorithms used inside the toolkit please refer to the papers below :

Download:

To run:

java -jar ruleinduction.jar <configfile>

A sample config file format can be seen here.

Formats and Details:

More details on the parameters and formats of files follows:

Input Files:

Output Files:

INPUT_MODE :

OUTPUT_MODE:

OUTPUT_MODE: Samples

CORPUS_FILE Format:

Any number of sentences can be given as input in this format. Each sentence should be separated by a new line. Anything starting with a semicolon is a comment. Sample example for one sentence is below
;; This is a comment
SentenceIndex:1
SL: Resumption of the session
TL: reprise de la session
Alignment:((1,1),(2,2),(3,3),(4,4))
Type: S

Config file:

Depending upon the mode of operation (T2S) or (T2T) , some of the TPARSE_FILE may be optional. Everything else is required.  A sample config file is given below.

###START OF SAMPLE CONFIG#########
VRULES_ROOT=C:/rulelearner/

#### RULE LEARNING MODES ####
INPUT_MODE=T2T
OUTPUT_MODE=T2T

##### Parallel Treebank
CORPUS_FILE=C:/rulelearner/ger/ec.txt
SPARSE_FILE=C:/rulelearner/ger/en1.parsed
TPARSE_FILE=C:/rulelearner/ger/de1.parsed

GRA_FILE=C:/rulelearner/ger/grammar.gra
PTABLE_FILE=C:/rulelearner/ger/phrases.phr
LEXICON_FILE=C:/rulelearner/ger/lexicon.lex

# Output modes can be 'NONE, 'AVENUE' or 'ONELINE'
GRA_FORMAT=ONELINE
PTABLE_FORMAT=ONELINE
LEXICON_FORMAT=ONELINE

# Output Case - LOWER, or TRUE (Default is the true case)
OUTPUT_CASE=LOWER

# (Optional Parameter) . if not included, runs on entire corpus
STOPAT=100000

TOOL_MODE=AVENUE

# Some extra parameters you may not need, but just set for now 
TOOL_MODE=AVENUE
MAX_RULE_SIZE=10
OUTPUT_CASE=MIXED
LEXICALIZATION=NONE
MARKOVIZATION=NONE

###END OF SAMPLE CONFIG#########