For Research
        This page lists various software developed by me. Most of the code
          is either under GPL or LGPL.
          There is no warranty of any kind that the code would work as advertised
          (in fact it might even blow up your machine) - use at your own risk.
          However if you do find the code useful, or have any comments (bug reports,
          etc.), I'd love to hear from you.
        
          UTools
          UTools is a set of software modules written in C++ for performing
            sentence generation/analysis using a unification-based grammar
            formalism (example grammar). It
            is a modernized and extended version based on the original GenKit written
            by Tomita and Nyberg in 1988 (in LISP), and is a close relative to KANTOO (also
            written in C++) used in the KANT project
            in LTI, CMU, but the latter
            is closed source. UTools is an independent implementation and currently
            used in various projects in LTI such as Avenu.
          To illustrate what it can do, here is how UTools can be used to
            generate a sentence. First you need to have a feature structure representing
            the meaning of the sentence you want to generate, like this ("A
            small dog all of a sudden bites the girl in white with the teeth."):
             ((SUBJ
                  (*OR*
                   ((PRED DOG)
                    (FIN -)
                    (PERS 3)
                    (NUM SG)
                    (BREED CHIHUAHUA))
                   ((PRED DOG)
                    (FIN -)
                    (PERS 3)
                    (NUM SG)
                    (BREED PITBULL))))
              (OBJ
                  ((PRED GIRL)
                    (FIN +)
                    (PERS 3)
                    (NUM SG)
                    (COAT_COLOR WHITE)))
              (INST
                  ((PRED TOOTH)
                    (FIN +)
                    (PERS 3)
                    (NUM PL)))
              (PRED BITE)
              (TENSE PRES)) 
          The feature structure is basically a tree structure specifying what
            the subject is (can be a 3rd person singular chihuahua or pitbull),
            what the object is (the poor girl), what the intrument is
            (tooth in plural), and what the action is (bite!).
            You then need to provide a grammar and
            a lexicon for UTools to generate the
            target sentence. A trace of this process can be seen here.
            You can find out more information about this process by reading the
            manual of UKernel: 
           B. Han and A. Lavie. UKernel:
              A Unification Kernel. Technical Report CMU-LTI-03-177,
              Language Technologies Institute, Carnegie Mellon University, August
              13, 2004.
           Currently UTools has the following components (C++):
          
            - Generator (download):
              This is a simple recursive-descent sentence generator - given a
              feature structure as input, a natural language sentence is generated
              by executing grammar rules.
- UKernel (download):
              This is the core engine of UTools - it provides the basic functionality
              for executing the unification rules in a grammar.
- Toolbox (download):
              This is a library implementing basic data structures such as token
              strings/dictionaries, tree templates and context-free grammars
              with prefix searching, etc.
Notably missing are two modules: the first one is a front-end module
            which parses grammar files into the corresponding data structures
            to drive the system (otherwise there is no way that UTools can understand
            the grammar rules you developed in plain text!). The second missing
            module is a parser. With these missing pieces added UTools can be
            a complete suite to meet a broad range of needs in developing a natural
            language application.
          Thus comes my call for help: my plan is to add a front-end
            module with bindings into popular scripting languages, e.g., Python,
            and to add an Earley
            parser for efficient parsing. These two components should be
            written in C/C++ as well. But with limited time on my hands, I would
            like to invite anyone who is interested in this to join me - we can
            even move this project to Sourceforge if
            necessary. But the project has to stay in open source (either GPL or LGPL).
            Please contact me (email at the bottom of this page) if you are interested.
          
          Analyzer
          Analyzer is a program for building a bilingual dictionary from a
            parallel corpus. Analyzer uses a steady-state Genetic Algorithm to
            find a better translation solution. It also uses part-of-speech information
            from the target language to 'optimize' the translation. You need
            to download Toolbox and ePost to
            build and run this code. 
          Paper: B. Han. Building
              a Bilingual Dictionary with Scarce Resources: A Genetic Algorithm
              Approach. In the Student Research Workshop, the Second Meeting
              of the North American Chapter of the Association for Computational
              Linguistics (NAACL-2001), Pittsburgh, 2001.
          You can also find the the slides in the Publications section. 
          v1_4-20010903: Now compiled
            under g++ 3.0 with -Wall -pedantic, along with some minor fixes. 
          
          ePost
          ePost is an adapted version of Brill's part-of-speech tagger for
            English (this is actually pulled out from Analyzer). The differences
            from Brill's original code are
          
            - Lots of memory-related bugs are removed, so you can fire the
              tagger more than once during the run-time now.
- A C++ class wrapper is defined so it's easier to use in C++.
- Now runs under Windows/MSVC++ 6.0.
Also you might be interested in a cool Java/Perl tagger developed
            by Jimmy Lin at
            University of Maryland (based on the ePost) - you can download them there!
          v1_0-20020722: fixed warnings/errors
            for gcc 3.1.
          
          
         
        
        For Fun
        These are non-research-related code (at least for now). Again, use at
        your own risk.
        
          JunkMatcher
           Icon
              courtesy of
 Icon
              courtesy of
            Steve Caplin  JunkMatcher (on Sourceforge.net)
            is a versatile spam filter addon for Mac OS X. Although Apple's built-in Mail.app has
            a wonderful, statistically trained junk filter, spammers nowadays
            use various
            tricks to conceal the real things they want to say (use of graphics,
            encoded characters, etc.). If you're going nuts about your spam problem,
            I've written a cocktail-styled tool that makes use of various effective
            techniques such as Bayesian filtering, IP-based blocking
            and flexible regular
            expressions to identify those sneaky junk mails. 
          receivedDB
          receivedDB is a set
            of Python scripts for parsing
            the Received: headers in emails. A database file (text)
            is kept separate from the code so new header patterns can be added
            for recognizing new kinds of headers. This is useful for tracking
            the origin of emails and detecting header forgery.