Proteome and Genome Wide Analysis
Machine Learning for Transmembrane helix prediction
TMpro, is an algorithm that was built in analogy to latent semantic analysis model, for transmembrane helix prediction. A web server makes this algorithm available to the scientific community, allowing upto 4000 sequences to be analyzed at a time. Current and future work involves designing learning algorithms to improve the algorithm to take into account additional sources of information (some of which may provide partial or unreliable information).
Sequence based prediction of genes that escape inactivation in the DNA
Biological Language Modeling Toolkit (BLMT):
A toolkit to compute n-gram frequencies (n-mer / k-mer / oligomer frequencies) from protein or nucleotide sequence data has been built previously. It processes data of protein sequences or genome sequences into suffix arrays and computes a variety of sequence features such as n-grams and Yule values. The source code is in C, and may be installed on any standard computer. The system has been tested for upto 25MB data at a time. The web interface provides an interactive mechanism to compute these features without requirement to locally install the software. A number of applications have been built over the toolkit, e.g. comparison of yule values of hydriphobic segments in transmembrane and globular proteins, n-gram comparison between human and mouse genomes, scalable algorithm for variable number tandem repeats (VNTRs) etc.
Current and future work involves advancing the scalability of the algorithms as well as development of novel applications.
Genome Sequence Analysis with BLM toolkit
Analysis of protein sequences as if they were natural language texts, allows analysis of sequence analogous to "topic segmentation" and "document classification". We computed the n-gram frequencies of 44 different organisms using the n-gram comparison functions provided by the Biological Language Modeling Toolkit and performed Markovian n-gram analysis, Zipf analysis and n-gram phrase analysis leading to the identificatio of genome signatures of organisms.
Comparison of transmembrane and soluble-hydrophobic helices
Transmembrane (TM) helix prediction algorithms often incorrectly predict globular helices and signal peptide sequences to be of TM type. The goal of this project was to identify if correlations between amino acids in globular helices, signal peptide sequences and actual transmembrane regions differ. Yule’s Q-statistic was computed using the BLM Toolkit for the three data sets. The results show that Yule values vary between the three data sets and may prove useful features for TM prediction algorithms.
Univsersal Digital Library, Language Technologies
Om Transliteration Editor
A large number of different languages are spoken in India. The languages and scripts are distinct from each other but all Indian languages are phonetic in nature. We developed a transliteration scheme Om which exploits this phonetic nature of the alphabet. Om uses ASCII characters to represent Indian language alphabets, and can be read directly in English, by a large number of users who cannot read script in other Indian Languages than their mother tongue. It is also useful in computer applications where local language tools are not yet available, such as email and chat. We also developed a text editor for Indian languages that integrates the Om input for many Indian languages into a word processor such as Microsoft Winword®. The text editor is also developed on Java® platform that can run on UNIX machines as well. This transliteration scheme is proposed as a possible standard for Indian language transliteration and keyboard entry.
Multilingual Book Reader: Transliteration, Word-to-Word and Full-text Translation
India being a multilingual nation, with 22 recognised official languages, also has literature in all these languages; they find representation in the Digital Library of India (DLI) which holds over 120,000 books. DLI has driven the creation of a large number of applications to process and present the Indian language content. In this paper, we present the creation of a multilingual book reader interface for DLI that supports transliteration and “good enough translation” features making it possible for readers to read a book that is written in another language.
Telugu Morphological Generator
Telmore is a morphological generator tool for Telugu nouns and verbs. Nouns generator: For nouns, it takes a word and its "class" as input, and generates morphological forms as output. Total number of noun morphological forms is 17 under nominative, genitive, accusative, dative, locative, instrumental and vocative (cases), masculine, feminine or neutral (gender) and in number. Verbs generator: For verbs, it takes a word in infinitive t'a form (ichchut'a, geluchut'a, raayut'a) and generates its morphological forms as output. The output has 130 forms: by 2 numbers (singular, plural), 3 genders (male, female, neutral), 3 persons (1st, 2nd and 3rd person), and 7 tenses/moods (present, past, future, aorist affirmitive, aorist negative, imperative and prohibitive), and 4 independent participles. Input and Output of Telugu text is in Om transliteration.