![]() |
![]() |
|||||||
|
|
Time: 9:30 am The goal of this research is to use speech recognition errors to improve language model generation. This research is done within the context of project LISTEN's Reading Tutor which uses speech recognition to listen to children read stories one sentence at a time, and detects reading errors. Since the sentence the child is attempting to read is known ahead of time, the speech recognizer uses a separate language model for each sentence that consists of only the words in the sentence, and a few other words or phoneme sequences that the child is predicted to utter. We construct and evaluate an algorithm that learns to improve on such language models. Specifically, given a set of transcribed utterances and their corresponding language models, we recognize the utterances, and compute a measure of tracking error rate by comparing the output hypotheses with the transcripts and the target text. Next we identify those n-grams in the language models that contributed to an increase or decrease in tracking error. Finally we create a feature vector for each such n-gram, and then induce a classifier that outputs the probability that a given n-gram will decrease tracking error. Given this classifier, we then implement a simple "nudge" algorithm that increases or decreases the transition probabilities of n-grams in unseen language models based on whether these n-grams are predicted to decrease or increase tracking error. We show that this algorithm can reduce the tracking error rate by more than 24% on unseen test data.
Time: 10:00 am We present an integrated phrase segmentation/alignment algorithm (ISA) for Statistical Machine Translation. Without the need of building an initial word-to-word alignment or initially segmenting the monolingual text into phrases as other methods do, this algorithm segments the sentences into phrases and finds their alignments simultaneously. For each sentence pair, ISA builds a two-dimensional matrix to represent a sentence pair where the value of each cell corresponds to the Point-wise Mutual Information (MI) between the source and target words. Based on the similarities of MI values among cells, we identify the aligned phrase pairs. Once all the phrase pairs are found, we know both how to segment one sentence into phrases and also the alignments between the source and target sentences. We use monolingual bigram language models to estimate the joint probabilities of the identified phrase pairs. The joint probabilities are then normalized to conditional probabilities, which are used by the decoder. Despite its simplicity, this approach yields phrase-to-phrase translations with significant higher precisions than our baseline system where phrase translations are extracted from the HMM word alignment. When we combine the phrase-to-phrase translations generated by this algorithm with the baseline system, the improvement on translation quality is even larger.
Time: 11:00 am In most of the practical systems incorporating automatic natural language analysis, real time has long been a forgotten dimension. For the few that utilize it, it is often over-simplified and not flexible enough for sophisticated reasoning required in more advanced tasks. The problem is twofold: the nuances of time manifested in natural language requires a rich yet compositional representation, and the need for advanced reasoning demands a more inference-ready encoding of time. Several previous works have explored the reasoning aspect of time, such as the works on various temporal logics and temporal constraint problems, and also on annotating and representing time in natural language, such as the recent development of the DAML Ontology of Time, Timex2 and TimeML. However, we feel that the need from both ends of the spectrum has not been fully satisfied. In this work we propose a time calculus for natural language, from a setting of constraint satisfaction problems (CSP). Architecturally, real-world calendars are modeled as the lowest level CSP, and provide the basic vocabularies for meaning composition. Temporal expressions are then captured using a typed formal language, where the phenomena such as granularity conversion and re-interpretation are handled via type coercion. With the proposed operators and relations, both qualitative and quantitative constraints among various temporal entities can then be expressed, and the resulting temporal constraint satisfaction problem (TCSP) is solved using a modified all-pairs shortest path algorithm. As with the previous works on TCSP, this formulation will enable the answering of various interesting temporal queries.
Time: 11:30 am Competetive research systems for Statistical Machine Translation (SMT) all employ some method of phrase-level translation in addition to or as a replacement for the word-level models originally proposed by (Brown et al., 1993). While extraction methods for these phrases differ among systems (see (Vogel et al., 2003), (Marcu and Wong, 2002), (Zens et al., 2002) for examples), the systems themselves all must combine phrase translation candidates in a useful way during decoding in order to take full advantage of them. This presentation treats the combination of partial translations in an SMT system currently used at CMU. Speciffically, I address the inability of the decoder in this system to combine phrase translations that overlap. As an example, consider the follwing two translation pairs:
While the phrases a b c and b c d can be translated into w x y and x y z, respectively, the traditional system is unable to use this information to translate a b c d as w x y z. The planned talk describes a series of experiments on allowing such overlapping phrase translations. I will present translation results on the Arabic-English development set used in the TIDES evaluations this Spring, along with an analysis of the number and quality of overlapping phrases that were generated. Interesting issues raised by this work include the effect of chaining longer and longer phrase rules together, the role of these long phrases in combination with reordering models which may contradict them, and the effect on decoder speed and the number of translation hypotheses generated for a single test sentence. Overall this presentation represents an effort to help the SMT system move beyond memorization of the training data in its use of phrase-level translations. References Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993. The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics, 19(2):263-311. Daniel Marcu and William Wong. 2002. A Phrase-based, Joint Probability Model for Statistical Machine Translation. In Proceedings of EMNLP-02, Philadelphia, July. Stephan Vogel, Ying Zhang, Fei Huang, Alicia Tribble, Ashish Venugopal, Bing Zhao, and Alex Waibel. 2003. The CMU Statistical Translation System. To appear in MT-Summit, New Orleans, September. Richard Zens, Franz Josef Och, and Hermann Ney. 2002. Phrase-Based Statistical Machine Translation. In KI-2002: 25th Annual German Conference on AI, Springer Verlag, September.
Time: Noon Logistic Regression (LR) has been widely used in statistics for many years, and has received extensive study in machine learning community recently due to its close relations to Support Vector Machines (SVM) and AdaBoost. In this paper, we use a modified version of LR to approximate the optimization of SVM by a sequence of unconstrained optimization problems. We prove that our approximation will converge to SVM, and propose an iterative algorithm called ``MLR-CG'' which uses Conjugate Gradient as its inner loop. Multiclass version ``MMLR-CG'' is also obtained after simple modifications. We compare the MLR-CG with SVM_light over different text categorization coll ections, and show that our algorithm is much more efficient than SVM_light when the number of training examples is very large. Results of the multiclass version MMLR-CG is also reported.
Time: 2:00 pm A potentially useful feature of information retrieval systems for students is the ability to identify documents that are not only relevant to the query, but also a good match for the student's reading level. Manually obtaining an estimate of reading difficulty for each document is not feasible for very large collections, so we require an automated technique. Traditional readability measures such as Flesch-Kincaid perform very poorly on Web pages and other non-traditional documents, because of unreliable sentence length estimates and other factors. Recasting the well-studied problem of readability in terms of text categorization, we describe a new method based on simple statistical language modeling techniques. We show that by using a mixture model to interpolate evidence of a word's frequency across grades, it is possible to build a classifier achieving good performance across a range of individual grade levels. In addition, the classifier is not specific to any subject area and can be built using relatively little training data. The rapid growth of the information on the Internet demands intelligent information agent that can sift through all the available information and find out the most valuable to us. Collaborative filtering is the intelligent system that makes recommendation decisions for a specific user based on the judgments of users with similar tastes. This work presents a flexible mixture model (FMM) for collaborative filtering. FMM is based on the idea that users and objects in collaborative filtering system are different concepts and should be clustered separately, and at the same time we should allow them to belong to different clusters. Furthermore, our second observation is that users with similar or even identical preference patterns may have totally different rating behaviors. Therefore, we propose a decoupled model (DM) to explicitly extract user preference values out of the surface rating values. Empirical study over two datasets of movie ratings has shown that our new algorithm outperforms five other collaborative filtering algorithms substantially. This is a cooperative work with Rong Jin, ChengXiang Zhai and Jamie Callan.
Time: 3:00 pm Identification of keywords is an important task in language technologies research. For example, keyword extraction is needed in question answering and summarization. A task-oriented dialog system may recognize keywords from a lexicon specific to the task instead of entire utterances in the speech. In the language of proteins, “keywords” are termed “motifs”, short amino acid sequences that are conserved across a family or subfamily of proteins because they are the binding sites for protein-protein interactions typical of that family or subfamily. Identifying motifs is an important task in bioinformatics because they aid in large-scale protein-protein interaction prediction and new drug design. However, like in many Asian languages, there are no word boundaries in protein sequences, making the task more difficult. Moreover, unlike human languages, we have yet to build a lexicon for the protein language. G-protein coupled receptors (GPCR) comprise one of the largest superfamily of proteins found in the body (Gether, 2000), and are the target of approximately 60% of current drugs on the market (Muller, 2000). They are also one of the most challenging datasets in protein classification due to the extreme diversity among its members (Moriyama and Kim, 2003). The GPCR superfamily is organized hierarchically in various levels of subfamilies. Karchin et al. (2002) tested a set of classifiers of varying complexity from k-NN to SVM in GPCR classification at the superfamily and subfamily levels, and showed that the more complex classifier SVM performed better than other classifiers at the subfamily level classification. Here, we show that by choosing the right features, n-gram counts selected by chi-square, a feature selection method successful in document classification (Yang and Pedersen, 1997), the simpler Naïve Bayes classifier can outperform the SVM. In addition, the selected n-grams appear to have biological significance, since they correlate with motifs previously identified through wet-lab experiments. References
U. Gether. Uncovering Molecular Mechanisms Involved in Activation of G Protein-Coupled Receptors. Endocrine Reviews, 21(1):90-113, 2000.
Time: 4:00 pm Many natural language applications, like speech recognition, information retrieval, machine translation etc., employ classification with statistical machine learning methods. To perform classification well one needs large amount of labeled data, which is often hard to obtain. On the other hand unlabeled data may be relatively easy to collect, but traditionally it was ignored for classification. It is of great interest to find ways to use both labeled and unlabeled data. I propose an approach based on a Gaussian random field model to learn from both labeled and unlabeled data. Labeled and unlabeled data are represented as vertices in a weighted graph, with edge weights encoding the similarity between instances. The learning problem is then formulated in terms of a Gaussian random field on this graph, where the mean of the field is characterized in terms of harmonic functions, and is efficiently obtained using matrix methods or belief propagation. The resulting learning algorithms have intimate connections with random walks, electric networks, and spectral graph theory. Promising experimental results are presented for synthetic data, digit classification, and text classification tasks.
Time: 4:30 pm My colleagues and I have analyzed speech interface requirements over a broad range of applications to find a small but credible requirements basis for configurable speech interfaces. Respecting these analyses, I have built the Speech Graffiti Personal Universal Controller (SGPUC or Controller), a personal universal interface for human-device speech interaction. Its specification language and protocol effectively separate the SGPUC architecture from the devices that it controls, which allows a user to carry the controller as their personal speech interface around with them, using it to interact universally with any adapted device. The development of numerous adapted devices demonstrated the SGPUC to be a sufficiently generic interface, and careful user studies are being planned to demonstrate the quality of those speech interfaces. The research, which is a part of my upcoming masters thesis, attempts to show that a high quality and low cost human-device speech interface can be built that is largely device agnostic, which is a benefit to manufacturers and interface users alike. These investigations also help to validate the principles of Speech Graffiti as a speech interface paradigm, and they provide a base-line for future study in this area. My talk will focus on dialog issues that we have faced in the design of effective speech interactions, and I will also provide an overview of the dialog and gadget controller system architecture. I will demonstrate some actual devices, probably a media player and some light switches. |