Home | CV | Projects | Publications | Work Experience

Past Projects

Summer 2012 Microblog Search

In Summer 2012, we participated to the ad-hoc search task of the Microblog Track of TREC. We focused on the vocabulary mismatch problem between tweets and the query. We proposed two approaches to address this issue. The first is query expansion through pseudo-relevance feedback and the other is document expansion of tweets using web documents linked from the body of the tweet. These two approaches gave additive gains in MAP and P@30, and our best run was in the top 10 of the automatic runs submitted.

Yubin Kim, Reyyan Yeniterzi and Jamie Callan, Overcoming Vocabulary Limitations in Microblogs, in Proceedings of the 21th Text REtrieval Conference (TREC 2012), National Institute of Standards and Technology, special publication. In press.

 

Fall 2008 - Spring 2009 Master Thesis : Syntax-to-morphology alignment and constituent reordering in factored phrase-based statistical machine translation from English to Turkish

We present a novel scheme to apply factored phrase-based SMT to a language pair with very disparate morphological structures. Our approach relies on syntactic analysis on the source side (English) and then encodes a wide variety of local and non-local syntactic structures as complex structural tags which appear as additional factors in the training data. On the target side (Turkish), we only perform morphological analysis and disambiguation but treat the complete complex morphological tag as a factor, instead of separating morphemes. We incrementally explore capturing various syntactic substructures as complex tags on the English side, and evaluate how our translations improve in BLEU scores. Our maximal set of source and target side transformations, coupled with some additional techniques, provide an 39% relative improvement from a baseline 17.08 to 23.78 BLEU, all averaged over 10 training and test sets. Now that the syntactic analysis on the English side is available, we also experiment with more long distance constituent reordering to bring the English constituent order close to Turkish, but find that these transformations do not provide any additional consistent tangible gains when averaged over the 10 sets.

This project was supported by the Qatar Foundation through Carnegie Mellon University's Seed Research program.

Reyyan Yeniterzi and Kemal Oflazer, Syntax-to-Morphology Mapping in Factored Phrase-Based Statistical Machine Translation from English to Turkish , in Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL 2010), Uppsala, Sweden, 2010.

 

Summer 2008 Evaluating the Effects of Synthesization on Text De-identification
  • Bradley Malin, Ph.D., Department of Biomedical Informatics Vanderbilt University Medical Center
  • John Aberdeen, the MITRE Corporation
  • Samuel Bayer, the MITRE Corporation
  • Lynnette Hirshman, Ph.D., the MITRE Corporation
  • Ben Wellner, the MITRE Corporation

De-identified medical records are critical to biomedical research. Text de-identification software exists, including “resynthesis” components that replace real identifiers with synthetic identifiers. The goal of this research is to evaluate the effectiveness and examine possible bias introduced by resynthesis on de-identification software. We evaluated the open-source MITRE Identification Scrubber Toolkit, which includes a resynthesis capability, with clinical text from Vanderbilt University Medical Center patient records. We investigated four record classes from over 500 patients' files, including laboratory reports, medication orders, discharge summaries and clinical notes. We trained and tested the de-identification tool on real and resynthesized records. We measured performance in terms of precision, recall, F-measure and accuracy for the detection of protected health identifiers as designated by the HIPAA Safe Harbor Rule.

The de-identification tool was trained and tested on a collection of real and resynthesized Vanderbilt records. Results for training and testing on the real records were 0.990 accuracy and 0.960 F-measure. The results improved when trained and tested on resynthesized records with 0.998 accuracy and 0.980 F-measure but deteriorated moderately when trained on real records and tested on resynthesized records with 0.989 accuracy 0.862 F-measure. Moreover, the results declined significantly when trained on resynthesized records and tested on real records with 0.942 accuracy and 0.728 F-measure. The de-identification tool achieves high accuracy when training and test sets are homogeneous (ie, both real or resynthesized records). The resynthesis component regularizes the data to make them less “realistic,” resulting in loss of performance particularly when training on resynthesized data and testing on real data.

The project was supported by the Vanderbilt International Office (VIO) Grants Program

Reyyan Yeniterzi, John Aberdeen, Sam Bayer, Ben Wellner, Lynette Hirschman, and Bradley Malin, Effects of Personal Identifier Resynthesis on Clinical Text De-identification , Journal of the American Medical Informatics Association,17: 159-168, 2010.

John Aberdeen, Samuel Bayer, Reyyan Yeniterzi, Ben Wellner, Cheryl Clark, David Hanauer, Bradley Malin and Lynette Hirschman, The MITRE Identification Scrubber Toolkit: Design, Training, Assessment, International Journal of Medical Informatics, fortcoming.

 

Spring 2008 Developing A Natural Language Based Querying System for Integrated Biomedical Ontologies

We introduce a controlled natural language for biomedical queries, called BioQueryCNL, and present an algorithm to convert a biomedical query in this language into a program in answer set programming (ASP)---a formal framework to automate reasoning about knowledge. BioQueryCNL allows users to express complex queries (possibly containing nested relative clauses and cardinality constraints)
over biomedical ontologies; and such a transformation of BioQueryCNL queries into ASP programs
is useful for automating reasoning about biomedical ontologies by means of ASP solvers. We precisely describe the grammar of BioQueryCNL, implement our transformation algorithm, and illustrate its applicability to biomedical queries by some examples.

Esra Erdem and Reyyan Yeniterzi, Transforming Controlled Natural Language Biomedical Queries into Answer Set Programs , Proceedings of the BioNLP 2009 Workshop at the Annual Meeting of the Association for Computational Linguistics (ACL).

 

Fall 2007 Determining the important career factors which affect the success of a manager as a CEO
  • Suveyda Yeniterzi, M. Sc., Computer Science and Engineering, Sabanci University
  • Ugur Sezerman, Ph.D., Biological Science and Engineering, Sabanci University
  • Nilay Noyan, Ph.D., Manufacturing Systems / Industrial Engineering, Sabanci University
  • Ayse Karaevli, Faculty of Management, Sabanci University

We approached this problem in two ways. Our first method was to use factor analysis to examine the underlying structure among the variables. As the second approach we used genetic algorithm to find a subset of features which helps us to to do better classifications among CEOs.

 

Fall 2007
Developing A New Approach to Measure the Similarities of Protein Structures Using Network Properties
  • Suveyda Yeniterzi, M. Sc., Computer Science and Engineering, Sabanci University
  • Alper Kucukural, Ph.D. student, Biological Science and Engineering, Sabanci University
  • Ugur Sezerman, Ph.D., Biological Science and Engineering, Sabanci University
  • Nilay Noyan, Ph.D., Manufacturing Systems / Industrial Engineering, Sabanci University

Protein structure prediction is one of the most important research areas in bioinformatics. CASP has been one of the world-wide experiments in this area. It assesses the quality of methods and results of international research in this area. CASP evaluation is based on comparison of each model with the corresponding native model.
In this work we aim to estimate a new function to calculate the measure of similarity between model and native protein structures. Moments of graph theoretical properties were used to find a similarity measure between two protein structures. Multiple Linear Regression was applied to these graph properties to estimate a new function.

Suveyda Yeniterzi, Reyyan Yeniterzi, Alper Kucukural, Nilay Noyan and Ugur Sezerman, A New Approach to Measure the Similarities of Protein Structures Using Network Properties, presented in HIBIT08, International Symposium on Health Informatics and Bioinformatics, May 18-20, 2008, Istanbul, Turkey.

A New Approach to Measure the Similarities of Protein Structures Using Network Properties, poster presented in BIOSYSBIO 2008, Synthetic Biology, Systems Biology and Bioinformatics, April 20-22, 2008, London, UK.

 

Fall 2006 - Spring 2007 Graduation Project : Developing an online multiplayer game to collect statistical data for word alignments between English - Turkish paralel sentences

Supervised by Kemal Oflazer, Ph.D., Computer Science and Engineering, Sabanci University

  • Hanife Kebapci, M. Sc., Computer Science and Engineering, Sabanci University
  • Ahmet Hakan Goral

The ESP Game is the famous example of Games With a Purpose, which are games that are played by humans and at the background collect those human computations to be used in researches. Luis von Ahn, Ph.D., developed many games like this to improve the accuracy of searches and computer computations.

Today, many Machine Learning applications need data to be more accurate. The Statistical Machine Translation is one of them. In this project we aim to overcome this problem by collecting word alignments with a game called "E.T. English Turkish Alignment Game". In this game two players simultaneously try to align words of the same English and Turkish sentences. Alignments that are same are stored and statistics about them are stored.

Fall 2006 - Spring 2007 Using Genetic Algorithms to Select the Minimum Number of Features for Classification
  • Suveyda Yeniterzi, M. Sc., Computer Science and Engineering, Sabanci University
  • Alper Kucukural, Ph.D. student, Biological Science and Engineering, Sabanci University
  • Ugur Sezerman, Ph.D., Biological Science and Engineering, Sabanci University

Selecting most relevant factors from genetic profiles that can optimally characterize cellular states is of crucial importance in identifying complex disease genes and biomarkers for disease diagnosis and for assessing drug efficiency. In this work, we present an approach using a genetic algorithm for feature subset selection problem that can be used in selecting optimum set of genes for classification of gene expression data. We implemented a dynamic parent generation procedure which is inspired by the nature. The idea of fitter and fewer genes (features) make up for fitter and more evolved efficient parents, enabled us to dynamically reduce number of genes. This way we could obtain optimum number of features with the highest classification accuracy for each data set.

Alper Kucukural, Reyyan Yeniterzi, Suveyda Yeniterzi and Ugur Sezerman, Evolutionary Selection of Minimum Number of Features for Classification of Gene Expression Data Using Genetic Algorithms, presented in GECCO 2007, Genetic and Evolutionary Computation Conference, July 7-11, 2007, London, England.

Suveyda Yeniterzi, Reyyan Yeniterzi, Alper Kucukural and Ugur Sezerman, Feature Selection with Genetic Algorithms on Medical Data, presented in HIBIT07, International Symposium on Health Informatics and Bioinformatics, April 30-May 2, 2007, Antalya, Turkey.

 

Summer 2006 Automatic Speech Recognition on 911 Calls

Supervised by Dilek Hakkani Tur, Ph.D. , Senior Researcher at ICSI

Automatic speech recognition (ASR) is the process of finding the most likely word sequence from a given acoustic speech signal. The ASR consists of several steps like feature extraction, the Acoustic Model and the decoder, which consists of Language Model and Lexical Model. In this project we mainly dealt with the language model and the lexical model. We used off-the-shelf acoustic models and as a result produced an ASR for the recognition of some 911 audio files.

 

Spring 2006 Developed an Online Search Database for SU Sponsored Research Award/Proposal Projects
  • Suveyda Yeniterzi, M. Sc., Computer Science and Engineering, Sabanci University
  • Akdes Serin , Ph.D. student, International Max Planck Research School for Computational Biology and Scientific Computing

We developed an online search database application with necessary administrative functions. Apache, PHP and MySQL was used.

Summer 2005 Transcription Factor Binding Site Determination Using Data Mining Methods

Transcription factors (TF) control the expression levels of the genes by binding to the regulatory DNA sequences in the genome. Finding these regulatory sequences will enable determination of TFs. We used data mining tools to find TF binding motifs. Using structural TF-DNA complex information, we performed association rule mining to determine the binding residues of TF. With the combination of these rules, we built a predictor which can predict the binding site. Moreover, using the rules derived from the genomic sequences together with TF sequences, our algorithm is able to determine the possible regulatory motifs of a given TF.

Transcription Factor Binding Site Determination Using Data Mining Methods, poster presented in FEBS 2006, Federation of European Biochemical Societies, June 24-29, 2006, Istanbul, Turkey.