HIDEKI SHIMA

Last Updated:
Feb 25, 2012
6th year Ph.D. student
Language Technologies Institute
School of Computer Science
Carnegie Mellon University

Office: 6609 Gates Hillman Complex
5000 Forbes Avenue
Pittsburgh, PA 15213 USA
Email: hideki at cs.cmu.edu

I am a sixth year Ph.D. candidate at LTI in CMU, advised by Dr. Teruko Mitamura. My research interest includes automatically acquiring paraphrase knowledge using semi-supervised machine learning models, and its application to Question Answering, Recognizing Textual Entailment, automatic evaluation of Information Access system, Information Extraction, and Information Retrieval. I am one of the organizers in a shared task for recognizing textual entailment, paraphrase and contradiction, called NTCIR-9, NTCIR-10 RITE (Recognizing Inference in TExt). I'm currently supported by DARPA and Qatar National Research Fund.

News: I have my paper accepted at LREC 2012. In Feb 2012, I received the Allen Newell Award for Research Excellence with Dr. Eric Nyberg, Dr. Teruko Mitamura, and Dr. Nico Schlaefer [full story].

 
EDUCATION Ph.D. Candidate,
(Aug 2006 - Present)
Carnegie Mellon University, Pittsburgh, PA
Language Technologies, School of Computer Science

M.S.,
(Aug 2004 - Aug 2006)
Carnegie Mellon University, Pittsburgh, PA
Language Technologies, School of Computer Science

(Apr 2000 - Mar 2004)
B.S., Waseda University, Tokyo, Japan
Information and Computer Science

RESEARCH
PROJECTS
(Winter 2011 - Present)
SmartReader - sponsored by NPRP under QNRF
We are developing intelligent educational software for English language learners. My contributions include analyzing functionalities of the system, designing the architecture, implementing an AJAX-supported web client which displays various NLP information annotated by a remote UIMA server. My current goal is to implement a Sentence Simplification module based on rich paraphrase rules acquired in bootstrap.

(Spring 2010 - Present)
Machine Reading Program - sponsored by DARPA under AFRL
In RACR (Reading and Contextual Reasoning) team under Machine Reading Program, we aim to build a universal text engine that captures knowledge from naturally occurring text and transforms it into the formal representations used by artificial intelligence (AI) reasoning systems. I have contributed in two aspects so far: 1. building a mixture-of-expert system that smartly merges outputs from various NLP components from multiple different universities collaborating in the project; 2. building automatic evaluation software for entity and relation mention extraction accuracy which also provides tools to analyze errors and visualize result diagrams.

(Sep 2011 - Feb 2012)
Yahoo! FREP Project - employed by Yahoo! Labs (remote part-time)
We are developing a question answering prototype which can process complex questions asked in a community Q&A site. Pisition: Research Scientist Stduent.

(June 2009 - Aug 2009)
Watson (DeepQA) Project - employed by IBM Research (full-time)
At IBM T.J. Watson Research Center, New York, I spent three months as a research intern in the Watson (DeepQA) Question Answering project which created a QA system that showed a great performance over a quiz champion in a Jeopardy! show. I developed one of the algorithms used to score answer candidates. The algorithm analyzes supporting evidence found in a set of text passages retrieved for each candidate answer, and estimates how well they support the answer by modeling the semantic similarity between the passages and the Jeopardy! clue.

(Fall 2008 - Spring 2010)
KIJI QA Project - sponsored by IBM Research
We built a Complex Cross-lingual QA system English-Japanese, in collaboration with IBM Research - Tokyo, which can answer various (e.g. definition, biography, relationship, event, person, location etc) kinds of questions that may be asked in Business Intelligence scenario. The system has been evaluated in NTCIR ACLIA and resulted in a good performance. The project also resulted in an HTML-based annotation viewer for UIMA.

(Fall 2004 - Fall 2008)
JAVELIN QA Project - sponsored by AQUAINT under ARDA/DTO/IARPA
As a graduate research assistant, I contributed in building open-domain Factoid and Complex Question Answering systems where I worked on crosslingual English-to-Japanese (EJ) and monolingual Japanese-to-Japanese (JJ) modules. My experience varies in various aspects of the QA research, e.g. Question Analysis, Named Entity Transliteration, Document Retrieval, Information Extraction, Answer Summarization, web-based demo, and batch evaluation with automatic error analysis. As a result of intensive research effort, the Javlein system achieved remarkable results in competition style evaluation-oriented QA tasks similar to TREC and CLEF; we achieved the best result among participants in NTCIR-6 CLQA JJ subtask, NTCIR-7 ACLIA CCLQA EJ and JJ task, and NTCIR-7 ACLIA IR4QA EJ task.

PUBLICATIONS
(Google Scholar)
  Shima, Hideki, Teruko Mitamura. 2012. "Diversifiable Bootstrapping for Acquiring High-Coverage Paraphrase Resource", in Proceedings of The Language Resource and Evaluation Conference (LREC) 2012, Turkey. (PDF), (PPTX).

Murdock, J. William, James Fan, Adam Lally, Hideki Shima, Branimir Boguraev. 2012. "Textual Evidence Gathering and Analysis". IBM Research and Development Journal Special Issue on DeepQA. (Link)

Shima, Hideki, Hiroshi Kanayama, Cheng-Wei Lee, Chuan-Jie Lin, Teruko Mitamura, Yusuke Miyao, Shuming Shi, and Koichi Takeda. 2011. "Overview of NTCIR-9 RITE: Recognizing Inference in TExt", in Proceedings of NTCIR-9 Workshop, Japan. (PDF), (CODE)

Shima, Hideki, Yuanpeng Li, Naoki Orii, and Teruko Mitamura. 2011. "LTI's Textual Entailment Recognizer System at NTCIR-9 RITE", in Proceedings of NTCIR-9 Workshop, Japan. (PDF).

Shima, Hideki and Teruko Mitamura. 2011. "Diversity-aware Evaluation for Paraphrase Patterns", in Proceedings of TextInfer 2011: The EMNLP 2011 Workshop on Textual Entailment. Edinburgh, Scotland. (PDF), (CODE)

Shima, Hideki and Teruko Mitamura. 2010. "Bootstrap Pattern Learning for Open-Domain CLQA", in Proceedings of NTCIR-8 Workshop, Japan. (PDF)

Mitamura, Teruko, Hideki Shima, Tetsuya Sakai, Noriko Kando, Tatsunori Mori, Koichi Takeda, Chin-Yew Lin, Ruihua Song, Chuan-Jie Lin, and Cheng-Wei Lee. 2010. "Overview of the NTCIR-8 ACLIA Tasks: Advanced Cross-Lingual Information Access", in Proceedings of NTCIR-8 Workshop, Japan. (PDF)

Sakai, Tetsuya, Hideki Shima, Noriko Kando, Ruihua Song, Chuan-Jie Lin, Teruko Mitamura, and Miho Sugimoto. 2010. "Overview of NTCIR-8 ACLIA IR4QA", in Proceedings of NTCIR-8 Workshop, Japan. (PDF)

Sakai, Tetsuya, Noriko Kando, Hideki Shima, Chuan-Jie Lin, Ruihua Song, Miho Sugimoto and Teruko Mitamura. 2009. "Ranking the NTCIR ACLIA IR4QA Systems without Relevance Assessments", DBSJ Journal, Vol.8, No.2, pp.1-6 (2009) (PDF)

Sakai, Tetsuya, Noriko Kando, Chuan-Jie Lin, Ruihua Song, Hideki Shima, and Teruko Mitamura. 2009. "Revisiting NTCIR ACLIA IR4QA with Additional Relevance Assessments", IPSJ SIG Technical Report Vol.2009-DBS-148 No.9 / Vol.2009-FI-95 No.9, 2009.

Shima, Hideki, Ni Lao, Eric Nyberg and Teruko Mitamura. 2008. "Complex Cross-lingual Question Answering as Sequential Classification and Multi-Document Summarization Task", in Proceedings of NTCIR-7 Workshop, Japan. (PDF)

Lao, Ni, Hideki Shima, Teruko Mitamura and Eric Nyberg. 2008. "Query Expansion and Machine Translation for Robust Cross-Lingual Information Retrieval", in Proceedings of NTCIR-7 Workshop, Japan. (PDF)

Mitamura, Teruko, Eric Nyberg, Hideki Shima, Tsuneaki Kato, Tatsunori Mori, Chin-Yew Lin, Ruihua Song, Chuan-Jie Lin, Tetsuya Sakai, Donghong Ji and Noriko Kando. 2008. "Overview of the NTCIR-7 ACLIA: Advanced Cross-Lingual Information Access", in Proceedings of NTCIR-7 Workshop, Japan. (PDF)

Sakai, Tetsuya, Noriko Kando, Chuan-Jie Lin, Teruko Mitamura, Hideki Shima, Donghong Ji, Kuang-Hua Chen and Eric Nyberg. 2008. "Overview of the NTCIR-7 ACLIA IR4QA Task", in Proceedings of NTCIR-7 Workshop, Japan. (PDF)

Mitamura, Teruko, Frank Lin, Hideki Shima, Mengqiu Wang, Jeongwoo Ko, Justin Betteridge, Matthew Bilotti, Andrew Schlaikjer and Eric Nyberg. 2007. "JAVELIN III: Cross-Lingual Question Answering from Japanese and Chinese Documents", in Proceedings of NTICIR-6 Workshop, Japan. (PDF)

Shima, Hideki and Teruko Mitamura. 2007. "JAVELIN III: Answering Non-Factoid Questions in Japanese", in Proceedings of NTICIR-6 Workshop, Japan. (PDF)

Mitamura, Teruko, Mengqiu Wang, Hideki Shima and Frank Lin. 2006. "Keyword Translation Accuracy and Cross-Lingual Question Answering in Chinese and Japanese", in Proceedings of EACL 2006 Workshop on MLQA (PDF)

Shima, Hideki, Mengqiu Wang, Frank Lin and Teruko Mitamura. 2006. "Modular Approach to Error Analysis and Evaluation for Multilingual Question Answering", in Proceedings of LREC 2006 (PDF)

Lin, Frank, Hideki Shima, Mengqiu Wang and Teruko Mitamura. 2005. "CMU JAVELIN System for NTCIR5 CLQA1", In Proceedings of the NTCIR-5 Workshop, Tokyo, Japan (PDF)
 
AWARD   The Allen Newell Award for Research Excellence (2012)

PROFESSIONAL
ACTIVITIES
  Workshop Chair: IEEE EMRITE 2012
Organizer: NTCIR-10 RITE, DARPA MRP Kick Off - Student Summit (2011), NTCIR-9 RITE, NTCIR-8 ACLIA and NTCIR-7 ACLIA
Program Committee: AIRS (2011), EMNLP (2011), AIRS (2010)
Reviewer: ACL (2010), ACM CIKM (2009), AIRS (2009), ACM SIGIR (2008)
 
TEACHING ASSISTANT   Spring 2009, 11-792 Software Engineering II (graduate level): Advising four student projects: HoneyDew (meeting scheduling agent that interprets emails), WebRecommender (web page recommendation system), STAT (unsupervised learning toolkit), PIGOptimizer (Hadoop's command optimization subproject).

Fall 2008, 2009 and 2010, 11-791 Software Engineering I (graduate level): Designing and grading individual assignments, exams and team projects. Giving a tutorial lecture for the tools/skills needed in the team project, including Subversion, Trac, Maven2 and Test-Driven Development with JUnit.

SOFTWARES   WS4J (WordNet Similarity for Java) provides APIs for several semantic relatedness algorithms for, in theory, any WordNet instance. The codebase has been mostly ported from WordNet-Similarity-2.05.

Wikipedia Redirect can extract pairs of a title and a redirected title (e.g. "USA" -> "United States") from a wikipedia dump on any language. It's useful for addressing vocabulary mismatch in text especially on proper nouns.

DIMPLE (DIversity-aware Metric for Pattern Learning Experiment) evaluates paraphrase patterns considering with lexical diversity. The software comes with a data loader for RTE, MS Paraphrase, and TREC Complex QA evaluation datasets which could be reused in other projects.

RITE SDK provides a Java framework for rapidly building a Textual Entailment recognition system especially toward participating in the NTCIR-9 RITE evaluation task. RITE SDK comes with a sample code, so you can rapidly build a working system by modifying it.

SEPIA is a web based tool for topic development and evaluation for Information Retrieval and Complex Question Answering. Officially used in NTCIR-7 ACLIA, NTCIR-8 ACLIA and GeoTime.

JAWJAW is a Java API for the Japanese/English WordNet. Try the web-based demo for semantic-relatedness matrix calculation.

Project Hello World is a collection of tiny sample codes for various things you can do in Java with/without some cool external libraries (e.g. jaxeclapi, sesame, UIMA, commons-cli, commons-exec, jfreechart, json, diff, MD5, javax.mail, DOM, SAX, zip).

HTML CAS Consumer is a UIMA component that produces annotated documents as html files. With the html output, you can easily demonstrate annotations to someone who doesn't have a UIMA environment. (To be released.) Static demo is available here.

Indri CAS Consumer is a UIMA component that produces offset annotations for Indri. With this component, you can easily create indri index so that document retrieval with annotated query is made possible. Structured indexing (e.g. syntactic dependency, predicate-argument structure) is also supported. Robust enough to work on a gigabyte class corpus. (To be released.)

UCR or UIMA Component Repository is a web based repository where developers can upload their UIMA components to share. As one of start up members consisting of CMU students and IBM researchers, I contributed in object oriented analysis, design and implementation especially in search part.

... And many other (re-)implementations including:
  • Recall-optimized sequential classifier for answer-bearing sentence extraction
  • Text summarizer based on Maximal Marginal Relevance
  • Bootstrapping relation-instance learner based on Espresso
  • Pattern-based Pseudo-Relevance Feedback addressing NE vocaburaly mismatch
  • Integrated automatic evaluation toolkit with BLEU, METEOR, ROUGE, BE, POURPRE,...
  • Factoid QA batch evaluation & error analysis tool (see sample output)
  • Japanese Named Entity tagger based on CRFs
  • Javelin web-based demo (see screendump)
  • Wikipedia inter-page and inter-language link mining (11-772 project, see figures)
  • Shallow Semantic Parser based on Tree-CRFs (11-748 team project)
  • Baum-Welch unsupervised learner for HMMs (11-761 assignment)
  • Machine-generated fake text classifier (11-761 team project)
  • Spam filter (15-681 assignment)
  • Protein search engine on Medline biomedical corpus (11-791 team project)
  • Web-mining tool for proper noun translation (11-731 project, see result)
  • English-French alignment tool (11-731 assignment)
  • Web-mining tool for person name transliteration (undergrad project)
  • Wikipedia-gloss annotator for kids (undergrad project, see screendump)
  • Robust bitmap emboldening algorithm (undergrad project with Microsoft, see result)
  • AIBO remote controller with gyroscope and head-mount-display (undergrad project)

COURSE WORKS  
15-681 Machine Learning
11-796 Question Answering Lab
11-792 Software Engineering II
11-791 Software Engineering
11-772 Analysis of Social Media
11-761 Language and Statistics

11-748 Information Extraction

11-741 Information Retrieval
11-731 Machine Translation
11-721 Grammars and Lexicons
11-711 Algorithms for NLP


MISC  
Languages Spoken:   Japanese (mother tongue), English (fluent)
Languages Researched:   Japanese, English, Chinese
Languages Studied:   French (2 years), Chinese (1 year), Spanish (1 year)