HIDEKI SHIMA

4th year Ph.D. student
Language Technologies Institute
Carnegie Mellon University

Office: 6609 Gates Hillman Complex
5000 Forbes Avenue
Pittsburgh, PA 15213 USA
Email: hideki at cs.cmu.edu

I am a fourth year Ph.D. candidate in CMU, advised by Dr. Teruko Mitamura. In Fall 2004, I became a master student at the LTI. Since then until Fall 2008, I have been involved with the JAVELIN Question Answering project sponsored by AQUAINT program under ARDA/DTO/IARPA, as a graduate research assistant. I contributed in building open-domain Factoid and Complex Question Answering systems where I worked on crosslingual English-to-Japanese (EJ) and monolingual Japanese-to-Japanese (JJ) modules. My experience varies in various aspects of the QA research, e.g. Question Analysis, Named Entity Transliteration, Document Retrieval, Information Extraction, Answer Summarization, web-based demo, and batch evaluation with automatic error analysis. As a result of intensive research effort, the Javlein system achieved remarkable results in competition style evaluation-oriented QA tasks similar to TREC and CLEF; we achieved the best result among participants in NTCIR-6 CLQA JJ subtask, NTCIR-7 ACLIA CCLQA EJ and JJ task, and NTCIR-7 ACLIA IR4QA EJ task.
Since Fall 2008, I have been working on the KIJI Question Answering project, a joint project of CMU and IBM Tokyo Research Laboratory, where I take responsibility in building "Complex QA for Business Intelligence" that meets real-world business application needs. During the Summer 2009, I joined the Watson (Jeopardy!) QA Project at IBM T.J. Watson Research Center, New York, for three months as a research intern. My research interests centers on the fields of Question Answering, automatic evaluation method for Information Access research, Information Extraction, and Information Retrieval.
 
EDUCATION Ph.D. Candidate, Carnegie Mellon University, Pittsburgh, PA (Aug 2006 - Present)
Language Technologies, School of Computer Science

M.S., Carnegie Mellon University, Pittsburgh, PA (Aug 2004 - Aug 2006)
Language Technologies, School of Computer Science

B.S., Waseda University, Tokyo, Japan (Apr 2000 - Mar 2004)
Information and Computer Science
 
RECENT PUBLICATIONS   Sakai, Tetsuya, Noriko Kando, Hideki Shima, Chuan-Jie Lin, Ruihua Song, Miho Sugimoto and Teruko Mitamura. 2009. "Ranking the NTCIR ACLIA IR4QA Systems without Relevance Assessments", DBSJ Journal, Vol.8, No.2, pp.1-6 (2009) (PDF)

Sakai, Tetsuya, Noriko Kando, Chuan-Jie Lin, Ruihua Song, Hideki Shima, and Teruko Mitamura. 2009. "Revisiting NTCIR ACLIA IR4QA with Additional Relevance Assessments", IPSJ SIG Technical Report Vol.2009-DBS-148 No.9 / Vol.2009-FI-95 No.9, 2009.

Shima, Hideki, Ni Lao, Eric Nyberg and Teruko Mitamura. 2008. "Complex Cross-lingual Question Answering as Sequential Classification and Multi-Document Summarization Task", in Proceedings of NTCIR-7 Workshop, Japan. (PDF)

Lao, Ni, Hideki Shima, Teruko Mitamura and Eric Nyberg. 2008. "Query Expansion and Machine Translation for Robust Cross-Lingual Information Retrieval", in Proceedings of NTCIR-7 Workshop, Japan. (PDF)

Mitamura, Teruko, Eric Nyberg, Hideki Shima, Tsuneaki Kato, Tatsunori Mori, Chin-Yew Lin, Ruihua Song, Chuan-Jie Lin, Tetsuya Sakai, Donghong Ji and Noriko Kando. 2008. "Overview of the NTCIR-7 ACLIA: Advanced Cross-Lingual Information Access", in Proceedings of NTCIR-7 Workshop, Japan. (PDF)

Sakai, Tetsuya, Noriko Kando, Chuan-Jie Lin, Teruko Mitamura, Hideki Shima, Donghong Ji, Kuang-Hua Chen and Eric Nyberg. 2008. "Overview of the NTCIR-7 ACLIA IR4QA Task", in Proceedings of NTCIR-7 Workshop, Japan. (PDF)

Mitamura, Teruko, Frank Lin, Hideki Shima, Mengqiu Wang, Jeongwoo Ko, Justin Betteridge, Matthew Bilotti, Andrew Schlaikjer and Eric Nyberg. 2007. "JAVELIN III: Cross-Lingual Question Answering from Japanese and Chinese Documents", in Proceedings of NTICIR-6 Workshop, Japan. (PDF)

Shima, Hideki and Teruko Mitamura. 2007. "JAVELIN III: Answering Non-Factoid Questions in Japanese", in Proceedings of NTICIR-6 Workshop, Japan. (PDF)

Mitamura, Teruko, Mengqiu Wang, Hideki Shima and Frank Lin. 2006. "Keyword Translation Accuracy and Cross-Lingual Question Answering in Chinese and Japanese", in Proceedings of EACL 2006 Workshop on MLQA (PDF)

Shima, Hideki, Mengqiu Wang, Frank Lin and Teruko Mitamura. 2006. "Modular Approach to Error Analysis and Evaluation for Multilingual Question Answering", in Proceedings of LREC 2006 (PDF)

Lin, Frank, Hideki Shima, Mengqiu Wang and Teruko Mitamura. 2005. "CMU JAVELIN System for NTCIR5 CLQA1", In Proceedings of the NTCIR-5 Workshop, Tokyo, Japan (PDF)
 
PROFESSIONAL
ACTIVITIES
  Task organizer: NTCIR-8 ACLIA and NTCIR-7 ACLIA
Invited reviewer: ACM CIKM (2009), AIRS (2009), ACM SIGIR (2008), and NTCIR-7 ACLIA (2008)
 
TEACHING ASSISTANT   Spring 2009, 11-792 Software Engineering II (graduate level): Advising four student projects: HoneyDew (meeting scheduling agent that interprets emails), WebRecommender (web page recommendation system), STAT (unsupervised learning toolkit), PIGOptimizer (Hadoop's command optimization subproject).

Fall 2008 & 2009, 11-791 Software Engineering I (graduate level): Designing and grading individual assignments, exams and team projects. Giving a tutorial lecture for the tools/skills needed in the team project, including Subversion, Trac, Maven2 and Test-Driven Development with JUnit.

SOFTWARES   EPAN is a web based tool for topic development and evaluation for Information Retrieval and Complex Question Answering. Officially adopted in NTCIR-7 ACLIA. (Becoming an open source project in May 2009)

JAWJAW is a Java API for the Japanese/English WordNet. Try the web-based demo for semantic-relatedness matrix calculation to be distributed in the next release.

HTML CAS Consumer is a UIMA component that produces annotated documents as html files. With the html output, you can easily demonstrate annotations to someone who doesn't have a UIMA environment. (To be released.) Static demo is available here.

Indri CAS Consumer is a UIMA component that produces offset annotations for Indri. With this component, you can easily create indri index so that document retrieval with annotated query is made possible. Structured indexing (e.g. syntactic dependency, predicate-argument structure) is also supported. Robust enough to work on a gigabyte class corpus. (To be released.)

UCR or UIMA Component Repository is a web based repository where developers can upload their UIMA components to share. As one of start up members consisting of CMU students and IBM researchers, I contributed in object oriented analysis, design and implementation especially in search part.

... And many other (re-)implementations including:
  • Recall-optimized sequential classifier for answer-bearing sentence extraction
  • Text summarizer based on Maximal Marginal Relevance
  • Bootstrapping relation-instance learner based on Espresso
  • Pattern-based Pseudo-Relevance Feedback addressing NE vocaburaly mismatch
  • Integrated automatic evaluation toolkit with BLEU, METEOR, ROUGE, BE, POURPRE,...
  • Factoid QA batch evaluation & error analysis tool (see sample output)
  • Japanese Named Entity tagger based on CRFs
  • Javelin web-based demo (see screendump)
  • Wikipedia inter-page and inter-language link mining (11-772 project, see figures)
  • Shallow Semantic Parser based on Tree-CRFs (11-748 team project)
  • Baum-Welch unsupervised learner for HMMs (11-761 assignment)
  • Machine-generated fake text classifier (11-761 team project)
  • Spam filter (15-681 assignment)
  • Protein search engine on Medline biomedical corpus (11-791 team project)
  • Web-mining tool for proper noun translation (11-731 project, see result)
  • English-French alignment tool (11-731 assignment)
  • Web-mining tool for person name transliteration (undergrad project)
  • Wikipedia-gloss annotator for kids (undergrad project, see screendump)
  • Robust bitmap emboldening algorithm (undergrad project with Microsoft, see result)
  • AIBO remote controller with gyroscope and head-mount-display (undergrad project)

COURSE WORKS  
15-681 Machine Learning
11-792 Software Engineering II
11-791 Software Engineering
11-772 Analysis of Social Media
11-761 Language and Statistics

11-748 Information Extraction

11-741 Information Retrieval
11-731 Machine Translation
11-721 Grammars and Lexicons
11-711 Algorithms for NLP


MISC  
Languages Spoken:   Japanese (mother tongue), English (fluent)
Languages Researched:   Japanese, English, Chinese
Languages Studied:   French (2 years), Chinese (1 year), Spanish (1 year)
Personal Website:   http://shi.ma/hideki