CMU - IR Discussion Series

Friday, October 1, 2004 - 1:30, NSH 4513
Title: Combining Language Modeling Approach with String-matching in Near-Duplicate Detection in E-Rulemaking
Speaker: Puck Treeratpituk

Abstract:

An administrative agency is required by law to evaluate the public's comments to the proposed regulations in the rulemaking process. In general, these public comments contain many exact duplicates and near duplicates of form letters. We will focus on the automatic process of near duplicate (form letter) detection in this domain. We propose the simple and efficient way using similarity between language models of documents together with fingerprinting to identify near duplicate comments. This method incorporates word-order information with the "bag-of-words" approach. Then we conduct an experiment showing that this simple method could provide reasonable performance in detecting near duplicates in public comment data.