I will lead the discussion next week on a theme I'm calling "Agile Methods for Computational Linguistic Data Collection." This includes web mining, active learning, and tools for eliciting information from non-linguists/non-technical experts.
Here are the papers I expect to cover.note that the focus will be on the parts most relevant to the theme, which will likely mean glossing over some technical details.
Alexis Palmer, Taesun Moon, Jason Baldridge, Katrin Erk, Eric
Campbell, & Telma Can (2010). Computational strategies for reducing
annotation effort in language documentation. Linguistic Issues in
Language Technology, 3(1).
http://elanguage.net/journals/index.php/lilt/article/view/663
Fei Xia & William Lewis (2007). Multilingual structural projection
across interlinear text. NAACL-HLT.
http://www.aclweb.org/anthology/N07-1057
Marelie Davel & Etienne Barnard (2004). The efficient generation of
pronunciation dictionaries: human factors during bootstrapping.
INTERSPEECH.
http://www.is.cs.cmu.edu/SpeechSeminar/Papers4Review/FrA2702p.20_p864.pdf
John Kominek & Alan W. Black (2006). Learning pronunciation
dictionaries: language complexity and word selection strategies.
HLT-NAACL.
http://www.aclweb.org/anthology/N06-1030
John Kominek, Sameer Badaskar, Tanja Schultz, & Alan W. Black (2008).
Improving speech systems built from very little data. INTERSPEECH.
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.151.3270&rep=rep1&type=pdf
Related:
Vamshi Ambati, & Stephan Vogel (2010). Can crowds build parallel
corpora for machine translation systems? NAACL HLT 2010 Workshop on
Creating Speech and Language Data with Amazon's Mechanical Turk.
http://www.aclweb.org/anthology/W10-0710
cf. Vamshi's slides from September's Crowdsourcing Lunch:
http://www.cs.cmu.edu/~vamshi/Vamshi/Publications_files/vamshi_crowdlunch.pdf
John Kominek (2009). TTS From Zero: Building Synthetic Voices for New
Languages. Ph.D. dissertation, Carnegie Mellon University.
http://www2.lti.cs.cmu.edu/Research/Thesis/john_kominek.pdf