I will lead the discussion next week on a theme I'm calling "Agile Methods for Computational Linguistic Data Collection." This includes web mining, active learning, and tools for eliciting information from non-linguists/non-technical experts.

Here are the papers I expect to cover.note that the focus will be on the parts most relevant to the theme, which will likely mean glossing over some technical details.

Alexis Palmer, Taesun Moon, Jason Baldridge, Katrin Erk, Eric Campbell, & Telma Can (2010). Computational strategies for reducing annotation effort in language documentation. Linguistic Issues in Language Technology, 3(1).
http://elanguage.net/journals/index.php/lilt/article/view/663

Fei Xia & William Lewis (2007). Multilingual structural projection across interlinear text. NAACL-HLT.
http://www.aclweb.org/anthology/N07-1057

Marelie Davel & Etienne Barnard (2004). The efficient generation of pronunciation dictionaries: human factors during bootstrapping. INTERSPEECH.
http://www.is.cs.cmu.edu/SpeechSeminar/Papers4Review/FrA2702p.20_p864.pdf

John Kominek & Alan W. Black (2006). Learning pronunciation dictionaries: language complexity and word selection strategies. HLT-NAACL.
http://www.aclweb.org/anthology/N06-1030

John Kominek, Sameer Badaskar, Tanja Schultz, & Alan W. Black (2008). Improving speech systems built from very little data. INTERSPEECH.
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.151.3270&rep=rep1&type=pdf

Vamshi Ambati, & Stephan Vogel (2010). Can crowds build parallel corpora for machine translation systems? NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk.
http://www.aclweb.org/anthology/W10-0710

cf. Vamshi's slides from September's Crowdsourcing Lunch:
http://www.cs.cmu.edu/~vamshi/Vamshi/Publications_files/vamshi_crowdlunch.pdf

John Kominek (2009). TTS From Zero: Building Synthetic Voices for New Languages. Ph.D. dissertation, Carnegie Mellon University.
http://www2.lti.cs.cmu.edu/Research/Thesis/john_kominek.pdf