John Wieting


Email :


I am a PhD student at Carnegie Mellon University supervised by Taylor Berg-Kirkpatrick and Graham Neubig. I also collaborate with Kevin Gimpel at the Toyota Technological Institute at the University of Chicago. Previously, I did my MS with Dan Roth, currently at the University of Pennsylvania.

My interests lie in machine learning, learning theory, optimization, natural language processing and computer vision. Currently my research has focused on machine learning and natural language processing.


I received my BS in Mathematics and BS in Chemistry at the University of Wisconsin with Honors in 2011 and my MS in Computer Science in May 2014 from the University of Illinois under the supervision of Dan Roth.

Key Publications

  • (EMNLP 2020) A Bilingual Generative Transformer for Semantic Sentence Embedding (pdf)

  • (ACL 2019) Simple and Effective Paraphrastic Similarity from Parallel Translations (pdf)

  • (ACL 2019) Beyond BLEU: Training Neural Machine Translation with Semantic Similarity (pdf)

  • (ICLR 2019) No Training Required: Exploring Random Encoders for Sentence Classification (pdf)

  • (NAACL 2018) Adversarial Example Generation with Syntactically Controlled Paraphrase Networks (pdf)

  • (ACL 2018) Pushing the Limits of Paraphrastic Sentence Embeddings with Millions of Machine Translations (pdf)

  • (EMNLP 2017) Learning Paraphrastic Sentence Embeddings from Back-Translated Bitext (pdf)

  • (ACL 2017) Revisiting Recurrent Networks for Paraphrastic Sentence Embeddings (pdf)

  • (EMNLP 2016) Charagram: Embedding Words and Sentences via Character n-grams (pdf)

  • (ICLR 2016, oral) Towards Universal Paraphrastic Sentence Embeddings (pdf)

  • (TACL 2015) From Paraphrase Database to Compositional Model and Back (pdf|bib).

  • Some Old Projects

    1. Generalization of Strongly Convex Online Learning Algorithms Download : This paper presents a discussion and fills in some of the blanks I had when reading Sham Kakade's paper. The main idea from the project, is that in batch learning we are interested in bounding generalization error with some probability. In online, we are interested in bounding the regret or the difference in the total loss we have incurred on all examples we have seen versus the total loss of the optimal function in our class. This paper relates the two for a particular class of online learning algorithms and it is also then able to characterize the convergence rate of these algorithms with high probability, not just the expected rate. It is a good application of learning theory, an interest of mine.

    2. Tiered Clustering Model for Lexical Entailment Download : In this project, I investigated clustering contexts for improving lexical entailment. I tried two different approaches. The first was a tiered clustering model. This is a nonparametric Bayes algorithm and so we do not need to specify in advance the number of clusters (which is nice - although we do still have hyperparameters that can affect the number of clusters). It was also hierarchical in the sense that a word could belong to one of two topics - a background topic or a forefront topic (i.e. one of the clusters). The other clustering approach was a simple greedy set covering approach. The paper suggests that clustering can help lexical entailment, especially tiered clustering, but we must be careful of how the latent senses are combined to form the new representation of the word.

    3. Learning and Inference in Entity Relation Identification Download : This paper investigates three approaches to discovering entities and relations in text. The first just uses a collection of local classifiers, the second uses a collection of local classifiers with integer linear programming inference, and the third uses inference based training. The third approach differs from the second as we are using the result of our ILP inference to update the weight vectors. This approach can be seen to be equivalent to Collin's structured perceptron (discussed in paper). The results are somewhat surprising as there is not too much difference in performance, but the IBT does have the best results in general. A lot of extra work for some small improvement - but at least the outputs of our predictions are more coherent.

    4. Two Dimensional Non-causal HMM for Texture Classification Download : This paper utilizes a special HMM that can model dependencies, not only those from the left to the right, but also from up and down as well as diagonal directions. We implemented and applied this HMM to texture classification (i.e. is the image of bark, water, granite, etc.).

    5. Constrained Conditional Model Java Programming Language and Library : Some code I hope to release at some point when I have some more time. Basically allows for one to simply create a simple CCM that can do multiclass learning, multiclass with ILP inference, or structured learning with structured perceptron or structured SVM with ILP inference. Simple and lightweight - useful for quick implementations or learning how these models work.


    1. Teaching Assistant for CS 546: Machine Learning in Natural Language Processing (Spring 2013)

    2. Teaching Assistant for CS 125: Introduction to Computer Science for Majors (Fall 2012)

      1. rated as Excellent (by ICES scoring, submitted by students)

    3. Teaching Assistant for CS 421: Programming Languages and Compilers (Summer 2012)

    4. Teaching Assistant for CS 125: Introduction to Computer Science for Majors (Spring 2011)

      1. rated as Outstanding (Top 10%) (by ICES scoring, submitted by students)

    5. Teaching Assistant for CS 125: Introduction to Computer Science for Majors (Fall 2011)

      1. rated as Excellent (by ICES scoring, submitted by students)

    6. Instructor for CS 199: Honors Projects for CS 125 (Fall 2011)

    Selected Awards