Machine Learning Thesis Proposal
- Newell-Simon Hall
- QIZHE XIE
- Ph.D. Student
- Machine Learning Department
- Carnegie Mellon University
Towards Data-Efficient Machine Learning
Deep learning works well when (1) the problem is regular enough and (2) there is enough training data to adequately and in a representative way reflect all the regularity. As the ambition of researchers grows, problems with less regularity are being addressed, where more data is needed to achieve great performance. In addition, as researchers push the boundary of deep learning, state-of-the-art models become more data-hungry. Hence, when labeled data is scarce, it can be difficult to train deep learning models to perform well, but the cost of creating larger labeled datasets is often prohibitive. To tackle the challenge, we firstly develop insights about collecting labeled datasets for irregular / difficult problems and then show that deep learning can leverage several kinds of information and data to improve data efficiency.
The thesis consists of two parts—methods to collect high-quality labeled data for irregular / difficult problems and algorithms for data-efficient learning. In the first part, for methods of collecting high-quality data, we take machine comprehension as a case study and demonstrate that collecting data from exams is a much more effective method than data collection methods such as automatically-generation and crowd-sourcing. In the second part, to address the costly data collection process, we show that it is possible to greatly reduce the required amount of data by making use of (1) unlabeled data; (2) labeled data from another domain; (3) prior knowledge about the task at hand. Firstly, when unlabeled data of the domain of interest is available, semi-supervised learning can effectively improve the performance of deep learning models; Secondly, when labeled data from a similar domain is available, transfer learning or domain adaptation can be applied to transfer knowledge learned from another domain; Lastly, when we have prior knowledge about the task at hand, models that utilizes prior knowledge can make better use of the limited data.
Eduard Hovy (Chair)
Quoc Le (Google Brain)