The project would primarily entail systems-building and software design for the tool, however, the student would do all of the following:
We expect this project would be well-suited for a student with experience in software engineering, and an interest in applications and research in machine learning. Contact Klas Leino Back to CS300 home page.
Project Suggestions from CS Department PhD students
I'd love to have an undergrad continue the work I started with Glacier, my
class immutability system for Java. I'd like to build a type inference
system, apply it to the JDK, and then do a corpus study to see how
applicable Glacier is. I'm also looking for students to help with the
Obsidian implementation; a request is already in the queue to go out in
One common way to evaluate automated program repair is to use test cases -
if all the test cases pass, then the fix matched what the developer wanted.
However, these approaches do not have as much information as someone
working on the project would (for example: which solution out of the space
of possible solutions is the best). It would be interesting to see how
much participants without experience in an application project are able to
infer about the correct fix and what the participants use to make that
Using data-driven models to accurately predict how humans learn is a
difficult task, because human learning is sophisticated. When we use a
statistical model of student learning that is overly simple, we can avoid
overfitting, but a number of other issues can arise due to the bias of the
overly simple model. In particular, we have recently shown that depending
on how much data we use from each student, we can learn drastically
different parameters for a particular statistical model of student
learning. In this project, we want to try to find statistical models that
are not susceptible to this issue. I am looking for a student that is
interested in machine learning and is also passionate about student
learning and education.
We find that there is a general lack of a robust compiler infrastructure
for the Java language that impedes the rapid development of static and
dynamic analyses. It would be great if someone could work on bettering the
current tools. The current approach I have to understanding the user's
graph computation is a combination of pattern matching on source code to
look for common patterns, and possibly augmenting that with dynamic
analyses. I believe there are ways to have a more principled approach to
this. Maybe a data driven approach to identifying how people write the
common computations we want to look for and refining the pattern matching,
or even making use of auxiliary information like variable/method names,
code comments to identify computations.
I have a variety of smallish project ideas that potentially a qualified
undergraduate student could work on. These include the following:
1. Investigating capsule networks as a way to reduce the overall footprint
(memory and operations) of a inference model 2. Help write some
intermittent code in order to support audio processing library functions.
3. Investigate how to use non-negative matrix factorization to reduce the
size of fully-connected layers in neural networks. 4. Investigate how
capacitor size effects the fraction of time spent running for an
intermittent device. Do this analysis across a range of capacitor sizes and
capacitor technologies. Projects three and four are probably the most
suitable for an undergraduate, but project one and two might be possible
for a motivated individual.
I'm interested in working with someone on analyzing policies which schedule
jobs with known sizes on multiple processors. Specifically, I'm interested
in the competitive ratio of the Shortest Remaining Processing Time policy
on many processors. I want to transfer the existing worst-case analysis in
this area to the stochastic domain, and improve upon it. Interest in theory
is all that's needed, no prior exposure is necessary.
Data stream algorithms have become a huge field of algorithmic research due
to the proliferation of massive data sets which cannot be stored in
memory. We have shown that for certain data streams which do not delete an
unbounded fraction of the items they insert, more space efficient
algorithms for L_0 and L_1 norm related problems are feasible. So far,
improvements for L_2 algorithms have remained elusive, can we make any?
Jay Yoon Lee
Dependency parsing in multilingual setting: Languages have commonality
between each other, especially if they are in the same family of language
(e.g. Germanic). So can we share the parsing rules across language and have
effect of having larger data. In pursuing this project, I hope to see
improvement over single parser especially for low resource (smaller labeled
data) cases using multi-task/continual learning approach.
GenAMap (http://genamap.org) is an open-source platform for genomic
association mapping. We are looking for motivated undergraduates to guide
the development of the software. Specifically, we are looking for a student
interested in the design of parallelized machine learning systems to
integrate our genomic algorithms with the open-source Petuum
framework. Knowledge of C++ is required.
for a lot of student projects - e.g., adding optimization, geometry,
machine learning algorithms, developing techniques to efficiently transmit
3D data over a network (ties in with the point cloud project), expanding
the functionality and improving the performance of the linear algebra
module, building creative interfaces for developing 3D geometry pipelines.
Yong Kiam Tan
I am interested in working with any undergrad on CakeML-related projects,
especially verifying new compiler optimizations. Some background in logic,
PL and compilers is needed. The student might need to learn interactive
theorem proving in HOL4, but that can be picked up at the start of the
project. Non-exhaustive list of ideas here: https://cakeml.org/projects.
Rapid Type Analysis (RTA)  is a widely used whole-program
interprocedural analysis for OO languages, such as Java. Java projects
often share a large common codebase (eg. JDK). To speed up RTA on large
Java projects, one crucial point is to reduce redundant analyses on the
common codebase. I want to develop a library summarization technique [2,3]
to achieve this goal.
 David F. Bacon and Peter F. Sweeney. 1996. Fast static analysis of C++
virtual function calls. In Proceedings of the 11th ACM SIGPLAN conference
on Object-oriented programming, systems, languages, and applications
(OOPSLA '96). ACM, New York, NY, USA, 324-341.
 Hao Tang, Di Wang, Yingfei Xiong, Lingming Zhang, Xiaoyin Wang, and Lu
Zhang. 2017. Conditional Dyck-CFL Reachability Analysis for Complete and
Efficient Library Summarization. In Proceedings of the 26th European
Symposium on Programming Languages and Systems - Volume 10201, Hongseok
Yang (Ed.), Vol. 10201. Springer-Verlag New York, Inc., New York, NY, USA,
 Sulekha Kulkarni, Ravi Mangal, Xin Zhang, and Mayur
Naik. 2016. Accelerating program analyses by cross-program
training. SIGPLAN Not. 51, 10 (October 2016), 359-377.
Students can participate in a related research project by coding up some
novel lock-free and lock-based data structures based on recent publications
on this area. These data structures could serve as a benchmarking baseline
for testing transactional memory implementations and other parallel
systems. Students could gain abundant experience and knowledge by
conducting performance analysis and performance tuning on the data
structure they have coded themselves.
There are many possible projects outlined in the Penrose issue tracker: https://github.com/penrose/penrose/issues
add signed distance functions for arbitrary objects
design a prodirect manipulation capability
design semantics or extensibility mechanisms for Substance and Style
design an optimization-aware GUI
add a domain to Penrose (like group theory)
perform user studies and evaluations of the system
The project on spherically inverted convex hulls has some potential
applications to collision testing or other forms of simulation, since it
enables relatively efficient point inclusion tests against non-convex
volumes -- in the limit, all star domains can be efficiently tested. There
are some problems that an undergrad might be able to help with: selecting a
good sampling of inversion centers, for instance, or reducing the number of
half-space tests by simplifying the convex hull in the inverted space.
Back to CS300 home page.