Several DOE reports have identified scale, energy consumption and resilience as main roadblocks on the path to exascale; in turn, energy consumption correlates to communication. Progress in these areas is impeded both by the lack of a suitable software infrastructure (to control communication, and handle fault detection, isolation and recovery); and the lack of understanding of fundamentals: inherent communication complexity of problems and algorithms; and inherent resilience of problems and algorithms. Our talk will focus on formulating these fundamental issues and proposing some open problems.
All based on the view that the challenge of exa-scale as well as the commercial exploitation of large commodity clusters forces us into confronting failure as a central element of computation.
(1) Develop simple but powerful (formal) (concurrent) calculii that integrate shared memory concurrency and communication (cf APGAS) with elasticity and resilience. (Elasticity is the ability of a computation to grow or shrink based on availability of computational resources. Resilience is the ability to recover from failure of a computational element (node, sub-node) while redoing only portions of the computation assigned to the failed element.) Resilience for (concurrent) functional data-structures is relatively easy (cf Hadoop), the challenge is to get its benefits without giving up the (performance) benefits of shared memory concurrency.
An associated challenge is to develop calculii that permit the compiler/programmer to formally establish that certain desirable application invariants (for concurrent, distributed/multi-node programs) continue to hold even in the presence of node failure. This requires integrating into process calculii notions of spatiality, failure, and knowledge. (cf Knight, Panangaden, Palamidessi, Valencia on spatial and epistemic modalities for the Concurrent Constraint Programming calculus, forthcoming CONCUR).
(2) Develop framework for types based on assertions about patterns of communication, computation and concurrency. (cf work on session types by Yoshida, Honda et al, and some of our recent work on clocked types).
(3) Develop frameworks for computer-aided development of concurrent/distributed/resilient programs, (cf sketching, Solar-Lezama, Bodik and colleagues), exploiting symbolic analysis, (constrained) type inference etc.
(4) (4 is the new 3.) Develop a framework for "Domain specific calculii". Arguably some of the power of development of core process calculii like CSP and CCS and the pi-calculus arise from the distillation of concurrency notions at the expense of any realistic notion of data. (Yes, channels are very powerful. No, I dont want to write my distributed maximal clique graph algorithm using channels.)
Our work on Concurrent Constraint Programming showed how one could develop a very simple but powerful concurrent calculus while leaving it parametric on a very rich notion of data (essentially leveraging the power of logic to describe data-structures). Certain basic results about the calculii (e.g. determinacy of programs) continue to hold independent of the choice of the data parameter.
The goal of domain specific calculii is to make it very simple to develop formal computational calculii in a variety of application domains in such a way that they directly reflect the notions of data and computation appropriate to that domain.
The large chip makers' (not completely voluntary) switch to multiprocessor architectures circa 2004 was to a large extent due to thermal/energy issues. Looking forward, thermal/energy issues are the main impediment to Moore's translating to an exponential increase in the number of processors per chip. And thermal/energy issues are tne main impediment to the development of exascale computing. Some of these issues are algorithmic in nature and involve managing energy as a computational resource. I believe that it will be beneficial to develop a theory/science of energy/temperature/power as a computational resources that is based on reasoning about these resources in simple models, much as the traditional theory of computer science is based on reasoning about time and space as computational resources in simple models, such as circuits and random access machines. However, it seems that different models are required to study energy as a computational resource as the physics of time and space is sufficiently different than the physics of energy (for example, one consequence of the Church-Turing thesis is that certain computations require a minimum amount of time and space, but there appears to be no physical laws that mandate a minimum energy for computation). I'll give a couple of current research challenges related to energy as a computational resource.
MapReduce has been the workhorse of large scale data analysis for the past 10 years, powering everything from search engines and dating websites, to document translation and bioinformatics. However, there is a large class of a problems for which the tradeoffs made in the MapReduce paradigm are at odds with the underlying problem structure. In this talk we will describe some of known knowns and known unknowns in this area.
Emerging real-world graph problems include detecting community structure in large social networks, improving the resilience of the electric power grid, and detecting and preventing disease in human populations. Unlike traditional applications in computational science and engineering, solving these problems at scale often raises new challenges because of sparsity and the lack of locality in the data, the need for additional research on scalable algorithms and development of frameworks for solving these problems on high performance computers, and the need for improved models that also capture the noise and bias inherent in the torrential data streams. In this talk, the speaker will discuss the opportunities and challenges in massive data-intensive computing for applications in computational biology, genomics, and security.
A standard well studied definition of "efficient" parallel algorithms is the class of NC algorithms. Such algorithms run in polylogarithmic parallel time (depth, span) and polynomial work (number of operations). Although the class does give a sense of the limits of parallelism for a problem, the definition allows for polynomially more resources than sequential algorithms for the same problem. This can make the class have little use when energy or total cost for a computation is of concern, which is almost surely the case in most situations. An alternative definition of "efficient" parallel algorithm is the class of algorithms that asymptotically require no more work than the best sequential algorithm for the problem but achieve polynomial speedup (sequential time/parallel time). For example, under this definition a linear depth, quadratic work algorithm is efficient if the optimal sequential solution for the problem requires quadratic time.
This class is incomparable with the class NC (i.e., some algorithm are in NC but not in this class, and vice versa). In this talk I will describe some open and important problems regarding this class, including the basic problems of matrix inversion and single source shortest paths.
The design of work efficient parallel algorithms for problems like single source shortest path problem and the maximum flow problem have been notoriously hard to find work efficient highly parallel algorithms. More recent work has centered around fast and fairly parallel solvers for SDD linear systems as a primitive for parallelizing the above problems. This work is only the beginning of an approach to getting real parallel code/algorithms.
This talk discusses research challenges for parallel computing as applied to big-data applications. First I discuss the problem of ingesting and indexing massive data sets. Traditional storage systems (databases and file systems) based on B-trees, are I/O-bound on many workloads. Write-optimized indexing is a technique that can dramatically reduce the number of I/Os associated with some traditional workloads, enabling big-data applications to scale by orders of magnitude. However, write-optimized storage systems are CPU bound on many more workloads, and parallelism has emerged as a critical issue for such systems. The talk is based partly on my experience at Tokutek, which has built a database storage engine.
The computing industry faces a critical crisis, as the gap between the parallelism we can create in hardware and the parallelism available in the software reaches unprecedented proportions, in many environments. This gap places in peril the historic performance scaling from generation to generation that has long driven the computer hardware and software industries. This has created a renewed interest in all things parallel (compilation, languages, etc.) along with a renewed urgency. However, this is not a new field; as a result, this new rush into parallel research has not been greeted by low-hanging fruit. The easy problems were solved years ago. Therefore, what are left are the hard problems---such as codes and algorithms historically resistant to parallelism. It is time to start picking through the neglected corners and finding new opportunities for parallelism. Here are a few of those.
1. Parallel speedup of serial code. Many important problems and algorithms are inherently serial. Even highly parallel code, when given enough hardware parallelism, always eventually becomes bottlenecked on the serial sections. The need for performance scaling demands that we continue to speed up these codes. However, the roadmaps for future high performance processors have stripped us of all the traditional mechanisms for speeding up a single thread and left us only one---hardware parallelism. So how do we use parallel hardware to get parallel speedup of serial code? We have a series of techniques in this area we refer to as "non-traditional parallelism", because they achieve parallel speedup without actually offloading any computation from the original serial thread. But I believe there are more opportunities.
2. Support for short/short-lived/agile threads. The large pockets of parallelism are easy to exploit. However, small pockets are another story altogether. On modern machines, it costs so much to create and start a new thread, that unless that thread contains hundreds of thousands of instructions, the startup costs cannot be amortized. If instead we could start a parallel thread that could profitably exploit a 100-instruction thread (instead of 200,000+), the parallelism available to the programmer would multiply. Several newly proposed execution and compilation models create short threads, but today's systems cannot take advantage of them. Even if we eliminate software overheads and assume hardware support, it is still orders of magnitude too high.
3. Parallelism available in unwritten code. The greatest hope for new parallelism is code not yet written. This argues for progress in two directions. First, how do we educate programmers for whom parallel programming productivity is as high as traditional serial programmers? Second, how do we accelerate the process of discovery of new parallel opportunities? For the latter, I don't believe we can ask a certain few to tell us where the opportunities are---I believe most of these will come from unexpected sources. Can we create the illusion of massively parallel hardware in the hands of a reasonable number of creative people (students?) and see what they do with them?
Anyone attending a workshop titled "Research directions in the principles of parallel computation" is well aware that as parallel computing professionals, we live in difficult times. Future systems will be composed of a heterogeneous ensemble of processing elements distributed in space from your pocket to an anonymous data center. These future systems will have so many components running at near threshold voltages and at such high densities that errors (including silent errors) will be common during the execution of a typical application. While the details from different prognosticators vary, these hardware trends are clear.
What is not clear is how software will need to adapt to run on these new systems. The software challenges raised by this vision of the future dwarfs anything facing hardware researchers. Of these research challenges, the following three (listed in order of importance) are the most critical:
In this talk, I will clearly explain what each of these research problems and describe some possible approaches we can take to solve them.
Modern high performance machines look increasingly different from those in the past. They are more hierarchical, with non-uniform memory access within a node and even within a single socket, resulting in a wider range of communication costs. They consist of heterogeneous computational elements, providing different performance and capabilities at different energy costs. Fault-tolerance is a growing concern due to a trade-off between failure rates and power use at the chip level, combined with a growing number of components in large scale systems. In this talk, we discuss three approaches to these challenges, focusing on machine hierarchy. The first is to expose the problem directly to the user in the programming model, and we present the hierarchical partitioned global address space (HPGAS) and recursive single-program, multiple-data (RSPMD) models that do so for machine hierarchy. Other solutions include using compiler analysis to automatically tackle the problem and building domain-specific libraries that hide it from the application programmer. We briefly discuss the latter two approaches, as well as some open questions in handling the three problems of hierarchy, heterogeneity, and resilience.
Build-first figure-out-how-to-program-later approaches and vendor hardware that followed have repeatedly collided with quests for constituting parallel computing on a principled foundation. It had been difficult to argue for robustness/validity of principled insights pursued by NSF-funded research, when the only platforms available mandate overcoming not only fundamental obstacles, but also accidental design elements. For decades, these approaches have also led to crisis after crisis on performance and ease-of-programming, harmed the establishment of a solid scientific approach to many aspects of parallel computing, and skewed a formative stage in the upbringing of the next generation of parallel computing researchers.
I will present evidence that a principle-based foundation for parallel computing is feasible. In particular, recent experimental results demonstrate improvements by orders of magnitude on our explicit multi-threaded (XMT) platform over recent commercial platforms on: (i) completing speedups for the most advanced irregular PRAM algorithms in the literature was just reported in SPAA'12; and (ii) ease-of-programming, by allowing, for example, solving PhD-level problems in high school.
The biggest research challenge in the principles of parallel computing is to finally establish that principles actually matter and can once and for all remedy the ills of the field. Cheerleading industrial accidents is inconsistent with advocating principles. Advancing XMT into a serious new "stack" proposition will require a dual effort: 1. Applications. Now that XMT has delivered on its promise to effectively support PRAM algorithms, establish, at least in principle, that XMT can provide order-magnitude improvements on applications of sufficient scientific (and/or market) interest. 2. Wall-clock runtime. To be able to establish, in principle, that the XMT platform is attractive for applications, upgrade the current prototype to yield wall-clock speedups, making it appealing for application people to try it.
Hardware support for multicore synchronization is changing. Both IBM and Intel now provide (different kinds of) hardware support for transactional memory. I think effect of this change will be pervasive, affecting all levels of the software stack. Here are some open questions: how can we redesign data structures and libraries to exploit these changes? What features are missing from current hardware TM? Are coarse and medium-grained speculation here to stay? What is the next step in multicore synchronization?
The multicore revolution represents a paradigm shift in computing, as parallelism in its various forms becomes ubiquitous. Barring a few exceptions, the theory of parallel algorithms has remained focussed on the decades-old approaches of designing algorithms for the PRAM model and for bulk-synchronous computing. Research on cost measures and algorithmic strategies that are geared to current parallel environments, such as multicores, is a key research area that needs to be encouraged strongly. In addition, heterogeneity in computing is here to stay, and a key research challenge is to design parallel algorithms that run efficiently, both across a wide range of machine parameters, and across distributed and shared memory environments.
This talk will describe some of our research results in the above research areas, and will discuss research directions and challenges in these areas. Developing strategies for energy efficiency and fault tolerance in parallel computing is another key research challenge.
What makes parallel programming hard is nondeterminism. Indeed, deterministic parallel programs are arguably as easy to understand as serial programs, but concurrent programs are hard to understand and even harder to get right. We need to move parallel application programming away from concurrency and toward determinism.
The research community has already made strides in this direction. Concurrency platforms for multicore parallel programming, such as Cilk, encapsulate the nondeterminism of scheduling, allowing programmers to write deterministic parallel codes. Race-detection tools, such as Cilkscreen, offer provable guarantees of determinism for programs free of determinacy races. Reducer hyperobjects encapsulate the nondeterminism of updates to nonlocal variables, providing serial semantics for parallel updates. Pedigree mechanisms support deterministic parallel random-number generation. Blelloch and his collaborators have shown that some applications that seem to require concurrency for good performance can, in fact, be programmed deterministically without paying a high cost.
These contributions are only first steps, however, since many applications cannot be programmed using fork-join models except by resorting to concurrency. We need to identify parallel-programming patterns that currently require concurrency and encapsulate the nondeterminism with linguistics that provide serial semantics. We need to develop compilers, efficient runtime systems, and productivity tools that support these linguistics. Finally, we need to test the linguistics and tools by parallelizing large-scale applications and benchmarks, and then close the feedback loop by enhancing the technology to make it more simple, more efficient, and more robust.