Research Interests
Multi-core and multi-processor architecture, memory
systems, database systems, data-intensive high-performance computing.
Current Research
Technological advancements in semiconductor
fabrication have led to an abundance of on-chip transistors. Future chips are
projected to employ hundreds of cores within a decade and tens of megabytes of
on-chip cache. However, even today's processors are far from realizing their
full potential, as they spend much of the execution time stalled on
long-latency data accesses. At the same time, current commercial server
software is not ready for a dramatic shift in the underlying hardware
architecture. Conventional software offers only limited parallelism and exhibits
adverse data access and sharing patterns that hinder performance even on
today's processors. My research targets novel large-scale multicore designs and
highly-parallel software architectures, along with scalable techniques to
evaluate their performance. It proceeds along three synergistic fronts:
·
Scalable
Chip Hardware. To explore
efficiently the vast design space of large-scale multicore processors, I am
developing ADviSE (Analytic Design Space Exploration) as part of
my dissertation research. ADviSE is a collection of
analytic models that estimate the overall performance of multicore designs to
divide optimally the shared on-chip hardware resources. The models conform to
physical constraints (i.e., area, power, thermal, bandwidth) and respect the
trade-offs between performance, core count and technology, cache size,
operational frequency and voltage, power and bandwidth. The overall analysis
devises design guidelines for multicore processors, targeting both
peak-performance and power-optimal designs across process technologies up to
20nm. Among other predictions, the models forecast that the growing cross-chip
communication latencies necessitate a departure from conventional cache designs
with a single, uniform access latency. Instead, future
designs will decompose the cache into slices distributed across the entire chip
and co-locate each slice with one or more cores. Such distributed shared caches
expose a continuum of latencies to the application, making the hit time a
function of the line's physical location within the aggregate cache. R-NUCA
(Reactive Non-Uniform Cache Architecture) is a novel distributed shared
cache architecture for multicore processors running commercial or
scientific/desktop workloads. R-NUCA is scalable, simple to implement, and minimizes
data access latency without wasting capacity. It minimizes on-chip
communication and maximizes hit rate through replication and migration of cache
blocks, based on the type of each access. While processors have
experienced unprecedented performance improvements, DRAM speeds have lagged
behind, resulting in an ever-increasing processor-memory performance gap. Spatio-Temporal Memory
Streaming (STeMS) is a new memory system architecture in which memory
moves in correlated groups (streams) rather than individual cache blocks to
enhance memory-level parallelism, hide memory latency, and improve on-chip
storage utilization and pin bandwidth. Our preliminary results indicate that a
STeMS-based system can eliminate over 60% of shared cache misses in on-line
transaction processing server software.
·
Scalable
Parallel Software. Conventional
software offers limited parallelism because it has been optimized for
architectures with core-private resources and coarse-grain OS-managed threads
with no resource usage coordination. Our work in StagedDB/CMP
proposes software staging to expose high levels of fine-grain parallelism to
the execution system and render data access and sharing patterns predictable.
Staging decomposes otherwise single-threaded requests into smaller tasks that
can execute in parallel. At the same time it provides memory access
predictability, thus simple architectural mechanisms (e.g., software-controlled
hardware streaming engines) can remove data access latencies from the
workload's critical path.
·
Scalable
Performance Evaluation Techniques of Large-Scale Systems. Computer architects have long relied on software
simulation to measure dynamic performance metrics (e.g., CPI) of a proposed
design. Unfortunately, with the ever-growing size and complexity of modern
hardware, detailed software simulators have become four or more orders of
magnitude slower than their hardware counterparts. Slow simulation has barred
researchers from attempting complete benchmarks and input sets or realistic
system sizes on detailed simulators. The SimFlex
project targets fast, accurate and flexible simulation of large-scale
multiprocessor and multicore systems. SimFlex is proceeding along two
synergistic fronts: (a) Flexus, a powerful and flexible full-system simulator
framework that relies heavily on well-defined component interface models to
facilitate both model integration and compile-time simulator optimization. (b)
SMARTS, a simulation methodology that applies rigorous statistical sampling
theory to reduce simulation turnaround by several orders of magnitude, while
achieving high accuracy and confidence in estimates.