Research Interests

 

Multi-core and multi-processor architecture, memory systems, database systems, data-intensive high-performance computing.

 

 

Current Research

 

Technological advancements in semiconductor fabrication have led to an abundance of on-chip transistors. Future chips are projected to employ hundreds of cores within a decade and tens of megabytes of on-chip cache. However, even today's processors are far from realizing their full potential, as they spend much of the execution time stalled on long-latency data accesses. At the same time, current commercial server software is not ready for a dramatic shift in the underlying hardware architecture. Conventional software offers only limited parallelism and exhibits adverse data access and sharing patterns that hinder performance even on today's processors. My research targets novel large-scale multicore designs and highly-parallel software architectures, along with scalable techniques to evaluate their performance. It proceeds along three synergistic fronts:

 

·         Scalable Chip Hardware. To explore efficiently the vast design space of large-scale multicore processors, I am developing ADviSE (Analytic Design Space Exploration) as part of my dissertation research. ADviSE is a collection of analytic models that estimate the overall performance of multicore designs to divide optimally the shared on-chip hardware resources. The models conform to physical constraints (i.e., area, power, thermal, bandwidth) and respect the trade-offs between performance, core count and technology, cache size, operational frequency and voltage, power and bandwidth. The overall analysis devises design guidelines for multicore processors, targeting both peak-performance and power-optimal designs across process technologies up to 20nm. Among other predictions, the models forecast that the growing cross-chip communication latencies necessitate a departure from conventional cache designs with a single, uniform access latency. Instead, future designs will decompose the cache into slices distributed across the entire chip and co-locate each slice with one or more cores. Such distributed shared caches expose a continuum of latencies to the application, making the hit time a function of the line's physical location within the aggregate cache. R-NUCA (Reactive Non-Uniform Cache Architecture) is a novel distributed shared cache architecture for multicore processors running commercial or scientific/desktop workloads. R-NUCA is scalable, simple to implement, and minimizes data access latency without wasting capacity. It minimizes on-chip communication and maximizes hit rate through replication and migration of cache blocks, based on the type of each access.

While processors have experienced unprecedented performance improvements, DRAM speeds have lagged behind, resulting in an ever-increasing processor-memory performance gap. Spatio-Temporal Memory Streaming (STeMS) is a new memory system architecture in which memory moves in correlated groups (streams) rather than individual cache blocks to enhance memory-level parallelism, hide memory latency, and improve on-chip storage utilization and pin bandwidth. Our preliminary results indicate that a STeMS-based system can eliminate over 60% of shared cache misses in on-line transaction processing server software.

 

·         Scalable Parallel Software. Conventional software offers limited parallelism because it has been optimized for architectures with core-private resources and coarse-grain OS-managed threads with no resource usage coordination. Our work in StagedDB/CMP proposes software staging to expose high levels of fine-grain parallelism to the execution system and render data access and sharing patterns predictable. Staging decomposes otherwise single-threaded requests into smaller tasks that can execute in parallel. At the same time it provides memory access predictability, thus simple architectural mechanisms (e.g., software-controlled hardware streaming engines) can remove data access latencies from the workload's critical path.

 

·         Scalable Performance Evaluation Techniques of Large-Scale Systems. Computer architects have long relied on software simulation to measure dynamic performance metrics (e.g., CPI) of a proposed design. Unfortunately, with the ever-growing size and complexity of modern hardware, detailed software simulators have become four or more orders of magnitude slower than their hardware counterparts. Slow simulation has barred researchers from attempting complete benchmarks and input sets or realistic system sizes on detailed simulators. The SimFlex project targets fast, accurate and flexible simulation of large-scale multiprocessor and multicore systems. SimFlex is proceeding along two synergistic fronts: (a) Flexus, a powerful and flexible full-system simulator framework that relies heavily on well-defined component interface models to facilitate both model integration and compile-time simulator optimization. (b) SMARTS, a simulation methodology that applies rigorous statistical sampling theory to reduce simulation turnaround by several orders of magnitude, while achieving high accuracy and confidence in estimates.