Homepage of Michael Papamichael

Computer Science Department

School of Computer Science

Michael K. Papamichael

FACS (FPGA Accelerated Multiprocessor Cache Simulator)

Group Members

Michael Papamichael
Wei Yu
Yongjun Jeon

Project Description
Current architectural-level full-system software-based simulators (e.g. Virtutech Simics) are limited in throughput, especially when simulating multiprocessor systems. The slowdown becomes even higher when attaching additional modules to the simulator, such as cache simulators. Recent research in the field of hybrid simulation has greatly accelerated simulation using FPGAs (e.g. Protoflex ). This corresponds to arrow 1 in figure 1. However, attaching software-based modules (e.g. cache simulators) to FPGA-accelerated simulators greatly slows down the over-all simulation speed. The goal of this project is to develop a functional simulator of a multiprocessor cache using FPGAs, which corresponds to arrow 2 in figure 1. Our ultimate goal is to integrate our cache module with Protoflex on a BEE2 board . We will evaluate our success based on the speedup we get when compared to a purely software-based simulation.

Project Poster - The project poster can be found here .

Project Proposal - The full project proposal can be found here .

Project Milestone Report - The project milestone report can be found here .

Final Project Report - The final project report can be found here .

Results
Our current configuration simulates a piranha-based cache model for a 16-cpu multiprocessor. Each processor has two 64KByte 2-way set associative L1 caches for instruction and data. All 16 cpus share a common 16MByte 8-way set associative L2 cache. L1 & L2 cachelines are set to 64 bytes. For each L1 cache (both instruction and data) we maintain 5 counters that are stored in an single on-chip memory for all L1 caches. Enclosed in parentheses is the actual hardware counter number that stores each statistic.

Number of Read (load) Hits (HW counter: 6)
Number of Write (store) Hits (HW counter: 4)
Number of Read (load) Misses (HW counter: 2)
Number of Write (store) Misses (HW counter: 0)
Number of Write (store) Misses that lead to an upgrade (cache has read-only copy) (HW counter: 5)

Below we present a simple example of a few memory references along with the expected cache behavior:

CPU 1 Reads (loads) from address 0x103C0 (TAG=0x2, INDEX=0xf, OFFSET=0x0)
--> expected result: Read Miss
CPU 1 Writes (stores) to address 0x103C2 (TAG=0x2, INDEX=0xf, OFFSET=0x2)
--> expected result: Write Miss that causes an upgrade
CPU 1 Reads (loads) from address 0x103C4 (TAG=0x2, INDEX=0xf, OFFSET=0x4)
--> expected result: Read Hit
CPU 2 Reads (loads) from address 0x103C6 (TAG=0x2, INDEX=0xf, OFFSET=0x6)
--> expected result: Read Miss that will also downgrade cache of CPU 1
CPU 2 Writes (stores) to address 0x103C8 (TAG=0x2, INDEX=0xf, OFFSET=0x8)
--> expected result: Write Miss that causes an upgrade and also invalidates cache of CPU 1
CPU 2 Reads (loads) from address 0x183C8 (TAG=0x3, INDEX=0xf, OFFSET=0x8)
--> expected result: Read Miss that fills second way of set at index f
CPU 2 Reads (loads) from address 0x103C8 (TAG=0x2, INDEX=0xf, OFFSET=0x8)
--> expected result: Read Hit
CPU 2 Reads (loads) from address 0x283C8 (TAG=0x5, INDEX=0xf, OFFSET=0x8)
--> expected result: Read Miss that replaces second way of cache (replacing 0x183C8)
CPU 2 Reads (loads) from address 0x183C8 (TAG=0x3, INDEX=0xf, OFFSET=0x8)
--> expected result: Read Miss

The cummulative expected statistics for the above sequence of memory references as given by TraceCMPFlex is given below.

	L1 of CPU 1	L1 of CPU 2
Number of Read (load) Hits	1	1
Number of Write (store) Hits	0	0
Number of Read (load) Misses	1	4
Number of Write (store) Misses	0	0
Number of Write (store) Misses that lead to an upgrade (cache has read-only copy)	1	1

To verify the correct behavior of our design we fed the same sequence of references to our hardware cache model and observed the HW statistics counters using Chipscope. Chipscope is a tool that allows real-time monitoring of FPGA signals. Below is a snapshot of Chipscope that shows the references being fed to our cache model as well as some of the response signals (Read, Hit, Upgrade):

The next snapshot shows the gathered statistics for the L1 cache of CPU 1 and CPU 2, which match the expected results shown above. The first digit of the IdxR signal indicates the cache number, while the second digit indicates the which counter is being read. The DoutR signal is the specific counter value, which is delayed by 1 cycle with respect to IdxR. This delay happens because when reading a memory the data that is read corresponds to the address provided in the previous cycle. As an example the value DoutR value 4 belongs to the IdxR value 22, which corresponds to the Read Misses (HW counter 2) of CPU 2.

Obviously more complicated test cases where used to thorougly verify the correct behavior of our cache model. Below is a snapshot of the results of a sequence of 200 memory references that involve all 16 CPUs. The X axis corresponds to all of the available L1 statistics and the Y axis corresponds to the value of each statistics counter. (The statistics memory is continuously scanned which explains the repetitive nature of the graph).