Class 11. Storage and Locality

So far we have concentrated on the processor, with a simple model of
memory as an array of bytes. While valid, this model is not quite
right. Memory is actually implemented as a hierarchy of storage
devices that can have a big impact on the performance of your
programs. The design of this hierarchy is a fundamental consequence
of the properties of storage technologies and application programs.

Today's topics:
1. Storage technologies
2. Program locality

***********************
1. Storage technologies
***********************

**************************
Random-access memory (RAM)
**************************
	Trans	Access Needs	Needs
	per bit	time   refresh	EDC	Cost	Apps
SRAM	4-6	1x     No	Maybe	400x	Cache mems
DRAM	1	5-10x  Yes	Yes	1x	Main mems, frame bufs


DRAM organization:
d x w DRAM chip: dw total bits: d w-bit supercells
Example: 16 x 8 DRAM chip
Sketch out picture  (CS:APP Fig 6.3):
    * CPU
    * memory controller		
    * 2 bit address  (-->)
    * 8 bit data (<-->)
    * Memory chip
        * 4x4 array
	* internal row buffer       

Accessing supercell (i,j):
* RAS (Row access strobe): selects row i and copies to row buffer
* CAS (Col access strobe): selects col j from row buffer

Combining DRAM chips to form modules (banks):
* Example: 64 MB memory module consisting of 8 8M x 8 DRAM chips
* Sketch out picture  (CS:APP Fig 6.5):
* (RAS,CAS) concurrently to each chip selects 64 bits

* DRAM0 -> bits 0-7
* DRAM1 -> bits 8-15
* ...
* DRAM6 -> bits 48-55
* DRAM7 -> bits 56-63

At a higher level, this is the same technique used by the
high-performance banked memory systems of supercomputers.

Enhanced DRAMS:
Same core with better interface (exploit row buffer, sync)
* SDRAM (Synch DRAM): single-edge clocking (1 bit/pin/clock) 
* DDR SDRAM (Double data-rate SDRAM) double edge clocking (2 bits/pin/clock)
* DDR2 SDRAM: incremental improvement to DDR SDRAM

***********************
How CPU accesses memory
***********************

Sketch on a separate board and then save:

    *****CPU chip*****
    register <--> ALU
    file
     |
    bus interface      <--system bus--> I/O bridge <--mem bus--> main mem
    *****CPU chip*****

Memory read transaction (movl A, %eax): 
1. CPU places A on memory bus
2. Memory retrieves data and places it on bus
3. CPU reads data from bus and copies to %eax

Key idea: It takes time to read and write memory.

*****
Disks
*****

Disk geometry (sketch on board:
* platters
* surfaces
* tracks
* sectors and gaps
(Aligned tracks form a cylinder)

Disk capacity: 
* 512 bytes/sector
* 300 sectors/track (on average)
* 20,000 tracks/surface
* 2 surfaces/platter
* 5 platters/disk
Capacity= 512 x 300 x 20,000 x 2 x 5 = 30 GB

Note: To disk vendors, 1 GB = 10^9 bytes! (instead of 2^30 bytes)

Disk operation (sketch out):
* arm with read/write head moves (seeks) radially
* platter spins at fixed rotational rate

Typical parameters:
* rotational rate: 7,200 RPMs 
* seek time: 5 ms 	

Access time = T_avg_seek + T_avg_rotation + T_avg_transfer
* T_avg_seek: ~5 ms
* T_avg_rotation: 1/2 x 1/RPMs x 60s/1m (~4 ms)
* T_avg_transfer: 1/RPM x 1/avg sectors/track x 60s/1m (~0.02 ms)

Key ideas:
1. First bit per sector is expensive, remaining bits free
2. Twice seek time is a reasonable guess for access time
3. Disk much slower than memory
   * 40,000x (SRAM)
   * 2,500x (DRAM)

*********************
How CPU accesses disk
*********************

Add sketch of I/O bus onto system picture.

Disk controllers present abstraction of a sequence
of logical blocks to the CPU: LBN0, LBN1, LBN2, LBN3, ....

Reading the disk:
1. CPU writes command, LBN, and dest memory address to port (address)
associated with disk controller.

2. Disk controller reads sector and performs direct memory access
(DMA) into main memory.

3. When DMA completes, disk controller notifies CPU with interrupt
(asserts an interrupt pin on CPU).


******************
Key Storage Trends
******************
1. Fast storage devices are more expensive per byte, have less capacity
(Show table of storage devices 1980-2005)

2. The gap between CPU and memory speeds is widening
(Show cpu-memory gap graph 1980-2005)

***********
2. Locality
***********

Principle of locality: programs tend to use data items near other
recently used items, or that were recently used themselves. 

One of the big ideas of computer science, because it has such
enormous impact on the design of hardware and software systems.

If a program references some data item, then it is likely to:
1. reference that data item in the near future (temporal locality).
2. reference a nearby item soon (spatial locality).

**************************************
Locality of references to program data:
**************************************

Example of good locality:

int sumvec(int v[N]) 
{  	
    int i, sum = 0; 

    for (i = 0; i < N; i++)
	sum += v[i];
    return sum;	
}

Example of good locality:

int sumarrayrows(int a[M][N]) 
{  	
    int i, j, sum = 0;

    for (i = 0; i < M; i++) 
	for (j = 0; j < N; j++)
            sum += a[i][j];
    return sum;	
}


Example of bad locality:

int sumarraycols(int a[M][N]) 
{  	
    int i, j, sum = 0;

    for (j = 0; j < N; j++)
	for (i = 0; i < M; i++) 
	    sum += a[i][j];
    return sum;	
}

*******************************
Locality of instruction fetches
*******************************

Example of good instruction locality:

int sumvec(int v[N]) 
{  	
    int i, sum = 0; 

    for (i = 0; i < N; i++)
	sum += v[i];
    return sum;	
}

*******************
Summary of locality
*******************

Qualitatively evaluating programs for locality:
1. Referencing the same variable -> good temporal locality.
2. smaller strides -> good spatial locality.
3. Stride-1 (sequential) -> best spatial locality.
3. Loops -> good instruction locality.

******************
Locality Exercises
******************

Permute the loops to give good locality:
a[i][j][k]

int sumarray3d(int a[N][N][N]) 
{  	
    int i, j, k, sum = 0;

    for (i = 0; i < N; i++) {
	for (j = 0; j < N; j++) {
	    for (k = 0; k < N; k++) {
		sum += a[k][i][j];
	    }
	}
    }
    return sum;	
}

Rank the following examples:

Solution: best to worst: 
clear1 (stride 1 in memory),
clear2 (semi-sequential)
clear3 (stride-6 access)

#define N 1000
typedef struct {
    int vel[3];
    int acc[3];
} point;

point p[N];


void clear1(point *p, int n) 
{
    int i, j;

    for (i = 0; i < n; i++) {
	for (j = 0; j < 3; j++) 
	    p[i].vel[j] = 0;
	for (j = 0; j < 3; j++) 
	    p[i].acc[j] = 0;
    }
}


void clear2(point *p, int n) 
{
    int i, j;

    for (i = 0; i < n; i++) {
	for (j = 0; j < 3; j++) {
	    p[i].vel[j] = 0;
	    p[i].acc[j] = 0;
	}
    }
}

void clear3(point *p, int n) 
{
    int i, j;

    for (j = 0; j < 3; j++) {
	for (i = 0; i < n; i++)
	    p[i].vel[j] = 0;
	for (i = 0; i < n; i++)
	    p[i].acc[j] = 0;
    }
}