---------------------------DO-NOT-DELETE-THIS-LINE-------------------------- 1996 ISCA Electronic Referee's Report Please return this report by Friday January 5 to jouppi@wrl.pa.dec.com with a carbon-copy (cc) to thomas.gross@cs.cmu.edu Please do not alter any existing fields/labels in this form. Paper Number: 155 Paper Title: Improving Cache Performance of Plasma Simulation Codes through All-Cache Virtual Machine Emulation ---------------------------DO-NOT-DELETE-THIS-LINE-------------------------- Place a number for each item below corresponding to your evaluation. (5=outstanding, 4=high, 3=medium, 2=fair, 1=poor) Confidence in your evaluation: 4 Interest/Importance to ISCA: 4 Quality of presentation: 3 Technical contribution of paper: 3.5 ---------------------------DO-NOT-DELETE-THIS-LINE-------------------------- (5=strong accept, 4=accept, 3=maybe, 2=reject, 1=strong reject) Recommended action for paper: 3.5 ---------------------------DO-NOT-DELETE-THIS-LINE-------------------------- Regardless of the recommendations, please state specific comments and suggestions that can be communicated to the author(s): Auther Comments =============== I found this to be a very interesting paper. The idea of a single unified approach to improving processor locality and memory reference locality for these kinds of codes is excellent. One concern I have that it is never shown how significant improving the cache performance is when we use multiple physical processors - is the improvement important when interprocessor communication is significant? It would be useful to know in what regimes improving the cache performance is important. The code examples appear to be hand-transformed. From section 4-5, it was unclear to me whether an actual compiler was used or not. I finally guessed that it wasn't because there are typos in the code examples. It took a while to plod through the measurements section because the graphs are too dense - 1000 points in a 6cm x 10cm box is just impossible. It is unclear whether the outlier points are significant or not. A better approach would be to plot the average time per iteration. Also, the y axis should start at 0. ===========================DO=NOT=DELETE=THIS=LINE================== Comments on this paper for the Program Committee, TO BE WITHHELD FROM THE AUTHOR(S): Overview ======== The authors are interested in improving the cache performance of codes that roughly have the structure do i do j independent I(j) = f(A(I(j))) enddo enddo with particle-in-cell (PIC) codes as an exemplar. The performance of such codes will decline over time because the locality of the reference pattern (I) will slowly degrade (as particles move from cell to cell.) Their approach is to consider the arrays I and A as partitioned over "All-Cache Virtual Machines" (ACVMs), where the total memory of each AVCM is the cache size of the processor. Periodically, or at some threshold of off-ACVM accesses, they move elements of I (and the associated computation) to the ACVM's that contain the needed elements of A. Cache locality is improved by executing each ACVM's portion of the computation sequentially. The technique is similar to how CHAOS manages irregular computations on multicomputers - the innovation here is to apply it to improving cache locality. The authors suggest extensions to HPF to support their technique, hand-compile a PIC program that uses these extensions, and present measurements that show that their technique gives 10-50% improvements in execution time on the SP2, Cray, and SP2 over a cache-unaware implementation. Their measurements also show that their approach is faster than frequently sorting the I array. Comments ======== I would mildly accept the paper. The overall approach of extending support for parallel irregular codes down to a uniprocessor as a technique for improving cache locality is very nice and surprisingly intuitive. What is missing is a more thorough comparison to other techniques for improving cache locality and measurements of how important this is in the context of a multiprocessor or multicomputer. I really would like to know whether interprocessor communication costs swamp improvements in cache performance. ---------------------------DO-NOT-DELETE-THIS-LINE-------------------------- Referee's name: Peter A. Dinda Referee's affiliation: Carnegie Mellon University +++++++++++++++++++++++++++DO+NOT+DELETE+THIS+LINE+++++++++++++++++++++++++