%% You can use this file to create your answer for Exam 2
%% Fill in the places labeled by comments.
%% Generate a PDF document by with the command 'pdflatex exam2-answer'
\documentclass[11pt]{article}
\usepackage{times}
\usepackage{listings}
\usepackage{enumerate}
\usepackage{courier}
\usepackage{hyperref}
\usepackage{xcolor}
\usepackage{graphicx}
\usepackage{pdflscape}
\usepackage{amsmath}
%% Page layout
\oddsidemargin 0pt
\evensidemargin 0pt
\textheight 600pt
\textwidth 469pt
\newcommand{\answerwidth}{400pt}
\setlength{\parindent}{0em}
\setlength{\parskip}{1ex}
% Insert page breaks in answer and solution
\newcommand{\checkpage}{\newpage}
%% Customization to listing
\lstset{basicstyle=\ttfamily\small,language=C,morekeywords={size_t,uint_64t,pragma,omp,parallel}}
%% Enumerate environment with alphabetic labels
\newenvironment{choice}{\begin{enumerate}%
\def\labelenumi{\Alph{enumi}.}}{\end{enumerate}}
\newenvironment{subchoice}{\begin{enumerate}[(1)]}{\end{enumerate}}
%% Environment for supplying answers to problem
\newenvironment{answer}[1]{\begin{minipage}[t][#1]{\answerwidth}\textit{Answer}:}{\end{minipage}}
\begin{document}
\begin{flushright}
{\large\bf Full Name: \makebox[2in][l]{
%% Put your name on the next line
}} \\[1ex]
{\large\bf Andrew Id: \makebox[2in][l]{\tt
%% Put your Andrew ID on the next line
}} \\[1ex]
\end{flushright}
\vspace*{0.3in}
\begin{center}
\LARGE
15-418/618 \\
Exam 2
\\ Answer Sheet %
\end{center}
\section*{Problem 1: Multiple Choice (15 points)}
Problems that state ``list all that apply'' may have any number of
correct answers (including zero). If none are correct, list ``none.''
The other problems have unique answers.
\begin{choice}
\item
Which of the following memory consistency policies allow the following behavior? Assume x and y
start with value zero. (List all that apply.)
\begin{verbatim}
Thread 0 Thread 1
-------- --------
x <- 1 y <- 1
print(y) print(x)
Output:
0
0
\end{verbatim}
\begin{choice}
\item Non-coherent shared memory.
\item Coherent shared memory.
\item Total-store-order (TSO) consistency.
\item Sequential consistency.
\end{choice}
\begin{answer}{0.25in}
%% Provide answer here
\end{answer}
\item
What is the key optimization enabled by total-store-order (TSO)
consistency vs. sequential consistency (SC)? (List all that apply.)
\begin{choice}
\item MESI coherence.
\item Prefetching.
\item Atomic memory operations.
\item Store buffers.
\end{choice}
\begin{answer}{0.25in}
%% Provide answer here
\end{answer}
\item
Which of the following statements are true about implementations of MPI (list all that apply)
\begin{choice}
\item MPI assumes the program is running on a machine with a cache-coherent shared address space.
\item The MPI library allocates the needed space for all buffers.
\item Synchronous communication creates a synchronization barrier for the sender and receiver.
\item A sender can modify the send buffer as soon as the call to \verb@MPI_Send@ returns.
\end{choice}
\begin{answer}{0.25in}
%% Provide answer here
\end{answer}
\item
Which of the following statements are true about implementations of OpenMP (list all that apply)
\begin{choice}
\item OpenMP assumes the program is running on a machine with a cache-coherent shared address space.
\item OpenMP can only exploit parallelism by splitting up the index ranges in {\tt for} loops.
\item OpenMP implements all atomic operations via locks.
\item OpenMP reduction assumes the reduction operation is associative.
\end{choice}
\begin{answer}{0.25in}
%% Provide answer here
\end{answer}
\item
Assume that in a computer system, all operations continually abort and
retry. Which one of the following could describe this scenario?
\begin{choice}
\item Deadlock.
\item Livelock.
\item Starvation.
\item None of the above.
\end{choice}
\begin{answer}{0.25in}
%% Provide answer here
\end{answer}
\item To implement a lock-free single reader, single writer
unbounded queue, how many additional pointers does the program need
beyond those needed for a sequential
implementation?
\begin{choice}
\item 0
\item 1
\item 2
\item 3
\end{choice}
\begin{answer}{0.25in}
%% Provide answer here
\end{answer}
\item
Which of the following statements is/are true? List all that apply.
\begin{choice}
\item Test \& Test and Set locks guarantee fairness of locks, as opposed to Test and Set locks
\item It is better to spinlock than block a thread if scheduling overhead is larger than expected wait time
\item You can replace all uses of locks with atomic transaction regions
\item Pessimistic conflict detection detects conflicts earlier than optimistic detection
\end{choice}
\begin{answer}{0.25in}
%% Provide answer here
\end{answer}
\item
Which of the following are domain-specific languages? List all that apply.
\begin{choice}
\item Ruby.
\item Liszt.
\item Halide.
\item C++.
\end{choice}
\begin{answer}{0.25in}
%% Provide answer here
\end{answer}
\item
You are tasked with implementing a hardware transactional memory system in a new multicore processor with a local cache for each processor.
Which of the following pieces of information are necessary for you to proceed with your implementation? (List all that apply)
\begin{choice}
\item The cache coherence protocol
\item The clock rate of each core
\item The number of cores
\item The versioning policy
\end{choice}
\begin{answer}{0.25in}
%% Provide answer here
\end{answer}
\item
When training neural networks on a cluster and using a parameter server, what can we always expect?
\begin{choice}
\item Parameters are always fresh
\item The neural network is running in some model-parallel manner
\item Training will always converge
\item Updates to the server happen asynchronously
\end{choice}
\begin{answer}{0.25in}
%% Provide answer here
\end{answer}
\item
Which of the following statements is/are true? (list all that apply)
\begin{choice}
\item ASICs have longer development workflows than FPGAs
\item An ASIC circuit is more reconfigurable than a FPGA circuit
\item ASICs developed for a given task are faster than FPGAs at that task
\item ASICs developed for a given task use more power than FPGAs for the same task
\end{choice}
\begin{answer}{0.25in}
%% Provide answer here
\end{answer}
\item
\begin{lstlisting}
lock: ts R3, memory[address]
bnz R3, lock
unlock : st memory[address], #0
\end{lstlisting}
Assume the above assembly code follows a RISC ISA, with \textbf{ts}
representing the test and set instruction. Assume that four threads
running on four different processors are trying to acquire the
lock. Which of the following is a/are plausible scenarios due to given
lock mechanism?
\begin{choice}
\item Starvation
\item Deadlock
\item Above lock doesn't ensure mutual exclusion
\item All of the above
\end{choice}
\begin{answer}{0.25in}
%% Provide answer here
\end{answer}
\item Which of the following performance evaluation tools has the least overhead?
\begin{choice}
\item valgrind
\item gprof
\item top
\item pin
\begin{answer}{0.25in}
%% Provide answer here
\end{answer}
\end{choice}
\item What is the worst case latency of an $n$-node mesh network?
\begin{choice}
\item $O(\sqrt{n})$
\item $O(\log n)$
\item $O(n)$
\item $O(n \, \log n)$
\begin{answer}{0.25in}
%% Provide answer here
\end{answer}
\end{choice}
\item Which of the following interconnect networks is non-blocking?
\begin{choice}
\item Mesh
\item Fat Tree
\item Tree
\item Multi-stage Logarithmic
\begin{answer}{0.25in}
%% Provide answer here
\end{answer}
\end{choice}
\end{choice}
\newpage
\section*{Problem 2: Locks and memory consistency (18 points)}
\subsection*{2A: Lock properties (12 points)}
Provide brief answers to the following questions:
\begin{subchoice}
\item
The C keyword
\texttt{volatile}
indicates to the compiler that changes to a value may occur between
different accesses, even if it does not appear to be modified. Why
must the field \texttt{serving} in \verb@ticket_lock_t@ be declared as
volatile to guarantee correct compilation?
\begin{answer}{1in}
%% Provide answer here
\end{answer}
\item Does this lock guarantee fairness (i.e., any request will
eventually be granted)? Explain.
\begin{answer}{1in}
%% Provide answer here
\end{answer}
\item
Why can the incrementing of \texttt{next->serving} in function
\verb@ticket_lock_unlock@ be done with
ordinary addition rather than atomic fetch-and-add?
\begin{answer}{1in}
%% Provide answer here
\end{answer}
\item
We saw in Exercise 4 that we could improve the performance of a lock
based on atomic compare-and-swap by inserting a random delay in the
busy-waiting loop, with the amount of the delay following an
exponential distribution.
In similar experiments for
a ticket lock, this scheme leads to worse performance.
Explain why that could be the case.
\begin{answer}{1in}
%% Provide answer here
\end{answer}
\newpage
\item
Suppose the definition of type \verb@ticket_value_t@ were changed
from \texttt{unsigned} \texttt{long} to \texttt{unsigned} \texttt{char}.
Would this impose any additional limits on the
values of \texttt{niters} or \texttt{nthreads} in the main program? Explain.
\begin{answer}{2in}
%% Provide answer here
\end{answer}
\item
Suppose you run the main program on an 8-core
processor with \texttt{nthreads} set to 8. The machine uses a
bus-based cache coherence system following an
MSI
protocol. Assume at some point in the main program execution that
Thread 0 is in its critical section and all other threads are in
their busy-wait loops, attempting to acquire the lock, with Thread 1
having the smallest ticket value and Thread 7 having the largest.
For all of the caches, describe
which field(s) of the lock would be held in each cache and what their
state(s) would be.
\begin{answer}{1in}
%% Provide answer here
\end{answer}
\newpage
\item
Describe all of the bus traffic and cache transitions that would occur due to the
following sequence.
Describe only the traffic relevant to the lock. Assume there are no conflict misses.
For each of the actions listed, describe any resulting actions that would occur by other threads or caches until a steady state is reached.
\begin{enumerate}
\item Thread 0 exits its critical section and releases the lock, and Thread 1 then acquires the lock and enters its critical section.
\begin{answer}{1.5in}
%% Provide answer here
\end{answer}
\item Thread 0 attempts to acquire the lock and re-enters the busy-wait loop.
\begin{answer}{1.5in}
%% Provide answer here
\end{answer}
\end{enumerate}
\end{subchoice}
\newpage
\subsection*{2B: Memory consistency (6 points)}
For each of the six
possible fence locations either: 1) justify why it is required by describing
a scenario in which incorrect behavior could arise without it, or 2) justify why
it is not required.
\begin{description}
\item[Possible fence \#1] $\;$ \\
\begin{answer}{1in}
%% Provide answer here
\end{answer}
\item[Possible fence \#2] $\;$ \\
\begin{answer}{1in}
%% Provide answer here
\end{answer}
\end{description}
\begin{description}
\item[Possible fence \#3] $\;$ \\
\begin{answer}{1in}
%% Provide answer here
\end{answer}
\item[Possible fence \#4] $\;$ \\
\begin{answer}{1in}
%% Provide answer here
\end{answer}
\newpage
\item[Possible fence \#5] $\;$ \\
\begin{answer}{1in}
%% Provide answer here
\end{answer}
\item[Possible fence \#6] $\;$ \\
\begin{answer}{1in}
%% Provide answer here
\end{answer}
\end{description}
\newpage
\section*{Problem 3: Managing Concurrency (10 points)}
Provide brief answers to the following questions:
\begin{choice}
\item Your friend writes three unit tests that each execute
a pair of operations concurrently on the list shown above.
\begin{itemize}
\item Test 1: \texttt{insert\_from\_head(1)}, \texttt{delete\_from\_head(9)}
\item Test 2: \texttt{insert\_from\_head(8)}, \texttt{delete\_from\_head(2)}
\item Test 3: \texttt{insert\_from\_head(8)}, \texttt{insert\_from\_tail(1)}
\end{itemize}
The first two unit tests complete without error, but the third test
goes badly and it does not terminate with the right answer.
Describe what behavior is observed and why the problem occurs. (All
unit tests start with the list in the state shown above.)
\begin{answer}{1.5in}
%% Provide answer here
\end{answer}
\item Imagine you removed all locks and implemented
\texttt{insert\_from\_head}, \texttt{delete\_from\_head}, and
\texttt{insert\_from\_tail} by placing the entire body of these functions
in an atomic block for execution on a system \textbf{supporting
lazy optimistic transactional memory with aggressive conflict resolution}.
Does this fix the correctness
problem described in part A? Why or why not?
\begin{answer}{1in}
%% Provide answer here
\end{answer}
\newpage
\item Consider two transactions simultaneously performing
\texttt{insert\_from\_head(3)} and \texttt{delete\_from\_head(9)}. Assume both
transactions start at the same time on different cores and the
transaction for \texttt{insert\_from\_head(3)} proceeds to commit while
the \texttt{delete\_from\_head(9)} transaction has just iterated to the
node with value 6. Must either of the two transactions abort in
this situation? Why? \textbf{(Remember this is an lazy optimistic
transactional memory system!)}
\begin{answer}{1.5in}
%% Provide answer here
\end{answer}
\item Must either transaction abort if the transaction for
\texttt{delete\_from\_head(9)} proceeds to commit before the transaction
for \texttt{insert\_from\_head(3)} does? Why? \textbf{Please assume that
at the time of the attempted commit, \texttt{insert\_from\_head(3)} has
iterated to node 2, but has not begun to modify the list.}
\begin{answer}{1.5in}
%% Provide answer here
\end{answer}
\item Must either transaction abort if the situation in part
D is changed so that \texttt{delete\_from\_head(9)} attempts to commit
first, but by this time \texttt{insert\_from\_head(3)} has made updates to
the list (although not yet initiated its commit)? Why?
\begin{answer}{1.5in}
%% Provide answer here
\end{answer}
\end{choice}
\newpage
\section*{Problem 4: Homogeneous vs.\ Heterogeneous Parallelism (20 points)}
This problem concerns the costs and benefits of heterogeneous parallelism.
We will begin by considering multicore CPU processors,
comparing the performance of symmetric (homogeneous) and asymmetric (heterogeneous) designs.
Imagine you are a hardware architect designing a new multicore CPU.
Suppose that the CPU can take at most $N$ resources
(e.g., resources could be area of at most 200 mm$^2$ or power of at most 100 W).
You can design CPU cores with the following characteristics:
\begin{itemize}
\item \textbf{Out-of-order (``OOO''):}
The OOO core achieves sequential speedup (vs.\ some baseline core) as a function of resources $r$
\begin{equation}
\text{speedup}_O(r) = \sqrt{r+1} - 1
\end{equation}
\item \textbf{In-order (``IO''):}
The IO core achieves sequential speedup (vs.\ some baseline core) as a function of resources $r$:
\begin{equation}
\text{speedup}_I(r) = 1 - \frac{1}{2r + 1}
\end{equation}
\end{itemize}
For this problem, all speedups are normalized to a core with sequential performance $= 1$.
Feel free to give approximate numerical answers (e.g., within 0.1 of optimal),
and we recommend that you write a simple program or use spreadsheet software to get numerical answers.
\paragraph{4A: Homogeneous multicore (7 points):}
Suppose an application spends a fraction $f$ of its execution running in parallel with ideal speedup,
and $1-f$ executing sequentially.
(1) What is the speedup {\bf in the sequential region} of the program for a single core of size $r$?
(Give an expression involving $N$, $f$, and $r$.)
OOO Core.
\begin{answer}{1in}
%% Provide answer here
\end{answer}
IO Core.
\begin{answer}{1in}
%% Provide answer here
\end{answer}
\checkpage
(2) What is the speedup {\bf in the parallel region} of the program for a homogeneous multicore?
(Give an expression involving $N$, $f$, and $r$.)
OOO Core.
\begin{answer}{2in}
%% Provide answer here
\end{answer}
IO Core.
\begin{answer}{1in}
%% Provide answer here
\end{answer}
(3) What is the {\bf end-to-end speedup} for a homogeneous multicore?
(Give an expression involving $N$, $f$, and $r$.)
OOO Core.
\begin{answer}{3in}
%% Provide answer here
\end{answer}
\checkpage
IO Core.
\begin{answer}{2in}
%% Provide answer here
\end{answer}
(4) Suppose that $f = 0.9$ and $N = 16$.
What speedup is achieved by the {\bf best} homogeneous OOO and IO designs?
Which performs better? How many cores does the best design have?
\begin{answer}{1.25in}
%% Provide answer here
\end{answer}
(5) Now suppose that $f = 0.8$ and $N = 16$.
What speedup is achieved by the {\bf best} homogeneous OOO and IO designs?
Which performs better? How many cores does the best design have?
\begin{answer}{1.25in}
%% Provide answer here
\end{answer}
(6) Compare your last two answers for $f = 0.9$ to $f = 0.8$.
What do they say about how applications affect the best architecture?
About the resilience of different architectural choices to different applications?
\begin{answer}{1.25in}
%% Provide answer here
\end{answer}
\checkpage
\paragraph{4B: Heterogeneous multicore (8 points):}
Now suppose that you are able to combine different kinds of cores in your multicore design.
Let's consider the performance of a heterogeneous multicore CPU
using a mix of OOO and IO cores, written as a function of $f$, $N$, $r_O$ (out-of-order core size)
and $r_I$ (in-order core size).
\emph{Note: The optimal heterogeneous design will consist of a single OOO core plus many IO cores.
The sequential region runs on the OOO core.}
(1) What is the speedup {\bf in the sequential region} of the program?
(Give an expression involving $N$, $f$, $r_O$, and $r_I$.)
\begin{answer}{1.25in}
%% Provide answer here
\end{answer}
(2) What is the speedup {\bf in the parallel region} of the program for a heterogeneous multicore?
(Give an expression involving $N$, $f$, $r_O$, and $r_I$.)
\begin{answer}{1.25in}
%% Provide answer here
\end{answer}
(3) What is the {\bf end-to-end speedup} for a heterogeneous multicore?
(Give an expression involving $N$, $f$, $r_O$, and $r_I$.)
\begin{answer}{2.5in}
%% Provide answer here
\end{answer}
\checkpage
(4) Suppose that $f = 0.9$ and $N = 16$ and, for now, fix in-order cores at size $r_I = 1$.
What speedup does the {\bf best} heterogeneous design achieve?
\begin{answer}{1.25in}
%% Provide answer here
\end{answer}
(5) What is the {\bf best} overall speedup when the {\bf in-order core size $r_I$ varies} from 0.1, 0.5, 1.0, 2.0, to 4.0?
\begin{answer}{1.25in}
%% Provide answer here
\end{answer}
(6) Qualitatively, what does this say about the optimal heterogeneous architecture?
(Does it remind you of any designs you've used this semester?)
\begin{answer}{2.0in}
%% Provide answer here
\end{answer}
\checkpage
\paragraph{4C: Scaling (5 points):}
Finally, let's use the model you've already developed
to see how things change with different resource budgets.
For the rest of this problem, fix $f = 0.9$ and $r_I = 0.1$ in the heterogeneous design.
(1) What are the best {\bf heterogeneous} and {\bf homogeneous} designs when $N = 4$?
\begin{answer}{1.25in}
%% Provide answer here
\end{answer}
(2) What are the best {\bf heterogeneous} and {\bf homogeneous} designs when $N = 64$?
\begin{answer}{1.25in}
%% Provide answer here
\end{answer}
(3) What does this say about the benefits of heterogeneity as resource budgets increase over time?
\begin{answer}{1.25in}
%% Provide answer here
\end{answer}
\newpage
\section*{Problem 5: Memory Performance during Convolution Computation (18 points)}
\subsection*{5A: Sequential performance (8 points)}
\begin{choice}
\item Consider the following code:
\begin{lstlisting}
#define LEN 5932
float data[LEN];
float sum = 0.0;
for (int i = 0; i < LEN; i++)
sum += data[i];
\end{lstlisting}
Calculate the CPIS for this code assuming array \texttt{data} is
initially in the following memory levels. Remember: report these
numbers to the first decimal place.
\begin{description}
\item[L1 cache.]
\begin{answer}{0.5in}
%% Provide answer here
\end{answer}
\item[L2 cache.]
\begin{answer}{0.5in}
%% Provide answer here
\end{answer}
\item[Main memory.]
\begin{answer}{0.5in}
%% Provide answer here
\end{answer}
\end{description}
\item Determine the following regarding function \texttt{convolve}:
\begin{subchoice}
\item The total number of input data elements accessed during the computation,
and the highest memory level in which this would fit (where L1 $>$ L2 $>$ Memory.)
\begin{answer}{0.5in}
%% Provide answer here
\end{answer}
\item The number of input data elements accessed for one value of outer loop index \texttt{i},
and the highest memory level in which this would fit.
\begin{answer}{0.5in}
%% Provide answer here
\end{answer}
\item The number of times each input data element is accessed for one value of outer loop index \texttt{i}.
\begin{answer}{0.5in}
%% Provide answer here
\end{answer}
\end{subchoice}
\newpage
\item Determine the following regarding function \texttt{convolve}:
\begin{subchoice}
\item The total number of weight values accessed during the computation,
and the highest memory level in which this would fit.
\begin{answer}{0.5in}
%% Provide answer here
\end{answer}
\item The number of weight values accessed for one value of outer loop index \texttt{i},
and the highest memory level in which this would fit.
\begin{answer}{0.5in}
%% Provide answer here
\end{answer}
\item The number of times each weight value is accessed for one value of outer loop index \texttt{i}.
\begin{answer}{0.5in}
%% Provide answer here
\end{answer}
\end{subchoice}
\item Determine the single-core performance of function \texttt{convolve}:
\begin{subchoice}
\item
Describe the access pattern for the
input data within the innermost two loops, in terms of which levels
of memory the data would reside in.
\begin{answer}{0.5in}
%% Provide answer here
\end{answer}
\item
What would be the resulting average number of cycles per load operation for the input data?
\begin{answer}{0.5in}
%% Provide answer here
\end{answer}
\item
Describe the access pattern for the
weights within the innermost two loops, in terms of which levels
of memory the weights would reside in.
\begin{answer}{0.5in}
%% Provide answer here
\end{answer}
\item
What would be the resulting average number of cycles per load operation for the weights?
\begin{answer}{0.5in}
%% Provide answer here
\end{answer}
\item
What is the CPIS for the function?
\begin{answer}{0.5in}
%% Provide answer here
\end{answer}
\end{subchoice}
\end{choice}
\newpage
\subsection*{5B: Parallel performance (10 points)}
\begin{choice}
\item
Determine the multicore performance of function \texttt{convolve\_par} when running with $T$ threads for $T \leq 8$.
\begin{subchoice}
\item
For a given thread, describe the access pattern for the
input data within the innermost two loops, in terms of which levels
of memory the data would reside in.
\begin{answer}{0.5in}
%% Provide answer here
\end{answer}
\item
For a given thread, what would be the resulting average number of cycles per load operation for the input data?
\begin{answer}{0.5in}
%% Provide answer here
\end{answer}
\item
For all threads, describe the access pattern for the
weights within the innermost two loops, in terms of which levels
of memory the weights would reside in.
\begin{answer}{0.5in}
%% Provide answer here
\end{answer}
\item
For a given thread, what would be the resulting average number of cycles per load operation for the weights?
\begin{answer}{0.5in}
%% Provide answer here
\end{answer}
\item
What is the CPIS per thread for the function?
\begin{answer}{0.5in}
%% Provide answer here
\end{answer}
\item
What is the speedup, relative to sequential execution, as a function of $T$?
\begin{answer}{0.5in}
%% Provide answer here
\end{answer}
\end{subchoice}
\newpage
\item
Describe how you could restructure the order in which elements are
accessed within the inner loop of the code
to give the program
superlinear speedup. You can assume that $T \in \{1, 2, 4, 8\}$. You may wish to make use of the values
\texttt{thread\_count} and \texttt{thread\_id} that are computed within the parallel block. You should write code for the restructured loop.
\begin{answer}{3.0in}
%% Provide answer here
\end{answer}
\item
What would be the resulting CPIS per thread for $T \in \{2, 4, 8\}$? Explain.
\begin{answer}{1.5in}
%% Provide answer here
\end{answer}
\item
What would be the speedup, relative to the sequential code for $T \in \{2, 4, 8\}$?
\begin{answer}{1.5in}
%% Provide answer here
\end{answer}
\end{choice}
\newpage
\section*{Pledge}
On the answer sheet, you must (electronically) sign the following pledge:
\begin{quotation}
I do hereby swear that, in generating my answers to this exam, I made
use of only the resources explicitly listed in the exam document. I
did not have any communication of any form with anyone other than the
course instructors and teaching assistants about the contents of
this exam. \\[1ex]
Signed
%% Put your full name here
\end{quotation}
\end{document}