Automatically Identifying Opportunities for Using Special Purpose Instructions

Edward Hogan

Glenn Judd

Shafeeq Sinnamohideen

1 Motivation

Over the last few years, there has been a trend in the microprocessor industry in which CPU vendors have been adding new specialized instructions to their processors. Many of these instructions have been added with the stated purpose of providing better performance for specialized applications. Examples of these instructions include Single-Instruction-Multiple-Data (SIMD) for use in multimedia applications or fused instructions (e.g. Multiply-Add) for use in numerical applications. Additionally, reconfigurable processors[6], which can be configured for arbitrary operations, take this concept to the extreme - a program can configure the processor for the instructions it expects will provide the best performance.

The benefits of specialized instructions bring with them a set of challenges. First, it can be difficult for a compiler to effectively determine optimal use of these instructions. Furthermore, it is potentially hard for a developer, coding in assembly language, to determine when it may be useful to use a new specialized instruction. This may be because the developer is not familiar with the available specialized instruction set, or may not understand which places in the code most heavily affect performance.

We have developed a tool that alleviates this difficulty by identifying potential places where a specialized instruction could be used. In addition to determining places where specialized instructions might be used, our approach helps determine how beneficial the use of the specialized instruction might be; this is done by profiling the program and identifying sections of code that are executed frequently (so called “hot spots”).

In short, we provide developers with suggestions of places in their code where they may be able to take advantage of specialized instructions. These suggestions target the program’s hot spots so that the optimizations can greatly improve performance.

2 Related Work

A tool that provides similar functionality already is the VTune Performance Analyzer[4] from Intel. VTune makes use of standard profiling techniques and hardware counters in Intel's Pentium and later processors to trace the execution of an x86 program. It also has "coaches" that can suggest locations where using Intel's Streaming SIMD Extensions[5] would increase program performance. It is, however, more restricted than we would like. First, it is confined to a single ISA, the x86, and to a fixed set of new instructions. This works well for its intended purpose, helping programmers optimize their code for the Pentium III, but falls short of supporting the arbitrary new instructions that a reconfigurable processor could execute. It is likely that VTune's analysis algorithm is capable of doing this, considering that it must be flexible to deal with the complexity of the x86 ISA, but that functionality is not available to the user.

Compilers also perform similar optimization processes on the intermediate representation of a program during compilation. Semantic-preserving transformations are used to transform one sequence of operations into another, possibly shorter, sequence. Strength reduction, which replaces complex operations with a sequence of simpler ones, is the opposite of what we want to do. This strategy makes sense, since complex operations frequently take longer to execute than simpler ones, but in our case, we expect that the new instruction is faster than a sequence of instructions, otherwise the instruction would not have been added to the ISA. The major difference between our post-compilation analyzer and an optimizer within the compiler is that while the compiler is constrained to generating code that is correct under all circumstances, we merely try to detect that a transformation is possible. The task of determining the safety of the transformation is left to the programmer or an automatic optimizer. Also, we may have an execution trace available that allows us to identify the common execution path through loops and optimize for it, whereas the compiler cannot know which path through a loop is the common case.

3 Methodology

Our system consists of two self-contained components: an optimization analysis tool and a hot spot trace generator. The optimization analysis tool analyzes a sequence of assembly code and finds places where specialized instructions can possibly be applied to optimize the code. Specialized instructions are defined using a language that describes patterns of assembly language instructions that can be replaced with a specialized instruction. We call this language the Instruction Optimization Description Language (IODL), and the patterns described by it we call IODL Patterns. Using IODL Patterns, arbitrary instructions can easily be defined which can then be tested against sequences of assembly code.

Our optimization analysis tool is capable of analyzing arbitrary sequences of assembly code; however, it is desirable to optimize the sections of code that most affect a program’s performance. The hot spot trace generator facilitates this by analyzing a running application, and identifying sections of the program that are executed frequently. It then generates a trace of instructions of the most frequently executed sections of the program.

3.1 IODL Pattern Definition

In order for our system to search and find potential places for multimedia instruction replacement, it is necessary for the user to be able to describe to the system the patterns of instructions to look for. This is done by the creation of instruction optimization description language (IODL) patterns. These patterns are human readable, and have an HTML-like syntax. Currently there are 3 tag pairings used in IODL, they are:

<INSTRUCTION> - Used to mark the beginning and end of the instruction patterns.

<TARGET> - Used to provide a user-friendly description of the pattern.

<IODL> - Used to mark the beginning and end of a single pattern.

The descriptions of the individual assembly instructions to search for are described by providing the name of the instruction, followed by alphanumeric strings that serve as a virtual register name for the specific registers used. The names of the register can also be the string “*”, this is a wildcard symbol that instructs the pattern matching facility to match this symbol with any register. The IODL descriptions use the Alpha instruction format.

Below is the IODL pattern describing a sample parallel add instruction. There are two descriptions of this instruction, a vector + vector version, and a vector + scalar version.

<TARGET> A Parallel Add Instruction </TARGET>

<IODL>

addq a b c

addq d e f

addq g h i

addq j k l

</IODL>

<IODL>

addq a b c

addq d b e

addq f b g

addq h b i

</IODL>

</INSTRUCTION>

3.2 IODL Dependency Graphs

When an IODL Pattern is read in, a dependency graph is generated for the pattern. For each statement in the IODL Pattern, a node is created in the graph. A node’s parents indicate nodes that have instructions on which the child nodes depend. Since Alpha instructions have at most two input registers, each Alpha instruction can have a direct data dependency on at most two other instructions. As a result, each node in the graph can have at most two parents. In addition, as the number of instructions that can depend on the output of an instruction is unbounded, the number of children that a node may have is likewise unbounded. In reality, however, the number of children that a node may have is limited by the fact that registers are reused.

IODL Dependency Graphs are not required to be contiguous. SIMD instructions are a common example of instructions having disjoint IODL Dependency Graphs.

3.3 Trace Dependency Graphs

Trace dependency graphs are dependency graphs generated from sequences of assembly code. They are created in the exact same fashion as IODL Dependency graphs. Note that trace dependency graphs can be generated from any sequence of assembly code whether it is from an actual trace, or comes from assembly source code. Also note our system does not treat control instructions in any special fashion.

3.4 Building Dependency Graphs

IODL Dependency Graphs and Trace Dependency Graphs are built in the exact same manner using a fairly straightforward algorithm:

· For each graph linear a list of instructions that defines the graph is maintained. This list is ordered according to the temporal ordering of the instructions. A linear list of “roots” is also maintained. Roots in a dependency graph are nodes which do not depend on any other nodes.

· When an instruction is added to the graph, a new (empty) dependency graph node is created. Next the list is traversed from the first instruction to the last instruction in temporal order. At each node in the list, if the new node depends on the node in the list, the node in the list is set to be a parent of the current node. After the entire list has been traversed the node’s parents will have been determined. For each parent, add the new node as a child. If the node has no parents, it is added to the list of roots.

· After all instructions in a pattern have been added to the graph, the graph is examined in order to determine which parts of the graph are disjoint. For each disjoint segment of the graph, a “primary root” is determined. The primary root is the first root (in temporal order) of the graph segment. Determining primary roots begins by traversing the list of roots. For each root, all reachable nodes from the root are “marked”. Next, the list of roots is traversed until an unmarked root is round. If such a root is found, it is added to the list of primary roots. This new primary root’s graph segment is then marked and the process continues until all roots have been marked and all primary roots have been determined.

As an example consider the following specialized instruction that computes the sum of two multiply-add instruction sequences:

Sum of two multiply-accumulations

</target>

<iodl>

mult a, b, c

add c, d, e

mult f, g, h

add h, i, j

add e, j, k

</iodl>

</instruction>

The following figure shows the IODL dependency graph generated for this instruction. The numbers on the upper left-hand corner of each graph node show the order in which nodes are added to the graph.

Figure 1: Sample IODL Dependency Graph

3.5 IODL Pattern Matcher

The IODL Pattern Matcher component is responsible for matching the specialized instructions defined in the IODL Patterns to the actual hot spot trace. Primarily, there are two varieties of instructions that the pattern matcher recognizes. First, it can match patterns for which the dependency trees for the operation has multiple root nodes that are independent This is the typical pattern found in Single-Instruction-Multiple-Data (SIMD) operations and is depicted in Figure 2. Second, it can also match patterns for instructions in which the dependency tree of the operation contains multiple paths that eventually merge into a single node. This pattern is typical of most fused instructions and is depicted in Figure 3.

Although, these are two predominant patterns found in most applications we have considered, the pattern-matching algorithm is capable of matching any arbitrary dependency graph in which each node may have multiple children and parent nodes. The pattern-matching algorithm will be described below.

3.6 Pattern-Matching Algorithm

The algorithm that matches the IODL descriptions, which are provided by the user, to the hot spot traces, which are retrieved by the Atom tool, works as follows.

In general, the algorithm operates in a depth-first manner, matching the entire descendant tree of a root node in the IODL description. Once the entire tree has been matched successfully for the root, the next root in the IODL description is chosen and the matching process begins again. A complete match can only be declared when the descendant trees of each root node are matched completely with the dependency tree found in the trace.

A match of a trace node and an IODL node is declared when the following conditions are true. First, the nodes’ operations must be identical. Second, the registers referred to by nodes must also be identical. Since the IODL description uses user chosen names for the registers, a match occurs when the chosen name of the register can be bound to the actual machine register. This register binding is described later. The last requirement to declare that two nodes match is that each child nodes of the IODL node must match a child node in the trace.

Although the pattern-matching algorithm matches the IODL graph to the trace graph in a depth-first manner, to make sure that all possible matches are found, this algorithm needs to be substantially more complex. The process of selecting the order that the IODL graph roots or node children are matched to the trace graph must be done by using every permutation of the roots and children that is possible. This must be done since the order that a match is made may influence if the match succeeds. This occurs because more than one IODL node may match to the same node in the trace graph. In short, this increases the complexity of the algorithm to n!, for n children of a node. Although, this number may appear intimidating on paper, in practice there are several short-cuts to make sure that not all of the n! permutations are tried.

Figure 4: Example IODL Graph (left) and Trace Graph (right)

Figure 4 shows an example of how one match can occur. The matching begins by selecting the first root node in the IODL graph (labeled root A), and matching it to a corresponding node in the trace. This search, matching the root node to the trace nodes is done using a breadth-first search. Once the root node has been matched and the bindings updated, a depth-first search and match begins at this node. The matching process matches the children marked 2, 3, and 4 in Figure 4 above. At this point, the algorithm has successfully matched an entire subtree of one of the IODL graphs two roots. The algorithm then selects the next root node (labeled root B) and search through the trace in a breadth-first manner until a match with this node is found. Again, when this root node is matched with the trace and marked with the number 5, its children are then matched in a depth-first manner, marking the node labeled 6. Finally, when we examine the deepest child in the root B subtree, we realize that this node actually resides as both a child of root A and root B nodes. Extra checks must be made at this node to guarantee that the parent nodes, marked 2 and 6, can both be found in this node’s list of parents. In this manner, the entire IODL graph can be pattern-matched against the entire trace.

3.7 Register Binding

Register binding is the process through which register arguments to an instruction can be determined to be matching. Consider the example described below in Table 1.

User-defined IODL description	Actual Trace Instructions
addq a b c	addq t3, t6, t7
addq d * e	addq t1, a1, t2
mulq c e *	mulq t7, t2, t7

Table 1: Sample Register Bindings

Initially, the register-binding table is empty. Register binding is done when the pattern matcher notices that it can create a mapping from the actual machine registers to the user-created virtual register names. In the first instruction, the pattern-matcher would map the string name ‘a’ to the register t3, the string name ‘b’ to the register t6, and the string name ‘c’ to the register t7. In the second instruction, the pattern-matcher would map the string name ‘d’ to the register t1 and map the string name ‘e’ to the register t2. Since the instruction uses a wildcard, the match to the a1 register is not stored in the registry-binding table. Finally, in the third instruction, the values of the string name ‘c’ would be looked up in the binding table and it would successfully match with the machine register t7, the string name ‘e’ would also successfully match with the machine register t2. Because the destination register of the third IODL instruction is a wildcard, it does not matter that this machine register is also bound to the string name ‘c’.

3.8 Hot Spot Trace Generation

Focusing on the hot spots in the execution of a program allows the pattern matcher to concentrate on the sections that will offer the greatest improvement. While the pattern matcher can handle the an entire program, only analyzing a relatively small portion at a time is useful because it keeps the number of possible combinations of instructions, which directly affects the search time, manageable. A profiling tool can be used to determine which sections of the code are executed most frequently, and thus would be most beneficial to examine. Additionally, some of the complex instructions we want to optimize for may replace multiple iterations of a loop with a single instruction. In order to determine if the loop is executed at least enough times to be able use such an instruction, we need a trace of the order in which basic block are executed. Finally, knowing the execution frequency of basic blocks and the commonly taken path will allow the programmer, or other optimizer, to choose the most optimal substitution from the set of possible optimizations.

The Hot Spot Trace Generator collects all the necessary data necessary to do this and provides it to the pattern matcher. It consists of two halves - trace collection and hotspot detection. The trace collection half uses Atom [3] to instrument an Alpha binary so that it is functionally identical to the original binary except that it also writes a file indicating which basic blocks are executed. Once the trace file has been generated it is given as input to the hotspot extractor. The hot spot extractor is a Perl script that examines the trace and the assembly source file and produces a trace for each hotspot found. The user can select the number of hotspots desired, the length of the trace desired, and the minimum iterations through that hot spot the trace must contain. An example of its use is:

hotspot gzip.dis gzip.trace 20 1000 10

The script counts the execution frequencies of each basic block and finds the most common one. It then steps through the trace beginning at the first occurrence of that block, printing out instructions until either there are enough instructions, or the block has been executed enough times. The process is then repeated for the next most frequent block until enough hotspots have been output. If two (or more) blocks repeat in series, as would happen for blocks that are part of the same loop, the other block's instructions will be output in trace order as well. The tool remembers this, however, and will skip the hotspot starting at the other block, since that hotspot will be the same as the first one, with a shifted starting point. The optimization of logging only the starting addresses of basic blocks in the trace reduces the size of the trace file significantly, but requires that the script read the other instructions in the block from the assembly file, which is a small cost.

4 Results

In order to test the ability of the IODL tool to adequately find places in code that can be replaced with multimedia or complex instructions, several existing Alpha binaries were analyzed with the Atom tool to find hot spots in the code that could be optimized. Due to disk space and processing time limitations, the traces were limited to 25MB in length, which corresponds to 2.25 million basic blocks. Each hot spot was limited to 1000 instructions per length and the top 20 hotspots from each application were produced. These hot spots in the code were then analyzed by the IODL Pattern-Matcher to determine if any complex instructions could be used to possibly improve the performance of the application. Because the hot spots are the most frequently run blocks of code in the program any improvements to this code can dramatically improve the execution of the program.

Four programs were analyzed using Atom and the IODL Pattern-Matcher, these programs were applu, compress, mpg123, and mpeg_play. These programs can be described briefly as:

· applu	A floating point partial differential equation program.
· compress	A Spec95 benchmark that uses the Lempel-Ziv compression algorithm.
· mpg123	A streaming MPEG audio player.
· mpeg_play	A streaming MPEG video player.

Each application’s hot spot trace was analyzed for several classes of complex instructions. Primarily, the trace were searched for possible Single-Instruction Multiple-Data (SIMD) instructions and fused instructions. SIMD instructions allow multiple instructions of the same class to operate simultaneously over separate data registers. Fused instructions allow several different operations to take place on the same set of data registers simultaneously. The third and final set of instructions that was considered in the pattern-matching was application-specific instructions. For this set of instructions, we examined places in the application’s assembly code that we believed were part of the application’s core functionality. We then made IODL patterns to recognize these potential sources of improvement and searched for these patterns in the code. This technique is useful in reconfigurable computing environments, where the optimization information could be fed back into the system to create a new optimized instruction for a specific optimization.

The classes of instructions searched are described in Table 2, and are enumerated in detail in Appendix 1. The data types that these instructions operate on are described in Table 3.

Longword Fused	Quadword Fused
Longword SIMD Scalar	Quadword SIMD Scalar
Longword SIMD Vector	Quadword SIMD Vector
T_Float Fused	S_Float Fused
T_Float SIMD Scalar	S_Float SIMD Scalar
T_Float SIMD Vector	S_Float SIMD Vector
Logical SIMD Scalar	Logical SIMD Vector
Application-Specific Instructions

Table 2: Categories of Searched Instructions

C declaration	Alpha Data Type	Size (Bytes)	C declaration	Alpha Data Type	Size (Bytes)
char	Byte	1	long int	Quadword	8
short	Word	2	long unsigned	Quadword	8
int	Longword	4	char *	Quadword	8
unsigned	Longword	4	float	S_Float	4
			double	T_Float	8

Table 3: Alpha Data Types

4.1 Instruction Match Count

The first results of interest are the Instruction Match Counts, these results measure the number of matches on a particular complex instruction that are found in the hot spots of program. These numbers can be quite high because the basic blocks that contain these possible optimizations can be repeated if they may exist inside a code loop. The pattern matcher creates an instruction match count for each of the many hot spots that it analyses for each application. The 5, 6, 7, and 8 depict a summary of the instruction match counts for all of the hot spot traces of the applu, compress, mpg123, and mpeg_play applications.

Figure 5, 6, 7, and 8: Instruction Match Counts for applu, compress, mpeg_play, and mpg123.

4.2 Per-Trace Line Match Histogram

The second set of results collected for each hotspot by the pattern-matcher was a histogram depicting the occurrence and frequency that a particular instruction line in the program trace could be replaced with an optimized instruction. In some cases, it is possible for multiple complex instructions to match to the same instruction, this can occur for example when an instruction could be part of a fused instruction or a SIMD instruction. In these cases, both suggestions are made to the user who must choose the final course of action. In addition, in many cases the same optimized instruction pattern matched multiple times with the same line in the program trace. This occurs because for many optimized instructions, especially SIMD operations, the pattern matcher will find each unique combination of the instructions that fits the pattern. For example, given a set of four instructions {a, b, c, d} where each could be a part of a SIMD operation where two operations are done in parallel, the pattern matcher would discover the matches {{a, b}, {a, c}, {a, d}, {b, c}, {b, d}, {c, d}}. In this instance, it would report that each instruction line was matched three times and that six SIMD operations are possible. In practice, it would only be possible to reduce the original sequence into two SIMD operations. The designer of the program can, after analyzing the match output, modify their program based on their knowledge of the system, making an informed decision about which of the set of reported matches best fits into the program.

Histograms were collected for each of the many hotspots in the targeted applications. The figures 9, 10, 11, and 12 show histograms for a single hotspot of application that contained interesting results.

Figure 9, 10, 11, and 12: Per Line Match histograms for selected applu, compress, mpeg_play, and mpg123 hotspots.

5 Analysis

From the results received, there are several conclusions that we can make after analyzing the collected data. Table 4 below restates the instruction match count data that was shown in the Results section above. From this data, we can make several conclusions about the properties of the targeted applications and the IODL pattern matching system.

	applu	compress	mpeg_play	mpg123
compress: sub; add; cmov; s8add	-	2	-	-
L_integer SIMD scalar subtraction	-	-	-	1040
L_integer SIMD vector addition	-	-	7	-
Logical SIMD scalar and	-	-	300	-
Logical SIMD scalar sll	-	-	-	720
Logical SIMD vector bis	-	301	23	-
mpeg_play: extract high; shift right	-	-	32	14
mpg123: extracts; ors; shifts	-	-	-	18
mpg123: insert; mask; or	-	2	4	2
Q_integer SIMD vector addition	-	47	-
T_float fused multiplication addition	13	-	-	-
T_float fused multiplication subtraction	8	-	-	2
T_float fused subtraction multiplication	6	-	-	-
T_float SIMD scalar division	-	2	-	-
T_float SIMD scalar multiplication	481	-	-	-
T_float SIMD vector addition	272	-	-	-
T_float SIMD vector multiplication	264	-	-	2100
T_float SIMD vector subtraction	51	-	-	-

Table 4: Instruction Match Results Summary

First, we can see from the applu results that this program can be heavily optimized through the use of optimized versions of T_Float instructions. Since this application does intensive differential equation processing, this fits the profile of an optimizable program. It is interesting to note that both scalar and vector versions of optimizations can done, in addition fused instructions seem to possible as well as SIMD instructions.

Next, it appears that the compress program can be optimized through many integer operation optimizations. A SIMD bis instruction (logical or) and a SIMD vector addition are most common and could probably contribute to the program speed-up. In addition, it is interesting to notice that it is possible to construct an instruction specific to the compress program that could be used to speed up the main compression loop. This instruction is named “compress: sub; add; cmov; s8add” and is represented in Table 5 below:

Compress: sub; add; cmov; s8add

subq a, *, a

addq a, *, b

cmovlt a, b, a

s8addq a, *, c

mpeg_play: extract high; shift right

extqh *, *, a

sra a, *, a

mpg123: extracts, or, shifts

extbl a, *, a

extbl b, *, b

sll a, *, a

bis a, b, a

sll a, *, c

mpg123: insert; mask; or

inswl *, a, b

mskwl c, a, c

bis c, b, c

Table 5: Application Specific IODL descriptions

Finally, from the results of the pattern match searches on the mpeg_play video application and the mpg123 audio tool, it is important to notice that there are many opportunities to optimize these applications. The mpeg_play video player can take advantage of many SIMD logical and instructions, vector addition, and vector bis instructions. The mpg123 audio player takes advantage primarily of a separate set of optimizations, it can use SIMD shift logical left, SIMD multiplication, and SIMD subtraction. However, although these programs differ in some regards, they can both take advantage of several uniquely configured instructions. First, they both can use an ‘word insert, mask, and or’ that we originally found while choosing possible optimizations for the mpg123 application. In fact, this combination can also be use to improve the compress program. In addition, both the mpeg_play and the mpg123 programs can take advantage of an extract high, shift right combination that was originally found to optimize the mpeg_play code. In short, many unusual combinations of instructions can be used to optimize code even in unusual combinations. From the application specific optimizations that we designed, we were surprised that these combinations could then be used in other applications. This can be of particular interest to reconfigurable computing environments where an application may be able to select specific optimizations that improve its own execution speed.

6 Conclusions

As instructions specialized for particular classes of applications such as multimedia applications become more common, the importance of determining their potential benefits increases. As shown above, our system can effectively assist in determining the benefits of these specialized instructions by gathering traces of application hot spots, and determining potential uses of new instructions in these hot spots. Our system can also assist reconfigurable systems by determining the potential benefit of instructions created on the fly.

Further, our results have shown that while many applications have obvious uses for specialized instructions, many applications of specialized instructions are non-obvious. Hence, an automatic analysis tool is critical in order to effectively use many specialized instructions.

7 Future Work

Though our overall approach and implementation have been shown to be effective, several enhancements could be made to our system. This section briefly describes some of the enhancements that would be most beneficial.

While our IODL descriptions are not limited to the Alpha instruction set, they are limited to Alpha-like instruction sets. Namely, IODL descriptions can only represent instructions, which have at most two inputs, and at most a single output. A more powerful scheme could allow for arbitrary numbers of inputs and outputs. Further, our IODL statements require either a virtual register name, or a wildcard value. A more powerful scheme could allow for logical expressions stating which virtual registers are and are not acceptable. This would allow for more compact IODL descriptions since, in many instances, users would no longer be required to create multiple descriptions for a single instruction. In addition, allowing logical expressions would increase the accuracy of matches generated through the removal of false matches that are caused by the limited expressiveness of IODL descriptions.

Also, when a single trace is analyzed many suggested instruction matches are found that conflict with each other. For instance, a fused multiply-accumulate and a SIMD vector multiply may both share a common source register. As another example, consider that multiple matches may be found for a four-way SIMD vector multiply, but only a subset of these matches may be used since some of them have the same source registers. Our current system simply reports all possible matches. A more useful system would report tradeoffs among matches (e.g. which matches cannot coexist) as well as suggest the relative benefit available from each match.

The detection of hot spots can also be improved significantly. The current system makes the assumption that the first occurrence of a frequently executed block will be representative of its behavior throughout the program. While this makes sense for loops, it is not so true if the most frequent block is a procedure that is called in one pattern during startup, and another during the body of the program. Considering the period between occurrences of a given block when counting them might solve this problem. If a block occurs within a loop that has different behaviors at different times, would effectively be considered as two separate candidate hotspots. For example, block A in the trace ABABABAB would be considered distinct from A in ABCDABCD. The hot spot detector could then know that ABAB... is a better choice, since it contains more iterations of A. Faced with the worst case of ABBB...AAAA where there are more A's than B's, the current simple detector will output ABBB as the trace of hotspot A. Using, the minimum iteration parameter, however, will ensure that the trace will at least contain that number of occurrences of A. Another improvement would be searching for the most common sequences of blocks, which would go a step further and distinguish ABCABC from ADEADE. This way, the pattern matcher would be run on both cases, instead of just the one that occurs first.

Finally, our current system determines how beneficial candidate instructions are for a particular program, but our system does not provide any mechanism for suggesting candidate instructions on its own. Adding this capability would greatly assist reconfigurable computing.

8 References

[1] Alpha Architecture Handbook, Version 3, Digital Equipment Corporation, Maynard, Massachusetts, October 1996.

[2] Assembly Language Programmer’s Guide, Digital Equipment Corporation, Maynard, Massachusetts, March 1996.

[3] Digital Unix Programmer's Guide, Digital Equipment Corporation, Maynard, Massachusetts,1999

[4] "VTuneä Performance Analyzer" 4.0, Intel Corporation, 1999

[5] Shreekant Thakkar, Tom Huff, "The Internet Streaming SIMD Extensions" in Intel Technology Journal, 2nd quarter 1999, August 1999

[6] Seth Copen Goldstein, Herman Schmit, Matthew Moe, Bihai Budiu, Srihari Cadambi, R. Reed Taylor, Ronald Laufer, "PipeRench : A Coprocessor for Streaming Multimedia Acceleration" in IEEE Symposium on Computer Architecture 1999

9 Appendix I: Targeted Specialized Instructions

The following table lists the specialized instructions that we attempted to search for in each of the traces we analyzed. The instructions are grouped into categories describing their general type of operation and the datatype on which they operate.

Longword Fused	Quadword Fused
L_integer fused addition division	Q_integer fused addition division
L_integer fused addition multiplication	Q_integer fused addition multiplication
L_integer fused division addition	Q_integer fused division addition
L_integer fused division subtraction	Q_integer fused division subtraction
L_integer fused multiplication addition	Q_integer fused multiplication addition
L_integer fused multiplication subtraction	Q_integer fused multiplication subtraction
L_integer fused subtraction division	Q_integer fused subtraction division
L_integer fused subtraction multiplication	Q_integer fused subtraction multiplication
Longword SIMD Scalar	Quadword SIMD Scalar
L_integer SIMD scalar addition	Q_integer SIMD scalar addition
L_integer SIMD scalar division	Q_integer SIMD scalar division
L_integer SIMD scalar multiplication	Q_integer SIMD scalar multiplication
L_integer SIMD scalar subtraction	Q_integer SIMD scalar subtraction
Longword SIMD Vector	Quadword SIMD Vector
L_integer SIMD vector addition	Q_integer SIMD vector multiplication
L_integer SIMD vector division	Q_integer SIMD vector division
L_integer SIMD vector multiplication	Q_integer SIMD vector remainder
L_integer SIMD vector remainder	Q_integer SIMD vector addition
L_integer SIMD vector scaled by 4 addition	Q_integer SIMD vector subtraction
L_integer SIMD vector scaled by 4 subtraction	Q_integer SIMD vector scaled by 4 addition
L_integer SIMD vector scaled by 8 addition	Q_integer SIMD vector scaled by 4 subtraction
L_integer SIMD vector scaled by 8 subtraction	Q_integer SIMD vector scaled by 8 addition
L_integer SIMD vector subtraction	Q_integer SIMD vector scaled by 8 subtraction
T_Float Fused	S_Float Fused
T_float fused addition division	S_float fused addition division
T_float fused addition multiplication	S_float fused addition multiplication
T_float fused division addition	S_float fused division addition
T_float fused division subtraction	S_float fused division subtraction
T_float fused multiplication addition	S_float fused multiplication addition
T_float fused multiplication subtraction	S_float fused multiplication subtraction
T_float fused subtraction division	S_float fused subtraction division
T_float fused subtraction multiplication	S_float fused subtraction multiplication
T_Float SIMD Scalar	S_Float SIMD Scalar
T_float SIMD scalar addition	S_float SIMD scalar addition
T_float SIMD scalar division	S_float SIMD scalar division
T_float SIMD scalar multiplication	S_float SIMD scalar multiplication
T_float SIMD scalar subtraction	S_float SIMD scalar subtraction
T_Float SIMD Vector	S_Float SIMD Vector
T_float SIMD vector addition	S_float SIMD vector addition
T_float SIMD vector division	S_float SIMD vector division
T_float SIMD vector multiplication	S_float SIMD vector multiplication
T_float SIMD vector subtraction	S_float SIMD vector subtraction

Logical SIMD Scalar	Logical SIMD Vector
Logical SIMD scalar and	Logical SIMD vector and
Logical SIMD scalar bic	Logical SIMD vector bic
Logical SIMD scalar bis	Logical SIMD vector bis
Logical SIMD scalar eqv	Logical SIMD vector eqv
Logical SIMD scalar ornot	Logical SIMD vector ornot
Logical SIMD scalar sll	Logical SIMD vector sll
Logical SIMD scalar sra	Logical SIMD vector sra
Logical SIMD scalar srl	Logical SIMD vector srl
Logical SIMD scalar xor	Logical SIMD vector xor

Application Specific Instructions

mpeg_play: extract high; shift left

mpeg_play: extract high; shift right

mpeg_play: extract low; shift left

mpeg_play: extract low; shift right

mpg123: insert; mask; or

mpg123: extracts; ors; shifts