This section presents preliminary results towards discovering how PE and RTCG can make the bitmap operations of 2D graphics code more modular and flexible, and sometimes faster too. First, the bitmap data-structure and some of its common operations are introduced. The implementation of two of these operations (bitblit and dithering) is then sketched, and explored further in two subsections. The difficulties of dithering are compounded in area-operations such as filtering. I suggest synchronous real-time languages as a promising source of techniques for efficiently compiling complex data-flow programs with partial evaluation. Finally we look at the display of an MPEG2 stream.
What is 2D graphics? 2D graphics is primarily concerned with the representation and manipulation of bitmapped images. [a bitmap is a two dimensional array of pixels where the boundaries may fall between words. there may be any number of bits per pixel, or they may be indexed. may have any number of channels, what's a channel. the most common kinds of bitmaps] [manipulation: per-pixel operations such as posterization and gamma correction, area operations such as filtering and cellular automata; blitting, scan-conversion (drawing lines and circles).
Because images are 2D arrays at heart, the loop language presented in section loop is directly applicable to performing per-pixel operations on bitmaps.
Bitblit (short for `bit block transfer') is a fundamental operation
used for everything from scrolling a window to printing text. The
classic example of RTCG is Pike's implementation of bitblit [blit]. Section bcopy presents preliminary results showing how
to implement bcopy (a special case of bitblit) with
Dithering is the operation used to display a 24-bit image on a display with only a few bits. A much more complex and sketchy description a general but fast medium-quality dither procedure is here. The ditherer handles interleaved/sequential channels in any order (but still shares loads and base addresses), dithers down to any number of bits in each channel. The ditherer is built from the following components: 1D linear byte-looper, software cache, 1D-feedback dither, byte permuter. The ditherer is specialized to the memory layout of its arguments.
Efficient implementations of area operations (eg CA and filtering, or error-diffusion dithering) use a sliding window to reduce memory references. If the area is 3x3, then a 3x4 window can reduce memory references (not cache misses) significantly (). Similar operations include 2x2 decimation, pixel replication, tiling, and mirroring. To avoid shuffling data inside these loops, they can be unrolled once for each column of the window. [need a diagram] (4 scan lines * 32 bits/pixel * 1k pixels = 16k)
I think such problems may be solved with synchronous streams. The connection of the streams is static, but the data flowing between them is dynamic. Proper scheduling involves repeating loops of different lengths until they sync up, this can be handled by the cogen's memoization if the right values are made static. Cogen may be handle something like an interpreter for a synchronous real-time language like [StateCharts] or [signal].
The above discussion focused on computations where data is read exactly once, and temporary storage can be avoided by dynamically building loops. but sometimes it cannot be avoided: for example if you compose two filters, or when you multiply two matrices. Here's where blocking and tiling come in.
RTCG can also be used to simplify the implementation of prims like line drawing. Bresenham's algorithm's 8-way symmetry may be generated. [Sproull] shows how a number of efficient variations of the algorithm may be generated by code transformations, perhaps some of them can be expressed with cogen.
The caches and byte-memops instructions available in general purpose CPUs take care of most of the computation saved by making the shifts and masks static in the bcopy and dither examples. However, DSP chips (plus the Alpha [Alpha]) sometimes have no such features, and sometimes do provide explicit access to on-chip ram, sram, and dram.
An important example: fast decompression and display of a video
stream. MPEG2 expands to full-color, but we want to display it on a
cheap 8-bit display. Pixels come in 8x8 blocks which must be dithered
then moved to the frame buffer. In the 2.5D and 3D realms, the pixels
are projected and clipped in what is usually another layer of
processing. Isn't this something like
prman's CAT (coherent
access texture) feature? Looking `end to end' (network device to
frame buffer), how much opportunity for
cogen is there? How close
to the `minimum' number of layers can
cogen go, without using an
unnatural means of organization to get there?