Reference Design Release Notes

For instructions on how to extract and run the reference design, please read refdes_readme.txt.


Version 1.0 Release, June 9, 2000
The following changes have been made since the last release:

1.  Transmit now has 6 fill threads.  The following projects now have 6 transmit fill threads:
L3fwd16, pfwd_2f, pfwd16_rx16_tx6, pfwd16_tx6, pfwd8_1f_rx16_tx6, and pfwd8_1f_tx6.  This new 
transmit design significantly speeds up the transmit processing.  One FBOX is assigned to the
even TFIFO elements, and the other FBOX is assigned the odd TFIFO elements in this design.  Each 
FBOX is also assigned specific ports to service.  For example, in the pfwd8_1f_rx16_tx6 project, 
FBOX 4 is responsible for all packets output on the 8 10/100 ports and it uses only the even TFIFO
elements, while FBOX 5 is responsible for all gigabit port packets and uses only the odd TFIFO
elements.  The four treads of each transmit FBOX perform the following functions: thread 0 is 
the transmit scheduler and threads 1, 2, and 3 are fill threads.  When the scheduler has no 
packets ready to be transmitted it instructs the next fill thread to skip the next TFIFO element.
When a fill thread gets a skip assignment it sets the ERR field in the transmit status word of its
assigned TFIFO element.  When the transmit state machine sees the ERR field set it does not 
transmit the data in the TFIFO element and goes on the next TFIFO element.  Interthread signals 
are now used by the fill threads to gaurantee the TFIFO elements are validated in order.  

2.  Dual gigabit port design pfwd_2f added.  This design has 16 receive threads and 6 transmit
threads.  8 receive threads and 3 transmit fill threads are dedicated to each gigabit port. 

3.  Variable sized packets are now supported.  The reference designs now support packets up to
1518 bytes long on B0 silicon with the core clock running at 200 MHz and the IX bus running at 
80 MHz.

4.  B0 chip feature "dual validate bit" functionality implemented.  When the IXP-1200 is 
configured to run in the "dual validate bit" mode the transmit state machine will start to
process a TFIFO element only after both of the following are true: 
  A. A transmit fill thread fast_wr a XMIT_VALIDATE for the TFIFO element.
  B. An SDRAM signal is received by the transmit state machine indicating the data transfer
     from SDRAM to the TFIFO has completed.
This new functionality speeds up the transmit microcode because the fill threads no longer must
sit around waiting for the SDRAM data transfer to the TFIFO element to complete before
writing the XMIT_VALIDATE and then move on to process the next mpacket.

5. The reference designs now use more macros from the microcode/include library.  Areas
significantly improved in this regard are the IP verification, IP modification, and 
route lookup (rec_ipverify, rec_lmatch) routines.  This modification makes the microcode
easier to use, read, and maintain.


Known issues with this release:

1. If a non-valid IP packet is received, it will be discarded and the thread that received it will
go off-line (it will not send thread done to the scheduler, and will be not available for 
further packets). This can be construed as a good thing, because, you can then look at the read
transfer registers where you see the actual bogus packet data.

2. While microcode supports both bridge and route lookups, it depends upon the tables that must
be created and maintained by core applications.

3. Layer 3 routing does not insert MAC addresses into the packet. It does however modify checksum
and time to live, and updates the first 32 bytes of the packet where the modified MAC addresses
would be inserted.

4. some fields in the transactor memory control registers mem_ctl0 and mem_ctl1, when changed, 
have no effect.  please see the transactor release notes for more details.

5. Bridge spawning tree is not implemented.

6. Priority-based queuing is not yet implemented. The multiple priority queues are present, but
the selection isn't made by the transmit arbiter.

7. Queuing of packets from/to core is not yet implemented.

8. Processing 64 byte packets using the IXF1002 dual port gigabit MAC can not execute at 
line rate.  The MAC has a restriction that only 2 packets can be in it's TFIFO at a time.
The transmit microcode special cases SOP in order to ensure that this MAC
restriction is not violated.  A future version of the MAC will not have this restriction and,
as a result, the reference design processing performance will increase. 



================================================================================================



------------------------------------------------------------------------------------------------


Previous releases
------------------------------------------------------------------------------------------------


Beta 4 Release,  March 3, 2000
The following changes have been made since last release:

1. A version for B0 chip using 16 receive threads and no transmit scheduler is implemented. This 
can be run in the workbench project pfwd16_b0. The B0 chip does not require a single thread to
check receive request available before sending receive request. Instead, the receive request can
be issued, and the requesting thread will stall if no register is available to hold the request.
This enables us to eliminate the receive scheduler, thus making another microengine available.
Also, pfwd16_b0 uses both sdram memory banks, and achieves higher performance due to the hiding of
precharge cycles. Receive performance measured was 3M min packets per second, transmit was
2.4 M min packets per second.

2. Fast port designs pfwd8_1f has been added. This allows us to run the eval card configuration
on the transactor, where fast port is mac 1, port 0. In the pfwd16_1f design, fast port is 
mac2, port0. 8 100M, 1 fast port (refdes8_1f_hw project) has been tested on the evaluation card.
See the Evaluation Card User's Guide for further information.

3. Fast port designs pfwd8_1f and pfwd16_1f use packet_gen script to source packets on the
transactor. Variable-size packets of 64-512 bytes run on these. Because this is an oversubscribed
situation. the gigabit port rate is throttled by setting its rate to 900Mbits in the packet_gen8*
and packet_gen16* scripts. The rate setting is actually higher than the resulting rate, due to
round-off in the rate calculation.

4. An error in microcode checksum routine was corrected, where in the final carry was not 
included in the calculation.

5. A 3 MAC mode, uni directional bug in packet gen was fixed, where an incorrec delay was used
for MAC receive select decode.

6. Packet cancel has been implemented and tested in receive microcode, for  fast port min-size 
and variable size packets.

7. It has been verified that the transmit fifo output pointer correctly functions with the A2 chip.

8. A bug in the settings of sdram mem init and control registers has been fixed. This caused severe
performance degradation for the reference design. With this fix we saw a big improvement in hardware
reference design performance. We have correlated journal results of hardware and transactor
timings. For receive threads, latency from receipt of receive control to enqueue is an average 
641 cycles for the transactor and 653 cycles for the hardware, a 2% difference. Because memory 
refresh is not modeled in the transactor, a 2-4% difference is expected.

9. The new eval card (1 octal MAC + 1 gigabit MAC) has been tested with core clock of 176MHZ, with 64 byte
min-size packets. Interpacket gap was 1.04us for Octal MAC, .64us for Gigabit MAC, with 8 100M ports and 1 
1G ports forwarding. The core function NetApp_GigInit sets the clock rate for the eval card to 176MHZ.
The validation module (2 octal MACs) has been tested at 162MHZ. For the 12 port version, interpacket
gap was tested to 1.04us with all 12 100M ports forwarding min-size packets. For the 16 port version, 
interpacket gap was tested to 1.2us with 15 100M ports forwarding min-size packets. Again, transmit
performance limits overall forwarding (see issue #5 below). Receive lookup and queueing rate comparison
is as described in #8 above.

10.  The Bridge reference design has been verified on both the transactor and hardware.  Also, both
the router and bridge libraries can be included in the NetApp application. 

Known issues with this release:


1. If a non-valid IP packet is received, it will be discarded and the thread that received it will
go off-line (it will not send thread done to the scheduler, and will be not available for 
further packets). This can be construed as a good thing, because, you can then look at the read
transfer registers where you see the actual bogus packet data.

2. While microcode supports both bridge and route lookups, it depends upon the tables that must
be created and maintained by core appications.

3. Variable-size packets have not been tested on the hardware. In addition, discard for 
variable sized packets on fast port is not fully implemented.

4. Layer 3 routing does not insert MAC addresses into the packet. It does however modify checksum
and time to live, and updates the first 32 bytes of the packet where the modified MAC addresses
would be inserted.

5. As can be seen below in the performance measurements, receive performance is higher than
transmit performance. If packets are received at a higher rate than can be transmitted, eventually
the design will run out of buffers (no_buf_available# at receive). The receive thread will then
loop until a buffer is available (one is freed by transmit side). In the next release this will
be changed to discard packets when a limit has been reached. Also next release will have a
new transmit design that uses 6 transmit threads instead of 4, thus improving transmit performance
significantly.

6. transactor memory control registers mem_ctl0 and mem_ctl1, when changed, have no effect. 
Because of this bank size must be 16MB. For the pfwd16_b0 design described above, sdram size
had to be set to 64MB. This configuration may not run on laptops, due to the size of memory 
allocated.

7. microcode examples: crc32_8 example is incorrect. Each table lookup is used in generation of a
new crc, which in turn is used in the next table lookup. This requires a serial set of sram reads, 
each with a context swap. Here is the inner loop.

for each byte, in order,
{
	// microcode:
	alu_shf[addr, byte, XOR, crc, >>24]
	sram[read, result, crc32_8_table, addr, 1], ctx_swap
	alu[crc, result, XOR, crc, <<8]
}

8. Bridge spawning tree is not implemented.

9. Priority-based queuing is not yet implemented. The multiple priority queues are present, but
the selection isn't made by the transmit arbitor.

10. Queueing of packets from/to core is not yet implemented.


The performance measurements shown in the refdes_readme.txt file were taken on the following dates
on previous releases (The measurements are old):
	7-23-99	rtm.exe
	4-16-99 bridge.exe


========================================================================================================



Beta Release 7/30/99

1. The microcode can be assembled into several configurations, using these batch files:

pfwd12.bat		12 100M port, big-endian, packet buffer_count 100
pfwd16.bat		16 100M port, big-endian, packet buffer_count 100
pfwd16_1f.bat	16 100M port, 1G port, big-endian, packet buffer_count 100
pfwd8_1f.bat	8 100M port, 1G port, big-endian, packet buffer_count 100
pfwd12hw.bat	12 100M port, little-endian, packet buffer_count 300
pfwd16hw.bat	16 100M port, little-endian, packet buffer_count 300 

The buffer count initialization is performed by the rec_scheduler*. When running on the
transactor, the smaller number is usually sufficient. However, when running on the SA1200
hardware, a higher number is desirable to account for queue bottlenecks. The Octal MAC
is configurable to little-endian or big-endian. The current version of Octal MAC
driver defaults to little-endian. See the readme.txt under workbench_projects for a description
of how to run big-endian on the SA1200 hardware.


2. A single refdes.uof file is created by the linker. This enables loading microcode from a
remote file. Also, the batch files create a ucld.c, which can be linked in with Core libraries,
enabling loading microcode from a memory buffer.


3. There is an fbi xmit_outptr hardware bug in the A1 chip with a software workaround that seems
work. The xmit_outptr from the fbi drops bits occasionally due to a clock synchronization problem. 
tx_scheduler.uc has a window check to determine if the value is in the range of possible values.
If not, it ignores it. A new xmit_outptr is pushed with every autopush. When a a bad value is 
pushed, and tx_scheduler ignores it, a good value will be received on the next autopush, and the
tx_scheduler will resynchronize with the good value. If a bad value is pushed and it is in the
window, tx_scheduler uses it, freeing an element. The tx_scheduler loop time is 56 cycles, and the
auto push rate should be set to be as fast as possible, assuring that tx_scheduler can make no
more than 1 assignment between auto pushes. Thus the tx_scheduler should pick up a good value
the next time it gets around to checking the xmit_outptr. (this may take some fine tuning).

If you suspect this problem, assemble tx_scheduler.uc with -D RECORD_OUTPTRS, This will journal 
the outptrs into SRAM starting at 0x4400, where they can be scanned with a for loop. For example, 
at the workbench command line, in hardware mode, do:
int i;
void check_outptrs(int start, int count){
	for(i=0;i<count;i++){
		if(((sram[start+i]+1)&0xf) != sram[start+i+1]){
			printf("outptr inc by more than 1, i=%d, i+1=%d\n", i, i+1);
		}
	}
}
check_outptrs(0x4400, 10000);


4. There is optimize_mem hardware bug in the A1 chip where sram references can be dropped,
causing the microcode to hang while waiting for the sram signal. For the reference designs,
optimize_mem has been removed for sram reads and read_locks.


5. There was a microcode bug in fast port transmit, where the last 64 byte element of the last
packet is not sent. The tx_fill_f is modified to do the same thing as tx_fill.uc. While polling 
for a new transmit assignment, it checks for an sdram signal to indicate completion of the sdram
to tfifo transfer. When the transfer is complete, it validates it to the fbi.


6. There was a problem where tx_fill_f microcode issued a scratch test and set (no signal) 
followed by a read (sig_done). The previous code incorrectly assumed there was ordering in fbi 
for test and set followed by another read. However, the test and set could finish after the read.
In this case the tx_fill code resumed too early, using uncompleted test and set results. This is
fixed by having tx_scheduler place message id in the assignment, whereby tx_fill checks the
message id.


7. The B0 chip will have some changes that improve performance. For transmit, a mode will allow
the sdram[tfifo_wr..] to be issued without having to check for completion. fbi will maintain 2
sets of valid bits, one from microcode, one from sdram, for each tfifo element. When both are set,
the element will be transmitted to the fbus. This results in a 5-10% improved transmit performace.
tx_scheduler_b0.uc and tx_scheduler_f_b0.uc sets the xmit_rdy_ctl bit 11 to enable this mode.
see also tx_fill_b0.uc and tx_fill_f_b0.uc for a sample implementation.


8. With assemble switch -D PROFILE on tx_fill.uc and rec.uc, these counters will be incremented:
TOTAL_RECEIVES for packets enqueued, TOTAL_TRANSMITS for packets transmitted. These are defined
by mem_map.h to be scratch_pad 0xc1 and 0xc2, respectively.


9. packet.cpp. pk_endian_swap function has been fixed to correctly swap the bytes of each 
32 bit word. This is used in packet_gen.ind. It will generate little-endian packets if you
#define LITTLE_ENDIAN before running packet_gen.ind.


10. fdrive.cpp. Corrected printf statement in PacketPool::Show (called by command-line function
pkpool_show).


11. rtm.cpp. RTM_RtAdd, changed 'end_bit -=4' to 'end_bit +=4'. The bug worked for NT simulations,
but not with GNU on StrongARM.


12. scheduler microcode. For contexts that are killed with ctx_arb[kill], insert 2 nop instructions
before the kill. This is required for SA1200 hardware. Modified scheduler microcode.


13. The amba simulation interface is now used to initialize microengines. The ambaio interface is 
initialized during rt_init and called by rt_load_ucfile. This results in the simulator
running approx. 1000 cycles while reads/writes from core libraries to microengines are simulated. 
Following this the microengines are enabled, and the model runs as before. 



------------------------------------------------------------------------------------------------
Base Level 5 Release
Don Hooper 4/16/99

changed since last release:

1. stdmac.uc. freelist and hashdb macros have been added
	freelist_create, malloc and free. 
	The reference design uses freelist_create to allocate packet buffer (sdram) 
	and queue(sdram) space for packets at microcode startup. This is done by the 
	receive scheduler when it starts up. mem_map.h is used to define freelist stype, 
	whether it allocates SRAM, SDRAM or both SRAM and SDRAM. For SDRAM mem_map.h
	also defines the SDRAM base address for the region being allocated.

	hashdb_add48, hashdb_lookup48, hash_add64, hashdb_lookup64, hashdb_resolve. 
	These macros can be used to store and retrieve arbitrary information at a hash 
	database, indexed by a 48 or 64 bit value. The lookups are highly efficient, 
	because they use the hardware hash feature that can result in minimal collisions. 
	The BL5 reference design now has a bridge manager in the StrongARM core libraries. 
	It is used to add bridge entries and maintain the lookup tables.
	Microcode calls hashdb_resolve for MAC DA and SA lookups. This macro uses stdfunc.uc
	where there is the function hashdb_resolve_func. hashdb_resolve will setup arguments,
	perform branch and link (balr in stdmac.uc), and execute the function. Upon completion
	the code branches back to the link_pc setup by the balr. 
	
	Under qa-test\uc_tests\macro are hashdb_test.ind and hashdb_test.uc. This test
	exercises tha hashdb functions for both 48 bit and 64 bit indexes. The hashdb_add*
	macros can be used by microcode for spanning tree or IP(SADA) connection lookups.

2. bridge reference design. 
	Under the 12 port version, there is a script that exercises the bridge manager core
	library to populate bridge lookup tables and entries, creates bridge packets using
	packet_gen console functions, and runs the bridge forwarding microcode. These scripts
	are pfwd12_bi_min_br.ind (top script) and packet_gen_br.ind.  Use the Br_RtAdd command
	at the VxWorks console to add route entry to the bridge table when running on hardware/VxWorks.
	The first two Br_RtAdd parameters describe the 6-byte net-order MAC address and the third
	parameter is the associated port.  For example, to add an entry that will route bridge
	packets with MAC address aabbccddeeff to port 0; type Br_RtAdd(0xaabbccdd,0xeeff,0) at the
	VxWorks console after the NetApp has been initialized.  The Bridge microcode is at
	rec_bridge.uc.  The Software Spec is updated to cover the Bridge Manager.

	Thread 18 is used as a microcode service library. FBOX 4 (receive scheduler) was the logical 
	choice, since it had some spare compute bandwidth and lots of unused microcode storage. The
	script calls br_sv_init(); to the Bridge Manager  on core to load the the service library
	imported function info and to load the hardware hash multiplier. In this version multiplier
	is 1 and Bridge Manager does not call the service library to get a hardware hash translation.
	This will be done in the next release. The microcode does call the hardware hash translation,
	but since the multiplier is 1, we get out what we put in.
	A bug exists where in the lookup, if a MAC address comes in with lower bits that match
	and existing forwarding entry, it will get a hit on that forwarding entry even if the
	upper bits mis matched. This is simply resolved by comparing the 47:16 bit hash remainder
	to the hash remainder stored in the forwarding database. This will be fixed in the next release.
	
3. packet_gen MAC device capability. 
	packet_gen library has been enhanced to simulate MAC devices(receive and transmit side)
	on the fbus. This done in C(see source code in the packet_gen project). 
	The packet_genLib.lib is linked with all executables and dlls that contain either the route table
	Manager or the Bridge Manager. rtm.exe.bridge.exe, SA1200Core_dll.dll, WIN32_SA1200CORE.exe.
	packet_gen.h exports pkpool_add_from_buf() than can be called by another application to insert
	packets into the packet pool. Alternatively, the transactor command line can be used to create packets
	and insert them in the packet pool. packet_gen will then drive packets from the packet pool
	into the transactor over the fbus interface. At the tranasactor command line, type fb_help();
	to get a list of all functions. You can create valid bridge packets or IP packets with checksum, 
	insert fields in packets, add them to pool, configure mac devices, set rates, enable and disable
	ports, write packets to file, etc.

	fb_print(); can be invoked to print the values of fbus interface pins each cycle.
	fb_noprint(); will turn off printing the values of fbus interface pins.

	pfwd12* and pfwd16* scripts use this new packet_gen MAC devices to drive and monitor the fbus.
	pfwd16_1f* scripts use the previous pfwd_mac script that emulates MAC ports driving packets to fbus.


4. fast port bug fixes. 
	Race conditions existed in the fast port save and restore. These have been corrected to use
	test and set for output port save and restore, and to lock the receive state mailbox (which is
	used for receive state save/restore between mpackets on greater-than-minsize packets. The mailbox
	is locked until the eop min-packet comes in. A test and set is performed by the thread getting the
	next sop, to ensure it doesn't save state while the previous packet is still being processed by other
	receive threads. There is no performance difference whith this vs the previous version.

------------------------------------------------------------------------------------------------
Base Level 5 Interim Release
Don Hooper 2/4/99

changed since last release:

1. endian swap macros. These include a set of endian-independent field extract, compare, branch, 
increment, decrement and add macros. This enables the same reference design to be assembled 

2. endian- macros used in receive microcode for field compare, compare and branch, checksum update, 
time-to-live decrement and checksum verification

3. packet generating script will drive packets in big or little-endian format.

4. new standard macros. bit set and bit clear,  indirect shift, sdram_r_fifo_rd, delay, msgq (init, send 
and receive), msg (send and receive)

5. 16 100M port and 1G/8 100M port designs. Ready count must increment twice before port mask is cleared. 
Also, slow ports use save and restore between mpackets to free up receive threads for higher performance. 
No longer using binding array in scheduler to bind port to thread.

6. Receive microcode, 16 port and 1G/8 100M port. Receive threads save and restore buf_handle and 
rec_state between mpackets.

7. Fast port scheduler and receive. Use mpacket sequence from in receive request for index to save 
and restore receive buf_handle and rec_State.

8. Packet discard. In rec_state, packet discard bit is set if packet is to be discarded. As further 
elements of the packet are received, they are ignored (no storage to sdram) if this bit is set.

9. 16 port and fast port reference design performance improvement by ~10%.

11. Reference Designs Illustrations document.

12. There is an example 2 gig port receive scheduler (rec_scheduler_f2.uc). It assembles, but has not 
been tested. It is based on the tested rec_scheduler_f.uc code.


------------------------------------------------------------------------------------------------
Don Hooper 12/22/98
Base Level 4 Release

A. Major Changes from BL3:

1. Top level scripts run the three reference designs in 4 configurations each, bi-directional,
uni-directional, min-size packets, variable size packets.

2. Added bus-watchers to pfwd_mac, to check that f_dat_lo, f_dat_hi, fbe_l0_l, fbe_hi_l, rdybus, 
sop, eop, and rxfail are being driven correctly.

3. Correctly implemented autopush prevent window in receive and transmit schedulers. The code
performs  in_autopush#: br_inp_state[push_protect, in_autopush#], followed immediatetly by read 
of the pushed transfer registers, so as not to read these registers while they are being written.

4. Fixed rx_fall bug that caused receive microcode to hang.

5. Modified tx_fill_f.uc to not send an early transmit validate at startup. 


----------------------------------------------------------------------------------------------------------
Don Hooper 10/5/98
Software Version: SA1200 base level 3
------
Notes:
------
What has changed since last release:

1. Sixteen 100M port version is added.
2. Variable packet length support is added for all versions.
3. Packets are stored in 2k sdram buffers.
4. Complete rewrite of tx_scheduler_f and tx_fill_f. Fast port queueing differs in that a
circular array is used to hold fast port queue heads. Transmit scheduler assigns tfifo 
elements and preferred queue to fast port transmit fill threads. Transmit fill decides 
whether to use the elements for a continuing long packet or for new packet.
5. A receive scheduler that achieves maximum receive rate 2.0 giga bits per sec is added.
See notes below.

----------------------------------------------------------------------------------------------------------
Don Hooper 8/5/98
Software Version: SA1200 base level 3 interim
------
Notes:
------
What has changed since last release:

1. A problem had existed in the 7/29/98 transactor release where if a receive request was issued to a non-ready
fast port, fbi would crash with invalid receive request message. There was a workaround ind drive_rdy_f.ind
to always keep fast port ready. The workaround has been removed.

2. Non-SOP code has been added to receive microcode. It assembles, but is not yet tested.


Outstanding Issues:

1. There will be change to transactor allowing the gigabit sequence count to be incremented independent of
writing THREAD_DONE. This allows THREAD_DONE to be written early, and the sequence check and increment to
be written later just prior to the enqueue. Writing THREAD_DONE early can result in a performance improvement
because it takes ~250 cycles from fast_wr[*, THREAD_DONE] until the receive thread is awakened by FBI 
indicating there is a new packet in the rfifo. If THREAD_DONE is written at the end of processing the header,
the receive thread will have to wait, doing nothing for many cycles. 
When it is supported, as soon as the mpacket has been transferred to sdram, we will fast_wr[THREAD_DONE],
and at enqueue time we will perfrom sequence check and fast_wr[INCR_SOP1] for the fast port

2. ind scripts do  not yet drive long packets (non-min-size).

3. exception packets are not yet passed to core or host in the reference designs.

4. packets are not yet sent from core or host in the reference designs.

5. multicast packets are not yet supported in the reference designs.

6. layer 2 bridge code is not operational. There is known working version that relies on tables in sram space 
created by ind scripts, but this is not compatible with the sram areas allocated by mem_map.h. Therefore
the tables are not created in this version and bridge code assembles but does not forward.



----------------------------------------------------------------------------------------------------------

Don Hooper 7/29/98
Software Version: SA1200 base level 3 interim
------
Notes:
------
What has changed since last release:

Notes on FAST PORT implementation July 29, 1998

	rec_scheduler_f.uc
	rtm_load_f.ind
	dfh		7/29/98		Rename rec_scheduler2_f.uc to rec_scheduler_f.uc. 
						Rename rtm_load2_f.ind to be rtm_load_f.ind. This version
						does dynamic port assignment (ports are no longer statically
						bound to threads), bookkeeping for the oversubscribed network of
						1 Gig port and 16 100M ports. drive_rdy_f.ind turns the ready
						bits for these ports on and off at network rates.	

	pfwd_f.ind			set core clock 166MHZ, fbus clock 66MHZ, pci clock 66MHZ

	drive_rdy_f.ind		drives port ready bits on and off for gigabit port, 16 100M ports



----------------------------------------------------------------------------------------------------------

Don Hooper	7/7/98

Softer Version: Base Level 2 Patch compatible with transactor of the same date.
------
Notes:
------
What has changed since Base Level 2:


1. rec_nextpac goes into a timeout loop at startup, to wait for a save time to pop 
a new buffer descriptor from the freelist. rec_scheduler create a freelist of 50 buffers.

2. rec_nextpac initiates pop to get a new buffer descriptor from the freelist prior 
to waiting on the start_receive signal from FBI

3. rec_ipverify uses the buffer descriptor and moves the second 32 bytes of the packet
from rfifo to sdram early

4. rec_bridge uses the buffer descriptor and moves the 64 bytes of the packet
from rfifo to sdram early. It then waits on the completion of this before going
to sdram for the forwarding lookup.

5. rec_lmatch waits on the completion of the initial rfifo to sdram transfer before 
going to sdram for the forwarding lookup. It then writes to THREAD_DONE early to 
give receive_scheduler to schedule the element in time for the thread to pick up the
next 64 bytes without having too much dead time between the receive thread going idle
and receiveing start_receive.

6. The receive control from fbi is moved into a status register in the format that
transmit will use it, in order to save cycles. Note the format of receive control
register from the fbi has changed so byte count, eop and sop line up in the same byte.

7. Fixed a bug in rec_scheduler3 where it hung with ctx_arb[fbi] due to a skip 
(when port is not ready).

8. Used a macro for stage_rec_ready and stage_thread_done in both rec_scheduler2 and
rec_scheduler3. Simplified the calculation of rec_rdy_true.

9. rec_enqueue.uc, tx_fill.uc and mem_map.h use XMIT_PW1E_ADDR to hold a vector of ports
with 1 or more elements, and XMIT_PW2E_ADDR to hold a vector of ports with 2 or more 
elements. This is in preparation for fast port work to come. For now, XMIT_PW1E_ADDR works
the same as the previous XMIT_PWNP_ADDR, and XMIT_PW2E_ADDR is also set and cleared. Note,
 .if structured assembly and macros are used here, so the assembly must be done
with the optimize -O switch, in order to get microwords into the defer shadows.

10. tx_scheduler. inserted a ctx_arb[fbi] to wait on the completion of the write of the
transmit schedule assignment to scratch.

11. tx_fill move sdram to tfifo earlier to avoid a long wait on the completion of this.
This enabled fill operation to complete and senx XMIT_VALIDATE earlier.

12. there is a rec.bat for assembling microcode with bridge and route. Some registers
have been removed to reducce the total local register count for receive. output_port,
enqueue_cmd, rcv_cmd information is combined in the status reg.


13. Fast port changes 
	dfh		7/7/98		Fixed problem where elements were blocked when transmit 
						assigns 2 elements

The differences with the 100M 16 port release

Ports: 1-2 fast ports and 16 100M ports
Look in the following files for FAST PORT, #ifdef FAST_PORT_ENABLED, 
	#ifdef FAST_PORT1, #ifdef FAST_PORT2

	mem_map.h			Defines XMIT_GIG_COUNTS_ADDR for use by receive and tx_scheduler
						Removed the reference to XMIT_PW2E.

	pfwd_f.ind			Calls pfwd_packet_f.ind and rtm_load_f.ind
						initializes fast port queue descriptor 1 with 0x00010000
						This primes the fast port element count with a bitset/bitclear
						byte. See rec_enqueue and tx_scheduler_f comments. 

	pfwd_packet_f.ind	Sets ports 16 and 17 ready
	 					watch aa sets pins fast_rx1 = 1 and fast_rx2 = 1;
						Calls drive_rdy_f.ind which drives port ready bits on and off.
    
	rtm_load2_f.ind		Route lookup results in half the packets to output port 16
	
	rec_scheduler2_f.uc	Assigns fast port 16 to any receive thread. Uses find first bit set
						to get slow port ready. Alternates between fast and slow port
						assignments. Tracks slow ports started and limits them to a max
						starts allowed which is derived from fast port ready.
	
	pfwd_f.bat			Assembles and links FAST PORT reference design microcode

	rec_f.uc			Defines FAST_PORT_ENABLED and FAST_PORT1

	rec_nextpac.uc		If msg = cancel, returns THREAD_DONE immediately. Sets fast
						port 2 or 1 in status 31:30.
	
	rec_lmatch.uc		Waits on enqueue sequence number so that all packets
						received from a fast port are enqueued in order.
	
	rec_enqueue.uc		Allows up to 32 ports.				
						Updates the fast port element bitset/bitclear in 
						queue descriptor1. Sets the queue count in 
						scratch XMIT_GIG_COUNTS_ADDR.
						Does read_lock at FPORT_MUTEX_ADDR before enqueueing.
						Enqueues on an array of 8 buckets for fast ports. This eliminates
						a read_lock bottleneck.			
	
	tx_scheduler_f.uc	Schedules 1 or 2 fast ports and 16 100M ports. Blocks the 100M
						ports with a bit set in 31:16 of PW1E. Assigns 1 or 2 elements
						to fast ports by comparing the number of elements on the fast port
						queue (calculated from a nibble count set in scratch XMIT_GIG_COUNTS_ADDR)
						to a local count of elements for that queue. 
						Inserts bucket in assignment. Bucket is derived from XMIT_GIG_COUNTS
						and local element count for the fast port.

	tx_fill_f.uc		Clears the 100M block with a bit clear in 31:16 of PW1E. Sends 1
						or 2 elements to fast port.
						Dequeues from an array of 8 buckets for fast ports.			




