/////////////////////////////////////////////////////////////////////////////////////////
//                                                                      
//                  I N T E L   P R O P R I E T A R Y                   
//                                                                      
//     COPYRIGHT (c)  2001 BY  INTEL  CORPORATION.  ALL RIGHTS          
//     RESERVED.   NO  PART  OF THIS PROGRAM  OR  PUBLICATION  MAY      
//     BE  REPRODUCED,   TRANSMITTED,   TRANSCRIBED,   STORED  IN  A    
//     RETRIEVAL SYSTEM, OR TRANSLATED INTO ANY LANGUAGE OR COMPUTER    
//     LANGUAGE IN ANY FORM OR BY ANY MEANS, ELECTRONIC, MECHANICAL,    
//     MAGNETIC,  OPTICAL,  CHEMICAL, MANUAL, OR OTHERWISE,  WITHOUT    
//     THE PRIOR WRITTEN PERMISSION OF :                                
//                                                                      
//                        INTEL  CORPORATION                            
//                                                                     
//                     2200 MISSION COLLEGE BLVD                        
//                                                                      
//               SANTA  CLARA,  CALIFORNIA  95052-8119 
//
/////////////////////////////////////////////////////////////////////////////////////////
// 		
//		Change History
// 		--------------
//
// Date			Description											Whom
// ---------------------------------------------------------------------------------
//
// 12/26/02    	Sort Microengine of Egress Scheduler 				urn/kt             
//                                                                  
/////////////////////////////////////////////////////////////////////////////////////////
/*
This ME is the first in a series of 3 MEs making context pipe to do the modified 
DRR scheduler function.

The basic scheme for this DRR implementation is defined below.

Every queue has some pre-assigned paramethers that define its transfer rate. 
In the DRR scheme every active queue collects credit as time progress and 
uses these credits as packets arrive. If there are not enough credit collected 
when a packet arrives, that packet is scheduled to be transmitted in some later 
time when its queue collects enough credit. 
In this Scheduler, the way the DRR scheme is accomplished is by defining each round 
quanta that is assigned to each queue. For example, there are 2 active queues: 
queue X can transmit 10 bytes per round and queue Y can transmit 20 bytes per round and
their last packets were schedule to be transmitted in the current round.
If queue X and Y each receive a 40-byte packet, then the packet in queue X will have to 
wait for 4 rounds and the one in queue Y will have to wait for 2 rounds from the current 
round.
Assuming that current round is round 0, so the SOP of the queue X's packet is placed in 
a fifo used by round 4 and the SOP of queue Y's packet will be placed in the fifo used 
by round 2. These rounds are defined seperately for every port. Depending upon the 
transmission of packets from different ports based on Round Robin, packets are drained
from these FIFOs and sent to the transmit functional block for transmit.

The entire DRR implementation is divided in 3 MEs called Sort, Count, and Port Scheduler.

Sort ME receives enqueue information thru next neighbor ring from stats ME and also next dequeue round
	request thru scratch ring (written by Port Scheduler ME).

	Next dequeue round request is sent every time the packets count of the round being dequeued
	hit a low-water-mark.
	After receiving a next dequeue round request for port X, Sort marks current round for port X 
	unavailable for future enqueuing and send it to the ME called Count which in turn will send 
	it to Port Scheduler ME. After making current round unavailable, SORT moves on to
	the next round and starts enqueueing packets in that bin and onwards. 
	
	Sort ME decide the round when a received packet can be put in. This depends upon the packet 
	size, the quantum assigned to the packet's queue, the credit the queue has when the packet 
	arrives. For this calculation following queue structure is stored in SRAM for each queue.
	
	+---------------------+---------------------------+-----------------------------+
	|			queue_current_credit_used	(# of bytes used in the last round)		|	LW0
	+---------------------+---------------------------+-----------------------------+
	|			enqueued_packets_count	(Used to find out when queue is empty)	    |	LW1
	+---------------------+---------------------------+-----------------------------+
	|			last_round_scheduled (where the last packet for this queue 			|
	|			is scheduled)		   	   											|	LW2
	+---------------------+---------------------------+-----------------------------+
	|			queue_quantum_mask_and_shift  										|
	|			quantum mask: (quantum per round - 1. Used as a maskfor calculation)|	LW3
	|			quantum shift: (n where 2^n is quantum per round)					|
	+---------------------+---------------------------+-----------------------------+

	16 queue structures are cached in local memory and index by the CAM

	Interfaces:

	 * Enqueue request from Statistics microengine to Sort microengine on NN ring 
 
 		packet_length				 : 32 bits	- LW0
 	 	port_number					 : 16 bits	- LW1
 	 	queue_number				 : 16 bits	- LW1
 	 	sop_handle					 : 32 bits  - LW2

	 * Enqueue request and new round response from Sort microengine to Count microengine
		sop_handle					 : bits 31:0		for enqueue request
															
		port_round_number			 : bits 31:0		for enqueue request
															[19:16] port number [11:0] round number
		
		port_next_dequeue_round		 : bits 31:0		for response to Next Dequeue Round
		 													Request from Port Scheduler microengine
		 													[19:16] port number [11:0] round number										[19:16] port number [11:0] round number   
		 													[31] invalid bit: 1 = invalid response
		 																	0 = valid response 
		 													[30:20] reserved (must be 0)
	 
	  If enqueue request or new round response is not valid then 
	  round number will be 0 (0 is INVALID_ROUND_NUM)
*/

#ifndef __SORT_UC__
#define __SORT_UC__

#include 	"scheduler_packet.h"
#include 	"stdmac.uc"
#include 	"xbuf.uc"
#include	"test_util.uc"

// Location in local memory where the per queue data structures are stored
// for the sort microengine.
//There will be maximum 16 queue structures cached in local memory. Each 
//structure is in a 64-byte block -> total memory used is 0x400 bytes
#define_eval    QUEUE_STRUCS_LM_BASE        0

// Location in local memory where the data structures per PORT for current round
// number and two previous round numbers are stored for the sort microengine
#define	PER_PORTS_CURR_ROUND_STRUCTURE_BASE		0x400


#define	INIT_CURR_ROUND							0x3
#define	INIT_CURR_ROUND_MINUS1					0x2
#define	INIT_CURR_ROUND_MINUS2					0x1


//Define value to write into SAME_ME_SIGNAL CSR
.sig	volatile prev_th_sig
#define next_th_signal		((1 << NEXT_CONTEXT_BIT) | (&prev_th_sig << SIGNAL_NUMBER_FIELD))

.sig	volatile wrback_lru_done
.sig	volatile read_queue_data_done
.sig 	volatile get_round_request_done
.sig	volatile read_deq_cntr_done

/*******************************************************************************/

#macro init_local_csr()
.begin

	.reg ctx_enable_data nn_ring_empty_val

	local_csr_rd[ctx_enables]
	immed[ctx_enable_data, 0]

	;Bits [19:18] controls threadhold when NN_Empty is asserted.
	;Set [19:18] to 1:0 to specify that the message on NN-ring is 3 longwords
	move(nn_ring_empty_val, 0x80000)
	
	alu[ctx_enable_data, ctx_enable_data, OR, nn_ring_empty_val]
	local_csr_wr[ctx_enables, ctx_enable_data]

	cam_clear

	local_csr_wr[nn_get, 0]
	local_csr_wr[nn_put, 0]

.end
#endm

/*******************************************************************************/

#macro init_per_port_curr_round_data_in_lm()
.begin

	.reg init_port, init_port_shifted
	.reg init_lmemaddr entry0 entry1 entry2

	;Initialize port data structures in local memory
	immed32[init_lmemaddr, PER_PORTS_CURR_ROUND_STRUCTURE_BASE ]
	immed[init_port, 0x0]
	
	.while (init_port < 16)

		local_csr_wr[active_lm_addr_1, init_lmemaddr]
		alu[init_port_shifted, --, B, init_port,<<16]
		alu[init_lmemaddr, init_lmemaddr, +, 0x10] 
		
		alu[entry0, init_port_shifted, OR, INIT_CURR_ROUND]
		alu[entry1, init_port_shifted, OR, INIT_CURR_ROUND_MINUS1]
		alu[entry2, init_port_shifted, OR, INIT_CURR_ROUND_MINUS2]
		alu[lm_curr_round,--,B,entry0]
		alu[lm_curr_round_minus1,--,B,entry1]
		alu[lm_curr_round_minus2,--,B,entry2]

		alu[init_port, init_port, +, 1]
		

	.endw

.end

#endm

/*******************************************************************************/

#macro sort_init()
.begin

	move(port_round_mask, PORT_ROUND_MASK_VAL)
	alu[zero, --, B, 0]

	#if ((SCHED_QUEUE_STRUCTURES_BASE << (31 - QUEUE_STRUCTURES_PER_PORT_SHIFT)) != 0)
		#error "Queue structures base must be aligned at boundary of \
				1 << QUEUE_STRUCTURES_PER_PORT_SHIFT"
	#endif
	move(queue_structures_base, SCHED_QUEUE_STRUCTURES_BASE)

	#if ((PER_PORTS_CURR_ROUND_STRUCTURE_BASE << (31 - PER_PORT_CURR_ROUNDS_SHIFT)) != 0)
		#error "Port current rounds structure base must be aligned at boundary of \
				1 << PER_PORT_CURR_ROUNDS_SHIFT"
	#endif
	move(port_curr_rounds_base, PER_PORTS_CURR_ROUND_STRUCTURE_BASE)	
	
	.if (ctx() == 0)
		init_local_csr()

		init_per_port_curr_round_data_in_lm()
			
		ctx_arb[system_init_sig]

	.else
		ctx_arb[prev_th_sig]	
	.endif

.end
#endm

#macro read_sratch_ring($next_dequeue_round_request)

	;check scratch ring for NEW ROUND REQUEST message from COUNT or PORT SCHEDULER ME
	#define_eval _RING_ADDR (SORT_COUNT_SCHEDULER_SCR_RING * 4)
	scratch[get, $next_dequeue_round_request, zero, _RING_ADDR , 1], \
			sig_done[get_round_request_done]
	#undef _RING_ADDR 
#endm

/*******************************************************************************/

#macro read_nn_ring_msg_and_calc_addrs(packet_length, queue_number, port_number, \
		sop_handle, this_port_curr_rounds_addr, eq_struct_addr, port_queue_data_base)

	alu[packet_length,--,B,*n$index++]

	ld_field_w_clr[queue_number,0011,*n$index]

	alu_shf[port_number, --, B, *n$index++, >>16]
	alu[sop_handle,--,B,*n$index++]

	;loading lm_addr1 for lm_curr_round, currBin_minus1 and currBin_minus2 info
	alu_shf[this_port_curr_rounds_addr, port_curr_rounds_base, OR, port_number, \
			<<PER_PORT_CURR_ROUNDS_SHIFT]	
	//local_csr_wr[active_lm_addr_1, temp]

	;calculate the base of the queue structures of this port

	alu_shf[port_queue_data_base, queue_structures_base, OR, 
				port_number, <<QUEUE_STRUCTURES_PER_PORT_SHIFT]

	alu_shf[eq_struct_addr, port_queue_data_base, or, queue_number, <<QUEUE_STRUCTURE_SHIFT]
	
#endm

/*******************************************************************************/

#macro write_to_nn_ring(sop_handle, enqueue_port_and_round, next_dequeue_round)

nextneighbor_full#:
	br_inp_state[NN_full, nextneighbor_full#]

	alu[*n$index++, --, B, sop_handle]
	alu[*n$index++, --, B, enqueue_port_and_round] 
	alu[*n$index++,--, B, next_dequeue_round] 

#endm

/*******************************************************************************/

#macro cam_lookup_and_get_queue_structures(eq_struct_addr, $xfer_in, $xfer_out)

.begin
	.reg	cam_result		// result of CAM lookup
	.reg	cam_entry		// entry number from CAM lookup
	.reg	cam_tag			// tag in CAM for entry 

	cam_lookup[cam_result, eq_struct_addr], lm_addr0[0]
	
	//	Check lookup result
	br_bset[cam_result, 7, sort_cam_hit_phase_1#]
	
sort_cam_miss_phase_1#:
	// this is a CAM miss case
	// LRU 	queue structure => SRAM; SRAM queue structure => $Xfer

	//Queue structures are placed in 64-byte blocks to make use of
	//the entry number in lm_addr#[9:6] after cam_lookup 
	sram[read, $xfer_in0, eq_struct_addr, 0, 8], sig_done[read_queue_data_done]
	;get CAM entry
	alu[cam_entry, 0xF,and, cam_result, >>3]
	; read CAM tag which is the queue address in SRAM
	cam_read_tag[cam_tag, cam_entry]							
	
 	// Move the modified part of the  LRU queue structure 
	// to transfer registers. l$index0 points to the beginning of 
	// the LRU queue structure
	alu[$xfer_out0, --, B, lm_q_current_credit_used]
 	alu[$xfer_out1, --, B, lm_q_last_round_scheduled]
 	alu[$xfer_out2, --, B, lm_q_enqueued_packets_count]

	//alu[addr_out, port_queue_data_base, or, cam_tag, <<6]	; get LRU CRX SRAM address
	sram[write, $xfer_out0, cam_tag, 0, QUEUE_STRUCTURE_LWS_TO_WRITE_BACK], \
		sig_done[wrback_lru_done]

	cam_write[cam_entry, eq_struct_addr, NORMAL_CAM_STATE]	; update CAM LRU entry

	ctx_arb[prev_th_sig, read_queue_data_done, wrback_lru_done, get_round_request_done], \
			all, br[sort_cam_miss_phase_2#]

sort_cam_hit_phase_1#:

	;read dequeue_counter
	sram[read, $xfer_in/**/_DEQ_COUNTER_INDEX, eq_struct_addr, \
		(_DEQ_COUNTER_INDEX * 4), 1], sig_done[read_queue_data_done]


	ctx_arb[prev_th_sig, read_queue_data_done, get_round_request_done], \
			all, br[sort_phase_2#]

	;Phase 2
sort_cam_miss_phase_2#:

	;move data to local memory. Only 4 longwords of the queue structure
	;need to be read in because the rest are reserved words and the
	;dequeue counter
/*	
 #define lm_q_current_credit_used		l$index0[0]
 #define lm_q_last_round_scheduled		l$index0[1] //[19:16] port [11:0] round 
 #define lm_q_enqueued_packets_count	l$index0[2]

 #define lm_q_quantum_mask_and_shift	l$index0[3] //[31:8] mask, [7:0] shift
*/		

	alu[lm_q_current_credit_used, --, B, $xfer_in0]
 	alu[lm_q_last_round_scheduled, --, B, $xfer_in1]
 	alu[lm_q_enqueued_packets_count, --, B, $xfer_in2]
 	alu[lm_q_quantum_mask_and_shift, --, B, $xfer_in3]
	
.end

#endm

/*******************************************************************************/

#macro update_current_rounds_data()
	alu[lm_curr_round_minus2, --, B, lm_curr_round_minus1]
	alu[lm_curr_round_minus1, --, B, lm_curr_round]
	alu[lm_curr_round, lm_curr_round, +, 0x1]
	alu[lm_curr_round, port_round_mask, AND, lm_curr_round]
#endm

/*******************************************************************************/

#macro calc_round_and_update_queue_structure(final_port_and_round)
.begin
	.reg quantum_mask quantum_shift

	alu_shf[quantum_mask, --, B, lm_q_quantum_mask_and_shift, >>8]
	ld_field_w_clr[quantum_shift, 0001, lm_q_quantum_mask_and_shift]

	alu[--,$xfer_in[/**/_DEQ_COUNTER_INDEX/**/], -, lm_q_enqueued_packets_count]
	bne[queue_not_empty#]
	//if queue empty make last_bin=curr_round and credit used=0
	alu[lm_q_current_credit_used, --, B, 0]
	alu[lm_q_last_round_scheduled, --, B, lm_curr_round]

queue_not_empty#:
	alu[total_bytes_used, lm_q_current_credit_used, +, packet_length]

	alu[--, quantum_shift, OR, 0x0]
	alu_shf[final_port_and_round,--, B, total_bytes_used, >>indirect]

	alu[final_port_and_round, final_port_and_round, +, lm_q_last_round_scheduled]	
	; masking off the roll over bits
	alu[final_port_and_round, port_round_mask, AND, final_port_and_round]
	
	alu[lm_q_enqueued_packets_count, lm_q_enqueued_packets_count, +, 1]
	
	;the credit has been used in the current round is total bytes used MOD round
	;quantum. This is the same as "AND" total_bytes_used with (round_quantum - 1)	
	alu[lm_q_current_credit_used, total_bytes_used, AND, quantum_mask]	

	alu[--,final_port_and_round, -, lm_curr_round_minus1]
	beq[change_enqueue_round#]

	alu[--,final_port_and_round, -, lm_curr_round_minus2]
	bne[no_change_enqueue_round#]

change_enqueue_round#:
	alu[final_port_and_round, --, B, lm_curr_round]

no_change_enqueue_round#:
	alu[lm_q_last_round_scheduled, --, B, final_port_and_round]

.end

#endm
/******************************************************************************/
#macro activate_lm_curr_rounds_for_next_dequeue_port(port_curr_rounds_base, \
											$next_dequeue_round_request)
.begin
	.reg temp

	alu_shf[temp, port_curr_rounds_base, OR, $next_dequeue_round_request, \
			<<PER_PORT_CURR_ROUNDS_SHIFT]
	;needs 3 cycles after CSR write before current_round data can be read
	local_csr_wr[active_lm_addr_1, temp]
.end
#endm

/*******************************************************************************
* Sort macro
	Phase 1

	If no message on NN-ring
		1)	Check Scratch ring for New Round Request. 
		2)	If there is message
				a.	Read message
				b.	Wait for (scratch read done  && prev_thread)
		    Else
				Go back to the beginning
	Else
		1)	Read message on NN-ring
		3)	Extract packet length, queue number, port number, SOP from the message 
		4)	Calculate queue address in SRAM. Use Queue address as tag to look up CAM.
		5)	If miss
			a.	Evict LRU
			b.	Issue read for queue info
			c.  Read queue dequeue counter in SRAM 
			d.	Check Scratch ring for New Round Request. 
			e.	If there is message
					Read message
					Wait for (done_writeback && done_queue_struct_read && \
								done_cxounter_read && scratch_read_done && prev_thread)
				Else
					Wait for (done_writeback && done_queue_struct_read && \
								done_counter_read && prev_thread)
	Phase 2
	6)	Move queue data to lmem

	7)  Compare enqueue and dequeue packets counter. 
		If enqueue counter == dequeue counter --> queue is empty
			last round scheduled = current round 	
	8)	Compute round number for enqueued packet with reference from last
		 	round scheduled
	8)	Update queue structures: credit used, Queue Count, Last round scheduled.
	9)	Increment current round
	10)	Send message on NN-ring

********************************************************************************/

#macro sort ()

.begin
	.reg 	sop_handle packet_length port_number queue_number
	.reg 	port_queue_data_base this_port_curr_rounds_addr final_port_and_round

	.reg 	eq_struct_addr addr_out total_bytes_used enqueue_round
	.reg 	 $next_dequeue_round_request temp

	//Allocate xfer regs to read in new queue structure in CAM miss case
	//and 1 more for dequeue_counter
 
	#define_eval _NUM_REGS	(QUEUE_STRUCTURE_SIZE / 4)
	#define_eval _DEQ_COUNTER_INDEX	(_NUM_REGS 	- 1)	
	xbuf_alloc($xfer_in, _NUM_REGS, read)
	xbuf_alloc($xfer_out,_NUM_REGS, write)

	;Phase 1
new_phase_start#:

	;check scratch ring for NEW ROUND REQUEST message from COUNT or PORT SCHEDULER ME
	read_sratch_ring($next_dequeue_round_request)

	local_csr_wr[same_me_signal, next_th_signal]	
	br_inp_state[NN_EMPTY, no_enqueue_msg#]

	read_nn_ring_msg_and_calc_addrs(packet_length, queue_number, port_number, \
		sop_handle, this_port_curr_rounds_addr, eq_struct_addr, port_queue_data_base)
	
	local_csr_wr[active_lm_addr_1, this_port_curr_rounds_addr]
	
    //CAM lookup and read queue structure
	cam_lookup_and_get_queue_structures(eq_struct_addr, $xfer_in, $xfer_out)

sort_phase_2#:

	calc_round_and_update_queue_structure(final_port_and_round)

	;Take care of the next dequeue round request from scratch ring
	;First, check for invalid new request message
	br_bclr[$next_dequeue_round_request, 31, invalid_next_dequeue_round_request#]

valid_next_dequeue_round_request#:
	;if valid message, then calculate local memory address for current rounds data
	activate_lm_curr_rounds_for_next_dequeue_port(port_curr_rounds_base, \
											$next_dequeue_round_request)

	write_to_nn_ring(sop_handle, final_port_and_round, lm_curr_round)

	update_current_rounds_data()

	br[new_phase_start#]

invalid_next_dequeue_round_request#:
	
	alu[temp,--, B, 1, <<31]	;invalid response to Next Dequeue Round Request
	write_to_nn_ring(sop_handle, final_port_and_round, temp)

	br[new_phase_start#]

no_enqueue_msg#:
	;If NN-ring is empty, swap out until scratch ring get is done
	ctx_arb[prev_th_sig, get_round_request_done], all

	;Upon waking up, check for invalid new request message
	br_bclr[$next_dequeue_round_request, 31, new_phase_start#]
	
	alu[sop_handle, --, B, 0]
	;invalid enqueue message
	alu[final_port_and_round, --, B, 1, <<31]	;invalid enqueue message
	
	br[valid_next_dequeue_round_request#]

done#:
	#undef _NUM_REGS
	#undef _DEQ_COUNTER_INDEX
.end

#endm

/////////////////////////////////////////////////////////////////////////////////////////

/////////////////////////////////////////////////////////////////////////////////////////
//
// This is where the code begins..................
//
/////////////////////////////////////////////////////////////////////////////////////////
main#:
.begin

	.reg zero
	.reg port_round_mask
	.reg queue_structures_base port_curr_rounds_base
	
	sort_init()
	
	// Here is where the main loop begins
	.while (1)
		sort();
	.endw

	//should never go here. This instruction is to avoid warning 383 
	nop
.end

/////////////////////////////////////////////////////////////////////////////////////////

#endif // __SORT_UC__

/////////////////////////////////////////////////////////////////////////////////////////