Design document for certain parts of TCP implementation
-------------------------------------------------------
			Index
			-----

	1. Some General Notes
	2. Segment Receive Handling
	3. Segment Send Handling
	4. Retransmission
	5. Urgent Data Processing
	6. PUSH Handling
	7. Buffer Management
	8. Connection Termination
	9. Flow Control
	10. Delayed Acknowledgements
	11. Nagle's Algorithm
	12. Silly Window Syndrome (SWS) Avoidance
	13. Sliding Window
	14. ICMP Support
	15. Probing Zero Windows (Persist Timer)
	16. Integration with RouterWare
	17. Initialization of Connection Record Fields
	18. Aborting Connections
	19. ?? Points
	20. FIN handling
	21. Security and Precedence
	22. Sequence Numbers
	23. Determining the available free space in the send window
============================================================================
1. Some General Notes
---------------------

The way RouterWare operates, all processing always takes place in
the foreground with each process getting turn one by one. SO IN
ALL PLACES, THERE IS NO NEED TO TAKE CARE OF CRITICAL SECTIONS OR
MULTIPLE ACCESS TO CRITICAL DATA SPACE.

(Atleast this assumption is made in the code)

=============================================================================
2. Segment Receive Handling
---------------------------

During init a receive window is setup. The window is a physical buffer. 
On each ACK or other packet sent from the receiving side to the sender,
the CURRENT window size is advertised. The receive buffer is used as a 
circular queue with the indices 'last' and 'first'. EACH CONNECTION has 
a receive window.


Data Space:

Each connection record has the following fields related to receive.

	buf_size- SIZE of the full receive window; always constant
	first - INDEX to the first location in receive window from where 
		an app's socket receive can pick up bytes; updated only
		when app picks up bytes
	filled - COUNT of bytes from 'first' that are valid for
		being picked up by a socket receive; updated on
		segments received and app picking up bytes
	nxt   - SEQUENCE NUMBER of next segment expected from TCP sender;
		updated on successful segment receives
	buf   - ptr to receive window buffer; always constant

Always,
	Number of bytes received = 'filled'
	Next advertised window size = Full window size - 'filled'
	Next ack number = 'nxt'

Acknowledgement Handling:

1. ACK for a received segment will be done by a timer process or any
   other send that may take place (piggy-backing).
2. On a segment received and filled into the receive window, set a flag
   indicating that an ACK is due. The timer process will delay and
   ACK appropriately.
3. If a received segment indicates a proper receive, by detects a lost
   segment or a checksum error, force an IMMEDIATE ACK and don't flag
   a delayed ACK.


Special Decisions:

1. Drop segments that can cause holes in the receive window (i.e. the
   new segment falls within the receive window, but some segment in
   between is not received yet).

   This is opposed to what the RFC says.
   REASON:
	It is a pain to keep track of such holes and to make sure that
	packets are received and copied into the window properly.
   DANGER:
	If the sending TCP has some algorithm (the RFC tells how to
	detect if only a segment or part of it in between a window is
	lost) to detect lost packets at the send end, it may take time
	to send.

   When a segment is dropped this way, if the segment had the FIN control
   bit is set, it is ignored in the hope that the other end will retransmit
   the FIN bit too. We could keep the info for delayed action perhaps, but 
   this is easier.

2. The receive window size will be about 1024+512 bytes (just > 1518
   (ethernet max). This is because the only TCP app we support is
   TELNET which deals most of the time with small packets.

Initialization:

Program Init
------------
For the specified max number of connections allowed, receive window
buffers are allocated for each connection (each of 1536 bytes) and
ptrs in the connection record are set to point to the respective
buffers. 'wnd' is filled. filled=first=nxt=0.

Connection init
---------------
nxt=init receive sequence number


Actions on Segment reception
----------------------------

1. Check if the received packet or PART of it will fall within the 
   receive window.
2. If so, check if the sequence number is same as expected.
3. If so, copy all or part of the packet, update 'nxt', flag for an
   ACK.


=============================================================================
3. Segment Send Handling
------------------------

Send can be viewed in many ways --

1. Send as initiated by an app using a SEND or WRITE socket call.
2. Send as initiated by an internal TCP process to retransmit a previous
   segment.
3. Send as initiated by an internal TCP process that finds that a 
   send has to be done.

Internally, TCP will maintain a send window. This is a physical buffer
of size 1536 (enough for TELNET). When a socket SEND or WRITE is done,
the send or write will fill this send window to the extent possible and
return an appropriate value to the calling app (0 or the number of
bytes copied to the send window). A separate process will periodically
check the send window and decide if an actual physical send has to
be done. If so, the sending action is done. As per specs, send is done
to the window size and then an ACK is waited for. When an ACK comes in,
the send window is updated accordingly. On each physical send, a 
retransmit timer is started. When the timer expires, if the ACK has
not come in, the same segment is retransmitted.

Each connection record has the following fields related to send.

	una	- last unacknowledged SEQUENCE number; updated whenever
		  a proper ACK is received to the value ACK'ed
	nxt	- next SEQUENCE number to use; updated on a physical
		  packet send to current value + size of data sent;
		  every SYN bit counts as one more sequence number
	filled	- COUNT of bytes freshly added to the window to send;
		  updated when an app fills the send window thro' a 
		  socket call (incr) and when a segment is sent out on the
		  network the first time (decr); beware -- this does not
		  tell how many bytes are in the send window
	buf	- PTR to the send window buffer; always constant
	bstart  - INDEX to first UNACK'ed byte in buffer; retransmissions
		  will be considered from this point onwards and this
		  corresponds to the 'una' sequence number; updated when
		  'una' is updated
	bnext   - INDEX to first freshly added byte (from where 'filled'
		  count starts) and this corresponds to the 'nxt' 
		  sequence number; updated every time 'nxt' is updated

	buf_size- MAX buffer space for the send window; always constant
	wnd	- SIZE of the CURRENT send window; variable size; 
		  updated from received packets when it is safe to
		  do so -- i.e. when a ACK signalling all of window
		  was successfully received arrives
	lwseq	- SEQUENCE number of incoming segment the last time
		  the window size 'wnd' was updated
	lwack	- ACK SEQUENCE number of incoming segment the last time
		  the window size 'wnd' was updated
	mss	- max seg size as learnt from window advertisements from
		  the receiving end; updated each time a segment is got
		


Computation,
	nxt - una : number of bytes yet to be ack'ed and so possibly
		included for retransmission
	filled : fresh bytes added; in case of a retransmission, number
		of bytes send = (nxt - una) + filled
Always,
	Send will take place so long as there are bytes within the
	send window
	The "current send window" is the filled portion of the "full
	send window"
	The "full send window" can be throttled from the receiving
	side using window advertisements. 

Special Decisions
-----------------
1. The send window will have a max of 1536 bytes and all incoming
   window adverts will the shortened to this size if required. Our
   app is only TELNET after all.

Maybes:
-------
1. Maybe we need not support using the Window advertisements and
   can always NOT allow the sliding window to shrink or grow. Our
   app is just telnet which will not put so much demand on the
   network. Prob. is what will people think if they notice that
   we are doing such a thing, huh?

Initialization
--------------
Program Init:
-------------
For the specified max number of connections allowed, send window
buffers are allocated for each connection (each of 1536 bytes) and
ptrs in the connection record are set to point to the respective
buffers. 'wnd' and 'buf_size' are filled. filled=first=nxt=0.


Connection init
---------------
wnd = window size
nxt = isn


==========================================================================
4. Retransmission
-----------------
As soon as a segment is transmitted, info regarding the segment is
entered into a structure (RTX_INFO_RECORD) and the structure queued
in a queue of segments that have been transmitted on a connection and
not yet acknowledged.

The timer process checks if this queue is non-empty and increments
a counter. When the counter reaches a max value (RTO value), a
retransmit function is called. In the process the bit-flag TCPF_RTXON
is set. When this flag is on, Karn's Algorithm is in progress with
regard to estimation of round-trip-time.

Whenever a segment with an ACK comes in, the ACK number is checked
to see what all segments in the retransmit queue are acknowledged and
appropriately the retransmit timer may be stopped and queued segments
may be deleted.

Retransmission uses the following fields in each connection record

	rtx_retries: number of times a retransmit will be retried; if
		this reaches a configured max, the connection will be
		aborted
	rtx_counter: incremented on each timer tick if there are any
		queued items in the retransmit queue; if this reaches
		a max value as signified by the retransmit timeout, then
		retransmission is attempted
	rtx_timeout: computed estimate of the timeout to be used on
		retransmission; computed by using Jacobson's algorithm
		of the RTT estimates
	conn_flags: the bit TCPF_RTXON set indicates Karn's algorithm is in 
		action.
	rtx_queue: ptr to linked list of segments awaiting acknowledgement.

Special Decisions
-----------------
1. The RFCs are not clear as to what to put a retransmit timer on --
   just the first unacked segment or all segments in the retransmit queue.
   And the sample sources are a pain to wade through. Comer's
   implemenation talks of "accepted implementation strategy" of having
   a retransmit timer on only the first unacked segment (the first
   segment in the retransmit queue).

   So we too will have a single retransmit timer (for the first segment
   alone) rather than one for each unacked segment in the retransmit
   queue. When the timer expires, the first segment in the retransmit
   queue will be retransmitted and the timer restarted for the same
   segment. If the segment is subsequently ack'ed, then the retransmit
   timer will be restarted for the next element in the retransmit queue.
   This will actually increase the length of time we wait for an ack for
   the next element in the retransmit queue.
2. "Repacketization" as mentioned in RFC 1122, Host Requirements, will
   not be done on retransmission. Simpler to implement. Next version
   can improve on this. Improvement is easy. Instead of keep the full
   segment even though part of it is ack'ed, we could update globals 
   suitably only the unacked part is kept track of. Only thing is take 
   care of the 'send_una_amount' variable properly (see point 23).

24, June, 1996 - Sanjay -- Some modification done so that if partial segment
is acked, that much free space is made available.

Retransmit timeout calculation
------------------------------

The method of calculation is the same as mentioned in the RFC 793 --
Jacobsons method combined with Karn's algorithm.

The only hitch is the way the current round trip time is estimated --
the time between a segment being sent and an ack received for the segment.

When a segment is sent, if the TCPF_TIMEACK flag in the 'conn_flags'
field is not set, the current timer tick count is stored in the field
'last_seg_sent_tick' field of the connection record. Next time an ACK
comes and the ACK does not ACK a retransmit but has a ack value greater
than or equal to the sequence number of the segment that was sent, and
the TCPF_TIMEACK flag is set, the flag is turned off and a rough
round trip estimate is made as below
	 current rtt = current tick count - last_seg_sent_tick
Using this estimate, the Jacobson's algorithm to compute retransmit
timeout is used. The field 'seg_for_rtt_estimate' will hold the sequence
number for which the current RTT will be estimated.

If the TCPF_RTXON flag is set then this estimation of time is not
done and Karn's algorithm of exponential backoff is done.

If a retransmitted packet gets an acknowledgement, the TCPF_RTXON flag
will be reset. Then the round-trip-time estimator will start working
again.

Initial round-trip-estimate is as mentioned in the RFCs.

Retransmission of SYN and FIN segments
--------------------------------------
The way we have implemented our TCP, SYN and FIN segments travel alone
and never piggy-backed with data. So whenever a SYN or FIN segment is
sent, such segments are also added to the retransmit queue, but with 0
segment length. On retransmission, if a 0 segment length segment is
found, and the current state is anything before "established", the
segment is treated as indicating a SYN segment. Else it is treated as
indicating a FIN segment. Hope this works. If piggy-backing is used,
then this approach has to be redesign. Maybe you will need one more
field in the retransmit info record.
=========================================================================
5. Urgent Data Processing
-------------------------
Urgent data processing has a lot of hassles as mentioned in Comer's
book (Internetworking with TCP/IP, Volume 2, Chapter 15). For our
purposes (TELNET implementation), we do not need to do all the
jugglery as mentioned there. We will adopt a simple-simple approach
as described below. Sample Telnet clients should be tested out for
this purpose thoroughly on URGent data.

BASIC ASSUMPTION: 
	THE ONLY TCP APPLICATION USING OUR TCP WILL BE A TELNET APP.

When a segment with the URG bit set comes in, we call an application
(TELNET) registered function informing the app that there was some
urgent data. The app function simply increments a counter based on
how many times the function was called. This counter is decremented 
whenever a "Data Mark" (DM) is encountered in the received stream. This 
much is sufficient for TELNET.

The urgent pointer field IS IGNORED and the urgent data is added to
the normal data stream (receive window). Since we ignore the urgent
pointer field, we also neatly side-step the BSD implementation versus 
the other TCP implementation problems.

The TCP_PER_CONN structure will have a field that will point to a callback
structure. Apps can register a callback function thro' a call provided in
the TCP software. This call should be called after socket() and bind() calls.
Call is register_urg_func_with_tcp(socket_descriptor).

BUT:
What happens when the receive window is throttled to a size of 0.
As per specs we still need to do out-of-band processing, but here we
can't do so.

=============================================================================
6. PUSH Handling
----------------

If PSH bit is set in an incoming segment, when the segment is processed
a flag (TCPF_RECVPSH) is set in the connection record. This flag is
acted upon when a socket receive is called and reset then.

All socket send calls result in a flag (TCPF_SENDPSH) being set. The send
routine will send as much as there is data and all data in the send window
if this flag is set and resets the flag.

The flag is a bit flag in the 'conn_flags' field of the connection
record.

Note that the method of handling PSH on receives will work without
the jugglery as described in Comer's book (Internetworking with TCP/IP,
Volume 2, Chapter 15), because we will not be queueing segments that
arrive out of sequence.

=============================================================================
7. Buffer Management
--------------------

Socket sends, fill  up the send window. The send process, checks on the
send window and does send by alloc'ing a buffer (tcp_get_user_buffer()),
filling it and sending it. This buffer sent is queued up in the retransmit
queue. The transmit completion post routine will NOT free the sent buffer
as it is still needed for retransmit purposes. The buffer will only be
freed (tcp_free_user_buffer()) when ACK for the segment queued for
retransmit is received.

Also when aborting connections, all queued retransmit buffers will be
freed.

Send and receive window buffers are allocated when the app attempts to
setup a connection. Connection is disallowed if a buffer cannot be
allocated. This scheme does not lock up buffers if connections do not
exist.

=============================================================================
8. Connection Termination
-------------------------
Since only a TELNET app is envisaged, the connection termination is only
a passive close operation which causes the TCP state machine to move from
the ESTABLISHED state to the CLOSE_WAIT state to the LAST_ACK state.

In the CLOSE_WAIT state, the server will issue a socket CLOSE call. The
close will set a bit (TCPF_SENDFIN) in the connection 'conn_flags' field.
TCP will then send out a FIN segment as a separate segment (without any
data or other control bits). This segment is sent only when all data in
the send window has been sent out (though may not be ack'ed yet) and the
TCPF_SENDFIN flag is set when checking in the router foreground. The
FIN segment sent out is also added to the retransmit queue to wait for
an ACK in response.

===========================================================================
9. Flow Control
---------------

The congestion control algorithms are not implemented (as we will
be having only a TELNET session). Consequently the send window won't
close to 0 perhaps.

============================================================================
10. Delayed Acknowledgements
----------------------------

ACK's will be delayed by 200ms as in BSD unix (explained in R. Stevens --
TCP/IP Illustrated, Vol. 1). There is a single timer that goes off every
200ms. When this timer expires, all sessions are checked for pending
ACK's and an ACK segment is sent out.

20, July, 1996 - Sanjay -- Changed strategy

Now we ACK immediately if the incoming segment has the PUSH bit set.

============================================================================
11. Nagle's Algorithm
---------------------

NOT IMPLEMENTED -- THOUGH COULD BE BENEFICIAL TO THE TELNET APP.
BUT BENEFICIAL TO THE CLIENT END.

============================================================================
12. Silly Window Syndrome (SWS) Avoidance
-----------------------------------------

Receive side:
-------------
The RFC's and the sample implementations are confusing and not clear on
this point. Our implementation is as per the explanation in Richard Stevens'
book TCP/IP Illustrated, Vol. 1. 

Related fields in connection record:

	recv_last_advert - last advertised window size from the receive side

Send side:
----------
Incorporated in a minimal form. See "tcp_send_fgnd()".

Related fields in connection record:

	send_max_advert - max of received send window advertisements

============================================================================
13. Sliding Window
------------------
Note that in our current design, sliding window is implicit. Packets
will be sent to window size. Window is unusable as long as the sent
packets are not acknowledged (our retransmit stuff is kept in the
send window itself).
============================================================================
14. ICMP Support
----------------
ICMP packets are not support from the TCP stack. Next version. Now we
have just a server application that responds to successful clients.
ICMP message that actually need to be supported are
- Source Quench
	Congestion control message
- Destination Unreachable
	Inform socket app or abort connection based on error code
- Time Exceeded
	Inform socket app
- Parameter Problem
	Inform socket app
============================================================================
15. Probing Zero Windows (Persist Timer)
----------------------------------------

Our implementation is as below (simple and neat)
1. When a valid zero window advertisement is got, if the retransmit queue
   is not empty, the contents of the retransmit queue will behave as a 
   window probe on each retransmit. So nothing is done. If the retransmit
   queue is empty, a persist timer is started. 
2. When the persist timer expires, and there is some data to be sent,
   a 1-byte data packet is sent and added to the retransmit queue.
   If there is no data to be sent nothing is done.
3. On a socket send, if something enters the send window and the 
   retransmit queue is empty and the send window size is 0 and the
   persist timer is 0, a 1 byte probe packet is prepared and sent and
   added to the retransmit queue.
4. From here on, the retransmit mechanism takes over. The only variation
   is that if the send window is 0, the retransmit will go on forever
   without aborting connection.

Special decision
----------------
1. The initial persist timer value is the same as the initial retransmit
   timeout.

The above is not exactly the mechanism specified. But it will work.
============================================================================
16. Integration with RouterWare
-------------------------------

Specialities are
	1. Initialization method - The init function needs to be
		supplied to a init table maintained in global data space.
	2. Configuration method - RouterWare has a different kind of
		string based configuration methodology that we support.
	3. How TCP stack gets control -
		On packet reception - Rx routine is called by IP layer
		On timer tick - the timer function is called which also
			behaves as our foreground
		A control entry point - called sometimes (mostly during
			init) by LSL.
============================================================================
17. Initialization of Connection Record Fields
----------------------------------------------
Conn. record fields are inited when you acquire a free connection record.
Init is done to default. It is at these times that the send and receive
windows (buffers) are allocated.

When a record is freed, the corresponding buffers are deallocated.

During the progress of the connection state machine, the fields will
change values.
============================================================================
18. Aborting Connections
------------------------

(Understood this a little late in the game)

When aborting TCP connections, the connection record state is 
changed to the CLOSED state. CLOSED state records are not usable by
anyone for any purpose. These are not actually free. CLOSED state
records will become FREE (usable for future connections) when the
socket user makes a socket call and realises that the connection
has closed due to some error.

============================================================================
19. ?? Points
-------------
Since the TCP is assumed to be used only by a server app, in certain
areas of the code, support for TCP client software are not fully
implemented or the support is weak. Such areas are commented with the
comments appearing between double question (??) marks. Search the
text for such comments if implementation is desired. 
============================================================================
20. FIN handling
----------------
We will accept a FIN in a segment only if it comes after all data has been
acknowledged.
============================================================================
21. Security and Precedence
---------------------------
NOT IMPLEMENTED
============================================================================
22. Sequence Numbers
--------------------

All sequence number fields and variables should be of type TCPSEQ. This
is a signed long to take care of comparisons in the face of wrap arounds
in the integer number space. See Douglas Comers' book Internetworking with
TCP/IP Vol 2. (the first chapter on TCP) for an explanation on this.

============================================================================
23. Determining the available free space in the send window
-----------------------------------------------------------

(Design change info added 23, June, 1996 -- Sanjay).
Initially we were calculating available free space using the 'send_bstart' 
and 'send_bnext' indices to compute the amount of unacknowledged data.
But when 'send_bnext' == 'send_bstart', the condition could be that the
buffer is either full or empty. So there was a bug such that on this
condition, we were assuming the buffer is empty and filling up the space
occupied by unacked data. 

So to be able to properly determine free space (we need this info during
a socket send call), a new field is added -- 'send_una_amount' to each
connection record. Whenever a regular send is done, this counter is
incremented by the send amount. And whenever an acknowledgement occurs,
this counter is decremented by the amount of data acknowledged. So it
will accurately reflect the amount of unacknowledged data. Using this
variable, free space in the send window can be computed as

  send_buf_size - (send_filled + send_una_amount)

'send_filled' keeps track of amount of data in the send window that has
not been sent even once.

============================================================================


