Audio Packets

 
CU-SeeMe Development Group                                                     
Technical Document (Draft):  005
Protocol Feature Designer: Derived from Macintosh program Maven by Charlie Kline
Document Author: Richard Kennerly, rbk1@cornell.edu
Revised: February 28, 1996

                          CU-SeeMe Audio

Status:
-------
Implemented in Mac CU-SeeMe 0.60 and earlier
Implemented in Windows CU-SeeMe 0.66


Abstract:
---------

   The audio portion of CU-SeeMe for Windows was based on the Mac
version's audio.  Audio for the Macintosh was ported from Maven
which is based on the unix program VAT.

   CU-SeeMe audio on the Macintosh was ported from the Mac program
Maven, written by Charlie Kline.  The network packet format of
Maven is  based on that of the unix program VAT.  I don't have any
documentation  for either VAT or Maven, therefore this is based on
my interpretation of how Mac CU-SeeMe, Maven, and VAT operate.

   The compression and decompression algorithms for CU-SeeMe audio
were taken directly from Maven and use a slightly modified
calling convention - the PC version adds flexibility as to how many
bytes in a compressed buffer are to be uncompressed for a given
call to decompress().  There are several different sample sizes
in use now and there will probably be more in the future.  Currently
there are buffer sizes of 20, 40, 50, 80, and 100 mSeconds.

   What differentiates CU-SeeMe audio most from other multimedia
audio applications is that the network and the client system itself
add an unpredictable delay to the playout of audio data.  A
"Playout Buffer" needs to be used to cache audio data so that it
can be played out steadily while audio packets arrive unevenly
over time.  This adds a certain amount of delay - the more sure
you want to be about getting every packet, the more delay you need
to add.  But excessive delay reduces the usefulness of the audio
so a compromise needs to be reached.  The current PC implementation
has a fixed delay in the playout buffer; this will be changed in
a later version to adapt to differring network delay situations.
   

Definitions:
------------

#define AUDF_MULAW8           0   /* 64 kb/s 8KHz mu-law encoded
                                     PCM */
#define AUDF_CELP             1   /* 4.8 kb/s FED-STD-1016 CELP */
#define AUDF_G721             2   /* 32 kb/s CCITT ADPCM */
#define AUDF_GSM              3   /* 13 kb/s Groupe Special 
                                     Mobile */

#define AUDF_DELTAMOD        26   /* 16 kb/s 2 bit delta mod [cvk] */
#define AUDF_LINEAR8         27   /* 64 kb/s 8KHz linear PCM  
                                     [macintosh] */
#define AUDF_LPC4            28   /* 4.8 kb/s LPC, 4 frames */
#define AUDF_LPC1            29   /* 4.8 kb/s LPC, 1 frame */
#define AUDF_IDVI            30   /* 32 kb/s Intel DVI ADPCM */
#define AUDF_UNDEF           31   /* undefined */


Packet Format:
--------------

 VAT packet header (0 is MSB):

  0                   1                   2                   3
  0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 | V |           |T|   |         |                               |
 | e |    NSID   |S| 0 |  Audio  |        Conference ID          |
 | r |           | |   | Format  |                               |
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 |                                                               |
 |                 Time Stamp (in audio samples)                 |
 |                                                               |
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+


typedef struct {
        VideoPacketHeader     vph;        // Standard CU-SeeMe
                                          //  header
        unsigned char         nsid;       // Number of speakers
                                          //  contributing to
                                          //  this sound data.
#define NSID_MASK             0x3f        // 
        unsigned char         flags;      // 
#define VATHF_NEWTS           0x80        // Set if start of
                                          //  new talkspurt
#define VATHF_FMTMASK         0x1f        // Audio format bits
        unsigned short        confid;     // Conference ID
        unsigned long         ts;         // Time Stamp
}               vat_hdr_t; 


   nsid - Number of speakers contributing to this sound segment.
The IP addresses of the speakers follow the VAT header.

   flags - NEWTS - New Talk Spurt.  This just means that this is
the first packet in a 'talk spurt'.  There is no corresponding 'End
Talk Spurt'.  Since the first packet in a group of audio packets
may be lost, the receiver must not depend on receiving this packet
in order to start playing audio.  The TalkSpurt indicator may be
used by a variable delay algorithm to indicate a point in the
playout buffer at which timing can be adjusted (to increase or
decrease the delay in the playout buffer).

   FMTMASK - Format Mask.  This is just a code indicating the
encoding format used in the audio data.  All CU-SeeMe receivers
can decode all types of encoding.  The sender chooses the audio
encoding type and can switch from one method to another between
talk spurts without any sort of hand-shaking.

   confid - Conference ID.  We don't use this currently, but obviously, 
this field could be used to reject audio data for other conferences or 
subconferences on a reflector that supported multiple conferences.

   ts - Time Stamp.  Calibrated in samples.  All CU-SeeMe audio
plays out in 8 kSamples per second regardless of the encoding
method and bandwidth used on the network.

Sending Audio Data:
-------------------

   It is assumed that sufficient bandwidth exists to send audio
data.

   The audio send packet is filled out as follows:

nsid = 1;

flags (bTalkSpurt ? 0x80 : 0) | (AudioEncoding & 0x1f);

confid = 0?  (Currently this field is not used RBK 1/6/96)

ts = sample count index.  We should modify this to increment with
time even when we're not sending audio.

if ìPrivate Talkingî is selected:

routing.destFamily = KCLIENT.
routing.destPort = VIDEO_PORT.
routing.destAddr = IP address of client that weíre directing our audio to.

if ìPrivate Talkingî is not selected:

header.routing.destFamily = KGROUP.
header.routing.destPort = VIDEO_PORT.
header.routing.destAddr = 0

header.routing.srcFamily = KCLIENT.
header.routing.srcPort = VIDEO_PORT.
header.routing.srcAddr = My IP Address.

header.message = 0.
header.sequence = (lastsequence++).
header.dataType = KAUDIOTYPE.
header.length = size of audio data plus VAT header.

Playout Buffer:
--------------

   The playout buffer is needed to buffer received sound data until
it is time to play it.  If sound data were submitted to the
sound system immediately after it was received there would be jitter
since packets  don't arrive evenly over time.

   Here is a diagram of the timing variables in the Playout buffer:

               | PlayTime
               |
                                                   v
System _ _ _ _ x x x x x x x x _ x x _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

               |<--AudioQueueLead-->|<---delay--->|
CUSeeMe_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ x x x x x x _ x _ _ _ _ _ _ _ _ _ _ _ _
                            ^       ^             ^
                            |       |             |
                            |       | MustQueue   | ArriveIndex [NumChans]
                            | NextToQueue


Playout Buffer Timing variables:
  
   AudioQueueLead - The arbitrarily chosen amount of milliseconds
of sound data that we want queued up at a minimum.  This must
be large enough so that the sound system does not run out of buffers
because we did not get to run this function soon enough.  If it is
too large there are large delays in the audio playout.  The value
is dynamically adjusted based on measured system performance.  It
is saved in cuseeme.ini since its reasonable to assume that system
performance will be roughly the same each session.

   PlayTime - when the sound is expected to be heard on the speakers.
This variable is initialized when any channel starts up; it
marches along with time based on the timeGetTime() call.

   NextToQueue - The time index at which this function should start
processing the CU-SeeMe queue.  It is basically where we left
off last time.  In theory, in the diagram above, this variable
slides to the left as time goes on while we're not being called.
After we finish it will be set to at least MustQueue.  It may end
up greater than MustQueue if the audio on all active channels is
being received steadily.  If it becomes less than PlayTime it means
we were called too late - there was a gap in the playout and we
need to decide whether to drop or squish sound in.

   MustQueue - The time index up to which audio must be transferred
and mixed from the CU-SeeMe audio buffers to the system.  This
marches along with time in front of the PlayTime variable based
on AudioQueueLead.  dwCTickZero is initialized based on this.  And,
when we in this function iMustQueue is set to timeGetTime() which
is basically 'now'.

   ArriveIndex - The time index at which we expect to be receiving
audio data from the sender.  This is initialized based on the
time of arrival of the first audio packet.  Each channel has a
different value for this depending on past variance in packet
arrival time from the sender.