Internet Engineering Task Force Kretschmer-AT&T/Basso-AT&T INTERNET DRAFT Civanlar-AT&T/Quackenbush-AT&T File:draft-kretschmer-mpeg2aac-01.txt Snyder-AT&T June 25, 1999 Expires: December 25, 1999 RTP Payload Format for MPEG-2 AAC Streams STATUS OF THIS MEMO This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract This document describes a payload format for transporting MPEG-2 AAC encoded data using RTP. MPEG-2 AAC is a recent standard from ISO/IEC for the coding multi-channel audio data. Several services provided by RTP are beneficial for MPEG-2 AAC encoded data transport over the Internet. Additionally, the use of RTP makes it possible to synchronize MPEG-2 AAC data with other real-time streams . Kretschmer/Basso/Civanlar/Quackenbush/Snyder [Page 1] INTERNET-DRAFT RTP Payload Format for MPEG-2 AAC Streams June 1999 1. Introduction The ISO/IEC MPEG-2 Advanced Audio Coding (AAC) [1] technology delivers unsurpassed multichannel audio quality at rates at or below 64 kbps/channel. It has a flexible bitstream syntax that supports from 1 to 48 audio channels, up to 16 subwoofer channels and up to 16 embedded data channels. AAC supports a wide range of sampling frequencies (from 16 kHz to 96 kHz) which enables it to have an extremely wide range of bitrates. This permits it to support applications ranging from professional or home theater sound systems to Internet music broadcast systems. The benefits of using RTP for MPEG-2 AAC data stream transport include: i. Ability to synchronize MPEG-2 AAC streams with other RTP payloads ii. Monitoring MPEG-2 AAC delivery performance through RTCP iii. Combining MPEG-2 AAC and other real-time data streams received from multiple end-systems into a set of consolidated streams through RTP mixers iv. Converting data types, etc. through the use of RTP translators. The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [3]. 1.1 Overview of MPEG-2 AAC AAC combines the coding efficiencies of a high resolution filter bank, a powerful model of audio perception, backward-adaptive prediction, joint channel coding, and Huffman coding to delivering excellent signal compression. In 1998 the MPEG Audio subgroup tested the family of MPEG audio coders (see http://www.tnt.uni-hannover.de/project/mpeg/ audio/public/w2006.pdf). The test results indicate that for a stereo signal, AAC at 96 kb/s has audio quality comparable to MPEG-3 Layer 3 ("mp3") at 128 kb/s. Therefore at equivalent quality levels, AAC offers approximately 1/3 greater compression than Layer 3. AAC is a block oriented, variable rate coding algorithm, which means that the AAC encoder reads 1024 samples of the input signal file and writes a variable number of compressed output bits that represent that block of input data. A sample can be one or more channels. Rate control can be used in the encoder such that the output bit rate is averaged to a predetermined rate, as would be required for constant-rate communication channels. Each block of AAC compressed bits is called a "raw data block", and it has the nice property that it can be decoded "stand-alone", that is, without knowledge of information in prior bitstream blocks. This is ideal for packet communication channels, in that if the payload of a packet is a single Kretschmer/Basso/Civanlar/Quackenbush/Snyder [Page 2] INTERNET-DRAFT RTP Payload Format for MPEG-2 AAC Streams June 1999 raw data block, packet framing facilitates encoder and decoder synchronization and, most importantly, loss of a single packet does not impair the decodability of adjacent packets. 1.2 Bitstream Syntax As already stated, a raw data block represents audio data for a time period of 1024 samples and may also contain related information and other data. The syntax of an AAC bitstream is as follows: => => [] where indicates the AAC bitstream, indicates intermediate tokens, indicates terminal tokens and [] indicates one or more occurance. is a token that indicates the end of a raw_data_block and is a variable length token that forces the total length of a raw_data_block to be an integral number of byes. In general, intermediate tokens are not an integral number of bytes in length. The tokens are a string of bits of varying length, and can be any of the following: represent a single audio channel represent a stereo presentation (2 channels) a mechanism for multi-channel compression represent a special effects channel represent "user data" a mechanism for describing the bitstream content a mechanism to use bits (for constant rate channels) The above can occur several times in a single raw_data_block. For example, the raw_data_block for a 5.1 surround sound signal would be: ... corresponding to the center, left and right, left surround and right surround and effects channels. Multiple occurances of the are dis-ambiguated by means of a unique 4-bit id inside the . Kretschmer/Basso/Civanlar/Quackenbush/Snyder [Page 3] INTERNET-DRAFT RTP Payload Format for MPEG-2 AAC Streams June 1999 2. Issues covered by this Payload Format 2.1 Repair Information to reconstruct lost AAC Frames A smart AAC decoder can mitigate the effects of lost packets using techniques such as interpolation in the spectral domain. However if the raw_data_block in a packet is perceptually very meaningful and also highly unpredictable (e.g. the onset of a symbol crash) then the encoder may choose to send repair information associated with that raw_data_block. We will call RepairData the variable size array containing such information. The RepairData in a given packet is typically associated with a raw_data_block that will be decoded in the future. The association between the raw_data_block and the RepairData can obtained by means of a specific field called RSEQ. The syntax of the RepairData and the AAC raw_data_block is the same. In practice, the RepairData can be a highly compressed monophonic version of the signal being transmitted. For example, an AAC stereo signal coded to an average rate of 96 kb/s corresponds to a raw_data_block size of 279 bytes. A RepairData version of that block, compressed to 16 kb/s would be 46 bytes. Given that perceptually critical blocks might occur only once per 100 or more blocks, the average rate imposed by the RepairData is very low. The usage of the RepairData information is similar to the one proposed in[4]. RepairData MAY be provided for every frame but its provision is OPTIONAL, in general. 2.2 Fragmentation of AAC Frames Since it is advantagous to put one AAC raw_data_block per packet, it is desirable to try to limit the size of the AAC raw_data_block to less than the path-MTU. If this is not possible, the raw_data_block can be fragmented across several packets. In this case, the raw_data_block can be fragmented at boundaries and the LEN field used to indicate the length of the to within a byte and the UBITS field used to indicate the length of the to a bit. The LEN and UBITS information allows re-assembly of the raw_data_block without knowledge of the syntax of the bits within each in the raw_data_block. Kretschmer/Basso/Civanlar/Quackenbush/Snyder [Page 4] INTERNET-DRAFT RTP Payload Format for MPEG-2 AAC Streams June 1999 2.3 Priority of AAC Frames Priority information is very important for AAC streaming over lossy channels since it allows to handle adaptively packet losses and/or given bandwidth constraints. Four priority levels are defined: 0 1 2 3. The priority level expresses the perceptual entropy of the AAC frame. Priority information is coded for every AAC frame in the Priority Quantifier (PQ) which is 2 bit in length. For a given RTP packet such PQs are organized in a Priority vector. 2.4 Interleaving of AAC Frames Instead of using a static interleaving scheme (i.e. 7x7) only frames with the same priority MUST be grouped. The sequence numbers SEQ of the AAC frames and RSEQ of REPAIRDATA are used to restore the actual order on the receiver side. Hence, the interleaving scheme does not have to be defined rigidly. 2.5 Example RTP Packet Sequence The example below shows how a sequence of AAC frames (a...p) with assigned priorities (0=low, 3=high) MAY be grouped. +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | a | b | c | d | e | f | g | h | i | j | k | l | m | n | o | p | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | 0 | 0 | 0 | 2 | 3 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 2 | 3 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Proposed interleaving/grouping of AAC frames and assigned RepairData R(x) being sent within the following RTP packet: +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |a g j|b h k|c i l| d | e | f | m q | n | o | p | +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ | |R(d) |R(e) |R(f) | |R(n) |R(o) |R(p) | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Kretschmer/Basso/Civanlar/Quackenbush/Snyder [Page 5] INTERNET-DRAFT RTP Payload Format for MPEG-2 AAC Streams June 1999 3. RTP AAC Payload Format The AAC specific RTP payload consists of a 32 or 64 bit header, a RepairData array which is variable in size containing information needed to reconstruct lost AAC frames and a variable number of AAC frames. The header contains a vector of Priority Quantizers (PQ) specifying the priority of the current and previous packets to the decoder to reconstruct the original signal. The X bit specifies if the header contains 12 or 28 PQs. REPAIRLEN specifies the length of the RepairData array expressed in 32bit words. REPAIRLEN MUST be set to 0 if the RepairData array is empty. Every REPAIRDATA array (AAC frame) is preceded by a sequence number SEQ (RSEQ) and a length specifier LEN (RLEN). In case of fragmented AAC frames UBITS specifies the number of unused bits in the last byte since frame fragments may not be byte aligned. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |X|REPAIRLEN |PRI VECTOR | Header +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |PRI VECTOR (continued), if X==1 | +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ |RSEQ |RLEN |REPAIRDATA 1 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | . | Repair | . | Data | . | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |RSEQ |RLEN |REPAIRDATA N | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | | | | | +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ |SEQ |LEN |UBITS |AAC FRAME 1 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | . | | . | | . | AAC | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Frames | |SEQ |LEN |UBITS | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |AAC FRAME N | | | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Kretschmer/Basso/Civanlar/Quackenbush/Snyder [Page 6] INTERNET-DRAFT RTP Payload Format for MPEG-2 AAC Streams June 1999 PRI VECTOR: Priority vector. It contains either 12 or 28 Priority Quantifiers (PQ). The size of a PQ element is 2 bits. Hence, four different priority levels can be assigned to an RTP packet. 0 means low and 3 means high priority. The first PQ refers to the current packet. The following PQs refer to the most recent previous packets. So, the vector looks like this: {PQ(t), PQ(t-1), PQ(t-2)...} X: Vector Extension, the priority vector uses 56 instead of 24 bits. Hence, another 32bit word is required. REPAIRLEN: The total number of 32bit words containing Repair Data for previous frames. If REPAIRLEN=0 then there is no repair information. RSEQ: The SEQ number of the AAC frame REPAIRDATA belongs to. RLEN: The length in bytes of REPAIRDATA. REPAIRDATA: An 8bit aligned data array containing repair information. This information can be ignored and is not mandatory. It should contain information that helps the decoder to reconstruct a lost frame as close to the original as possible. SEQ: 8 bit. The sequence number of the AAC frame. The application has to make sure that the sequence numbers of interleaved frames to not overlap. LEN: 12 bit. The length of the actual AAC frame UBITS: 4 bit. The number of unused bits in the last byte of the AAC frame if the frame is fragmented. 3.1 RTP Header Fields Usage: The RTP header fields are used as follows: Payload Type (PT): The assignment of an RTP payload type for this new packet format is outside the scope of this document, and will not be specified here. It is expected that the RTP profile for a particular class of applications will assign a payload type for this encoding, or if that is not done then a payload type in the dynamic range shall be chosen. Marker (M) bit: Set to one to mark the last fragment (or only fragment) of an AAC frame. Kretschmer/Basso/Civanlar/Quackenbush/Snyder [Page 7] INTERNET-DRAFT RTP Payload Format for MPEG-2 AAC Streams June 1999 Extension (X) bit: Defined by the RTP profile used. Timestamp (TS): 32-bit 90K Hz timestamp representing presentation time of the AAC frame. Same for all packets that make up the fragmented AAC frame.Timestamps are recommended to start at a random value for security reasons. SSRC: set as described in RFC1889 [2]. CC and CSRC fields are used as described in RFC 1889 [2]. 4. References [1] ISO/IEC 13818-7 Advanced Audio Coding (AAC) [2] Schulzrinne, Casner, Frederick, Jacobson RTP: A Transport Protocol for Real Time Applications RFC 1889, Internet Engineering Task Force, January 1996. [3] S. Bradner, Key words for use in RFCs to Indicate Requirement Levels, RFC 2119, March 1997. [4] Perkins,Kouvelas,Hodson,Hardman,Handley,Bolot,Vega-Garcia, Fosse-Parisis RTP Payload for Redundant Audio Data draft-ietf-avt-redundancy-revised-00.txt Kretschmer/Basso/Civanlar/Quackenbush/Snyder [Page 8] INTERNET-DRAFT RTP Payload Format for MPEG-2 AAC Streams June 1999 5. Authors' Addresses Mathias Kretschmer AT&T Labs - Research 180 Park Ave. Florham Park, NJ 07932 USA e-mail: mathias@research.att.com Andrea Basso AT&T Labs - Research 100 Schultz Drive Red Bank, NJ 07701 USA e-mail: basso@research.att.com M. Reha Civanlar AT&T Labs - Research 100 Schultz Drive Red Bank, NJ 07701 USA e-mail: civanlar@research.att.com Schuyler R. Quackenbush AT&T Labs - Research 180 Park Ave. Florham Park, NJ 07932 USA e-mail: srq@research.att.com James H. Snyder AT&T Labs - Research 180 Park Ave. Florham Park, NJ 07932 USA e-mail: jhs@research.att.com Kretschmer/Basso/Civanlar/Quackenbush/Snyder [Page 9]