Internet-Draft Unicode Subsets June 2024
Bray & Hoffman Expires 9 December 2024 [Page]
Workgroup:
Network Working Group
Published:
Intended Status:
Standards Track
Expires:
Authors:
T. Bray
Textuality Services
P. Hoffman
ICANN

Unicode Character Repertoire Subsets

Abstract

This document discusses specifying subsets of the Unicode character repertoire for use in protocols and data formats. It also specifies those subsets as PRECIS profiles.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on 9 December 2024.

Table of Contents

1. Introduction

When a protocol or data format has text fields, that text is normally composed of Unicode [UNICODE] characters, to support use by speakers of many languages. The "set of all Unicode characters" is generally not a good choice for use in text fields. Instead, subsets such as those discussed in this document are typically used.

In this document, "subset" means a subset of the Unicode character repertoire. This document specifies subsets that exclude some or all of the entities that are "problematic" as defined in Section 2.2. Authors should have a way to concisely and exactly reference a stable specification that identifies which subset a protocol or data format accepts.

This document discusses issues that apply in choosing subsets, names two subsets that have been popular in practice, and suggests one new subset. The goal is to provide a convenient target for cross-reference from other specifications.

1.1. Notation

In this document, the numeric values assigned to Unicode characters are provided in hexadecimal. In the text, Unicode’s standard "U+", zero-padded to four places, is used. For example, "A", decimal 65, would be expressed as U+0041, and "🖤" (Black Heart), decimal 128,420, would be U+1F5A4.

Groups of numeric values described in Section 4 are given in ABNF [RFC5234]. In ABNF, hexadecimal values are preceded by "%x" rather than "U+".

All the numeric ranges in this document are inclusive.

The subsets are described both in ABNF and as PRECIS profiles [RFC8264].

2. Characters and Code Points

Definition D9 in section 3.4 of [UNICODE] defines "Unicode codespace" as "a range of integers from 0 to 10FFFF16". Definition D10 defines "code point" as "Any value in the Unicode codespace".

The Unicode Standard's definition of "Unicode character" is conceptual. However, each Unicode character is assigned a code point, used to represent the characters in computer memory and storage systems and, in specifications, to specify allowed subsets.

There are 1,114,112 code points; as of Unicode 15.1 (2023), fewer than 150,000 have been assigned to characters. It is difficult to specify that unassigned code points should be avoided, because they regularly become assigned when new characters are added to Unicode.

Section 6.1 defines a new PRECIS base class that encompasses all Unicode code points. This base class is used for the PRECIS profiles for the subsets defined in this document.

2.1. Transformation Formats

Unicode describes a variety of "transformation formats", ways to marshal code points into byte sequences. A survey of transformation formats is beyond the scope of this document. However, it is useful to note that the "UTF-16" format represents each code point with one or two 16-bit chunks, and the “UTF-8” format uses variable-length byte sequences.

The "IETF Policy on Character Sets and Languages", BCP 18 [RFC2277], says "Protocols MUST be able to use the UTF-8 charset", which becomes a mandate to use UTF-8 for any protocol or data format that specifies a single transformation format. UTF-8 is widely used for interoperable data formats such as JSON, YAML, and XML.

2.2. Problematic Code Points

This section classifies as "problematic" all the code points which can never represent useful text and in some cases can lead to software misbehavior. This is a low bar; the PRECIS [RFC8264] framework's "IdentifierClass" and "FreeformClass" exclude many more code points which can cause problems when displayed to humans, in some cases presenting security risks. Specifications of fields in protocols and data formats whose contents are designed for display to and interactions with humans would benefit from careful consideration of the issues described by PRECIS; its more-restrictive subsets might be better choices than those specified in this document.

Definition D10a in section 3.4 of [UNICODE] defines seven code point types. Three types of code points are assigned to entities which are not actually characters or whose value as Unicode characters in text fields is questionable: "Surrogate", "Control", and "Noncharacter". In this document, "problematic" refers to code points whose type is "Surrogate" or "Noncharacter", and to "legacy controls" as defined in Section 2.2.2.2.

2.2.1. Surrogates

A total of 2,048 code points, in the range U+D800-U+DFFF, are divided into two blocks called "high surrogates" and "low surrogates"; collectively the 2,048 code points are referred to as "surrogates". Surrogates may only be used in Unicode texts encoded in UTF-16, where a high-surrogate/low-surrogate pair represents a code point greater than U+FFFF.

A surrogate which occurs in text encoded in any transformation format other than UTF-16 has no meaning and may cause malfunction in software that encounters it. In particular, it is impossible to represent a surrogate in well-formed UTF-8.

2.2.2. Control Codes

Section 23.1 of [UNICODE] introduces the "Control Codes", for compatibility with legacy pre-Unicode standards. They comprise 65 code points in the ranges U+0000-U+001F ("C0 Controls") and U+0080-U+009F (“C1 Controls”), plus U+007F, "DEL".

2.2.2.1. Useful Controls

The C0 Controls include newline (U+000A), carriage return (U+000D), and tab (U+0009); this document refers to these three characters as the "useful controls".

2.2.2.2. Legacy Controls

Aside from the useful controls, the control codes are mostly obsolete and generally lack interoperable semantics. This document uses the phrase "legacy controls" to describe control codes that are not useful controls.

Since the code points for C0 Controls include the 32 smallest integers including zero, they are likely to occur in data as a result of programming errors.

2.2.3. Noncharacters

Certain code points are classified as "noncharacters", and [UNICODE] asserts repeatedly that they are not designed or used for open interchange.

Code points are organized into 17 "planes", each containing 216 code points. The last two code points in each plane are noncharacters: U+00FFFE, U+00FFFF, U+01FFFE, U+01FFFF, U+02FFFE, U+02FFFF, and so on, up to U+10FFFE, U+10FFFF.

The code points in the range U+FDD0-U+FDEF are noncharacters.

3. Dealing With Problematic Code Points

[RFC9413], "Maintaining Robust Protocols", provides a thorough discussion of strategies for dealing with issues in input data, for example problematic code points.

Different types of problematic code points cause different issues. Noncharacters and legacy controls are unlikely to cause software failures, but they cannot usefully be displayed to humans, and can be used in attacks based on misleading human readers of text that attempt to display them [TR36].

Surrogate code points have been observed to cause software failures. The behavior of software which encounters them is unpredictable and differs in programming-language implementations, even between different API calls in the same language.

Section 3.9 of [UNICODE] makes it clear that a UTF-8 byte sequence which would map to a surrogate is ill-formed. Thus, in theory, if a specification requires that input data be encoded with UTF-8, implementors should never have to concern themselves with surrogates.

Unfortunately, industry experience teaches that problematic code points, including surrogates, can and do occur in program input where the source of input data is not controlled by the implementor. In particular, the specification of JSON allows any code point to appear in object member names and string values [RFC8259]; the following is a conforming JSON text:

{"example": "\u0000\u0089\uDEAD\uD9BF\uDFFF"}

The value of the "example" field contains the C0 Control NUL, the C1 Control "CHARACTER TABULATION WITH JUSTIFICATION", an unpaired surrogate, and the noncharacter U+7FFFF encoded per JSON rules as two escaped UTF-16 surrogate code points. It is unlikely to be useful as the value of a text field. That value cannot be serialized into well-formed UTF-8, but the behavior of libraries asked to parse the sample is unpredictable; some will silently parse this and generate an ill-formed UTF-8 string.

Reasonable options for dealing with problematic input include, first, rejecting text containing problematic code points, and second, replacing them with placeholders. (As an exception, [UNICODE] notes that it may in some cases be appropriate, specifically for noncharacters, to treat them as non-problematic unassigned code points.)

Silently deleting an ill-formed part of a string is a known security risk. Responding to that risk, [UNICODE] section 3.2 recommends dealing with ill-formed byte sequences by signaling an error, or replacing problematic code points, ideally with "�" (U+FFFD, REPLACEMENT CHARACTER), although some popular software platforms, notably Java, use "?".

[RFC9413] emphasizes that when encountering problematic input, software should consider the field as a whole, not individual code points or bytes.

4. Subsets

This section describes subsets that can be used in specifying acceptable content for text fields in protocols and data types. Specifications can refer to these subsets by the names "Unicode Scalars", "XML Characters", and "Unicode Assignables".

4.1. Unicode Scalars

Definition D76 in section 3.9 of [UNICODE] defines the term "Unicode scalar value" as "Any Unicode code point except high-surrogate and low-surrogate code points."

The "Unicode Scalars" subset can be expressed as an ABNF production:

unicode-scalar =
   %x0-D7FF / %xE000-10FFFF  ; exclude surrogates

This subset is the default for CBOR [RFC8949], and has the advantage of excluding surrogates. However, it includes legacy controls and noncharacters.

This subset is called the UnicodeScalarsClass for use in PRECIS. Its registration template can be found in Section 6.2.1.

4.2. XML Characters

The XML 1.0 Specification [XML], in its grammar production labeled "Char", specifies a subset of Unicode code points that excludes surrogates, legacy C0 Controls, and the noncharacters U+FFFE and U+FFFF.

The "XML Characters" subset can be expressed as an ABNF production:

xml-character =
   %x9 / %xA / %xD /   ; useful controls
   %x20-D7FF /         ; exclude surrogates
   %xE000-FFFD/        ; exclude FFFE and FFFF nonchars
   %x100000-10FFFF

While this subset does not exclude all the problematic code points, the C1 Controls are less likely than the C0 Controls to appear erroneously in data, and have not been observed to be a frequent source of problems. Also, the noncharacters greater in value than U+FFFF are rarely encountered.

This subset is called the XMLCharactersClass for use in PRECIS. Its registration template can be found in Section 6.2.2.

4.3. Unicode Assignables

This document defines the "Unicode Assignables" subset as all the Unicode code points that are not problematic. This subset comprises all code points that are currently assigned, or might in future be assigned, to characters that are not legacy control codes.

Unicode Assignables can be expressed as an ABNF production:

unicode-assignable =
   %x9 / %xA / %xD /               ; useful controls
   %x20-7E /                       ; exclude C1 Controls and DEL
   %xA0-D7FF /                     ; exclude surrogates
   %xE000-FDCF                     ; exclude FDD0 nonchars
   %xFDF0-FFFD /                   ; exclude FFFE and FFFF nonchars
   %x10000-1FFFD / %x20000-2FFFD / ; (repeat per plane)
   %x30000-3FFFD / %x40000-4FFFD /
   %x50000-5FFFD / %x60000-6FFFD /
   %x70000-7FFFD / %x80000-8FFFD /
   %x90000-9FFFD / %xA0000-AFFFD /
   %xB0000-BFFFD / %xC0000-CFFFD /
   %xD0000-DFFFD / %xE0000-EFFFD /
   %xF0000-FFFFD / %x100000-10FFFD

This subset is called the AssignablesClass for use in PRECIS. Its registration template can be found in Section 6.2.3.

5. Using Subsets

Many IETF specifications rely on well-known data formats such as JSON, I-JSON, CBOR, YAML, and XML. These formats specify default subsets. For example, JSON allows object member names and string values to include any Unicode code point, including all the problematic types.

A protocol based on JSON can be made more robust and implementor-friendly by restricting the contents of object member names and string values to one of the subsets described in Section 4. Equivalent restrictions are possible for other packaging formats such as I-JSON, XML, YAML, and CBOR.

Note that escaping techniques such as those in the JSON example in Section 3 cannot be used to circumvent this sort of restriction, which applies to data content, not textual representation in packaging formats. If a specification restricted a JSON field value to the Unicode Assignables, the example would remain a conforming JSON Text but the data it represents would not constitute Unicode Assignable code points.

6. IANA Considerations

6.1. Addition to the PRECIS Base Classes Registry

This document defines a new PRECIS base class to be entered into the "PRECIS Base Classes" registry.

The registration template for this new base class is as follows:

Base Class: UnicodeBaseClass

Description: All Unicode code points. It can be used by PRECIS profiles that are simple subsets of Unicode characters.

Note that this base class and the profiles in the following section are specified with straightforward ABNF expressions. This contrasts with the more flexible and expressive approach in [RFC8264], but is appropriate since the subsets are much simpler and less restrictive.

Reference: Section 2 of this RFC

Applicability: This class is undesirable for direct use in protocols because many code points are problematic in the sense defined in Section 2.2.

6.2. Additions to the PRECIS Profiles Registry

This document defines new PRECIS profiles to be entered into the "PRECIS Profiles" registry.

Note that these profiles are oblivious to many of the issues that are considered in depth in [RFC8264], including case mapping, normalization, and directionality. As noted above, specifications for text fields that are designed for display to and interaction with humans would benefit from consideration of those issues.

6.2.1. UnicodeScalarsClass Profile

The registration template for this class is as follows:

Name: UnicodeScalarsClass

Base Class: UnicodeBaseClass

Applicability: Protocols that want to include all Unicode code points except surrogates

Replaces: None

Width Mapping Rule: None

Additional Mapping Rule: None

Case Mapping Rule: None

Normalization Rule: None

Directionality Rule: None

Enforcement: Not specified

Specification: Section 4.1 of this RFC

6.2.2. XMLCharactersClass Profile

The registration template for this class is as follows:

Name: XMLCharactersClass

Base Class: UnicodeBaseClass

Applicability: Protocols that want to allow the same Unicode code points that are allowed in XML

Replaces: None

Width Mapping Rule: None

Additional Mapping Rule: None

Case Mapping Rule: None

Normalization Rule: None

Directionality Rule: None

Enforcement: Not specified

Specification: Section 4.2 of this RFC

6.2.3. Unicode Assignables Profile

The registration template for this class is as follows:

Name: AssignablesClass

Base Class: UnicodeBaseClass

Applicability: Protocols that want to allow all Unicode code points that are currently assigned, or might be assigned in the future, to characters that are not "legacy controls" as defined in Section 2.2.2.2

Replaces: None

Width Mapping Rule: None

Additional Mapping Rule: None

Case Mapping Rule: None

Normalization Rule: None

Directionality Rule: None

Enforcement: Not specified

Specification: Section 4.3 of this RFC

7. Security Considerations

Unicode Security Considerations [TR36] is a wide-ranging survey of the issues implementors should consider while writing software to process Unicode text. Many of the attacks it discusses are aimed at deceiving human readers, but vulnerabilities involving issues such as surrogates and noncharacters are also covered, and in fact can contribute to human-deceiving exploits.

The Security Considerations in Section 12 of [RFC8264] generally applies to this document as well.

Note that the Unicode-character subsets specified in this document include a successively-decreasing number of problematic code points, and thus should be less and less susceptible to vulnerabilities. The Section 4.3 subset, "Unicode Assignables", excludes all of them.

8. Normative References

[RFC5234]
Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax Specifications: ABNF", STD 68, RFC 5234, DOI 10.17487/RFC5234, , <https://www.rfc-editor.org/info/rfc5234>.
[RFC8264]
Saint-Andre, P. and M. Blanchet, "PRECIS Framework: Preparation, Enforcement, and Comparison of Internationalized Strings in Application Protocols", RFC 8264, DOI 10.17487/RFC8264, , <https://www.rfc-editor.org/info/rfc8264>.
[TR36]
The Unicode Consortium, "Unicode Security Considerations", <https://www.unicode.org/reports/tr36/>. Note that this reference is to the latest version of this document, rather than to a specific release. It is not expected that future updates will affect the referenced discussions.
[UNICODE]
The Unicode Consortium, "The Unicode Standard", <http://www.unicode.org/versions/latest/>. Note that this reference is to the latest version of Unicode, rather than to a specific release. It is not expected that future changes in the Unicode Standard will affect the referenced definitions.

9. Informative References

[RFC2277]
Alvestrand, H., "IETF Policy on Character Sets and Languages", BCP 18, RFC 2277, DOI 10.17487/RFC2277, , <https://www.rfc-editor.org/info/rfc2277>.
[RFC8259]
Bray, T., Ed., "The JavaScript Object Notation (JSON) Data Interchange Format", STD 90, RFC 8259, DOI 10.17487/RFC8259, , <https://www.rfc-editor.org/info/rfc8259>.
[RFC8949]
Bormann, C. and P. Hoffman, "Concise Binary Object Representation (CBOR)", STD 94, RFC 8949, DOI 10.17487/RFC8949, , <https://www.rfc-editor.org/info/rfc8949>.
[RFC9413]
Thomson, M. and D. Schinazi, "Maintaining Robust Protocols", RFC 9413, DOI 10.17487/RFC9413, , <https://www.rfc-editor.org/info/rfc9413>.
[XML]
Bray, T., Paoli, J., McQueen, C.M., Maler, E., and F. Yergeau, "Extensible Markup Language (XML) 1.0 (Fifth Edition)", , <http://www.w3.org/TR/2008/REC-xml-20081126/>. Note that this reference is to a specific release, based on a history of previous "Edition" releases having changed this production.

Acknowledgements

Thanks are due to Guillaume Fortin-Debigaré, who filed an Errata Report against RFC 8259, The JavaScript Object Notation, noting frequent references to "Unicode characters", when in fact the RFC formally specifies the use of Unicode Code Points.

Thanks also to Asmus Freytag for careful review and many constructive suggestions aimed at making the language more consistent with the structure of the Unicode Standard.

Thanks also to James Manger for the correctness of the ABNF and JSON samples.

Thanks also to Peter Saint-Andre for harmonization with PRECIS.

Authors' Addresses

Tim Bray
Textuality Services
Paul Hoffman
ICANN