Unicode Extensions for BCP 47

IETF BCP 47 Tags for Identifying Languages defines the language identifiers (tags) used on the Internet and in many standards. It has an extension mechanism that allows additional information to be included. The Unicode Consortium is the maintainer of the extension ‘u’ for Locale Extensions, as described in rfc6067, and the extension 't' for Transformed Content, as described in rfc6497.

For more details on the valid subtags for these extensions, their syntax, and their meanings, see LDML Section 3.7 Unicode BCP 47 Extension Data.

The subtags available for use in the 'u' extension provide language tag extensions that provide for additional information needed for identifying locales. The 'u' subtags consist of a set of keys and associated values (types). For example, a locale identifier for British English with numeric collation has the following form:

en-GB-u-kn-true
The subtags available for use in the 't' extension provide language tag extensions that provide for additional information needed for identifying transformed content, or a request to transform content in a certain way. For example, the language tag "ja-Kana-t-it" can be used as a content tag indicates Japanese Katakana transformed from Italian. It can also be used as a request for a given transformation.

Machine-Readable Files for Validity Testing

Beginning with CLDR version 1.7.2, machine-readable files are available listing the valid attributes, keys, and types for each successive version of LDMLThe most recent version is always available at http://unicode.org/Public/cldr/latest/core.zip. Inside that file, the directory "common/bcp47/" contains the data files defining the valid attributes, keys, and types. For example, the timezone.xml file looks like the following: 

<keyword> 
  <key name="tz" alias="timezone"> 
    <type name="adalv" alias="Europe/Andorra"/>
 
    <type name="aedxb" alias="Asia/Dubai"/> 

Using this data, an implementation would determine that "fr-u-tz-adalv" and fr-u-tz-aedxb" are both valid. 
All releases including the latest are listed on http://cldr.unicode.org/index/downloads, with a link to each respective data directory under the column heading Data.

Some data in the CLDR data files also requires reference to LDML for validation according to Appendix Q of LDML. For example, LDML defines the type 'codepoints' to define specific code point ranges in Unicode for specific purposes.

Additional Information

The following is not necessary for correct validation of the -u- extension, but may be useful for some readers. 

Each release has an associated data directory of the form "http://unicode.org/Public/cldr/<version>", where "<version>" is replaced by the release number. For example, for version 1.7.2, the "core.zip" file is located at http://unicode.org/Public/cldr/1.7.2/core.zip.  

The version number for any file is given by the directory where it was downloaded from. If that information is no longer available, the version can still be accessed by looking at the common/dtd/ldml.dtd file in the core.zip, at the element cldrVersion, such as the following. This information is also accessible with a validating XML parser.

<!ATTLIST version cldrVersion CDATA #FIXED "1.8" >

For each release after CLDR 1.8, types introduced in that release are also marked in the data files by the XML attribute "since", such as in the following example: <type name="adp" since="1.9"/> 

The data is also currently maintained in a source code repository, with each release tagged, for viewing directly without unzipping. 
For example, see: http://unicode.org/repos/cldr/tags/release-1-9/common/bcp47/. The current development snapshot is found at http://unicode.org/repos/cldr/trunk/common/bcp47/.
Comments