IETF BCP 47 Tags for Identifying Languages defines the language identifiers (tags) used on the Internet and in many standards. It has an extension mechanism that allows additional information to be included. The Unicode Consortium is the maintainer of the extension ‘u’, as described in https://tools.ietf.org/html/rfc6067.
The subtags available for use in the 'u' extension provide language tag extensions that provide for additional information needed for identifying locales. The 'u' subtags consist of a set of keys and associated values (types). For example, a locale identifier for British English with numeric collation has the following form:
en-GB-u-kn-trueThe allowable keys and types, and their respective meanings, are defined in Section 3, Unicode Language and Locale Identifiers of LDML. That section also includes BNF syntax for testing well-formed extensions, and the definition of the canonical form. For future use, the specification also provides for attributes. Beginning with CLDR version 1.7.2, machine-readable files are available listing the valid attributes, keys, and types for each successive version of LDML. The most recent version is always available at http://unicode.org/Public/cldr/latest/core.zip. Inside that file, the directory "common/bcp47/" contains the data files defining the valid attributes, keys, and types. For example, the timezone.xml file looks like the following:
<key name="tz" alias="timezone">
<type name="adalv" alias="Europe/Andorra"/>
<type name="aedxb" alias="Asia/Dubai"/>
Using this data, an implementation would determine that "fr-u-tz-adalv" and fr-u-tz-aedxb" are both valid. All releases including the latest are listed on http://cldr.unicode.org/index/downloads, with a link to each respective data directory under the column heading Data.
Some data in the CLDR data files also requires reference to LDML for validation according to Appendix Q of LDML. For example, LDML defines the type 'codepoints' to define specific code point ranges in Unicode for specific purposes.
The following is not necessary for correct validation of the -u- extension, but may be useful for some readers.
Each release has an associated data directory of the form "http://unicode.org/Public/cldr/<version>", where "<version>" is replaced by the release number. For example, for version 1.7.2, the "core.zip" file is located at http://unicode.org/Public/cldr/1.7.2/core.zip.
The version number for any file is given by the directory where it was downloaded from. If that information is no longer available, the version can still be accessed by looking at the common/dtd/ldml.dtd file in the core.zip, at the element cldrVersion, such as the following. This information is also accessible with a validating XML parser.
<!ATTLIST version cldrVersion CDATA #FIXED "1.8" >
For each release after CLDR 1.8, types introduced in that release are also marked in the data files by the XML attribute "since", such as in the following example: <type name="adp" since="1.9"/>
The data is also currently maintained in a source code repository, with each release tagged, for viewing directly without unzipping. For example, see: http://unicode.org/repos/cldr/tags/release-1-9/common/bcp47/. The current development snapshot is found at http://unicode.org/repos/cldr/trunk/common/bcp47/.