BCP47 Syntax Mapping

In the current LDML specification, a Unicode Locale Identifier consists from is composed of a Unicode Language Identifier plus optional locale extensions. Unicode Language Identifier is fully compatible with BCP47 language tag, but the syntax of locale extensions ("@" key "=" type (";" key "=" type)* ) are not. The LDML is trying to define systematical mapping, but the current definition may truncate (and or remove "-" in some type values) key or type value to 8 characters because of the BCP47 language subtag's syntax restriction. The current definition utilizes BCP47 private use features, but we want to make locale extensions formal (writing a new RFC to reserve a singleton letter for the usage), so we can avoid any conflicts with other private use values and also allow software developers to write a parser for Unicode locale extensions confidently.

BCP 47 is undergoing a revision which should be done soon:

Once we define formal representation of Unicode locale extensions in BCP47 syntax, we actually no longer have any good reasons to use @key1=type1;key2=type2... syntax for Unicode Locale Identifier other than backward compatibility reasons. This document proposes that we retire the proprietary syntax and fully migrate to the new syntax fully supported by BCP47 language tag.

There are several options for representing keyword key/type pairs in BCP47 syntax. Examples in following proposal assume a letter "u" is reserved for the Unicode locale extensions; however we could go for any of the possible extensions: [0-9 a-w y z].

The table below shows the locale extension keys/values currently defined by the LDML specification.

Key/Type Definitions

Collation Parameters

1. Proposed BCP47 subtag syntax

This document propose the syntax described by the BNF below.

locale-extensions = locale-singleton "-" extension *("-" extension)

extension = key "-" type

locale-singleton = "u"

key = 2alphanum

type = 3*8alphanum

alphanum = (ALPHA / DIGIT)

Example:

en-US-u-ca-islamicc-co-phonebk

this corresponds to the former syntax

en-US@calendar=islamic-civil;collation=phonebook

2. Keys

Key names and only key names are always of length=2, and types (values) are always greater than 2. This proposal defines new canonical key names below.

The motivation is reduction of string size, and making sure that keys and values don't overlap syntactically.

3. Types

3.1 Collation

3.1.1 Collation (co) types

3.1.2 Collation Strength (ks) types

3.1.3 Collation Alternate (ka) types

Current

non-ignorable

shifted

Proposed

noignore

shifted

3.1.4 Collation Backwards (kb) / Normalization (kk) / Case Level (kc) / Hiragana Quaternary (kh) / Numeric (kn) types

Current

yes

no

Proposed

true

false

3.1.5 Collation Case First (kf) types

3.1.6 Collation Variable Top (kv) type

The variable top parameter is specified by a code point in the format uXXuYYYY. No changes are required.

3.2 Calendar (ca)

3.3 Currency (cu) types

ISO4217 code (3-letter alpha) is used for currency. No changes required.

3.4 Number System (nu) types

The current CVS snapshot implementation uses CSS3 names. This proposal changes all of type names to script code with one exception (arabext).

3.5 Time Zone (tz) types

CLDR uses Olson tzids. These IDs are usually made from <continent>+"/"+<exemplar city> and relatively long. To satisfy the syntax requirement discussed in this document, we need to map these IDs to relatively short IDs uniquely. The UN LOCODE is designed to assign unique location code and it satisfies most of the requirement. A LOCODE consists from 2 letter ISO country code and 3 letter location code. This proproposal suggest that a 5 letter LOCODE is used as a short time zone ID if examplar city has a exact match in LOCODE repertoire. Some Olson tzids do not have direct mapping in LOCODE. In this case, we assign our own codes to them, but using 3-4/6-8 letter code to distinguish them from LOCODE. For Olson tzid Etc/GMT*, this proposal suggest "UTC" + ["E" | "W"] + nn (hour offset), for example, UTCE01 means 1 hour east from UTC (Etc/GMT-1). The proposed short ID list is attached in this document.