en, fr-CA, or zh-Hant. The standard Unicode language identifiers follow IETF BCP 47, with some small differences defined in UTS #35: Locale Data Markup Language (LDML). Locale identifiers use the same format, with certain possible extensions.
Within programs and structured data, languages are indicated with stable identifiers of the form Often it is not clear which language identifier to use. For example, what most people call Punjabi in Pakistan actually has the code 'lah', and formal name "Lahnda". There are many other cases where the same name is used for different languages, or where the name that people search for is not listed in the IANA registry. Moreover, a language identifier uses not only the 'base' language code, like 'en' for English or 'ku' for Kurdish, but also certain modifiers such as en-CA for Canadian English, or ku-Latn for Kurdish written in Latin script. Each of these modifiers are called subtags (or sometimes codes), and are separated by "-" or "_". The language identifier itself is also called a language tag, and sometimes a language code. Here is an example of the steps to take to find the right language identifier to use. Let's say you to find the identifier for a language called "Ganda" which you know is spoken in Uganda. You'll first pick the base language subtag as described below, then add any necessary script/territory subtags, and then verify. If you can't find the name after following these steps or have other questions, ask on the Unicode CLDR Mailing List. If you are looking at a prospective language code, like "swh", the process is similar; follow the steps below, starting with the verification. Choosing the Base Language Code
Choosing Script/Territory SubtagsIf you need a particular variant of a language, then you'll add additional subtags, typically script or territory. Consult Sample Subtags for the most common choices. Again, review Caution! below.Verifying Your Choice
Documenting Your ChoiceIf you are requesting a new locale / language in CLDR, please include
the links to the particular pages above so that we can process your
request more quickly, as we have to double check before any addition. The links will be of the form:
Caution!Canonical Form
MacrolanguagesISO (and hence BCP 47) has the notion of an individual language (like en = English) versus a Collection or Macrolanguage. For compatibility, Unicode language and locale identifiers always use the Macrolanguage to identify the predominant form. Thus the Macrolanguage subtag "zh" (Chinese) is used instead of "cmn" (Mandarin). Similarly, suppose that you are looking for Kurdish written in Latin letters, as in Turkey. It is a mistake to think that because that is in the north, that you should use the subtag 'kmr' for Northern Kurdish. You should instead use ku-Latn-TR. See also: ISO 636 Deprecation Requests. Unicode language identifiers do not allow the "extlang" form defined in BCP 47. For example, use "yue" instead of "zh-yue" for Cantonese. EthnologueWhen searching, such as site:ethnologue.com ganda, be sure to completely disregard matches in Ethnologue 14 -- these are out of date, and do not have the right codes! The Ethnologue is a great source of information, but it must be approached with a certain degree of caution. Many of the population figures are far out of date, or not well substantiated. The Ethnologue also focus on native, spoken languages, whereas CLDR and many other systems are focused on written language, for computer UI and document translation, and on fluent speakers (not necessarily native speakers). So, for example, it would be a mistake to look at http://www.ethnologue.com/show_country.asp?name=EG and conclude that the right language subtag for the Arabic used in Egypt was "arz", which has the largest population. Instead, the right code is "ar", Standard Arabic, which would be the one used for document and UI translation. WikipediaWikipedia is also a great source of information, but it must be approached with a certain degree of caution as well. Be sure to follow up on references, not just look at articles. |
Unicode CLDR Project > CLDR Specifications >