Collation sequences can be quite tricky to specify. The locale-based collation rules in Unicode CLDR specify customizations of the standard data for UTS #10: Unicode Collation Algorithm (UCA). Requests to change the collation order for a given locale, or to supply additional variants, need to follow the guidelines in this document. Filing a Request
Requests to change the collation order for a given locale, or to supply additional variants should be filed as bugs at http://unicode.org/cldr/trac/newticket.
RulesThe request should present the precise change expressed as rules. For readability, the rules should be supplied in the core syntax as specified in the table "Specifying Collation Ordering" in http://www.unicode.org/reports/tr35/. The rules must also be Minimal Rules as described below. & c < cs & cs <<< ccs / cs Test DataPlease supply short test cases that illustrate the correct sorting behavior as a list of lines in sorted order. Try to include cases that show the boundary behavior by including suffixes, such as the following to illustrate that "cs" and "ccs" sort specially.
c cy cs cscs ccs cscsy ccsy csy d JustificationProvide justification for your change. Citations should be to authoritative pages on the web, in English. Testing Your RequestPlease test out any suggested rules before filing a bug, using Locale Explorer:
Determining the OrderThe exact collation sequence for a given language may be difficult to determine. The base ordering of characters can be fairly straightforward, but there are quite a few other complications involved. Most standards that specify collation, such as DIN or CS, are not targeted at algorithmic sorting, and are not complete algorithmic specifications. For example, CSN 97 6030 requires transliteration of foreign scripts, but there are many choices as to how to transliterate, and the exact mechanism is not specified. It also specifies that geometric shapes are sorted by the number of vertices and edges, which is, at a minimum, difficult to determine; and are subject to variation in glyphs. The CLDR goals are to match the sorting of exemplar letters and common punctuation and leave everything else to the standard UCA ordering. For more information, see UTS #10: Unicode Collation Algorithm (UCA). Determining Level DifferencesIt is often tricky to determine the exact relationship between characters. In the UCA, case and similar variant differences are at a third (tertiary) level, while accent and similar differences are at a second (secondary) level, and base letter differences are at the first (primary) level. That results in an order like the following:
That is, the difference between c and C is weaker than the difference between c and ç, which in turn is weaker than the difference between c and d. For any two characters α and β, it may be very clear that α < β, but not be clear what the right level difference is. To establish this, see if you can find examples of two words that of the following form. Primary Test
You now need to distinguish which of the non-primary level differences you could have. So try again, this time seeing if you can find examples of two words that of the following form, where you know that A and Á have a clear secondary difference in the script. Secondary Test1
Now the ordering of these two strings tells you whether the difference between α and β is a secondary difference, or not. Alternatively, you can look for words of the form: Secondary Test2
where b < B at a tertiary level. If you get the above ordering for the secondary test2, you also know that the difference between α and β is at a secondary level. The Test2 form is often easier to find examples for. If you have established that the characters have neither a primary nor secondary difference, the following can be used in a similar fashion to test whether the difference is at a tertiary level or not. Tertiary Test
ContractionsCharacters may behave differently in different contexts. For example, "ch" in Slovak sorts after H. A sequence of characters that behaves that way is called a contraction. Another common case of contractions is in the case of syllabaries, where a sequence of characters forming a syllable collates as a unit.Note that contractions are typically rather expensive in implementations: they take more storage, and are much slower to compare. So they should be avoided where possible. For example, suppose that we have the following sequence in a dictionary (where the uppercase characters represent characters in the target script): KB ... // combinations of K with consonants KZ KA KE KI KO KU LB ... There are two ways to produce this ordering. One is to have KA, KE, KI, etc be contractions. The other is to order all the vowels after all the consonants. Where the latter is sufficient, it is strongly preferred. Minimal RulesThe goal is always specify the minimal differences from the DUCET. For example, take the case of Slovak, where everything sorts as in DUCET except for certain characters. The following rules place the characters ä, č, đ, and the sequence "ch" (and their case variants) at the appropriate positions in the sorting sequence, and with the appropriate strengths:Minimal Rules
It would be possible instead to have rules that list every letter used by Slovak [a á ä b c č d ď e é f-h {ch} i í j-l ĺ ľ m n ň o ó ô p-r ŕ s š t ť u ú v-y ý z ž], looking something like the following. Maximal Rules
The Maximal Rules format is not accepted in CLDR. The reasons are:
& ඖ # U+0D96 SINHALA LETTER AUYANNA < ඉ # U+0D89 SINHALA LETTER IYANNA < ඊ # U+0D8A SINHALA LETTER IIYANNA PitfallsThere are a number of pitfalls with collation, so be careful. In some cases, such as Hungarian or Japanese, the rules can be fairly complicated (of course, reflecting that the sorting sequence for those languages is complicated).
|

