The locale-based collation rules in Unicode CLDR specify modifications of the standard UCA data, also known as DUCET. Requests to change the collation order for a given locale, or to supply additional variants, need to follow the guidelines in this document.
Determining the OrderThe exact collation sequence for a given language may be difficult to determine. The base ordering of characters can be fairly straightforward, but there are quite a few other complications involved. Most standards that specify collation, such as DIN or CS, are not targeted at algorithmic sorting, and are not complete algorithmic specifications. For example, CSN 97 6030 requires transliteration of foreign scripts, but there are many choices as to how to transliterate, and the exact mechanism is not specified. It also specifies that geometric shapes are sorted by the number of vertices and edges, which is, at a minimum, difficult to determine; and are subject to variation in glyphs. The CLDR goals are to match the sorting of exemplar letters and common punctuation and leave everything else to the standard UCA ordering. For more information, see UTS #10: Unicode Collation Algorithm (UCA). Determining Level DifferencesIt is often tricky to determine the exact relationship between characters. In the UCA, case and similar variant differences are at a third (tertiary) level, while accent and similar differences are at a second (secondary) level, and base letter differences are at the first (primary) level. That results in an order like the following:
That is, the difference between c and C is weaker than the difference between c and ç, which in turn is weaker than the difference between c and d. For any two characters α and β, it may be very clear that α < β, but not be clear what the right level difference is. To establish this, see if you can find examples of two words that of the following form. Primary Test
|
| ICU Format |
XML Format |
|---|---|
| & A < ä <<< Ä |
<reset>A</reset> <p>ä</p> <t>Ä</t> |
| & C < č <<< Č |
<reset>C</reset> <p>č</p> <t>Č</t> |
| & D < đ <<< Đ |
<reset>D</reset> <p>đ</p> <t>Đ</t> |
| & H < ch <<< cH <<< Ch <<< CH |
<reset>H</reset> <p>ch</p> <t>cH</t> <t>Ch</t> <t>CH</t> |
| ... |
... |
It would be possible instead to have rules that list every letter used by Slovak [a á ä b c č d ď e é f-h {ch} i í j-l ĺ ľ m n ň o ó ô p-r ŕ s š t ť u ú v-y ý z ž], looking something like the following.
Maximal Rules
| ICU Format |
XML Format |
| & A << á <<< Á < ä <<< Ä < b <<< B < c <<< C < č <<< Č < d ... |
<reset>A</reset> <s>á</s> <t>Á</t> <p>ä</p> <t>Ä</t> <p>b</p> <t>B</t> <p>c</t> <t>C</t> <p>č</p> <t>Č</t> <p>d</p> ... |
The Maximal Rules format is not accepted in CLDR. The reasons are:
- Every time a character is tailored, the data for that character takes up more room in typical implemenations. That means that the data for collation is larger, downloads of collation libraries with that data are slower, sort keys are longer, and performance is slower; sometimes very much so.
- Related characters in the same script are in a peculiar order. For example, if the Slovak tailoring omits ƀ, then it would show up as after z.
You can see what the UCA currently does with a given script by looking at the charts at Unicode Collation Charts, or at the UCA in ICU-style rules. For example, suppose that U+0D89 SINHALA LETTER IYANNA and U+0D8A SINHALA LETTER IIYANNA needed to come after U+0D96 SINHALA LETTER AUYANNA, in primary order, and that otherwise DUCET was ok. Then you would give the following rules:
& ඖ # U+0D96 SINHALA LETTER AUYANNA
< ඉ # U+0D89 SINHALA LETTER IYANNA
< ඊ # U+0D8A SINHALA LETTER IIYANNA
Pitfalls
There are a number of pitfalls with collation, so be careful. In some cases, such as Hungarian or Japanese, the rules can be fairly complicated (of course, reflecting that the sorting sequence for those languages is complicated).
-
Only tailor expected data. We focus on the required collation sequence for a given language with normal data. So we don't include full-width characters for a European collation sequence, such as
- ... CSCS <<< CSCS ...
- ... CSCS <<< \uFF23\uFF33\uFF23\uFF33 ... (equivalently)
- Tailor trailing contractions. If a sequence of characters is treated as a unit for collation, it should be entered as a contraction.
& c < ch
One might think that sequence like "dz" doesn't require that, since it would always come after "d" followed by any other letter; it is a "trailing contraction". But in unusual cases, that wouldn't be true; if "dz" is a unit sorted as if it were a distinct letter after "d", one should get the ordering "dα" < "dz". The correct behavior will only happen if "dz" is a contraction, such as
& d < dz - Watch out for Expansions. If you have a rule like &cs < d, and "cs" has not occurred in a previous rule as a contraction, then
this is automatically considered to be the same as &c < d / s; that is, the d expands as if it were a "cs" (actually, primary greater
than a "cs", since we wrote "<"). This expansion takes effect until the next primary difference.
So suppose that "ccs" is to behave as if it were "cscs", and take case differences into account. You might try to do this with the rules on the left:
Rules (Wrong) Actual Effect & C < cs <<< Cs <<< CS
& cscs <<< ccs
<<< Cscs <<< Ccs
<<< CSCS <<< CCS& C < cs <<< Cs <<< CS
& cs <<< ccs / cs
<<< Cscs / cs <<< Ccs / cs
<<< CSCS / cs <<< CCS / csBut since the CSCS has not been made a contraction in previous rules, this produces an automatic expansion, one that continues through the entire sequence of non-primary differences, as shown on the right. This is not what is wanted: each item acts like it expands compared to the previous item. So CCS, for example, will act like it expands to CSCScs!
What you actually want is the following:
Rules (Right) Actual Effect & C < cs <<< Cs <<< CS
& cscs <<< ccs
& Cscs <<< Ccs
& CSCS <<< CCS& C < cs <<< Cs <<< CS
& cs <<< ccs / cs
& Cs <<< Ccs / cs
& CS <<< CCS / CSIn short, when you have expansions, it is always safer and clearer to express them with separate resets. There are only a few exceptions to this, notably when CJK characters are interleaved with Hangul Syllables.
- Minimal Rules. Example: Maltese was sorting character sequences before a base character using the
following style:
& B
< ċ
<<<Ċ
< c
<<<CThe correct rules should be the minimal ones.
& [before 1] c < ċ <<< Ċ
This finds the highest primary (that's what the 1 is for) character less than c, and uses that as the reset point. For Maltese, the same technique needs to be used for ġ and ż.
-
Blocking Contractions. Contractions can be blocked with CGJ, as described in the Unicode Standard and in the Characters and Combining Marks FAQ.
-
Case Combinations. Normally all combinations of case need to be supplied for contractions. That is, if ch is a contraction, then you would have the rules ... ch < cH < Ch < CH. The reason for this is so that all case variants sort at the same primary level: thus lowercasing a string will not affect its primary order. Cases such as McHugh are handled like other instances where contractions should be blocked.
