Transform Fallback
We need to more clearly describe the presumed lookup fallback for transforms:
Code equivalence
- A lone script code or long script name is equivalent to the BCP 47 syntax: Latn = Latin = und-Latn.
- “und” from BCP 47 is treated the same as the special code “any” in transform IDs
- In the unlikely event that we have a collision between a special transform code (any, hex, fullwidth, etc) and a BCP 47 language code, we have to figure out what to do. Initial suggestion: add “_ZZ” to language code.
- For the special codes, we should probably switch to aliases that have a low probability of collision, eg > 3 letters always.
Language tag fallback
If the source or target is a Unicode language ID, then a fallback is followed, with some additions.
- az_Arab_IR
- az_Arab
- az_IR
- az
- Arab
- Cyrl
The fallback additions are:
- We fallback also through the country (03). This is along the lines we’ve otherwise discussed for BCP47 support, and that we should clarify in the spec.
- Once the language is reached, we fall back to script; first the specified script if there is one (05), then the likely script for lang (06 - if different than 05)
Laddered fallback
The source, target, and varient use “laddered” fallback. That is, in pseudo code:
a. for variant in variant-chain
b. for target in target-chain
c. for source in source-chain
transform = lookup source-target/variant
if transform != null return transform
..
For example, here is the chain for ru_RU-el_GR/BGN. I’m spacing out the source, target, and variant for clarity.
- ru_RU - el_GR /BGN
- ru - el_GR /BGN
- Cyrl - el_GR /BGN
- ru_RU - el /BGN
- ru - el /BGN
- Cyrl - el /BGN
- ru_RU - Grek /BGN
- ru - Grek /BGN
- Cyrl - Grek /BGN
- ru_RU - el_GR
- ru - el_GR
- Cyrl - el_GR
- ru_RU - el
- ru - el
- Cyrl - el
- ru_RU - Grek
- ru - Grek
- Cyrl - Grek
Comments:
- The above is not how ICU code works. That code actually discards the variant if the exact match is not found, so lines 02-09 are not queried at all. I think that is definitely a mistake.
- Personally, I think the above chain might not be optimal; that it would be better to have BGN be stronger than country difference, but not as strong as Script. However, in conversations with Markus, I was convinced that a simple story for how it works is probably the best, and the above is simpler to explain and easier to implement.
Model Requirements
We have the implicit requirement that no variant is populated unless there is a no-variant version. We need to make sure that that is maintained by the build tools and/or tests. That is, if we have fa-Latn/BGN, we should have fa-Latn as well. The other piece of this is that we should name all the no-variant versions, so that people can be explicit about the variant even in case we change the default later on. The upshot is that the no-variant version should always just be aliases to one of the variant versions. Operationally, that means the following actions:
Case 1. only fa-Latn/BGN. Add an alias from fa-Latn to fa-Latn/BGN
Case 2. only foo-Latn. Rename to foo-Latn/SOMETHING, and then do Case 1.