Unihan Data

Background

In CLDR, we use this data for sorting and romanization of Chinese data. Both of these need to be weighted for proper names, since those are the items most commonly needed (contact names, map locations, etc.).

Sorting
1. A major collation for simplified Chinese compares characters first by pinyin, then (if the same pinyin) by total strokes. It thus needs the most common (simplified) pinyin values, and total (simplified) strokes.
2. A major collation for traditional Chinese compares characters by total (traditional) strokes. It needs reliable total (traditional) strokes.
3. For both of these, we use the Unicode radical-stroke (kRSUnicode) as a tie-breaker. The pinyin values need to be the best single-character readings (without context).
Romanization
1. We need to have the most common pinyin values. These can have contextual readings (eg more than one character).

Tool

There is a file called GenerateUnihanCollators.java which is currently used to generate the CLDR data, making use of Unihan data plus some special data files. The code is pretty crufty, since it was mostly designed to synthesize data from different sources before kMandarin and kTotalstrokes were expanded in Unihan. It is currently in the unicodetools project since it needs to be run against draft versions of the UCD.

As input, it uses the Unicode properties, plus the following:

It creates a number of files in {Generated}/cldr/han/kMandarin.txt

Take Han-Latin.txt, and insert into /cldr/common/transforms/Han-Latin.txt, replacing the lines between
- # START AUTOGENERATED Han-Latin.xml
- # END AUTOGENERATED Han-Latin.xml
Diff to sanity check. Run the Transform tests (or just all of them), then check in.
Take the strokeT.*\.txt files, and pinyin.*\.txt and insert them in the appropriate slots in
1. pinyin.txt → # START AUTOGENERATED PINYIN LONG (sort by pinyin then kTotalStrokes then kRSUnicode)
2. pinyin_short.txt → # START AUTOGENERATED PINYIN SHORT (sort by pinyin then kTotalStrokes then kRSUnicode)
3. strokeT.txt → # START AUTOGENERATED STROKE LONG
4. strokeT_short.txt → # START AUTOGENERATED STROKE SHORT
Diff to sanity check.
Run tests, check in.

The tool also generates some files that we should take back to the Unihan people. Either changes should be made in Unihan, or we should drop the items from out patch files. Examples:

kTotalStrokesReplacements.txt

It shows the cases where the binhua values are different than Unihan.

imputedStrokes.txt

It shows the cases where a stroke count is synthesized from radical/stroke information. This is only approximate, but better than sorting them all at the bottom. It is only used if there is no Unihan or binhua information.

Stopgap

As a proxy for the best pinyin, we use an algorithm to pick from the many pinyin choices in Unihan, based on an algorithm that Richard supplied. There is a small patch file based on having native Chinese speakers look over the data. Any patches should be pulled back into Unihan. The algorithm is:

Take the first pinyin from the following. Where there are multiple choices in a field, use the first

patchFile
kMandarin // moved up in CLDR 30.
kHanyuPinlu
kXHC1983
kHanyuPinyin
bihua

Then, if it is still missing, try to map to a character that does have a pinyin. If we find one, stop and use it.

Radical => Unifield
kTraditionalVariant
kSimplifiedVariant
NFKD

OLD

~~DRAFT!!~~

~~In 1.9, we converted to using Unihan data for CLDR collation and transliteration. We’ve run into some problems (pedberg - see for example~~ #3428~~), and this is a draft proposal for how to resolve them.~~

Longer Term

~~The following are (draft) recommendations for the UTC.~~

Define the kMandarin field to contain one or two values. If there are two values, then the first is preferred for zh-Hans (CN) and the second is preferred for zh-Hant (TW). If the values would be the same, there is only one value. (pedberg - it is already defined that way)
The preferred value should be the one that is most commonly used, with a focus on proper names (persons or places). For example, if reading X has 30% of the frequency of Y, but X is used with proper names but Y is not, X would be preferred.
Define the kTotalStrokes field to be what is most appropriate for use with zh-Hant, and add a new field, kTotalSimplifiedStrokes, to be what is most appropriate for use with zh-Hans. pedberg- The kTotalStrokes field is already defined to be the value “for the character as drawn in the Unicode charts” which may not match the value for zh-Hant; we may need to add 2 stroke count fields.
~~Get a commitment from the IRG to supply these values for all new characters. Set in place a program to add/fix values for existing characters.~~

~~Once this is in place, remove the now-superfluous patch files in the CLDR collation/transliteration generation.~~

Short Term (1.9.1)

~~Modify the pinyin to choose the 1.8 CLDR transliteration value first, then fall back to the others.~~
Have two transliteration pinyin variants: Names and General. Make the default for pinyin be “Names”. (There are only currently 2 differences.) (pedberg - Yes, but there is a ticket to add more, see ~~#3381, which covers some of the problems from #3428 above)~~
~~Use the default pinyin for collation.~~
~~Add two total-strokes patch files for the collation generator, one for simplified and one for traditional.~~
~~In the generator, have two different total-strokes used for simplified vs traditional.~~

~~pedberg comments:~~

~~We need to ensure that the transliteration value is consistent with the pinyin collator.~~
~~The 1.8 transliterator had many errors, I don’t think a wholesale fallback to that is a good idea.~~
~~Using the name reading rather than the general reading for standard pinyin collation might produce unexpected results.~~
~~Why not just specify the name reading when that is desired? No need to make it the default if it is the less common reading.~~