Updating Language Groups


  1. (prerequisite: being able to build CLDR locally with Maven)
  2. Run GenerateLanguageContainment
    1. cd cldr/tools
    2. mvn -DCLDR_DIR=________/cldr  -Dexec.mainClass=org.unicode.cldr.tool.GenerateLanguageContainment exec:java -pl cldr-rdf
  3. This will create {workspace}/cldr/common/supplemental/languageGroup.xml
    1. Copy the console log into debugLog.txt to help in debugging problems. (Should modify tool to do this.)
  4. Run TestLanguageGroup and fix problems if necessary:
    1. OVERRIDES: If a language code moves or is deleted, consider adding override to GenerateLanguageContainment
      1. Additions go in EXTRA_PARENT_CHILDREN
        1. If you add something, you might have to remove it someplace else. You'll get a "duplicate parent" error in TestLanguageGroup
      2. Removals go in REMOVE_PARENT_CHILDREN
        1. "*" for value means all.
    2. Example: pcm [Nigerian Pidgin] [pcm] - not in languages/isolates.json nor languageGroup.xml
      1. Go to https://en.wikipedia.org/wiki/Nigerian_Pidgin (by searching)
      2. Under language family, click on the ancestor. Keep clicking until you find a language group with an "ISO 639-2 / 5" code.
      3. Get the ancestor chain (see below), we find kri
      4. Go to GenerateLanguageContainment.EXTRA_PARENT_CHILDREN, add .put("kri""pcm")
    3. Example: inc [Indic] is not an ancestor of trw [Torwali]: expected true
      1. Go to https://en.wikipedia.org/wiki/Torwali_language (find by searching). 
      2. Under language family, click on the ancestor. Keep clicking until you find a language group with an "ISO 639-2 / 5" code.
      3. That says 'inc', so we have a case where wikidata is out of sync with wikipedia. 
      4. Go to GenerateLanguageContainment.EXTRA_PARENT_CHILDREN, add .put("inc", "trw")
    4. Occasionally LanguageGroup.java will need some fixes instead, once you have done the research.
    5. Once you are done, rerun GenerateLanguageContainment and TestLanguageGroup
      1. You may need to repeat the process to get a full chain of ancestors.
      2. Example: For X Creoles, we use the X, so for the first example above we needed .put("en""kri")
  5. Run ChartLanguageGroup
    1. Review {workspace}/cldr-aux/charts/<number>/supplemental/language_groups.html
  6. Check in
    1. {workspace}/cldr/common/supplemental/languageGroup.xml
    2. {workspace}/cldr/tools/cldr-rdf/external/*.tsv ( intermediate tables, for tracking)
    3. (???) {workspace}/cldr-aux/charts/<number>/supplemental/language_groups.html
Comments