CLDR File

Introduction

CLDRFile is a workhorse for CLDR tools. It has grown and changed over the years, without a serious rearchitecture, so there are oddities in its usage and construction that we'll try to describe here.

The CLDRFile is logically a representation of one or more XML files as a set of <key, value> pairs, where the key is a path, and the value is the element value at the path. This structure allows the XML file to use inheritance, and allows it to be modified without worrying about the exact structure. No matter what changes are made to it, when a CLDRFile is written out into a file, the resulting XML file is syntactically correct.

The logical contents of a CLDRFile may be represented by multiple files on disk. The DtdType class contains information about which directories are used for which types of files.

Because CLDRFile incorporates inheritance, and because the data source may be a database and not just in-memory storage, the actual contents are in a separate object called XMLSource. There are two types of XMLSource: resolved and unresolved. An unresolved XMLSource is a fairly simple mapping, while a resolved XMLSource references other XMLSource files, the ones in the inheritance tree. Only CLDRFiles of type DtdType.ldml support (or need) inheritance.

The structure is a bit more complicated than just a <path,value> pair:

    1. For inheritance, the path keys remove attributes that don't count for the purposes of inheritance, leaving a "distinguishing" path. So there is a separate mapping from the distinguishing path to the "full" path, with those attributes. There is API to get a distinguishing path from a full path, and for a given locale.

    2. There is a separate mapping for comments. This indicates where a comment attaches to an element (before, at, or after). If the CLDRFile has been modified to remove the path, then the comment is still printed, but in a manufactured location.

    3. As a CLDRFile is read from disk, certain changes may be made to the structure, including the addition of attributes in order to get the ordering correct.

    4. There is a special mechanism used for paths that have no value, internally called extraPaths. These are returned when iterating with fullIterable(). They are typically paths whose attributes depend on the nature of the locale. For example, paths are added based on the pluralCategories supported by the locale. These valueless paths are presented in the SurveyTool, where vetters can add missing values.

    5. A by-product of the <key,value> pair nature is that no CLDR XML file can have mixed content: the content of every element is either a value, or a sequence of child elements.

Because of these complications, when a new element or attribute is added to the DTD, adjustments have to be made to the code in CLDRFile and other places. For details, see Updating DTDs.

Input

When a file is read, special attributes in DtdData are used to determine which elements are not reordered or inherited. Those elements get an artificial attribute "_q" with an increasing number as they are read.

Output

The write method writes out an XML file. The elements and attributes are ordered based on the their order in the DTD file. The attribute ordering is determined by a comparator: getAttributeValueComparator() [Note: we should also move these into special attributes in the DTD file.]

Certain attributes are suppressed when written out. These are in defaultSuppressionMap. [TBD check on this, might be old.]

Some of the code uses a MapComparator, which logically takes a list of items.

    1. If both items are in the ordered list, it returns the ordering according to that list. Else if exactly one is, it comes first (TBD check this).

    2. If both items are numeric (all digits), then the ordering is the normal numeric order. Else if exactly one is, it comes first (TBD check this.)

    3. Else return the "natural" comparison: return ((Comparable)a).compareTo(b);

Usage

The iteration of paths in a CLDRFile returns the paths in random (hashed) order. There are, however, some options that let you iterator through a subset of the paths: only paths that match a prefix string, or those that only match a regular expression. You can also use a comparator (eg CLDRFile.ldmlComparator) to change the order of iteration. Example:

for (Iterator it = cldrFile.iterator("", CLDRFile.ldmlComparator); it.hasNext();)

String path = (String) it.next();

String value = cldrFile.getStringValue(path);

String fullpath = cldrFile.getFullXPath(path);

[TBD Update example and add full iterable.]