Schedule (Tentative)

Date Phase
02.10 v21.0 Released
02.29 v22 Data Submission Starts
06.22 v22 Release
For details, see Release Schedule.
See also Notification Services.

Updating DTDs

Introduction

CLDR makes special use of XML because of the way it is structured. In particular, the XML is designed so that you can read in a CLDR XML file and interpret it as an unordered list of <path,value> pairs, called a CLDRFile internally. These path/value pairs can be added to or deleted, and then the CLDRFile can be written back out to disk, resulting in a valid XML file. That is a very powerful mechanism, and also allows for the CLDR inheritance model.

Sounds simple, right? But it isn't quite that easy.

Summary

In summary, when you add an element, attribute, or new kind of attribute value:
  1. Make that that you don't break any of the invariants below (read through once to make sure you get them)!
  2. Element. If an element is added, run FindDTDOrder (making sure that your dtd cache is cleared), and paste the resulting list into both CLDRFile elementOrdering and supplementalMetadata <elementOrder>. If the element is ordered, add to CLDRFile.orderedElements and supplementalMetadata <serialElements>.
  3. Attribute. If a new attribute is added, put it into CLDRFile.attributeOrdering and supplementalMetadata <attributeOrdering>, near the end (but before draft, references, etc). All new ldml.dtd attributes need to be added to CLDRFile.distinguishedData. All attribute values should have a supplementalMetadata <attributeValues> element.
    1. If a new attribute is distinguishing, add to CLDRFile.isDistinguishing.
    2. Never have an attribute be both distinguishing and not distinguishing!! (we have some old cases for compatibility, where the element makes a difference)
    3. Don't introduce any default DTD values. If you did, you'd have to update CLDRFile.defaultSuppressionMap and supplementalMetadata <suppress>.
  4. Attribute Value. If you add a new kind of value to an attribute, also adjust the corresponding supplementalMetadata <attributeValues> element.
    • To check for problems in attribute values after you've done this, run ConsoleCheckCLDR -f en -z FINAL_TESTING -e -c comprehensive
    • If you missed any codes, you will get error message: "Unexpected Attribute Value"
  5. Validation
    1. Run QuickCheck to make sure that the PrettyPath still works, and everything validates.
  6. Add Documentation, PrettyPath, Examples, and Tests.
  7. Add any tests/examples to CheckCLDR
    1. In particular, add to CheckNew (note: we might change this to use Easy Steps)
  8. (Optional) add additional data. If the data is just seed data (that you aren't sure of), make sure that you have draft="unconfirmed" on the leaf nodes.

  • Note: it would make the code easier to manage and less fragile if we had a different ordering for ldml.dtd than for the other DTDs! So if we can get around to that....
  • We should at least change the CLDRFile table names so that they match the medatdata ones.
  • We should also add a test that the CLDRFile data is in sync with the supplementalMetadata. We probably don't want to make the former use the latter, just for efficiency.

Details

Elements

We never have "mixed" content. That is, no element values can occur in anything but leaf nodes. You can never have <x>abcd<y>def</y></x>. You must instead introduce another element, such as: <x><z>abcd</z><y>def</y></x>

There is a strong distinction between rule elements and structure elements. Example: in collations you have <p>x</p><p>y</p> representing x < y. Clearly changing the order would cause problems! There are restrictions on this, however:
  1. Rule elements must be written in the same order they are read.
  2. They can't inherit.
  3. You can't (easily) add to them programmatically.
  4. You can't mix rule and structure elements under the same parent element. That is, if you can have <x><y>...</y><z>...</z></x>, then either y and z must both be rule or both be structure elements.
  5. In our code, rule elements have their ordering preserved by adding a fake attribute added when reading, _q="nnn".
  6. The CLDRFile code has a list of these, in the right order, as orderedElements. If you ever add an rule element to a DTD, you MUST add it there. Be careful to preserve the above invariants.
    • Note: we should change the name orderedElements for clarity.
In order to write out an XML file correctly, we also have to know the valid ordering of paths for elements that are not ordered. This is done via an elementOrdering list in CLDRFile. This list is generated automatically from the DTD, using a program called util/FindDTDOrder. It reads through the DTDs and constructs a list, whereby if the DTD orders X before Y, then the list does. For example, take:
  • <!ELEMENT localeDisplayNames (alias | (localeDisplayPattern?, languages?, scripts?, territories?...
From this, we can derive that localeDisplayPattern < languages < scripts < territories. The full list is constructed by merging. If there are any cycles in the ordering, then the FindDTDOrder will throw an exception, and you have to fix it. That also means that we cannot have complicated DTDs; each non-leaf node will be of the form:
  • <!ELEMENT foo (alias (first?, second*, third?, ... special*))>. 
The subelements of an element will vary between * and ?. Note however that all leaf nodes MUST allow for the attributes alt=... draft=... and references=.... So that the alt can work, the leaf nodes MUST occur in their parent as *, not ?, even if logically there can be only one. For example, even though logically there is only a single quotationStart, we see:
  • <!ELEMENT delimiters (alias | (quotationStart*, ...

Attributes

The attribute order is much more flexible, since it doesn't affect the validity of the file. That is, in XML the following are equal:
  • <info iso4217="ADP" digits="0" rounding="0"/>
  • <info digits="0" rounding="0" iso4217="ADP"/>
However, when this is turned into a path, the order does matter. That is, as strings the following are not equal
  • //supplementalData/currencyData/fractions/info[@iso4217="ADP"][@digits="0"][@rounding="0"]
  • //supplementalData/currencyData/fractions/info[@digits="0"][@rounding="0"][@iso4217="ADP"]
The ordering of attributes in the string path and in the output file is controlled by attributeOrdering in CLDRFile. Although there shouldn't be, there are probably code dependencies in CLDR on the ordering in various places, so don't change it without knowing what you are doing. In particular, certain attributes always come first (like _q and type), and certain others always come last (like draft and references). Normally you add new attributes to the middle somewhere.

When computing the file ordering, we compare paths using CLDRFile.ldmlComparator. Here is the basic ordering algorithm:

Walk through the elements in the path. For each element and its attributes:
  1. compare the corresponding elements at that level in the respective paths; if unequal, return their ordering
    • If they are orderedElements, treat them as equal (the _q attributes will distinguish them).
    • Otherwise the "less than" ordering is given by elementOrdering.
  2. otherwise compare the respective attributes and attribute values, one by one:
    1. if the attributes are unequal, return their ordering (according to attributeOrdering)
    2. if the attribute values are unequal, return their ordering
While attribute value orderings are mostly alphabetic, we do have a number of tweeks in getAttributeValueComparator so that values come in a reasonable order, such as "sun" < "mon" < "tues" < ...

There is an important distinction for attributes. The distinguishing attributes are relevant to the identity of the path and for inheritance. For example, in <language type="en"...> the type is a distinguishing attribute. The non-distinguishing attributes instead carry information, and aren't relevant to the identity of the path, nor are they used in the ordering above. Non-Distinguishing elements in the ldml DTD cause problems: try to design all future DTD structure to avoid them; put data in element values, not attribute values. It is ok to have data in attributes in the other DTDs. The distinction between the distinguishing and non-distinguishing elements is captured in the distinguishingData in CLDRFile. So by default, always put new ldml attributes in this array.
  • (Note: we should change this to be exclusive instead of inclusive, to reduce the possibility for error.)

Attribute Values

We use some default attribute values in our DTD, such as
  • <!ATTLIST decimalFormat type NMTOKEN "standard" >
This was probably a mistake, since it makes the interpretation of the file depend on the DTD; we might fix it some day, maybe if we go to Relax, but for now just don't introduce any more of these. It also means that we have a table in CLDRFile with these values: defaultSuppressionMap.

When you make a draft attribute on a new element, don't copy the old ones like this:

<!ATTLIST xxx draft ( approved | contributed | provisional | unconfirmed | true | false ) #IMPLIED ><!-- true and false are deprecated. -->

That is, we don't want the deprecated values on new elements. Just make it:

<!ATTLIST xxx draft ( approved | contributed | provisional | unconfirmed ) #IMPLIED >

Metadata

In addition, we try to capture the above information in supplementalMetadata. We are not yet complete, but the idea is that someone should be able to determine the status for all of the above from the data instead of looking at CLDRFile. That means that when you update CLDRFile, you need to update the supplementalMetadata also.

The DTD cannot do anything like the level of testing for legitimate values that we need, so supplementalMetadata also has a set of <validity> data for checking attribute values. For example, we see:
  • <attributeValues attributes="validSubLocales" type="list">$locale</attributeValues>
This means that whenever you see that attribute, it can be tested for a list of values that are contained in the variable $locale, defined above. Some of these variables are lists, and some are regex. When you add a new attribute to ldml, you should add a <validity> element if possible.

This was probably a mistake, since it makes the interpretation of the file depend on the DTD; we might fix it some day, maybe if we go to Relax, but for now just don't introduce any more of these. It also means that we have a table in CLDRFile with these values: defaultSuppressionMap.

Don't Reuse

For many many reasons, you never reuse an element name or attribute name unless you mean precisely the same thing, and the item is used in the same way. So to="2009-05-21" is always an attribute that means an end date. Be very careful about new elements with the same name as old ones. You can't have <territory> be an orderedElement in one place, and a non-orderedElement in another. The attribute type=... is always used as an id. For historial reasons, sometimes it is distinguishing and sometimes note (this is very painful, don't add to it!). It is also not used as the id in numberingSystems.

Documentation, PrettyPath, Examples, and Tests

Don't forget the following!
  1. If possible, add a quick sanity test for your new feature. See unittest/NumberingSystemsTest as an example. Remember to add to unittest/TestAll.
  2. If it is an ldml DTD change:
    1. For anything but trivial list items you'll also want to add to the test/ExampleGenerator so that survey tool users see examples of your structure in place.
    2. If there are things you can do to fix the user data on entry, add to test/DisplayAndInputProcessor
    3. Consider also adding a survey test, to check for bad user input. Look at test/CheckDates to see how this is done.
    4. If there are coverage requirements, modify test/CoverageLevel
    5. Update the util/data/prettyPath.txt file for showing a short path name.
  3. If it is a supplemental DTD change:
    1. Add code to util/SupplementalDataInfo to fetch the data.
    2. You should develop a chart program that shows your data in http://www.unicode.org/cldr/data/charts/supplemental/index.html
  4. File a separate bug to update the spec for your DTD change so that that doesn't get lost.
(TBD: add details for the above; in the meantime, ask for help on the mailing list).
Subpages (1): current changes