CLDR Process

Introduction

This document describes the Unicode CLDR Technical Committee’s process for data collection, resolution, public feedback and release.

The process is designed to be light-weight; in particular, the meetings are frequent, short, and informal. Most of the discussion happens over email or in virtual meetings, with a database recording requested changes (see Requesting Changes).
When gathering data for a region and language, it is important to have multiple sources for that data to produce the most commonly used data. The initial versions of the data were based on best available sources, and updates with new and improvements are released twice a year with work by contributors inside and outside of the Unicode Consortium.
It is important to note that CLDR is a Repository, not a Registration. That is, contributors should NOT expect that their suggestions will simply be adopted into the repository; instead, it will be vetted by other contributors.
The CLDR Survey Tool is the main channel for collecting data, and bug/feature requests are tracked in a database (see Requesting Changes).
The final approval of the release of any version of CLDR is up to the decision of the CLDR Technical Committee.

Formal Technical Committee Procedures

For more information on the formal procedures for the Unicode CLDR Technical Committee, see the Technical Committee Procedures for the Unicode Consortium.

Specification Changes

The UTS #35: Locale Data Markup Language (LDML) specification are kept up to date with each release with change/added structure for new data types or other features.

The CLDR TC maintains redirects that redirect users to from each major CLDR release to the latest version of the spec. For example, reports/tr35/46/tr35.html will redirect to the 46.1 version of the specification since the 46.1 was the latest revision of the LDML for 46. To make this clear to any readers, the modifications for any dot release of CLDR will always include the modifications for any release(s) during that major version, as well as clearly delinating the modifications of that dot release in its own section. CLDR 46.1 modifications is a great example of this.

Requests for changes are entered in the bug/feature request database (file a ticket).
Structural changes are always backwards-compatible. That is, previous files will continue to work. Deprecated elements remain, although their usage is strongly discouraged.
There is a standing policy for structural changes that require non-trivial code for proper implementation, such as time zone fallback or alias mechanisms.
- These require design discussions in the CLDR Design Working Group and approval by the CLDR Technical Committee.
- Complex changes may require prototypes that demonstrate correct function according to the proposed specification.
- When a spec change or clarification affects existing ICU APIs, CLDR will discuss the change with ICU and an ICU member be a required reviewer on the pull request.
New sections may be added to the specification with the status of Technical Preview or Final Candidate depending on how comprehensive the new section is, what type of feedback the Technical Committee requires, and whether the feedback period needs to extend across one or more releases.
New features in the spec will be marked as Technical Preview if the following conditions are true:
- They are intended for implementation in ICU and/or ICU4X (eg excluding annotations for emoji, etc)
- They make compliant pre-existing ICU or ICU4X APIs become non-compliant
- They won’t be implemented by either ICU or ICU4X (in at least draft status) in the synchronized CLDR release

Status	Description
Technical Preview	- The specification section is fairly complete but not stable, included in the release to gather feedback. - Features may be modified or removed based upon feedback. A section in Technical Preview may remain in Technical Preview in the following release if more feedback is needed, or could advance to Final Candidate or to stable. - It is similar to elements marked with @TECHPREVIEW in the DTD as described in the LDML.
Final Candidate	- The specification section is complete and considered ready for release, and is expected to become stable in the next release. An optional Final Candidate stage follows a period of feedback in Technical Preview where final feedback is desired. Changes will only be made if serious issues are discovered during this feedback period.
Stable	- The specification section has been approved as stable by the Technical Committee, any changes must be backward compatible. Deprecated elements will remain, although their usage is strongly discouraged.

Data- Submission and Vetting

The contributors of locale data are expected to be language speakers residing in the country/region. In particular, national standards organizations are encouraged to be involved in the data vetting process.

In order to add a locale to the repository, Core data (See Core data for new locales) is needed. The content is collected from language experts and is reviewed by the CLDR Technical Committee. This information is the minimum needed for the Survey Tool to offer the locale for further data collection and is required for a new locale to be added in CLDR. With the Core data added to the CLDR, further data collection takes place using the Survey tool. (See Getting Started.)

The following 4 states are used to differentiate the data contribution levels. The initial data contributions are normally marked as draft; this may be changed once the data is vetted.

Level 1: unconfirmed
Level 2: provisional
Level 3: contributed (= minimally approved)
Level 4: approved (equivalent to an absent draft attribute)

Implementations may choose the level at which they wish to accept data. They may choose to accept even unconfirmed data if having some data is better than no data for their purpose. Approved data are vetted by language speakers; however, this does not mean that the data is guaranteed to be error-free – this is simply the best judgment of the vetters and the committee according to the process.

Survey Tool User Levels

There are multiple levels of access and control:

Vetter Level	Number of Votes	Description
TC Member	50	- Manage users in their organization - Can vet and submit data for all locales (high level votes are only used to correct issues with CLDR Technical Committee’s approval - otherwise the voting level of their organization would be used) - Can see the email addresses for all vetters in their organization
TC Organization Managers	6	- Manage users in their organization - Can vet and submit data for all locales (However, their vetting work is only done to correct issues.) - Can see the email addresses for all vetters in their organization
Organization Manager	4	- Manage users in their organization - Can vet and submit data for all locales (only done to correct issues) - Can see the email addresses for all vetters in their organization
TC Organization Vetter	6	- Can vet and submit data for a particular set of locales - Can see the email addresses for vetters that submitted data in their locales - Cannot manage other users
Organization Vetter	4	- Can vet and submit data for a particular set of locales - Can see the email addresses for submitted data in their locales. - Cannot manage other users.
Guest Vetter	1	- Can vet and submit data for a particular set of locales - Cannot see others’ email addresses - Cannot manage other users
Locked Vetter	0	- When a user is locked during a vetting cycle, their vote is considered a zero weight and is no longer counted.

These levels are decided by the technical committee and the TC representative for the respective organizations.

Unicode TC members (full/institutional/supporting) can create new user accounts at the TC, Manager, Vetter, or Guest level.
Vetters of TC Organizations that are fully engaged in the CLDR Technical Committee are given the higher vote level of a TC Organization Vetter to reflect their level of expertise and coordination in the workings of CLDR and the survey tool as compared to the regular Organization Vetter level.
Other organizations, including liaison or associate members, can create user accounts at the Manager, Vetter, or Guest level.
Users who sign up for an account outside of any other organization are added to the special Unafilliated organization, and are given the level of Guest.
The TC may move users between organizations if needed, without losing voting records or history.

Voting Process

Note: within the Survey Tool, click on the ⓘ symbol in the right hand sidebar to access the “Vote Explainer” which explains how a particular item’s result was calculated.

Each user gets a vote on each value, but the strength of the vote varies according to the user level (see table above).
For each value, each organization gets a vote based on the maximum (not cumulative) strength of the votes of its users who voted on that item.
- For example, if an organization has 10 Vetters for one locale, if the highest user level who voted has user level of 4 votes, then the vote count attributed to the organization as a whole is 4 for that item.
If the users within an organization vote for different values, a user with a higher voting level will overrule a vote at a lower voting level.
- For “TC Organizations” (see above), a later vote at the same level will override an earlier vote: “latest wins”.
- For all other organizations, a value with more users voting for it will win over a value with fewer users voting for it.
- If there’s still a tie, the tie will be broken arbitrarily by comparing the text lexically.

Optimal Field Value

For each release, there is one optimal (or “winning”) field value determined by the following process:

Add up the votes for each value from each organization. (Each organization votes for only one value, see above.)
Sort the possible alternative values for a given field
- by the most votes (descending)
- then by UCA order of the values (ascending)
The first value is the optimal value (O).
The second value (if any) is the next best value (N).

Draft Status of Optimal Field Value

Let O be the optimal value’s vote, N be the vote of the next best value (or zero if there is none), and G be the number of organizations that voted for the optimal value. Let oldStatus be the draft status of the previously released value.
Assign the draft status according to the first of the conditions below that applies:

Resulting Draft Status	Condition
approved	- O > N and O ≥ 8, for established locales* - O > N and O ≥ 4, for other locales
contributed	- O > N and O ≥ 4 and oldstatus < contributed - O > N and O ≥ 2 and G ≥ 2
provisional	O ≥ N and O ≥ 2
unconfirmed	otherwise

Established locales are currently found in coverageLevels.xml, with approvalRequirement[@votes="8"]
- Some specific items have an even higher threshold. See approvalRequirement elements in coverageLevels.xml for details.
If the oldStatus is better than the new draft status, then no change is made. Otherwise, the optimal value and its draft status are made part of the new release.
- For example, if the new optimal value does not have the status of approved, and the previous release had an approved value (one that does not have an error and is not a fallback), then that previously-released value stays approved and replaces the optimal value in the following steps.

It is difficult to develop a formulation that provides for stability, yet allows people to make needed changes. The CLDR committee welcomes suggestions for tuning this mechanism. Such suggestions can be made by filing a ticket.

Data Resolution

After the contribution of collecting and vetting data, the data goes through a curation stage to ensure it is free of errors for the release:

Collisions errors are resolved by retaining one of the values and removing the other(s).
The resolution choice is based on the judgment of the committee, typically according to which field is most commonly used.
- When an item is removed, an alternate may then become the new optimal value.
- All values with errors are removed.
Non-optimal values are handled as follows
- Those with no votes are removed.
- Those with votes are marked with alt=proposed and given the draft status: unconfirmed

If a locale does not have minimal data (at least at a provisional level), then it may be excluded from the release. Where this is done, it may be restored to the repository for the next submission cycle.

This process can be fine-tuned by the Technical Committee as needed, to resolve any problems that turn up. A committee decision can also override any of the above process for any specific values.

For more information see the key links in CLDR Survey Tool (especially the Vetting Phase).

Notes:

If data has a formal problem, it can be fixed directly (in GitHub) without going through the above process. Examples include:
- syntactic problems in pattern, extra trailing spaces, inconsistent decimals, mechanical sweeps to change attributes, translatable characters not quoted in patterns, changing ‘ (punctuation mark) to curly apostrophe or s-cedilla to s-comma-below, removing disallowed exemplar characters (non-letter, number, mark, uppercase when there is a lowercase).
- These are changed in-place, without changing the draft status.
Linguistically sensitive data should always go through the survey tool. Examples include:
- names of months, territories, number formats, changing ASCII apostrophe to U+02BC modifier letter apostrophe or U+02BB modifier letter turned comma, or U+02BD modifier letter reversed comma, adding/removing normal exemplar characters.
The TC committee can authorize bulk submissions of new data directly, with all new data marked draft="unconfirmed" (or other status decided by the committee), but only where the data passes the CheckCLDR console tests.
The survey tool does not currently handle all CLDR data. For data it doesn’t cover, the regular bug system is used to submit new data or ask for revisions of this data. In particular:
- Collation, transforms, or text segmentation, which are more complex.
  - For collation data, see the comparison charts or the XML data at /common/collation/
  - For transforms, see the XML data at /common/transforms/
- Non-linguistic locale data:
  - XML data: /common/supplemental/
  - HTML view: supplemental data charts

Prioritization

There may be conflicting common practices or standards for a given country and language. Thus LDML provides keyword variants to reflect the different practices (for example, for German it allows the distinction between PHONEBOOK and DICTIONARY collation.).

When there is an existing national standard for a country that is widely accepted in practice, the goal is to follow that standard as much as possible. Where the common practice in the country deviates from the national standard, or if there are multiple conflicting common practices, or options in conforming to the national standard, or conflicting national standards, multiple variants may be entered into the CLDR, distinguished by keyword variants or variant locale identifiers.

Where a data value is identified as following a particular national standard (or other reference), the goal is to keep that data aligned with that standard. There is, however, no guarantee that data will be tagged with any or all of the national standards that it follows.

Maintenance Releases

Maintenance releases, such as 26.1, are issued whenever the standard identifiers change (that is, BCP 47 identifiers, Time zone identifiers, or ISO 4217 Currency identifiers). Updates to identifiers will also mean updating the English names for those identifiers.

Corrigenda may also be included in maintenance releases. Maintenance releases may also be issued if there are substantive changes to supplemental data (non-language such as script info, transforms) data or other critical data changes that impact the CLDR data users community.

The structure and DTD may change, but except for additions or for small bug fixes, data will not be changed in a way that would affect the content of resolved data.

Data Retention Policy

Public Feedback Process

The public can supply formal feedback into CLDR via the CLDR Survey Tool or by filing a ticket. There is also a public forum for questions at CLDR Mailing List (details on archives are found there).

There is also a members-only CLDR mailing list for members of the CLDR Technical Committee.

Public Review Issues may be posted in cases where broader public feedback is desired on a particular issue.

Be aware that changes and updates to CLDR will only be taken in response to information entered in the CLDR Survey Tool or by filing a ticket. Discussion on public mailing lists is not monitored; no actions will be taken in response to such discussion – only in response to filed bugs. The process of checking and entering data takes time and effort; so even when bugs/feature requests are accepted, it may take some time before they are in a release of CLDR.

Data Release Process

Version Numbering

The locale data is frozen per version. Once a version is released, it is never modified. Any changes, however minor, will mean a newer version of the locale data being released. The version numbering scheme is “xy.z”, where z is incremented for maintenance releases, and xy is incremented for regular semi-annual releases as defined by the schedule

Release Schedule

Early releases of a version of the common locale data will be issued as either alpha or beta releases, available for public feedback. The dates for the next scheduled release will be on CLDR Project.

The schedule milestones are listed below.

Milestone	JiraPhase	Description
Survey Tool Shakedown		Selected survey tool users try out the survey tool and supply feedback. The contributed data will be considered as real data.
Data Submission	dsub	All survey tool registered u sers can add data and vet (vote for) for data
Data Vetting	dvet	The survey tool users focus shifts to resolving data differences/disputes, and resolve errors.
Data Resolution		T he data contribution is closed for general contributors. The Technical Committee will close remaining errors and issues found during the release process .
Alpha and Beta releases	rc	The release candidates are available for testing. Only showstoppers will be triage and fixed at this point.
Release	final	Release completed with referenceable release notes and links.

Labels in the Jira column correspond to the phase field in Jira. Phase field in Jira is used to identify tickets that need to be completed before the start of each milestone (table above).

Meetings and Communication

The currently-scheduled meetings are listed on the Unicode Calendar. Meetings are held by phone and Google Meet, weekly. Additional meetings are scheduled depending on the need and participants’ availability.

There is an internal email list for the Unicode CLDR Technical Committee, open to Unicode members and invited experts. All national standards bodies who are interested in locale data are also invited to become involved by establishing a Liaison membership in the Unicode Consortium, to gain access to this list.

Officers

The current Technical Committee Officers are:

Chair: Mark Davis (Google)
Vice-Chair: Annemarie Apple (Google)