Specifying text break variants in locale IDs

   
Author Peter Edberg
Date 2014-11-11, last update 2016-10-20
Status Proposal
Feedback to pedberg (at) apple (dot) com
Bugs See list below

This proposal discusses options for extending Unicode locale identifiers to specify text break variants with a locale. It was prompted by CLDR and ICU bugs including the following, as well as by other requests:

I. Options needed (as known so far)

A. Grapheme cluster break

Need to choose one of the following (current CLDR/ICU implementation uses extended grapheme clusters):

B. Word break

Currently uses dictionary-based break for sequences in CJK scripts (Han/Kana/Hangul) or SE Asian scripts (LineBreak property value SA/Complex_Context: Thai, Lao, Khmer, Myanmar, etc.); we need a locale keyword that can turn this on or off (i.e. off to use basic UAX #29 word break), at least for CJK.

C. Sentence break

We need a locale keyword to control use of ULI suppressions data (i.e. to determine whether we should wrap the UAX29-based break iterator in a FilteredBreakIterator instance for the locale, and to determine which suppressions set to use).

D. Line break (highest priority)

Currently ICU uses dictionary-based break for text in SE Asian scripts only. The two most important needs for line break control are:

Other desirable capabilities include:

II. Notes on CSS level 3 line break

(from draft of Jun 2015, http://dev.w3.org/csswg/css-text/#line-breaking)

CSS has two independent properties for controlling line break behavior:

A. The line-break property

This is mainly about break behavior for punctuation and symbols, though it does affect small kana. The rules are intended to specify behavior that may be language-specific, but explicit rules are provided for CJK. Besides the “auto” value, there are three specific values for this property.

B. The word-break property

This only controls break opportunities between letter-like characters (including ideographs), and has 3 possible values. Symbols that break in the same way as letters are affected in the same way by these options.

III. Proposed -u extension keys

A. For control of grapheme cluster break

For gb, current default is extended.

<key name="gb" description="Grapheme cluster break type key">

	<type name="legacy" description=“Grapheme break using UAX #29 legacy grapheme clusters"/>

	<type name="extended" description="Grapheme break using UAX #29 extended grapheme clusters"/>

	<type name="aksara" description="Grapheme break adding aksaras to extended grapheme clusters”/>

B. For control of word break

(Type key not needed yet and values undetermined, just reserve it)

<key name="wb" description="Word break type key">

Will also need a word break parameter key to control whether dictionary-based work break is used, probably need separate control for at least for CJ, Korean, and SEAsian scripts; no key proposed yet.

C. For control of sentence break

(Type key not needed yet and values undetermined, just reserve it)

<key name="sb" description="Sentence break type key">

For ss, current default is none.

<key name=“ss” description=“Sentence break parameter key to control use of suppressions data”>

	<type name=“none” description="Don’t use segmentation suppressions data"/>

	<type name=“standard” description="Use segmentation suppressions data of type standard"/>

D. For control of line break

The current proposal is to use the type to specify the CSS line-break property; this can be used in older implementations as e.g. “@lb=strict”. One or more additional parameter keywords are provided to permit control of the CSS word-break property and to permit control of whether AI is treated as AL or ID.

D1. Supporting CSS line-break

For lb, the current default is normal for the “ja” locale, but strict for others; it should probably be normal for all since the distinction is mainly relevant for Japanese), and the discussion below assumes that change.

<key name="lb" description="Line break type key">

	<type name="strict" description="CSS lev 3 line-break=strict, e.g. treat CJ as NS"/>

	<type name="normal" description="CSS lev 3 line-break=normal, e.g. treat CJ as ID, break before hyphens for ja,zh"/>

	<type name="loose" description="CSS lev 3 line-break=loose"/>

D2. Supporting other controls including CSS word-break (for line break), first idea (2014-11)

For the other controls, including support of the CSS word-break property, I think it is best to have separate control over how certain sets of characters are treated:

<key name="lc” description="Line break class remapping“>

	<type name=“LB_CLASS_MAP_CODE” description=“One or more linebreak class mapping codes, see xxx”/>

where LB_CLASS_MAP_CODE is a sequence of one or more of the following codes (separated by - or _):

Then for example CSS lev 3 word-break=keep-all could be indicated as “-u-lc-hang2al-id2al”.

D3. Supporting CSS word-break (for line break), second idea (2015-07)

I now think explicit remapping of certain classes is the wrong approach for supporting the CSS word-break options for line break control:

Thus I propose the following .

<key name="lw" description="Line break key for CSS lev 3 word-break options">

	<type name="normal" description="CSS lev 3 word-break=normal"/>

	<type name="breakall" description="CSS lev 3 word-break=break-all, allow breaks in words unless forbidden by lb setting"/>

	<type name="keepall" description="CSS lev 3 word-break=keep-all, prohibit breaks in words except for dictionary breaks"/>

E. Other ideas

For linebreak control:

Overall: Another suggestion goes further than the second bullt above: Have just a single keyword to specify all break variants; it would be followed by a list of attributes that would all share a single namespace, and whose names would need to identify which type of break they affected. Examples might include gblegacy, gbextend, gbaksara (use one to specify grapheme break); ssnone, ssstd (use one to specify sentence break suppressions); etc. While this consumes less of the -u keyword namespace, it is less flexible at mapping to values specified in resource attributes, such as different types of sentence break suppression data, unless significant restrictions are placed on those attribute values.

F. Current status

F1. keyword -lb-

In the CLDR meeting of 2014-Nov-19, it was agreed to add the -lb- keyword with at least the values “strict”, “normal” and “loose” for support of the corresponding CSS level 3 behavior; for legacy-style Unicode locale IDs using ‘@’, “lb=” should be used. The implementation details are not yet determined or specified, nor are the details of any locale-specific override behavior.

Current (2015-02-18) work under CLDR #4931 and ICU #9379 includes the following, approved in CLDR and ICU meetings:

  1. Add new CLDR file common/bcp47/segmentation.xml (name OK?) with the following:
<key name="lb" description="Line break type key" since="27">

	<type name="strict" description="CSS level 3 line-break=strict, e.g. treat CJ as NS"/>

	<type name="normal" description="CSS level 3 line-break=normal, e.g. treat CJ as ID, break before hyphens for ja,zh"/>

	<type name="loose" description="CSS lev 3 line-break=loose"/>

</key>
  1. In CLDR file common/dtd/ldmlICU.dtd, add “alt” as an attribute for the <icu:line …> element (and allow multiple <icu:line …> elements).
  2. In ICU icu/trunk/source/data/xml/brkitr/ files such as root.xml, fi.xml, and ja.xml, add lines mapping the line break types to corresponding rule files, e.g. in root:
<icu:line alt="loose" icu:dependency="line_loose.brk"/>

<icu:line alt="normal" icu:dependency="line_normal.brk"/>

<icu:line alt="strict" icu:dependency="line.brk"/>

(Note that we need to add brkitr locales for zh and zh_Hant since they have non-standard CSS line break types, like ja)

  1. In CLDR, update tools/java/org/unicode/cldr/icu/BreakIteratorMapper.java to handle the alts (3 added lines).
  2. In ICU, add 6 new line break rule files in source/data/brkitr/ (and delete line_ja.txt):
line_loose.txt

line_loose_cj.txt

line_loose_fi.txt

line_normal.txt

line_normal_cj.txt

line_normal_fi.txt

These result in an increase of about 630K bytes (2.5%) in the data file. These can be tailored out in cases for which it is a problem, either by deleting lines from the ICUdata/xml/brkitr/ files if building from CLDR data, or by deleting corresponding lines in the data/brkitr/<locale>.txt files and deleting the unused files from BRK_SOURCE in data/brkitr/brkfiles.mk. #11530 is to investigate a more efficient way of representing the line break rule variants.

Note that the CLDR representation of the line break rules have not yet been updated to match (they are currently ignored when generating ICU data).

  1. In ICU4C, update BreakIterator::makeInstance to map locale to the correct ruleset (about 10 lines, not yet committed)… similar change in ICU4J.
  2. Update testate/rbbitst.txt to test the variants. More extensive monkey tests for the variants are covered by #11531.

F2. keywords -ss-, -lw-

Proposal for CLDR & ICU meetings 2015-Jul-08:

  1. In CLDR file common/bcp47/segmentation.xml add the following (approved in CLDR meeting 2015-Jul-08):
<key name="lw" description="Line break key for CSS lev 3 word-break options" since="28">

	<type name="normal" description="CSS lev 3 word-break=normal"/>

	<type name="breakall" description="CSS lev 3 word-break=break-all, allow breaks in words unless forbidden by lb setting"/>

	<type name="keepall" description="CSS lev 3 word-break=keep-all, prohibit breaks in words except for dictionary breaks"/>

</key>

Default value is “normal”. English names for the values are:

• normal: “Normal line breaks for words”

• breakall: “Allow line breaks in all words”

• keepall: “Prevent line breaks in all words”

<key name="ss" description="Sentence break parameter key to control use of suppressions data" since="28">

<type name="none" description="Don’t use segmentation suppressions data"/>

<type name="standard" description="Use segmentation suppressions data of type standard"/>

</key>

Current default value is “none”. In the future we hope to make the default “standard”. English names for the values are:

  1. In ICU BreakIterator, initial support will be incomplete (details for ICU4C below, similar approach in ICU4J):

a) In ICU4C BreakIterator::makeInstance, for kind = UBRK_SENTENCE, if locale has key “ss” with value “standard”, then call FilteredBreakIteratorBuilder on the result of BreakIterator::buildInstance to produce a new BreakIterator* which supports the sentence break exceptions. Notes:

b) In ICU4C RuleBasedBreakIterator::handleNext and handlePrevious, for now we can implement an approximation of support for the key “lw” values by alteriing the character classes as follows (similar to the behavior in section D2 above):

More complete support is dependent on a mechanism for turning on and off certain rules, see ICU #11530.

IV. Implementation notes

What I had in mind was that the break type selection (gb, lb) would be implemented by selection of different break table resources, while the parameter keywords (ss, lc) would be implemented in code (changing line break classes, perhaps with an annotation in the tables along the lines suggested in http://unicode.org/cldr/trac/ticket/4931). However, it is not clear how to implement selection of different tables given the current resource structure in ICU (which does not exactly mirror the CLDR structure).

A. CLDR XML structure

Currently in CLDR we can have a structure locale-specific break iterator data icu/trunk/source/data/xml/brkitr/xx.xml as follows; except for the suppressions data, this is otherwise ignored for building ICU data (segmentation type is GraphemeClusterBreak, WordBreak, LineBreak, SentenceBreak):

<ldml>

	<segmentations>

		<segmentation type="LineBreak">

			<variables>

				….

			</variables>

			<segmentRules>

				….

			</segmentRules>

		<segmentation type="SentenceBreak">

			<suppressions type="standard">

				…

			</suppressions

We could an attribute “alt” for <segmentation> to specify the specific variant (corresponds to the value for the -gb or -lb keyword, for example), though this would currently be ignored for LDML to ICU conversion:

<segmentation type="LineBreak" alt="strict">

Handling of default values and elements without “alt” is discussed in section E below.

B. ICU XML source structure

In ICU we have XML source data and generated txt data. The XML source structure is specified by

http://www.unicode.org/repos/cldr/trunk/common/dtd/ldmlICU.dtd

and currently looks like this for root (any locale-specific data uses a subset of this):

<special xmlns:icu="http://www.icu-project.org/">

	<icu:breakIteratorData>

		<icu:boundaries>

			<icu:grapheme icu:dependency="char.brk"/>

			<icu:word icu:dependency="word.brk"/>

			<icu:line icu:dependency="line.brk"/> or e.g. "line_xx.brk" in locale-specific data

			…

		</icu:boundaries>

		<icu:dictionaries>

			<icu:dictionary type="Hani" icu:dependency="cjdict.dict"/>

			<icu:dictionary type="Hira" icu:dependency="cjdict.dict"/>

			…

			<icu:dictionary type="Thai" icu:dependency="thaidict.dict"/>

		</icu:dictionaries>

	</icu:breakIteratorData>

</special>

Note that the following attributes for the boundaries subelements (icu:word etc.) are defined in CLDR’s ICU DTD but currently unused:

icu:class NMTOKEN #IMPLIED

icu:append NMTOKEN #IMPLIED

icu:import NMTOKEN #IMPLIED

We could define an additional attribute “alt” and then use that to match the CLDR <segmentations> alt attribute:

<icu:boundaries>

	<icu:grapheme icu:dependency="char.brk"/>

	<icu:grapheme alt="extended" icu:dependency="char.brk"/>

	<icu:grapheme alt="legacy" icu:dependency="char_legacy.brk"/>

	…

	<icu:line icu:dependency="line.brk"/>

	<icu:line alt="normal" icu:dependency="line.brk"/>

	<icu:line alt="strict" icu:dependency="line_strict.brk"/>

	…

</icu:boundaries>

C. ICU txt resource structure

The ICU xml files (and the CLDR xml files, for suppressions data) are processed by CLDR tools such as cldr/trunk/tools/java/org/unicode/cldr/icu/BreakIteratorMapper.java to generate the text resources, for example:

root{

	boundaries{

		grapheme:process(dependency){"char.brk"}

		line:process(dependency){"line.brk"}

		…

		word:process(dependency){"word.brk"}

	}

	dictionaries{

		Hani:process(dependency){"cjdict.dict"}

		Hira:process(dependency){"cjdict.dict"}

		...

		Thai:process(dependency){"thaidict.dict"}

	}

}

xx{

	boundaries{

		line:process(dependency){"line_xx.brk"}

	}

	exceptions{

		SentenceBreak:array{

			"Mr.",

			"Etc.",

			…

		}

	}

}

These files are read by BreakIterator::buildInstance(…) in ICU4C, with a type parameter that maps directly to the key in the boundaries resource: “grapheme”, “line”, etc. Currently there is not a way to add attributes for the boundaries subelements such as line or word. However, we could map the icu:alt values proposed in section C to resource keys with extensions where appropriate:

boundaries{

	grapheme:process(dependency){"char.brk"}

	grapheme_extended:process(dependency){"char.brk"}

	grapheme_legacy:process(dependency){"char_legacy.brk"}

	…

	line:process(dependency){"line.brk"}

	line_normal:process(dependency){"line.brk"}

	line_strict:process(dependency){"line_strict.brk"}

	…

}

BreakIterator::buildInstance is called by BreakIterator::makeInstance, which provides the type keys “grapheme”, “line”, etc. It could use the locale to construct the resource keys with extensions.

D. Current dictionary break implementation

(See also the relevant section of the ICU User Guide)

The use of dictionary break depends on the existence in the rules of a variable “$dictionary” which defines the UnicodeSet of characters for which dictionary break should be used.

For line break, this is defined as “$dictionary = [:LineBreak = Complex_Context:];” where the Line_Break property value Complex_Context is equivalent to SA and applies to most letters, marks, and some other signs in Southeast Asian scripts: Thai, Lao, Myanmar, Khmer, Tai Le, New Tai Lue, Tai Tham, Tai Viet, etc. For word break, in addition to characters with Line_Break property value SA, the $dictionary set includes characters with script Han, Hiragana, Katakana, as well as composed Hangul syllables in the range \uAC00-\uD7A3 (not sure why the latter are included, since we do not have dictionary support for them).

In both cases, the rules are defined to disallow breaks between characters in the $dictionary set. When determine the next or previous break, the iterator first determines the break using the normal rules (which will not break between characters in the $dictionary set); in the process it marks which characters are handled by a dictionary break engine (For each script that has a break dictionary, the associated break engine defines a more specific set of characters to which it applies). If characters handled by a dictionary break engine were encountered, the break iterator then invokes the dictionary break engines to determine breaks within the $dictionary-set span.

E. Multiple rule sets that depend on break type

It would be nice for a given locale to be able to specify, for each break type, which variant is the default for that locale. In root this can just be done by using the resource key without any extension. In other locales, we could do something like this in the CLDR XML:

<segmentations>

	<default type="LineBreak" alt="strict">

	<default type="GraphemeClusterBreak" alt="legacy">

V. Acknowledgments

Thanks to Koji Ishii and the CLDR team for feedback on this document.