The goal is to allow the use of the appropriate grapheme clusters for given tasks, for a given language. See http://unicode.org/cldr/trac/ticket/2142. Please leave any feedback as comments on that ticket.
The idea is that we have explicit boundaries that represent certain common behaviors (codepoint breaks, or legacy grapheme cluster breaks), and we also have associations for a given language between a particular function and the explicit boundaries that should be used in that language for that function.
Here is a proposal for the structure in LDML:
<grapheme-usage type="count">extended</grapheme-usage> <!-- when counting 'user characters' -->
<grapheme-usage type="drop-cap">legacy</grapheme-usage> <!-- paragraph drop-caps -->
<grapheme-usage type="selection">aksara</grapheme-usage> <!-- selection boundaries: highlighting, keyboard arrows, cut&paste -->
<grapheme-usage type="backspace">codepoint</grapheme-usage> <!-- delete previous character -->
<grapheme-usage type="delete">extended</grapheme-usage> <!-- delete next character -->
The above would be tailorable per locale.
In segments/root.xml we have GraphemeClusterBreak. We interpret that as extended grapheme clusters for compatibility. We then add rules for:
These would also be tailorable per locale (except CodePoint), but should be more rarely done.
Clients like ICU would add new constants for getting BreakIterators (or equivalents). These would be both corresponding to the new explicit rules:
And to the new 'function-based' breaks: