Indic Grapheme Clusters (Draft)

There are a number of scripts that don't break after viramas (halants), so that a cluster like ksha (X) is bound together into an item that behaves like a single character for most operations.

In CLDR 35, it is enabled for 6 scripts: Gujr, Telu, Mlym, Orya, Beng, Deva, and will be implemented in ICU in its next release.

Adding Scripts

To add another script, please open a new ticket, and:
  1. Provide verification that the implementation below works for the language.
  2. Attach a test file for that script. It must be in precisely the format used in common/testData/segmentation/graphemeCluster.


When a script is added, it changes the ScriptList in the following. We need verification that it is ok to forbid breaking after a Virama in all these cases.

Variable Virama ScriptList, limited to: Indic_Syllabic_Category=Virama 
VariableLinkingConsonantScriptList, limited to: Indic_Syllabic_Category=Consonant
Rule9.3Don't break within LinkingConsonant Virama LinkingConsonant
(allowing also combining marks before and after the Virama

To see what characters would be affected, look at the following lists (replacing Deva by your script's code). Please also supply links to web pages that substantiate this.

Implementation Details

Most people don't need to know the details, but for the curious there is more information at: