[a-m w-z] , or using a combinations of Unicode Properties such as the following, for the Arabic script characters that have a canonical decomposition:[[:script=arabic:]&[:decompositiontype=canonical:]] OperationEnter a UnicodeSet into the Input box, and hit Show Set. You can also choose certain combinations of options for display, such as abbreviated or not. The values you use are encapsulated into a URL for reference, such as If you add properties to the Group By box, you can sort the results by property values. For example, if you set it to "General_Category Numeric_Value" (or the short form "gc nv"), you'll see the results sorted first by the general category of the characters, and then by the numeric value. SyntaxUnicodeSets are defined according to the description on UTS #35: Locale Data Markup Language (LDML), but has some useful extensions in these online demos. Properties can be specified either with Perl-style notation ( \p{script=arabic} ) or with POSIX-style notation ([:script=arabic:] ). Properties and values can either use a long form (like "script") or a short form (like "sc").No argument is equivalent to "Yes"; mostly useful with binary properties, like \p{isLowercase}! ConvenienceThe following examples illustrate the syntax with a particular property, value pair: the property age and the value 3.2: The : can be used in the place of =. (Mostly because : doesn't require percent-encoding in URLs.)
The Perl and Posix syntax for negations are \P{...} and [:^...:], respectively. The characters ≠ and ! are added for convenience:
Regular ExpressionsFor the name property, regular expressions can be used for the value, enclosed in /.../. For example in the following expression, the first term will select all those Unicode characters whose names contain "CJK". The rest of the expression will then subtract the ideographic characters, showing that these can be used in arbitrary combinations.
Some particularly useful regex features are:
Caveats:
Property ComparisonProperty values can be compared to those for other properties, using the syntax @...@. For example:
There is a special property "cp" that returns the code point itself. For example:
Available PropertiesYou can see a full listing of the possible properties on http://unicode.org/cldr/utility/properties.jsp. The standard Unicode properties are supported, plus the extra ICU properties. There are some additional properties just in this demo. The easiest way to see the properties for a range of characters is to use a set like [:Greek:] in the Input, and then set the Group By box to the property name. List
Normally, \p{isX} is equivalent to \p{toX=@cp@}. There are some exceptions and missing cases. Note: The Unassigned, Surrogate, and Private Use code points are skipped in the generation of some of these sets. Casing PropertiesUnicode defines a number of string casing functions in Section 3.13 Default Case Algorithms. These string functions can also be applied to single characters.Warning: the first three sets may be somewhat misleading: isLowercase means that the character is the same as its lowercase version, which includes all uncased characters. To get those characters that are cased characters and lowercase, use
Normalization PropertiesUnicode defines a number of string normalization functions UAX #15. These string functions can also be applied to single characters.
|
Unicode Utilities >