Character Sets

 

The GOLD Builder has a collection of useful pre-defined sets at your disposal. These include the sets that are often used for defining terminals as well as characters not accessible via the keyboard.

Common Characters

Set Name Description
{HT} Horizontal Tab character {#09}.
{LF} Line Feed character {#10}.
{VT} Vertical Tab character {#11}. This character is rarely used.
{FF} Form Feed character {#12}. This character is also known as "New Page".
{CR} Carriage Return character {#13}.
{Space} Space character {#32}. Technically, this set is not needed since a "space" can be expressed by using single quotes: ' '. The set was added to allow the developer to more explicitly indicate the character and add readability.
{NBSP} No-Break Space character {#160}. The No-Break Space character is used to represent a space where a line break is not allowed. It is often used in source code for indentation.
{Euro Sign} The Euro Currency Sign {#8364}. The set is only available in versions 2.0.6 and later of the Builder.

Common Character Sets

Please see the Pre-Defined Character Set Chart for pictures of these characters.

Set Name Description
{Number} 0123456789
{Digit} 0123456789
This set is maintained to support older grammars. The term "digit" is technically inaccurate and this set will eventually be removed, but not for a long time. Please use the {Number} set.
{Letter} abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
{AlphaNumeric} This set includes all the characters in {Letter} and {Number}
{Printable} This set includes all standard characters that can be printed onscreen. This includes the characters from  #32 to #127 and  #160 (No-Break Space). The No-Break Space character was included since it is often used in source code.
{Letter Extended} This set includes all the letters which are part of the extended characters in the first 256 characters (ANSI).
{Printable Extended} This set includes all the printable characters above #127. Although rarely used in programming languages, they could be used, for instance, as valid characters in a string literal.
{Whitespace} This set includes all characters that are normally considered whitespace and ignored by the parser. The set consists of the Space, Horizontal  Tab, Line Feed, Vertical Tab, Form Feed, Carriage Return and No-Break Space.

Character Constants and Set Ranges

Each character in the Basic Multilingual Plane of the Unicode Character Set is represented by a 16-bit integer. This value is known as a "code point" in Unicode terminology. The characters from 0xD800 to 0xDBFF and from 0xFFF0 to 0xFFFF are reserved by Unicode for encoding. As a result, these values cannot be used.

The developer can specify any character using either its decimal or hexadecimal code point. Decimal values are denoted by a number-sign prefix (#) and hexadecimal values are denoted by an ampersand (&).

Set ranges were added in version 2.6 of the Builder. They can be specified by using a ".." between two values. Both the start and end values can be in either decimal or hexadecimal.

Set Name Description
{#n} Using this notation, you can specify any character - in particular, those not accessible via the keyboard.  For instance, {#169} specifies the copyright character ©. The value of, n can be any number from 1 to 55295 or from 56320 to 65519. 
{&n} This is the hexadecimal notation for a single character.  The value of, n can be any number from &1 to &D7FF or from &DC00 to &FFEF.
{#n .. #m} Using this notation, you can specify a set containing the characters from n to m. The number-sign denotes a decimal value.
{&n .. &m} Set ranges can also be defined using hexadecimal values.

Examples

Declaration Characters Comments
Example1 = {#65} A This example specifies the character with the Unicode codepoint of 65 - the letter 'A'.
Example1 = {&41} A Hexadecimal value for 'A'.
Example3 = {#65 .. #70} ABCDEF This set range defines a set from from the letter 'A' (#65) to 'F' (#70).
Example4 = {&41 .. &46} ABCDEF This is the same set range using the hexadecimal values.
Example5 = {#65 .. &46} ABCDEF Both decimal and hexadecimal notation can be mixed. This, however, can be confusing and it is not recommended.

Language Sets

Unicode is divided into different distinct sections for many of the world's languages. The GOLD Meta-Language contains a number of predefined sets based on the Unicode standard.

For more information please visit: www.unicode.org.

Set Name Character Range
In Decimal
Character Range
In Hexadecimal
From To From To
{Latin Extended} #256 #687 &100 &2AF
{Greek} #880 #1023 &370 &3FF
{Cyrillic} #1024 #1279 &400 &4FF
{Cyrillic Supplementary} #1280 #1327 &500 &52F
{Armenian} #1328 #1423 &530 &58F
{Hebrew} #1424 #1535 &590 &5FF
{Arabic} #1536 #1791 &600 &6FF
{Syriac} #1792 #1871 &700 &74F
{Thaana} #1920 #1983 &780 &7BF
{Devanagari} #2304 #2431 &900 &97F
{Bengali} #2432 #2559 &980 &9FF
{Gurmukhi} #2560 #2687 &A00 &A7F
{Gujarati} #2688 #2815 &A80 &AFF
{Oriya} #2816 #2943 &B00 &B7F
{Tamil} #2944 #3071 &B80 &BFF
{Telugu} #3072 #3199 &C00 &C7F
{Kannada} #3200 #3327 &C80 &CFF
{Malayalam} #3328 #3455 &D00 &D7F
{Sinhala} #3456 #3583 &D80 &DFF
{Thai} #3584 #3711 &E00 &E7F
{Lao} #3712 #3839 &E80 &EFF
{Tibetan} #3840 #4095 &F00 &FFF
{Myanmar} #4096 #4255 &1000 &109F
{Georgian} #4256 #4351 &10A0 &10FF
{Hangul Jamo} #4352 #4607 &1100 &11FF
{Ethiopic} #4608 #4991 &1200 &137F
{Cherokee} #5024 #5119 &13A0 &13FF
{Ogham} #5760 #5791 &1680 &169F
{Runic} #5792 #5887 &16A0 &16FF
{Tagalog} #5888 #5919 &1700 &171F
{Hanunoo} #5920 #5951 &1720 &173F
{Buhid} #5952 #5983 &1740 &175F
{Tagbanwa} #5984 #6015 &1760 &177F
{Khmer} #6016 #6143 &1780 &17FF
{Mongolian} #6144 #6319 &1800 &18AF
{Latin Extended Additional} #7680 #7935 &1E00 &1EFF
{Greek Extended} #7936 #8191 &1F00 &1FFF
{Hiragana} #12352 #12447 &3040 &309F
{Katakana} #12448 #12543 &30A0 &30FF
{Bopomofo} #12544 #12591 &3100 &312F
{Kanbun} #12688 #12703 &3190 &319F
{Bopomofo Extended} #12704 #12735 &31A0 &31BF

Miscellaneous Character Sets

Set Name Description
{All Valid} The {All Valid} character set contains every valid character in the Basic Multilingual Plane of the Unicode Character Set. This includes the characters from &1 to &D7FF and  &DC00 to &FFEF.

Please note that this set also includes whitespace and control characters that are rarely used in programming languages.  It is only available in versions 2.6 and later of the Builder.

{ANSI Mapped} This set contains the characters between 128 and 159 that have different values in Unicode. The set is only available in versions 2.0.6 and later of the Builder.
{ANSI Printable} This set contains all printable characters available in ANSI. Essentially, this is a union of {Printable}, {Printable Extended} and {ANSI Mapped}.  The set is only available in versions 2.0.6 and later of the Builder.
{Control Codes} This set includes the characters from 1 to 31 and from 127 to 159.  It is only available in versions 2.6 and later of the Builder.