Valid identifiers in FORTRAN 66, C, Java , C++ current and future

What’s in a name ? Identifiers are used in modern programming languages to refer to types, classes, variables and object instances. While the first programming languages were resource-constrained and ASCII-centered, modern languages are more flexible with regards to the possible forms identifiers can take.

This post is a comparison on the lexical conventions for identifiers (length and character sets) in FORTRAN 66, C, Java, current and future C++.

FORTRAN 66

The original FORTRAN 66 identifiers were defined based on digits and letters as follows:

A symbolic name consists of from one to six alphanumeric characters, the first of which must be alphabetic.
A digit is one of the ten characters: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9
A letter is one of the twenty-six characters; A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z.

So we only have the 26 ASCII letters to choose from (i.e. case insensitive) to build our 6-character identifiers. No underscores, no $ signs.

C

ANSI C (or ISO C or C90) as defined by ISO/IEC 9899:1990 says:

An identifier is a sequence of nondigit characters (including the underscore _ and the lower-case and upper-case letters) and digits.
The first character shall be a nondigit character.

C is limited to ASCII letters, but it is case sensitive. Underscore OK, $ not OK.

ISO C lifted the length limitations set 15 years before in the C Reference Manual that came with 6th Edition Unix, where “no more than the first eight characters are significant, and only the first seven for external identifiers“. The practical length of identifiers in ISO C is constrained by the requirements on the compiler implementation translation limits: 31 significant characters for an internal identifier.

C++ current standard (2003)

The current C++ standard as implemented in currently available compilers has the same character set limitations as C:

identifier:
nondigitidentifier nondigitidentifier digit

nondigit: one of _ a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L MN O P Q R S T U V W X Y Z
digit: one of 0 1 2 3 4 5 6 7 8 9

The limit for the maximum number of characters in an internal identifier,  macro name or in an external identifier is increased to a grandiose 1024.

Java

In the Java Language Specification, Third Edition an identifier is defined as an unlimited-length sequence of Java letters and Java digits, the first of which must be a Java letter.

The “Java digits” are just 0-9.

A “Java letter” is defined with reference to the 30 Unicode General Categories which also match the “Java Constant Field” values, according to this table:

Abbr Long Description Java Constant Field Value
Cc Control a C0 or C1 control code CONTROL
Cf Format a format control character FORMAT
Cn Unassigned a reserved unassigned code point or a noncharacter UNASSIGNED
Co Private_Use a private-use character PRIVATE_USE
Cs Surrogate a surrogate code point SURROGATE
Ll Lowercase_Letter a lowercase letter LOWERCASE_LETTER
Lm Modifier_Letter a modifier letter MODIFIER_LETTER
Lo Other_Letter other letters, including syllables and ideographs OTHER_LETTER
Lt Titlecase_Letter a digraphic character, with first part uppercase TITLECASE_LETTER
Lu Uppercase_Letter an uppercase letter UPPERCASE_LETTER
Mc Spacing_Mark a spacing combining mark (positive advance width) COMBINING_SPACING_MARK
Me Enclosing_Mark an enclosing combining mark ENCLOSING_MARK
Mn Nonspacing_Mark a nonspacing combining mark (zero advance width) NON_SPACING_MARK
Nd Decimal_Number a decimal digit DECIMAL_DIGIT_NUMBER
Nl Letter_Number a letterlike numeric character LETTER_NUMBER
No Other_Number a numeric character of other type OTHER_NUMBER
Pc Connector_Punctuation a connecting punctuation mark, like a tie CONNECTOR_PUNCTUATION
Pd Dash_Punctuation a dash or hyphen punctuation mark DASH_PUNCTUATION
Pe Close_Punctuation a closing punctuation mark (of a pair) END_PUNCTUATION
Pf Final_Punctuation a final quotation mark FINAL_QUOTE_PUNCTUATION
Pi Initial_Punctuation an initial quotation mark INITIAL_QUOTE_PUNCTUATION
Po Other_Punctuation a punctuation mark of other type OTHER_PUNCTUATION
Ps Open_Punctuation an opening punctuation mark (of a pair) START_PUNCTUATION
Sc Currency_Symbol a currency sign CURRENCY_SYMBOL
Sk Modifier_Symbol a non-letterlike modifier symbol MODIFIER_SYMBOL
Sm Math_Symbol a symbol of primarily mathematical use MATH_SYMBOL
So Other_Symbol a symbol of other type OTHER_SYMBOL
Zl Line_Separator U+2028 LINE SEPARATOR only LINE_SEPARATOR
Zp Paragraph_Separator U+2029 PARAGRAPH SEPARATOR only PARAGRAPH_SEPARATOR
Zs Space_Separator a space character (of various non-zero widths) SPACE_SEPARATOR

With the help of this table, we understand that a “Java Letter” can be a currency symbol (such as “$”), a connecting punctuation character (such as “_”), or belong to one of the Unicode General Categores Lu, Ll, Lt, Lm or Lo.

It is not clear to me whether by saying currency symbol and connecting punctuation character the entire CURRENCY_SYMBOL (Sc), CONNECTOR_PUNCTUATION (Pc), DASH_PUNCTUATION (Pd),  END_PUNCTUATION (Pe), FINAL_QUOTE_PUNCTUATION (Pf), INITIAL_QUOTE_PUNCTUATION (Pi), OTHER_PUNCTUATION (Po) and START_PUNCTUATION (Ps) Unicode General Categores are included, maybe somebody with Java skills can fill this void.

The Java programming language allows programmers to name identifiers with great liberty, including most Unicode code points (basically in their native languages), with underscore and dollar sign ($) both OK. . An undesirable side effect is that two identifiers differ if they differ in their Unicode code point, even if the glyphs (what you see on the screen) are the same. For example A and Α are different identifiers in Java because they are respectively LATIN CAPITAL LETTER A and GREEK CAPITAL LETTER ALPHA, and a is different from а because they are respectively LATIN SMALL LETTER A and CYRILLIC SMALL LETTER A.

C++ upcoming standard (C++0x)

C++0x, the planned new standard for the C++ programming language due to come out in 2011 or 2012 is more elastic than current C++ in its definition of an identifier:

An identifier is an arbitrarily long sequence of letters and digits, starting with a letter.
Upper-and lower-case letters are different. All characters are significant.
A “letter” is the usual a-z, A-Z and _ or a “universal-character-name” or “other implementation-defined characters”.

A “universal-character-name” is defined with reference to Annex A (Recommended extended repertoire for user-defined identifiers) of TR 10176:2003, TECHNICAL REPORT ISO/IEC TR 10176, Fourth edition (2003): Guidelines for the preparation of programming language standards.

A “universal-character-name” according to TR 10176, Annex A can be any character which “collectively can be used to generate word-like identifiers for most natural languages of the world“, including “letters (combining or not), syllables, and ideographs together with the modifier letters and marks conventionally used as parts of words“. The acceptable Unicode code points are:

Latin: 0041-005A, 0061-007A, 00AA, 00BA, 00C0-00D6, 00D8-00F6, 00F8-01F5, 01FA-0217, 0250-02A8, 1E00-1E9B, 1EA0-1EF9, 207F
Greek: 0386, 0388-038A, 038C, 038E-03A1, 03A3-03CE, 03D0-03D6, 03DA, 03DC, 03DE, 03E0, 03E2-03F3, 1F00-1F15, 1F18-1F1D, 1F20-1F45, 1F48-1F4D, 1F50-1F57, 1F59, 1F5B, 1F5D, 1F5F-1F7D, 1F80-1FB4, 1FB6-1FBC, 1FC2-1FC4, 1FC6-1FCC, 1FD0-1FD3, 1FD6-1FDB, 1FE0-1FEC, 1FF2-1FF4, 1FF6-1FFC
Cyrillic: 0401-040C, 040E-044F, 0451-045C, 045E-0481, 0490-04C4, 04C7-04C8, 04CB-04CC, 04D0-04EB, 04EE-04F5, 04F8-04F9
Armenian: 0531-0556, 0561-0587
Hebrew: 05D0-05EA, 05F0-05F2
Hebrew (C): 05B0-05B9, 05BB-05BD, 05BF, 05C1-05C2
Arabic: 0621-063A, 0640-064A, 0671-06B7, 06BA-06BE, 06C0-06CE, 06D0-06D3, 06D5, 06E5-06E6
Arabic (C): 064B-0652, 0670, 06D6-06DC, 06E7-06E8, 06EA-06ED
Devanagari: 0905-0939, 0950, 0958-0961
Devanagari (C): 0901-0903, 093E-094D, 0951-0952, 0962-0963
Bengali: 0985-098C, 098F-0990, 0993-09A8, 09AA-09B0, 09B2, 09B6-09B9, 09DC-09DD, 09DF-09E1, 09F0-09F1
Bengali (C): 0981-0983, 09BE-09C4, 09C7-09C8, 09CB-09CD, 09E2-09E3
Gurmukhi: 0A05-0A0A, 0A0F-0A10, 0A13-0A28, 0A2A-0A30, 0A32-0A33, 0A35-0A36, 0A38-0A39, 0A59-0A5C, 0A5E, 0A74
Gurmukhi (C): 0A02, 0A3E-0A42, 0A47-0A48, 0A4B-0A4D
Gujarati: 0A85-0A8B, 0A8D, 0A8F-0A91, 0A93-0AA8, 0AAA-0AB0, 0AB2-0AB3, 0AB5-0AB9, 0ABD, 0AD0, 0AE0
Gujarati (C): 0A81-0A83, 0ABE-0AC5, 0AC7-0AC9, 0ACB-0ACD
Oriya: 0B05-0B0C, 0B0F-0B10, 0B13-0B28, 0B2A-0B30, 0B32-0B33, 0B36-0B39, 0B5C-0B5D, 0B5F-0B61
Oriya (C): 0B01-0B03, 0B3E-0B43, 0B47-0B48, 0B4B-0B4D
Tamil: 0B85-0B8A, 0B8E-0B90, 0B92-0B95, 0B99-0B9A, 0B9C, 0B9E-0B9F, 0BA3-0BA4, 0BA8-0BAA, 0BAE-0BB5, 0BB7-0BB9
Tamil (C): 0B82-0B83, 0BBE-0BC2, 0BC6-0BC8, 0BCA-0BCD
Telugu: 0C05-0C0C, 0C0E-0C10, 0C12-0C28, 0C2A-0C33, 0C35-0C39, 0C60-0C61
Telugu (C): 0C01-0C03, 0C3E-0C44, 0C46-0C48, 0C4A-0C4D
Kannada: 0C85-0C8C, 0C8E-0C90, 0C92-0CA8, 0CAA-0CB3, 0CB5-0CB9, 0CDE, 0CE0-0CE1
Kannada (C): 0C82-0C83, 0CBE-0CC4, 0CC6-0CC8, 0CCA-0CCD
Malayalam: 0D05-0D0C, 0D0E-0D10, 0D12-0D28, 0D2A-0D39, 0D60-0D61
Malayalam (C): 0D02-0D03, 0D3E-0D43, 0D46-0D48, 0D4A-0D4D
Thai: 0E01-0E30, 0E32-0E33, 0E40-0E46, 0E50-0E59
Thai (C): 0E31, 0E34-0E3A, 0E47-0E4E
Lao: 0E81-0E82, 0E84, 0E87-0E88, 0E8A, 0E8D, 0E94-0E97, 0E99-0E9F, 0EA1-0EA3, 0EA5, 0EA7, 0EAA-0EAB, 0EAD-0EAE, 0EB0, 0EB2-0EB3, 0EBD, 0EC0-0EC4, 0EC6, 0EDC-0EDD
Lao (C): 0EB1, 0EB4-0EB9, 0EBB-0EBC, 0EC8-0ECD
Tibetan: 0F00, 0F40-0F47, 0F49-0F69, 0F88-0F8B
Tibetan (C): 0F18-0F19, 0F35, 0F37, 0F39, 0F71-0F84, 0F86-0F87, 0F90-0F95, 0F97, 0F99-0FAD, 0FB1-0FB7, 0FB9
Georgian: 10A0-10C5, 10D0-10F6
Hiragana: 3041-3093
Katakana: 30A1-30F6, 30FB-30FC
Bopomofo: 3105-312C
Hangul: AC00-D7A3
CJK Unified Ideographs: 4E00-9FA5
Digits: 0030-0039, 0660-0669, 06F0-06F9, 0966-096F, 09E6-09EF, 0A66-0A6F, 0AE6-0AEF, 0B66-0B6F, 0BE7-0BEF, 0C66-0C6F, 0CE6-0CEF, 0D66-0D6F, 0E50-0E59, 0ED0-0ED9, 0F20-0F29
Special characters: 00B5, 02B0-02B8, 02BB, 02BD-02C1, 02D0-02D1, 02E0-02E4, 037A, 0559, 093D, 0B3D, 1FBE, 203F-2040, 2102, 2107, 210A-2113, 2115, 2118-211D, 2124, 2126, 2128, 212A-2131, 2133-2138, 2160-2182, 3005-3007, 3021-3029

The upcoming version of the C++ will be subject to the same confusing same-glyph, different Unicode code-point syndrome as Java A != Α and a != а.

The good news is that since the “good” code points are listed, it is easier for implementations to check if a character is acceptable or not, whereas for Java it is required to have access to the Unicode tables to know if a character belongs to a certain General Category.