home > posts > 2010-09-04-valid-identifiers-in-fortran-66-c-java-c-current-and-future

Valid identifiers in FORTRAN 66, C, Java , C++ current and future

What’s in a name ? Identifiers are used in modern programming languages to refer to types, classes, variables and object instances. While the first programming languages were resource-constrained and ASCII-centered, modern languages are more flexible with regards to the possible forms identifiers can take.

This post is a comparison on the lexical conventions for identifiers (length and character sets) in FORTRAN 66, C, Java, current and future C++.

FORTRAN 66

The original FORTRAN 66 identifiers were defined based on digits and letters as follows:

A symbolic name consists of from one to six alphanumeric characters, the first of which must be alphabetic.
A digit is one of the ten characters: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9
A letter is one of the twenty-six characters; A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z.

So we only have the 26 ASCII letters to choose from (i.e. case insensitive) to build our 6-character identifiers. No underscores, no $ signs.

C

ANSI C (or ISO C or C90) as defined by ISO/IEC 9899:1990 says:

An identifier is a sequence of nondigit characters (including the underscore _ and the lower-case and upper-case letters) and digits.
The first character shall be a nondigit character.

C is limited to ASCII letters, but it is case sensitive. Underscore OK, $ not OK.

ISO C lifted the length limitations set 15 years before in the C Reference Manual that came with 6th Edition Unix, where “no more than the first eight characters are significant, and only the first seven for external identifiers“. The practical length of identifiers in ISO C is constrained by the requirements on the compiler implementation translation limits: 31 significant characters for an internal identifier.

C++ current standard (2003)

The current C++ standard as implemented in currently available compilers has the same character set limitations as C:

identifier:
nondigitidentifier nondigitidentifier digit

nondigit: one of _ a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L MN O P Q R S T U V W X Y Z
digit: one of 0 1 2 3 4 5 6 7 8 9

The limit for the maximum number of characters in an internal identiﬁer, macro name or in an external identiﬁer is increased to a grandiose 1024.

Java

In the Java Language Specification, Third Edition an identifier is defined as an unlimited-length sequence of Java letters and Java digits, the first of which must be a Java letter.

The “Java digits” are just 0-9.

A “Java letter” is defined with reference to the 30 Unicode General Categories which also match the “Java Constant Field” values, according to this table:

Abbr	Long	Description	Java Constant Field Value
Cc	Control	a C0 or C1 control code	CONTROL
Cf	Format	a format control character	FORMAT
Cn	Unassigned	a reserved unassigned code point or a noncharacter	UNASSIGNED
Co	Private_Use	a private-use character	PRIVATE_USE
Cs	Surrogate	a surrogate code point	SURROGATE
Ll	Lowercase_Letter	a lowercase letter	LOWERCASE_LETTER
Lm	Modifier_Letter	a modifier letter	MODIFIER_LETTER
Lo	Other_Letter	other letters, including syllables and ideographs	OTHER_LETTER
Lt	Titlecase_Letter	a digraphic character, with first part uppercase	TITLECASE_LETTER
Lu	Uppercase_Letter	an uppercase letter	UPPERCASE_LETTER
Mc	Spacing_Mark	a spacing combining mark (positive advance width)	COMBINING_SPACING_MARK
Me	Enclosing_Mark	an enclosing combining mark	ENCLOSING_MARK
Mn	Nonspacing_Mark	a nonspacing combining mark (zero advance width)	NON_SPACING_MARK
Nd	Decimal_Number	a decimal digit	DECIMAL_DIGIT_NUMBER
Nl	Letter_Number	a letterlike numeric character	LETTER_NUMBER
No	Other_Number	a numeric character of other type	OTHER_NUMBER
Pc	Connector_Punctuation	a connecting punctuation mark, like a tie	CONNECTOR_PUNCTUATION
Pd	Dash_Punctuation	a dash or hyphen punctuation mark	DASH_PUNCTUATION
Pe	Close_Punctuation	a closing punctuation mark (of a pair)	END_PUNCTUATION
Pf	Final_Punctuation	a final quotation mark	FINAL_QUOTE_PUNCTUATION
Pi	Initial_Punctuation	an initial quotation mark	INITIAL_QUOTE_PUNCTUATION
Po	Other_Punctuation	a punctuation mark of other type	OTHER_PUNCTUATION
Ps	Open_Punctuation	an opening punctuation mark (of a pair)	START_PUNCTUATION
Sc	Currency_Symbol	a currency sign	CURRENCY_SYMBOL
Sk	Modifier_Symbol	a non-letterlike modifier symbol	MODIFIER_SYMBOL
Sm	Math_Symbol	a symbol of primarily mathematical use	MATH_SYMBOL
So	Other_Symbol	a symbol of other type	OTHER_SYMBOL
Zl	Line_Separator	U+2028 LINE SEPARATOR only	LINE_SEPARATOR
Zp	Paragraph_Separator	U+2029 PARAGRAPH SEPARATOR only	PARAGRAPH_SEPARATOR
Zs	Space_Separator	a space character (of various non-zero widths)	SPACE_SEPARATOR

With the help of this table, we understand that a “Java Letter” can be a currency symbol (such as “$”), a connecting punctuation character (such as “_”), or belong to one of the Unicode General Categores Lu, Ll, Lt, Lm or Lo.

It is not clear to me whether by saying currency symbol and connecting punctuation character the entire CURRENCY_SYMBOL (Sc), CONNECTOR_PUNCTUATION (Pc), DASH_PUNCTUATION (Pd), END_PUNCTUATION (Pe), FINAL_QUOTE_PUNCTUATION (Pf), INITIAL_QUOTE_PUNCTUATION (Pi), OTHER_PUNCTUATION (Po) and START_PUNCTUATION (Ps) Unicode General Categores are included, maybe somebody with Java skills can fill this void.

The Java programming language allows programmers to name identifiers with great liberty, including most Unicode code points (basically in their native languages), with underscore and dollar sign ($) both OK. . An undesirable side effect is that two identifiers differ if they differ in their Unicode code point, even if the glyphs (what you see on the screen) are the same. For example A and Α are different identifiers in Java because they are respectively LATIN CAPITAL LETTER A and GREEK CAPITAL LETTER ALPHA, and a is different from а because they are respectively LATIN SMALL LETTER A and CYRILLIC SMALL LETTER A.

C++ upcoming standard (C++0x)

C++0x, the planned new standard for the C++ programming language due to come out in 2011 or 2012 is more elastic than current C++ in its definition of an identifier:

An identiﬁer is an arbitrarily long sequence of letters and digits, starting with a letter.
Upper-and lower-case letters are diﬀerent. All characters are signiﬁcant.
A “letter” is the usual a-z, A-Z and _ or a “universal-character-name” or “other implementation-deﬁned characters”.

A “universal-character-name” is defined with reference to Annex A (Recommended extended repertoire for user-defined identifiers) of TR 10176:2003, TECHNICAL REPORT ISO/IEC TR 10176, Fourth edition (2003): Guidelines for the preparation of programming language standards.

A “universal-character-name” according to TR 10176, Annex A can be any character which “collectively can be used to generate word-like identifiers for most natural languages of the world“, including “letters (combining or not), syllables, and ideographs together with the modifier letters and marks conventionally used as parts of words“. The acceptable Unicode code points are:

Latin: 0041-005A, 0061-007A, 00AA, 00BA, 00C0-00D6, 00D8-00F6, 00F8-01F5, 01FA-0217, 0250-02A8, 1E00-1E9B, 1EA0-1EF9, 207F
Greek: 0386, 0388-038A, 038C, 038E-03A1, 03A3-03CE, 03D0-03D6, 03DA, 03DC, 03DE, 03E0, 03E2-03F3, 1F00-1F15, 1F18-1F1D, 1F20-1F45, 1F48-1F4D, 1F50-1F57, 1F59, 1F5B, 1F5D, 1F5F-1F7D, 1F80-1FB4, 1FB6-1FBC, 1FC2-1FC4, 1FC6-1FCC, 1FD0-1FD3, 1FD6-1FDB, 1FE0-1FEC, 1FF2-1FF4, 1FF6-1FFC
Cyrillic: 0401-040C, 040E-044F, 0451-045C, 045E-0481, 0490-04C4, 04C7-04C8, 04CB-04CC, 04D0-04EB, 04EE-04F5, 04F8-04F9
Armenian: 0531-0556, 0561-0587
Hebrew: 05D0-05EA, 05F0-05F2
Hebrew (C): 05B0-05B9, 05BB-05BD, 05BF, 05C1-05C2
Arabic: 0621-063A, 0640-064A, 0671-06B7, 06BA-06BE, 06C0-06CE, 06D0-06D3, 06D5, 06E5-06E6
Arabic (C): 064B-0652, 0670, 06D6-06DC, 06E7-06E8, 06EA-06ED
Devanagari: 0905-0939, 0950, 0958-0961
Devanagari (C): 0901-0903, 093E-094D, 0951-0952, 0962-0963
Bengali: 0985-098C, 098F-0990, 0993-09A8, 09AA-09B0, 09B2, 09B6-09B9, 09DC-09DD, 09DF-09E1, 09F0-09F1
Bengali (C): 0981-0983, 09BE-09C4, 09C7-09C8, 09CB-09CD, 09E2-09E3
Gurmukhi: 0A05-0A0A, 0A0F-0A10, 0A13-0A28, 0A2A-0A30, 0A32-0A33, 0A35-0A36, 0A38-0A39, 0A59-0A5C, 0A5E, 0A74
Gurmukhi (C): 0A02, 0A3E-0A42, 0A47-0A48, 0A4B-0A4D
Gujarati: 0A85-0A8B, 0A8D, 0A8F-0A91, 0A93-0AA8, 0AAA-0AB0, 0AB2-0AB3, 0AB5-0AB9, 0ABD, 0AD0, 0AE0
Gujarati (C): 0A81-0A83, 0ABE-0AC5, 0AC7-0AC9, 0ACB-0ACD
Oriya: 0B05-0B0C, 0B0F-0B10, 0B13-0B28, 0B2A-0B30, 0B32-0B33, 0B36-0B39, 0B5C-0B5D, 0B5F-0B61
Oriya (C): 0B01-0B03, 0B3E-0B43, 0B47-0B48, 0B4B-0B4D
Tamil: 0B85-0B8A, 0B8E-0B90, 0B92-0B95, 0B99-0B9A, 0B9C, 0B9E-0B9F, 0BA3-0BA4, 0BA8-0BAA, 0BAE-0BB5, 0BB7-0BB9
Tamil (C): 0B82-0B83, 0BBE-0BC2, 0BC6-0BC8, 0BCA-0BCD
Telugu: 0C05-0C0C, 0C0E-0C10, 0C12-0C28, 0C2A-0C33, 0C35-0C39, 0C60-0C61
Telugu (C): 0C01-0C03, 0C3E-0C44, 0C46-0C48, 0C4A-0C4D
Kannada: 0C85-0C8C, 0C8E-0C90, 0C92-0CA8, 0CAA-0CB3, 0CB5-0CB9, 0CDE, 0CE0-0CE1
Kannada (C): 0C82-0C83, 0CBE-0CC4, 0CC6-0CC8, 0CCA-0CCD
Malayalam: 0D05-0D0C, 0D0E-0D10, 0D12-0D28, 0D2A-0D39, 0D60-0D61
Malayalam (C): 0D02-0D03, 0D3E-0D43, 0D46-0D48, 0D4A-0D4D
Thai: 0E01-0E30, 0E32-0E33, 0E40-0E46, 0E50-0E59
Thai (C): 0E31, 0E34-0E3A, 0E47-0E4E
Lao: 0E81-0E82, 0E84, 0E87-0E88, 0E8A, 0E8D, 0E94-0E97, 0E99-0E9F, 0EA1-0EA3, 0EA5, 0EA7, 0EAA-0EAB, 0EAD-0EAE, 0EB0, 0EB2-0EB3, 0EBD, 0EC0-0EC4, 0EC6, 0EDC-0EDD
Lao (C): 0EB1, 0EB4-0EB9, 0EBB-0EBC, 0EC8-0ECD
Tibetan: 0F00, 0F40-0F47, 0F49-0F69, 0F88-0F8B
Tibetan (C): 0F18-0F19, 0F35, 0F37, 0F39, 0F71-0F84, 0F86-0F87, 0F90-0F95, 0F97, 0F99-0FAD, 0FB1-0FB7, 0FB9
Georgian: 10A0-10C5, 10D0-10F6
Hiragana: 3041-3093
Katakana: 30A1-30F6, 30FB-30FC
Bopomofo: 3105-312C
Hangul: AC00-D7A3
CJK Unified Ideographs: 4E00-9FA5
Digits: 0030-0039, 0660-0669, 06F0-06F9, 0966-096F, 09E6-09EF, 0A66-0A6F, 0AE6-0AEF, 0B66-0B6F, 0BE7-0BEF, 0C66-0C6F, 0CE6-0CEF, 0D66-0D6F, 0E50-0E59, 0ED0-0ED9, 0F20-0F29
Special characters: 00B5, 02B0-02B8, 02BB, 02BD-02C1, 02D0-02D1, 02E0-02E4, 037A, 0559, 093D, 0B3D, 1FBE, 203F-2040, 2102, 2107, 210A-2113, 2115, 2118-211D, 2124, 2126, 2128, 212A-2131, 2133-2138, 2160-2182, 3005-3007, 3021-3029

The upcoming version of the C++ will be subject to the same confusing same-glyph, different Unicode code-point syndrome as Java A != Α and a != а.

The good news is that since the “good” code points are listed, it is easier for implementations to check if a character is acceptable or not, whereas for Java it is required to have access to the Unicode tables to know if a character belongs to a certain General Category.