www.delorie.com/gnu/docs/gcc/g77_677.html   search  
 
Buy the book!


Using and Porting GNU Fortran

[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

21.2.11.1 Multi-character Lexemes

Each lexeme carries with it a pointer to where it appears in the source.

To provide the ability for diagnostics to point to column numbers, in addition to line numbers and names, lexemes that represent more than one (significant) character in the source code need, generally, to provide pointers to where each character appears in the source.

This provides the ability to properly identify the precise location of the problem in code like

 
SUBROUTINE X
END
BLOCK DATA X
END

which, in fixed-form source, would result in single lexemes consisting of the strings `SUBROUTINEX' and `BLOCKDATAX'. (The problem is that `X' is defined twice, so a pointer to the `X' in the second definition, as well as a follow-up pointer to the corresponding pointer in the first, would be preferable to pointing to the beginnings of the statements.)

This need also arises when parsing (and diagnosing) FORMAT statements.

Further, it arises when diagnosing FMT= specifiers that contain constants (or partial constants, or even propagated constants!) in I/O statements, as in:

 
PRINT '(I2, 3HAB)', J

(A pointer to the beginning of the prematurely-terminated Hollerith constant, and/or to the close parenthese, is preferable to a pointer to the open-parenthese or the apostrophe that precedes it.)

Multi-character lexemes, which would seem to naturally include at least digit strings, alphanumeric strings, CHARACTER constants, and Hollerith constants, therefore need to provide location information on each character. (Maybe Hollerith constants don't, but it's unnecessary to except them.)

The question then arises, what about other multi-character lexemes, such as `**' and `//', and Fortran 90's `(/', `/)', `::', and so on?

Turns out there's a need to identify the location of the second character of these two-character lexemes. For example, in `I(/J) = K', the slash needs to be diagnosed as the problem, not the open parenthese. Similarly, it is preferable to diagnose the second slash in `I = J // K' rather than the first, given the implicit typing rules, which would result in the compiler disallowing the attempted concatenation of two integers. (Though, since that's more of a semantic issue, it's not that much preferable.)

Even sequences that could be parsed as digit strings could use location info, for example, to diagnose the `9' in the octal constant `O'129''. (This probably will be parsed as a character string, to be consistent with the parsing of `Z'129A''.)

To avoid the hassle of recording the location of the second character, while also preserving the general rule that each significant character is distinctly pointed to by the lexeme that contains it, it's best to simply not have any fixed-size lexemes larger than one character.

This new design is expected to make checking for two `*' lexemes in a row much easier than the old design, so this is not much of a sacrifice. It probably makes the lexer much easier to implement than it makes the parser harder.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

  webmaster     delorie software   privacy  
  Copyright 2003   by The Free Software Foundation     Updated Jun 2003