<!--
  Copyright 2023, Gerwin Klein, Régis Décamps, Steve Rowe
  SPDX-License-Identifier: CC-BY-SA-4.0
-->

A simple Example: How to work with JFlex   {#Example}
========================================

To demonstrate how a lexical specification with JFlex looks like, this
section presents a part of the specification for the Java language.

The example does not describe the whole lexical structure of Java programs,
but only a small and simplified part of it:
- some keywords,
- some operators,
- comments 
- and only two kinds of literals.

It also shows how to interface with the LALR parser generator CUP [@CUP] 
and therefore uses a class `sym` (generated by CUP), where integer constants
for the terminal tokens of the CUP grammar are declared. 

You can find this example in `examples/cup-java-minijava`.

The `examples/cup-java` directory also contains a *complete* JFlex specification of the
lexical structure of Java programs together with the CUP parser specification
for Java by C. Scott Ananian, obtained from the CUP [@CUP] web site (modified to
interface with the JFlex scanner). Both specifications adhere to the Java
Language Specification [@LangSpec].

In `examples/standalone`, you can find a small standalone scanner that
doesn’t need other dependencies or tools like CUP to give you working code. 


```
    /* JFlex example: partial Java language lexer specification */
    import java_cup.runtime.*;

    /**
     * This class is a simple example lexer.
     */
    %%

    %class Lexer
    %unicode
    %cup
    %line
    %column

    %{
      StringBuffer string = new StringBuffer();

      private Symbol symbol(int type) {
        return new Symbol(type, yyline, yycolumn);
      }
      private Symbol symbol(int type, Object value) {
        return new Symbol(type, yyline, yycolumn, value);
      }
    %}

    LineTerminator = \r|\n|\r\n
    InputCharacter = [^\r\n]
    WhiteSpace     = {LineTerminator} | [ \t\f]

    /* comments */
    Comment = {TraditionalComment} | {EndOfLineComment} | {DocumentationComment}

    TraditionalComment   = "/*" [^*] ~"*/" | "/*" "*"+ "/"
    // Comment can be the last line of the file, without line terminator.
    EndOfLineComment     = "//" {InputCharacter}* {LineTerminator}?
    DocumentationComment = "/**" {CommentContent} "*"+ "/"
    CommentContent       = ( [^*] | \*+ [^/*] )*

    Identifier = [:jletter:] [:jletterdigit:]*

    DecIntegerLiteral = 0 | [1-9][0-9]*

    %state STRING

    %%

    /* keywords */
    <YYINITIAL> "abstract"           { return symbol(sym.ABSTRACT); }
    <YYINITIAL> "boolean"            { return symbol(sym.BOOLEAN); }
    <YYINITIAL> "break"              { return symbol(sym.BREAK); }

    <YYINITIAL> {
      /* identifiers */ 
      {Identifier}                   { return symbol(sym.IDENTIFIER); }
     
      /* literals */
      {DecIntegerLiteral}            { return symbol(sym.INTEGER_LITERAL); }
      \"                             { string.setLength(0); yybegin(STRING); }

      /* operators */
      "="                            { return symbol(sym.EQ); }
      "=="                           { return symbol(sym.EQEQ); }
      "+"                            { return symbol(sym.PLUS); }

      /* comments */
      {Comment}                      { /* ignore */ }
     
      /* whitespace */
      {WhiteSpace}                   { /* ignore */ }
    }

    <STRING> {
      \"                             { yybegin(YYINITIAL); 
                                       return symbol(sym.STRING_LITERAL, 
                                       string.toString()); }
      [^\n\r\"\\]+                   { string.append( yytext() ); }
      \\t                            { string.append('\t'); }
      \\n                            { string.append('\n'); }

      \\r                            { string.append('\r'); }
      \\\"                           { string.append('\"'); }
      \\                             { string.append('\\'); }
    }

    /* error fallback */
    [^]                              { throw new Error("Illegal character <"+
                                                        yytext()+">"); }
```

From this specification JFlex generates a `.java` file with one class
that contains code for the scanner. The class will have a constructor
taking a `java.io.Reader` from which the input is read. The class will
also have a function `yylex()` that runs the scanner and that can be
used to get the next token from the input (in this example the function
actually has the name `next_token()` because the specification uses the
`%cup` switch).

As with JLex, the specification consists of three parts, divided by
`%%`:

-   [usercode](#ExampleUserCode),
-   [options and declarations](#ExampleOptions) and
-   [lexical rules](#ExampleLexRules).


Code to include {#ExampleUserCode}
---------------

Let’s take a look at the first section, _user code_: The text up to the
first line starting with `%%` is copied verbatim to the top of the
generated lexer class (before the actual class declaration). Next to
`package` and `import` statements there is usually not much to do here.
If the code ends with a `javadoc` class comment, the generated class will
get this comment, if not, JFlex will generate one automatically.


Options and Macros {#ExampleOptions}
------------------

The second section _options and declarations_ is more interesting. It
consists of a set of options, code that is included inside the generated
scanner class, lexical states and macro declarations. Each JFlex option
must begin a line of the specification and starts with a `%`. In our
example the following options are used:

-   `%class Lexer` tells JFlex to give the generated class the
    name `Lexer` and to write the code to a file `Lexer.java`.

-   `%unicode` defines the set of characters the scanner will
    work on. For scanning text files, `%unicode` should always be used.
    The Unicode version may be specified, e.g. `%unicode 4.1`. If no
    version is specified, the most recent supported Unicode version will
    be used - in JFlex $VERSION, this is Unicode $UNICODE_VER. See also
    [Encodings](#sec:encodings) for more information on character
    sets, encodings, and scanning text vs. binary files.

-   `%cup` switches to CUP compatibility mode to interface
    with a CUP generated parser.

-   `%line` switches line counting on (the current line number
    can be accessed via the variable `yyline`)

-   `%column` switches column counting on (the current column is
    accessed via `yycolumn`)


The code between `%{` and `%}` is copied verbatim into the generated lexer
class source. Here you can declare member variables and functions that are
used inside scanner actions. In our example we declare a `StringBuffer`
`string` in which we will store parts of string literals and two helper
functions `symbol` that create `java_cup.runtime.Symbol` objects with
position information of the current token (see also [JFlex and CUP](#CUPWork)
for how to interface with the parser generator CUP). As with all JFlex
options, both `%{` and `%}` must begin a line.

The specification continues with macro declarations. Macros are abbreviations
for regular expressions, used to make lexical specifications easier to read
and understand. A macro declaration consists of a macro identifier followed
by `=`, then followed by the regular expression it represents. This regular
expression may itself contain macro usages. Although this allows a
grammar-like specification style, macros are still just abbreviations and not
non-terminals – they cannot be recursive. Cycles in macro definitions are
detected and reported at generation time by JFlex.

Here some of the example macros in more detail:

-   `LineTerminator` stands for the regular expression that matches an
    ASCII `CR`, an ASCII `LF` or a `CR` followed by `LF`.

-   `InputCharacter` stands for all characters that are not a `CR` or `LF`.

-   `TraditionalComment` is the expression that matches the string `/*`
    followed by a character that is not a `*`, followed by anything that
    does not contain, but ends in `*/`. As this would not match comments
    like `/****/`, we add `/*` followed by an arbitrary number (at least
    one) of `*` followed by the closing `/`. This is not the only, but
    one of the simpler expressions matching non-nesting Java comments.
    It is tempting to just write something like the expression
    `/* .* */`, but this would match more than we want. It would for
    instance match the entire input `/* */ x = 0; /* */`, instead of two
    comments and four real tokens. See the macros `DocumentationComment` and
    `CommentContent` for an alternative.

-   `CommentContent` matches zero or more occurrences of any character
    except a `*` or any number of `*` followed by a character that is
    not a `/`

-   `Identifier` matches each string that starts with a character of
    class `jletter` followed by zero or more characters of class
    `jletterdigit`. `jletter` and `jletterdigit` are predefined
    character classes. `jletter` includes all characters for which the
    Java function `Character.isJavaIdentifierStart` returns `true` and
    `jletterdigit` all characters for that
    `Character.isJavaIdentifierPart` returns `true`.


The last part of the second section in our lexical specification is a lexical
state declaration: `state STRING` declares a lexical state `STRING` that can
be used in the _lexical rules_ part of the specification. A state declaration
is a line starting with `%state` followed by a space or comma separated list
of state identifiers. There can be more than one line starting with `%state`.


Rules and Actions {#ExampleLexRules}
-----------------

The _lexical rules_ section of a JFlex specification contains regular
expressions and actions (Java code) that are executed when the scanner
matches the associated regular expression. As the scanner reads its input, it
keeps track of all regular expressions and activates the action of the
expression that has the longest match. Our specification above for instance
would with input `breaker` match the regular expression for `Identifier` and
not the keyword `break` followed by the Identifier `er`, because rule
`{Identifier}` matches more of this input at once than any other rule in the
specification. If two regular expressions both have the longest match for a
certain input, the scanner chooses the action of the expression that appears
first in the specification. In that way, we get for input `break` the keyword
`break` and not an Identifier `break`.

In addition to regular expression matches, one can use lexical states to
refine a specification. A lexical state acts like a start condition. If the
scanner is in lexical state `STRING`, only expressions that are preceded by
the start condition `<STRING>` can be matched. A start condition of a regular
expression can contain more than one lexical state. It is then matched when
the lexer is in any of these lexical states. The lexical state `YYINITIAL` is
predefined and is also the state in which the lexer begins scanning. If a
regular expression has no start conditions it is matched in *all* lexical
states.

Since there often are sets of expressions with the same start conditions,
they can be grouped:

    <STRING> {
      expr1   { action1 }
      expr2   { action2 }
    }

means that both `expr1` and `expr2` have start condition `<STRING>`.

The first three rules in our example demonstrate the syntax of a regular
expression preceded by the start condition `<YYINITIAL>`.

    <YYINITIAL> "abstract"           { return symbol(sym.ABSTRACT); }

matches the input `abstract` only if the scanner is in its start state
`YYINITIAL`. When the string `abstract` is matched, the scanner function
returns the CUP symbol `sym.ABSTRACT`. If an action does not return a value,
the scanning process is resumed immediately after executing the action.

The rules enclosed in

    <YYINITIAL> { ...

demonstrate the abbreviated syntax and are also only matched in state
`YYINITIAL`.

Of these rules, one is of special interest:

    \"  { string.setLength(0); yybegin(STRING); }

If the scanner matches a double quote in state `YYINITIAL` we have
recognised the start of a string literal. Therefore we clear our
`StringBuffer` that will hold the content of this string literal and
tell the scanner with `yybegin(STRING)` to switch into the lexical state
`STRING`. Because we do not yet return a value to the parser, our
scanner proceeds immediately.

In lexical state `STRING` another rule demonstrates how to refer to the
input that has been matched:

    [^\n\r\"\\]+                   { string.append( yytext() ); }

The expression `[^\n\r\"\\]+` matches all characters in the input up to
the next backslash (indicating an escape sequence such as `\n`), double
quote (indicating the end of the string), or line terminator (which must
not occur in a Java string literal). The matched region of the input is
referred to by `yytext()` and appended to the content of the string
literal parsed so far.

The last lexical rule in the example specification is used as an error
fallback. It matches any character in any state that has not been
matched by another rule. It doesn’t conflict with any other rule because
it has the least priority (because it’s the last rule) and because it
matches only one character (so it can’t have longest match precedence
over any other rule).


How to get it building
----------------------

-   [Install JFlex](#Installing)

-   If you have written your specification file (or chosen one from the
    `examples` directory), save it (say under the name `java-lang.flex`).

-   Run JFlex with

    `jflex java-lang.flex`

-   JFlex should then show progress messages about generating the
    scanner and write the generated code to the directory of your
    specification file.

-   Compile the generated `.java` file and your own classes. (If you use
    CUP, generate your parser classes first)

-   That’s it.

