Flex and Bison: Lexical analysis stops before end …

I am using Flex and Bison to parse a simplified SQL grammar. I am running into an issue where Flex will stop tokenizing before the end of the file. When I pass the test case containing just:

create database this;

everything works fine and the tokens CREATE, DATEABASE, and ID are created. But when I pass a test case containing just:

create table this (integer qty);

it will tokenize up to and including “qty” but not read the ‘)’ or the ‘;’. Instead it will jump right into the parsing and print “REJECT” (as it should with the incorrect tokens). How can I fix my code to tokenize everything so that I pass CREATE, TABLE, ID, (, INTEGER, ID, ), ; as tokens to Bison? Below are my .l .y and makefile. Thank you in advance for the help!

sql.l

%{
#include "sql.tab.h"
%}

Delimiter    [ t]
WhiteSpace   {Delimiter}+
Letter       [A-Za-z]
Name         [A-Za-z][A-Za-z0-9_]*
Digit        [0-9]
Integer      {Digit}+
Float        {Digit}+"."{Digit}+
Date         (0[1-9]|1[012])"/"(0[1-9]|[12][0-9]|3[01])"/"[0-9][0-9][0-9][0-9]
Other        [-!@#$%&+:"~`]


%%
{WhiteSpace}                                     { ; }
[Cc][Rr][Ee][Aa][Tt][Ee]                 {printf("token:     CREATEn");return(CREATE);}
[Dd][Rr][Oo][Pp]                         {printf("token: DROPn");return(DROP);}
[Ll][Oo][Aa][Dd]                         {printf("token: LOADn");return(LOAD);}
[Ss][Aa][Vv][Ee]                         {printf("token: SAVEn");return(SAVE);}
[Dd][Aa][Tt][Aa][Bb][Aa][Ss][Ee]         {printf("token: DATABASEn");return(DATABASE);}
[Tt][Aa][Bb][Ll][Ee]                     {printf("token: TABLEn");return(TABLE);}
[Ii][Nn][Ss][Ee][Rr][Tt]                 {printf("token: INSERTn");return(INSERT);}
[Ii][Nn][Tt][Oo]                         {printf("token: INTOn");return(INTO);}
[Ff][Rr][Oo][Mm]                         {printf("token: FROMn");return(FROM);}
[Ww][Hh][Ee][Rr][Ee]                     {printf("token: WHEREn");return(WHERE);}
[Ss][Ee][Tt]                               {printf("token: SETn");return(SET);}
[Dd][Ee][Ll][Ee][Tt][Ee]                 {printf("token: DELETEn");return(DELETE);}
[Uu][Pp][Dd][Aa][Tt][Ee]                 {printf("token: UPDATEn");return(UPDATE);}
[Ss][Ee][Ll][Ee][Cc][Tt]                 {printf("token: SELECTn");return(SELECT);}
[Ww][Ss][Ee][Ll][Ee][Cc][Tt]             {printf("token: WSELECTn");return(WSELECT);}
[Vv][Aa][Ll][Uu][Ee][Ss]                 {printf("token: VALUESn");return(VALUES);}
[Ii][Nn][Tt][Ee][Gg][Ee][Rr]             {printf("token: INTEGERn");return(INTEGER);}
[Nn][Uu][Mm][Bb][Ee][Rr]                 {printf("token: NUMBERn");return(NUMBER);}
[Cc][Hh][Aa][Rr][Aa][Cc][Tt][Ee][Rr]     {printf("token: CHARACTERn");return(CHARACTER);}
[Dd][Aa][Tt][Ee]                               {printf("token: DATEn");return(DATE);}
{Name}                       {printf("token: ID %sn", yytext);return(ID);}
"*"                      {printf("token: *n");return('*');}
"("                      {printf("token: (n");return('(');}
")"                      {printf("token: )n");return(')');}
","                      {printf("token: ,n");return(',');}
"<"                      {printf("token: LTn");return(LT);}
">"                      {printf("token: GTn");return(GT);}
"<="                     {printf("token: LTEQn");return(LTEQ);}
">="                     {printf("token: GTEQn");return(GTEQ);}
"=="                           {printf("token: EQEQn");return(EQEQ);}
"!="                           {printf("token: NOTEQn");return(NOTEQ);}
";"                      {printf("token: ;n");return(';');}
{Float}                  {printf("token: DECn");return(DEC);}
"-"{Float}               {printf("token: DECn");return(DEC);}
{Integer}                {printf("token: INTn");return(INT);}
"-"{Integer}             {printf("token: INTn");return(INT);}
"'"[ ta-zA-Z0-9]+"'"    {printf("token: STRn");return(STR);}
{Date}                     {printf("token: DATETYPEn");return(DATETYPE);}
n           { ; } 
.                    {printf("Character %s not tokenizedn", yytext);exit(0)
;}

sql.y

%{ 
#define YYSTYPE char*
#include <stdio.h>
%}

%start START
%token ID 
%token CREATE DROP LOAD SAVE DATABASE TABLE INSERT INTO FROM 
%token WHERE SET DELETE UPDATE SELECT WSELECT VALUES
%token DEC INT STR DATETYPE DATE NUMBER CHARACTER INTEGER
%token LT GT LTEQ GTEQ EQ EQEQ NOTEQ


%%
START          : COMMAND_LIST  

COMMAND_LIST   : COMMAND COMMAND_LIST2

COMMAND_LIST2  : COMMAND COMMAND_LIST2 
           | /* EMPTY */

COMMAND        : SYSTEM_COMMAND 
           | DDL_COMMAND 
           | DML_COMMAND

SYSTEM_COMMAND : CREATE_DATABASE_COMMAND
               | DROP_DATABASE_COMMAND
               | SAVE_COMMAND
               | LOAD_DATABASE_COMMAND


DDL_COMMAND    : CREATE_TABLE_COMMAND
               | DROP_TABLE_COMMAND

DDL_COMMAND    : CREATE_TABLE_COMMAND
               | DROP_TABLE_COMMAND

DML_COMMAND    : INSERT_INTO_COMMAND
               | DELETE_FROM_COMMAND
               | UPDATE_COMMAND
               | SELECT_COMMAND
               | W_SELECT_COMMAND

CREATE_DATABASE_COMMAND : CREATE DATABASE ID ';'

DROP_DATABASE_COMMAND   : DROP DATABASE ID ';'

SAVE_COMMAND            : SAVE ';'

LOAD_DATABASE_COMMAND   : LOAD DATABASE ID ';'

CREATE_TABLE_COMMAND    : CREATE TABLE ID '(' FIELD_DEF_LIST ')' ';'

DROP_TABLE_COMMAND      : DROP TABLE ID ';'

INSERT_INTO_COMMAND     : INSERT INTO ID INSERT_INTO_COMMAND2

INSERT_INTO_COMMAND2    : '(' FIELD_LIST ')' VALUES '(' LITERAL_LIST ')' ';'
                        | VALUES '(' LITERAL_LIST ')' ';'

DELETE_FROM_COMMAND     : DELETE FROM ID DELETE_FROM_COMMAND2

DELETE_FROM_COMMAND2    : WHERE CONDITION ';' | ';'

UPDATE_COMMAND          : UPDATE ID SET ID '=' LITERAL UPDATE_COMMAND2

UPDATE_COMMAND2         : ',' ID '=' LITERAL UPDATE_COMMAND2      
                        | UPDATE_COMMAND3

UPDATE_COMMAND3         : WHERE CONDITION ';' 
            | ';'

SELECT_COMMAND          : SELECT '*' FROM ID ';'

W_SELECT_COMMAND        : WSELECT W_SELECT_COMMAND2

W_SELECT_COMMAND2       : '*' FROM ID W_SELECT_COMMAND3
                        | '(' FIELD_LIST ')' FROM ID W_SELECT_COMMAND3

W_SELECT_COMMAND3       : WHERE CONDITION ';' 
                | ';'

FIELD_DEF_LIST      : FIELD_DEF FIELD_DEF_LIST2 

FIELD_DEF_LIST2     : ',' FIELD_DEF FIELD_DEF_LIST2 
                | /* EMPTY */ 

FIELD_DEF       : FIELD_TYPE ID 

FIELD_LIST      : ID FIELD_LIST2

FIELD_LIST2         : ',' ID FIELD_LIST2 
                | /* EMPTY */ 

FIELD_TYPE              : INTEGER '(' INT ')' 
                        | INT
                        | NUMBER '(' INT ')' 
                        | NUMBER '(' INT ',' INT ')'
                        | NUMBER
                        | CHARACTER '(' INT ')'
                        | DATE

LITERAL_LIST            : LITERAL LITERAL_LIST2

LITERAL_LIST2           : ',' LITERAL LITERAL_LIST2 
                | /* EMPTY */

LITERAL                 : INT 
                        | DEC 
                        | STR 
                        | DATETYPE

CONDITION               : ID COMP LITERAL 

COMP                    : LT 
            | GT 
            | EQEQ 
            | LTEQ 
            | GTEQ
            | NOTEQ 
%%

main()
{
   yyparse();
}
yyerror()
{
   printf("REJECTn");
}
yywrap()
{
   printf("ACCEPTn");
}

makefile

trial: lex.yy.o sql.tab.o
    cc -o trial lex.yy.o sql.tab.o 

sql.tab.o: sql.tab.c
    cc -c sql.tab.c

sql.tab.c: sql.y
    bison -d sql.y

lex.yy.o: lex.yy.c 
    cc -c lex.yy.c

lex.yy.c: sql.l sql.tab.c
    flex sql.l

Btw I have some print statements in the .l file for debugging. And the liked executable is called ‘trial’.

Answer

As soon as the bison parser gets an error (something that can’t be parsed), it will call yyerror with a “syntax error” message and then go into error recovery mode. In error recovery mode, it looks for rules with error on the right hand side (error recovery rules), and will use one of them to try to recover. If the grammar has no error recovery rules (as yours), then it won’t find any, so yyparse will simply return without trying to read any further tokens.

If you want to try to recover from the syntax error and continue parsing alternate commands (generally, stuff after the part with the syntax error), you need to add some error recovery rules. Writing good error recovery rules is somewhat tricky, and you can do lots of different things. In your case, the simplest thing you can try is to just recover at the top level, so it will chuck the rest of the current command (with the sytax error) and attempt to find the next command:

START          : /* empty */
               | START COMMAND

COMMAND        : SYSTEM_COMMAND 
               | DDL_COMMAND 
               | DML_COMMAND
               | error ';'

Note that I’ve gotten rid of the useless, redundant COMMAND_LIST rules and changed it to use left recursion instead of righ recursion (as bison is generally better with left recursion — you should only use right recursion if you really need it and know what you are doing).

This error recovery rule will be active whenever you are trying to parse commands, and will work by throwing away input tokens until a ‘;’ is seen. Once it gets to the ‘;’, it will treat that as being the end of the command (it will reduce the COMMAND : error ';' rule, then the START : START COMMAND rule), and continue parsing (looking for another COMMAND)

Note that once an error recovery rule has recovered from an error, the parser will continue to parse, and if gets to the end of the input without further error, yyparse will return success.

Flex and Bison: Lexical analysis stops before end of input

Advertisement

Answer