I am using Flex and Bison to parse a simplified SQL grammar. I am running into an issue where Flex will stop tokenizing before the end of the file. When I pass the test case containing just:
create database this;
everything works fine and the tokens CREATE, DATEABASE, and ID are created. But when I pass a test case containing just:
create table this (integer qty);
it will tokenize up to and including “qty” but not read the ‘)’ or the ‘;’. Instead it will jump right into the parsing and print “REJECT” (as it should with the incorrect tokens). How can I fix my code to tokenize everything so that I pass CREATE, TABLE, ID, (, INTEGER, ID, ), ; as tokens to Bison? Below are my .l .y and makefile. Thank you in advance for the help!
sql.l
%{ #include "sql.tab.h" %} Delimiter [ t] WhiteSpace {Delimiter}+ Letter [A-Za-z] Name [A-Za-z][A-Za-z0-9_]* Digit [0-9] Integer {Digit}+ Float {Digit}+"."{Digit}+ Date (0[1-9]|1[012])"/"(0[1-9]|[12][0-9]|3[01])"/"[0-9][0-9][0-9][0-9] Other [-!@#$%&+:"~`] %% {WhiteSpace} { ; } [Cc][Rr][Ee][Aa][Tt][Ee] {printf("token: CREATEn");return(CREATE);} [Dd][Rr][Oo][Pp] {printf("token: DROPn");return(DROP);} [Ll][Oo][Aa][Dd] {printf("token: LOADn");return(LOAD);} [Ss][Aa][Vv][Ee] {printf("token: SAVEn");return(SAVE);} [Dd][Aa][Tt][Aa][Bb][Aa][Ss][Ee] {printf("token: DATABASEn");return(DATABASE);} [Tt][Aa][Bb][Ll][Ee] {printf("token: TABLEn");return(TABLE);} [Ii][Nn][Ss][Ee][Rr][Tt] {printf("token: INSERTn");return(INSERT);} [Ii][Nn][Tt][Oo] {printf("token: INTOn");return(INTO);} [Ff][Rr][Oo][Mm] {printf("token: FROMn");return(FROM);} [Ww][Hh][Ee][Rr][Ee] {printf("token: WHEREn");return(WHERE);} [Ss][Ee][Tt] {printf("token: SETn");return(SET);} [Dd][Ee][Ll][Ee][Tt][Ee] {printf("token: DELETEn");return(DELETE);} [Uu][Pp][Dd][Aa][Tt][Ee] {printf("token: UPDATEn");return(UPDATE);} [Ss][Ee][Ll][Ee][Cc][Tt] {printf("token: SELECTn");return(SELECT);} [Ww][Ss][Ee][Ll][Ee][Cc][Tt] {printf("token: WSELECTn");return(WSELECT);} [Vv][Aa][Ll][Uu][Ee][Ss] {printf("token: VALUESn");return(VALUES);} [Ii][Nn][Tt][Ee][Gg][Ee][Rr] {printf("token: INTEGERn");return(INTEGER);} [Nn][Uu][Mm][Bb][Ee][Rr] {printf("token: NUMBERn");return(NUMBER);} [Cc][Hh][Aa][Rr][Aa][Cc][Tt][Ee][Rr] {printf("token: CHARACTERn");return(CHARACTER);} [Dd][Aa][Tt][Ee] {printf("token: DATEn");return(DATE);} {Name} {printf("token: ID %sn", yytext);return(ID);} "*" {printf("token: *n");return('*');} "(" {printf("token: (n");return('(');} ")" {printf("token: )n");return(')');} "," {printf("token: ,n");return(',');} "<" {printf("token: LTn");return(LT);} ">" {printf("token: GTn");return(GT);} "<=" {printf("token: LTEQn");return(LTEQ);} ">=" {printf("token: GTEQn");return(GTEQ);} "==" {printf("token: EQEQn");return(EQEQ);} "!=" {printf("token: NOTEQn");return(NOTEQ);} ";" {printf("token: ;n");return(';');} {Float} {printf("token: DECn");return(DEC);} "-"{Float} {printf("token: DECn");return(DEC);} {Integer} {printf("token: INTn");return(INT);} "-"{Integer} {printf("token: INTn");return(INT);} "'"[ ta-zA-Z0-9]+"'" {printf("token: STRn");return(STR);} {Date} {printf("token: DATETYPEn");return(DATETYPE);} n { ; } . {printf("Character %s not tokenizedn", yytext);exit(0) ;}
sql.y
%{ #define YYSTYPE char* #include <stdio.h> %} %start START %token ID %token CREATE DROP LOAD SAVE DATABASE TABLE INSERT INTO FROM %token WHERE SET DELETE UPDATE SELECT WSELECT VALUES %token DEC INT STR DATETYPE DATE NUMBER CHARACTER INTEGER %token LT GT LTEQ GTEQ EQ EQEQ NOTEQ %% START : COMMAND_LIST COMMAND_LIST : COMMAND COMMAND_LIST2 COMMAND_LIST2 : COMMAND COMMAND_LIST2 | /* EMPTY */ COMMAND : SYSTEM_COMMAND | DDL_COMMAND | DML_COMMAND SYSTEM_COMMAND : CREATE_DATABASE_COMMAND | DROP_DATABASE_COMMAND | SAVE_COMMAND | LOAD_DATABASE_COMMAND DDL_COMMAND : CREATE_TABLE_COMMAND | DROP_TABLE_COMMAND DDL_COMMAND : CREATE_TABLE_COMMAND | DROP_TABLE_COMMAND DML_COMMAND : INSERT_INTO_COMMAND | DELETE_FROM_COMMAND | UPDATE_COMMAND | SELECT_COMMAND | W_SELECT_COMMAND CREATE_DATABASE_COMMAND : CREATE DATABASE ID ';' DROP_DATABASE_COMMAND : DROP DATABASE ID ';' SAVE_COMMAND : SAVE ';' LOAD_DATABASE_COMMAND : LOAD DATABASE ID ';' CREATE_TABLE_COMMAND : CREATE TABLE ID '(' FIELD_DEF_LIST ')' ';' DROP_TABLE_COMMAND : DROP TABLE ID ';' INSERT_INTO_COMMAND : INSERT INTO ID INSERT_INTO_COMMAND2 INSERT_INTO_COMMAND2 : '(' FIELD_LIST ')' VALUES '(' LITERAL_LIST ')' ';' | VALUES '(' LITERAL_LIST ')' ';' DELETE_FROM_COMMAND : DELETE FROM ID DELETE_FROM_COMMAND2 DELETE_FROM_COMMAND2 : WHERE CONDITION ';' | ';' UPDATE_COMMAND : UPDATE ID SET ID '=' LITERAL UPDATE_COMMAND2 UPDATE_COMMAND2 : ',' ID '=' LITERAL UPDATE_COMMAND2 | UPDATE_COMMAND3 UPDATE_COMMAND3 : WHERE CONDITION ';' | ';' SELECT_COMMAND : SELECT '*' FROM ID ';' W_SELECT_COMMAND : WSELECT W_SELECT_COMMAND2 W_SELECT_COMMAND2 : '*' FROM ID W_SELECT_COMMAND3 | '(' FIELD_LIST ')' FROM ID W_SELECT_COMMAND3 W_SELECT_COMMAND3 : WHERE CONDITION ';' | ';' FIELD_DEF_LIST : FIELD_DEF FIELD_DEF_LIST2 FIELD_DEF_LIST2 : ',' FIELD_DEF FIELD_DEF_LIST2 | /* EMPTY */ FIELD_DEF : FIELD_TYPE ID FIELD_LIST : ID FIELD_LIST2 FIELD_LIST2 : ',' ID FIELD_LIST2 | /* EMPTY */ FIELD_TYPE : INTEGER '(' INT ')' | INT | NUMBER '(' INT ')' | NUMBER '(' INT ',' INT ')' | NUMBER | CHARACTER '(' INT ')' | DATE LITERAL_LIST : LITERAL LITERAL_LIST2 LITERAL_LIST2 : ',' LITERAL LITERAL_LIST2 | /* EMPTY */ LITERAL : INT | DEC | STR | DATETYPE CONDITION : ID COMP LITERAL COMP : LT | GT | EQEQ | LTEQ | GTEQ | NOTEQ %% main() { yyparse(); } yyerror() { printf("REJECTn"); } yywrap() { printf("ACCEPTn"); }
makefile
trial: lex.yy.o sql.tab.o cc -o trial lex.yy.o sql.tab.o sql.tab.o: sql.tab.c cc -c sql.tab.c sql.tab.c: sql.y bison -d sql.y lex.yy.o: lex.yy.c cc -c lex.yy.c lex.yy.c: sql.l sql.tab.c flex sql.l
Btw I have some print statements in the .l file for debugging. And the liked executable is called ‘trial’.
Advertisement
Answer
As soon as the bison parser gets an error (something that can’t be parsed), it will call yyerror
with a “syntax error” message and then go into error recovery mode. In error recovery mode, it looks for rules with error
on the right hand side (error recovery rules), and will use one of them to try to recover. If the grammar has no error recovery rules (as yours), then it won’t find any, so yyparse will simply return without trying to read any further tokens.
If you want to try to recover from the syntax error and continue parsing alternate commands (generally, stuff after the part with the syntax error), you need to add some error recovery rules. Writing good error recovery rules is somewhat tricky, and you can do lots of different things. In your case, the simplest thing you can try is to just recover at the top level, so it will chuck the rest of the current command (with the sytax error) and attempt to find the next command:
START : /* empty */ | START COMMAND COMMAND : SYSTEM_COMMAND | DDL_COMMAND | DML_COMMAND | error ';'
Note that I’ve gotten rid of the useless, redundant COMMAND_LIST
rules and changed it to use left recursion instead of righ recursion (as bison is generally better with left recursion — you should only use right recursion if you really need it and know what you are doing).
This error recovery rule will be active whenever you are trying to parse commands, and will work by throwing away input tokens until a ‘;’ is seen. Once it gets to the ‘;’, it will treat that as being the end of the command (it will reduce the COMMAND : error ';'
rule, then the START : START COMMAND
rule), and continue parsing (looking for another COMMAND
)
Note that once an error recovery rule has recovered from an error, the parser will continue to parse, and if gets to the end of the input without further error, yyparse
will return success.