2021-06-05

Compilation Principles: Intermediate Code Generation

Download source code here

The grammar provided for the experiment is as follows:

\begin{array}{lcl} program & \to & block\\ block & \to & \{\;decls\;stmts\;\}\\ decls & \to & decls\;decl\;|\;\varepsilon\\ decl & \to & type\;\textbf{id}\;;\\ type & \to & type\;[\;\textbf{num}\;]\;|\;\textbf{basic}\\ stmts & \to & stmts\;stmt\;|\;\varepsilon\\ stmt & \to & \textbf{id}=expr\;;\\ & | & \textbf{if}\;(\;bool\;)\;stmt\\ & | & \textbf{if}\;(\;bool\;)\;stmt\;\textbf{else}\;stmt\\ & | & \textbf{while}\;(\;bool\;)\;stmt\\ & | & \textbf{do}\;stmt\;\textbf{while}\;(\;bool\;)\;;\\ & | & \textbf{break}\;;\\ bool & \to & bool\;||\;join\;|\;join\\ join & \to & join\;\&\&\;equality\;|\;equality\\ equality & \to & equality==rel\;|\;equality\;\text{!=}\;rel\;|\;rel\\ rel & \to & expr<expr\\ & | & expr<=expr\\ & | & expr>expr\\ & | & expr>=expr\\ & | & expr\\ expr & \to & expr+term\\ & | & expr - term\\ & | & term\\ term & \to & term\,*\,unary\\ & | & term\;/\;unary\\ & | & unary\\ unary & \to & !\;unary\;|\;-unary\;|\;factor\\ factor & \to & (\;expr\;)\;|\;\textbf{id}\;|\;\textbf{num} \end{array}

where \textbf{basic} includes four types: int, float, char, and bool.

1 Grammar Rewriting

1.1 Error Correction

Since array types are defined in type but not reflected in assignment expressions, array types are not implemented here, and decl is modified to:

decl\to \textbf{basic}\;\textbf{id}\;;

Since expr only involves four arithmetic operations, negative numbers, and negation, without evaluation of boolean expressions and logical operations, ||, \&\&, >, >=, <, and <= are also included in the operations of expr.

Since the final factor can only derive \textbf{num} (i.e., integers), while \textbf{basic} also supports floating-point, character, and boolean types, a new terminal symbol \textbf{const} needs to be defined to represent all constants.

1.2 Conversion to Ambiguous Grammar

Since ambiguous grammar is more concise and consistent with most syntax-directed definitions for intermediate code generation given in textbooks, the original grammar can be converted into an equivalent ambiguous grammar. First, modify bool as follows:

\begin{array}{lcl} bool & \to & bool\;||\;bool\\ & | & bool\;\&\&\;bool\\ & | & !\;bool\\ & | & (\;bool\;)\\ & | & expr<expr\\ & | & expr<=expr\\ & | & expr>expr\\ & | & expr>=expr\\ & | & expr\\ \end{array}

Then modify expr as follows:

\begin{array}{lcl} expr & \to & expr\;||\;expr\\ & | & expr\;\&\&\;expr\\ & | & expr<expr\\ & | & expr<=expr\\ & | & expr>expr\\ & | & expr>=expr\\ & | & expr+expr\\ & | & expr-expr\\ & | & expr\,*\,expr\\ & | & expr\;/\;expr\\ & | & !\;expr\\ & | & -\;expr\\ & | & (\;expr\;)\\ & | & \textbf{id}\\ & | & \textbf{const} \end{array}

The shift-reduce and reduce-reduce conflicts of the ambiguous grammar are resolved by specifying the precedence and associativity of operators. The precedence and associativity are specified in grammar.y as follows:

%left OR
%left AND
%left EQ NE
%left LT LE GT GE
%left '+' '-'
%left '*' '/'
%right INV NOT

where INV stands for negation (negative numbers), and NOT stands for logical negation.

1.3 Error Recovery

To prevent yacc from exiting immediately when a syntax error is detected, but to continue translation and find remaining possible errors, an error production is added to stmt:

stmt\to\textbf{error}\;;

When encountering an error, Yacc will keep looking ahead until it encounters a semicolon (i.e., the end of the erroneous statement), then reduce the preceding error to \textbf{error}, and continue translating the subsequent content to achieve error recovery.

1.4 Synthesized Attributes

Since yacc adopts a bottom-up LR translation scheme, each non-terminal and terminal symbol is set to have the following synthesized attributes, defined as the yystack structure in types.h:

typedef struct yystack
{
    char name[MAXLEN];
    constant_val val;
    type type;
    int addr;
    goto_list *t_list;
    goto_list *f_list;
    goto_list *n_list;
} yystack;

#define YYSTYPE yystack

where:

Attribute NameMeaning
nameName of the terminal symbol \textbf{id}
valConstant value of the terminal symbol \textbf{const}
typeType of terminal and non-terminal symbols (int, float, char, bool)
addrAddress of the temporary variable corresponding to the non-terminal symbol
t_listList of jump instructions to be backfilled with target addresses when the condition is true (corresponding to truelist in textbooks)
f_listList of jump instructions to be backfilled with target addresses when the condition is false (corresponding to falselist in textbooks)
n_listList of jump instructions to be backfilled with target addresses at the end of the statement (corresponding to nextlist in textbooks)

2 Lexical Analysis

In lexer.l, for the \textbf{basic} terminal symbol, set the corresponding type attribute and return it:

int     { yylval.type = INT; return BASIC; }
float   { yylval.type = FLOAT; return BASIC; }
char    { yylval.type = CHAR; return BASIC; }
bool    { yylval.type = BOOL; return BASIC; }

For the constant \textbf{const}, also set the corresponding type and parse the constant value using different methods according to different types. Integers and floating-point numbers are implemented through atoi and atof, character types take the second character of the string (the first is a single quote), and boolean types are set directly:

true    { yylval.type = BOOL; yylval.val.boolean = true; return CONST; }
false   { yylval.type = BOOL; yylval.val.boolean = false; return CONST; }
{num}   { yylval.type = INT; yylval.val.num = atoi(yytext); return CONST; }
{real}  { yylval.type = FLOAT; yylval.val.real = atof(yytext); return CONST; }
{char}  { yylval.type = CHAR; yylval.val.ch = yytext[1]; return CONST; }

For \textbf{id}, directly copy the matching result to the name attribute, and its type can only be determined during syntax analysis.

{id}    { strcpy(yylval.name, yytext); return ID; }

The regular expressions for matching identifiers, integers, floating-point numbers, and characters are defined as follows, supporting the E notation for floating-point numbers.

id      [A-Za-z][0-9A-Za-z]*
num     [0-9]+
real    [0-9]+(\.[0-9]+)?(E[+-]?[0-9]+)?
char    '.'

3 Identifiers and Constants

3.1 Symbol Table

Identifiers, symbol tables, and related functions are defined in types.h as follows:

typedef struct symbol
{
    char name[MAXLEN];
    type type;
    int addr;
    int width;
    struct symbol *next;
} symbol;

typedef struct symbol_list
{
    symbol *head;
    struct symbol_list *prev;
    int begin;
    int end;
} symbol_list;

symbol *new_symbol(symbol_list *list, char *name, type type);
symbol *find_symbol(symbol_list *list, char *name);
symbol *find_local_symbol(symbol_list *list, char *name);
symbol_list *new_symbol_list(symbol_list *prev, int addr);
void print_symbol_list(symbol_list *list);
void delete_symbol_list(symbol_list *list);

Among them, the identifier structure records the name, type, address, and memory width occupied by the identifier, and records the pointer to the next symbol to form a linked list. Since identifier definitions can appear in different blocks, and the innermost block can reference all defined identifiers in the current and upper blocks, it is necessary to save the pointer prev to the upper-level symbol table of the symbol table.

Dynamic creation and deletion of symbol tables at different levels are realized by modifying the production of block:

\begin{array}{ll} block\to \{\;begin\;decls\;stmts\;end\;\}\\ begin\to\varepsilon & \{\;slist = new\_symbol\_list();\;\}\\ end\to\varepsilon & \{\;delete\_symbol\_list(slist);\;\} \end{array}

where slist is a global variable representing the symbol table of the outermost block currently.

The specific implementation in grammar.y is as follows: when creating a symbol table for the first time, it is necessary to specify the starting address of the symbol table. Here, SYMBOL_BEGIN is used to represent it, which is 1000. Subsequent symbol tables start from the end address of the upper-level symbol table. Before the symbol table is released, print_symbol_list is called to print the current symbol table.

block   : '{' begin decls stmts end '}'
begin   :
        {
            if (slist == NULL)
                slist = new_symbol_list(NULL, SYMBOL_BEGIN);
            else
                slist = new_symbol_list(slist, slist->end);
        }
end     :
        {
            symbol_list *slist2 = slist;
            if (error == 0)
                print_symbol_list(slist2);
            slist = slist2->prev;
            delete_symbol_list(slist2);
        }

3.2 Constant Table

The constant table is similar to the symbol table, but only a global constant table clist is needed, starting from address number 3000 (CONSTANT_BEGIN), and there is no need to save hierarchical relationships. The relevant function definitions are as follows:

constant *new_constant(constant_list *list, constant_val val, type type);
constant *find_constant(constant_list *list, constant_val val, type type);
constant_list *new_constant_list(int addr);
void print_constant_list(constant_list *list);
void delete_constant_list(constant_list *list);

3.3 Identifier Declaration and Reference

Identifiers are added to the symbol table when the declaration is parsed during syntax analysis, because only then can the type of the identifier be known.

decl\to\textbf{basic}\;\textbf{id}\;;\qquad\{\;new\_symbol(slist,\,\textbf{id}.name,\,\textbf{basic}.type);\;\}

The new_symbol function creates a new identifier, specifies the address of the new identifier through the current symbol table information, and determines the width occupied by the new identifier through the type. When inserting into the symbol table, it is also necessary to consider whether there are duplicate defined identifiers. Since identifiers with the same name in the inner layer will override those in the outer layer, only the current layer’s symbol table needs to be searched, which is realized by the find_local_symbol function. The specific implementation of the semantic rules in grammar.y is as follows:

decl    : BASIC ID ';'
        {
            if (find_local_symbol(slist, $2.name) == NULL)
                new_symbol(slist, $2.name, $1.type);
            else
            {
                char msg[MAXLEN];
                sprintf(msg, "duplicate declearation of symbol \"%s\"", $2.name);
                yyerror(msg);
            }
        }

When referencing an identifier, assign the attributes of the identifier to the current non-terminal symbol:

expr\to\textbf{id}\qquad\begin{array}{l}\{\;expr.type=\mathbf{id}.type;\\\;\;expr.addr=\textbf{id}.addr;\;\}\end{array}

Of course, it is also necessary to check whether the symbol has been defined; otherwise, an error is reported, but a new symbol can be inserted immediately for error recovery processing. The specific implementation is as follows:

expr    : ID
        {
            symbol *id = find_symbol(slist, $1.name);
            if (id == NULL)
            {
                char msg[MAXLEN];
                sprintf(msg, "undeclared symbol \"%s\"", $1.name);
                yyerror(msg);
                id = new_symbol(slist, $1.name, INT);
            }
            $$.type = id->type;
            $$.addr = id->addr;
        }

3.4 Constant Reference

For constants, they need to be added to the global constant table clist when they appear, and their attributes are assigned to non-terminal symbols:

expr\to\textbf{const}\qquad\begin{array}{l}\{\;expr.type=\mathbf{const}.type;\\\;\;expr.addr=new\_constant(clist,\,\textbf{const}.val,\,\textbf{const}.type);\;\}\end{array}

To address the problem of repeated references to the same constant, a check for duplicate constants is added inside the new_constant function. When both the constant value and type are the same, the original constant is returned directly. The specific implementation is similar to the reference of identifiers.

4 Quadruples and Three-Address Code

4.1 Quadruple Table

A quadruple consists of an operator op, parameters arg1, arg2, and result result (arg3). The structures of quadruples and quadruple tables are defined in types.h as follows:

typedef struct quadruple
{
    int op;
    int arg1;
    int arg2;
    int arg3;
    char code[4 * MAXLEN];
} quadruple;

typedef struct quadtable
{
    quadruple *buf;
    int size;
    int bufsize;
} quadtable;

quadtable *new_quadtable();
void print_quadtable(quadtable *table);
void delete_quadtable(quadtable *table);

Among them, the code field of the quadruple is used to store the three-address code. The buf of the quadruple table is used to store quadruples. The code for inserting quadruples into the quadruple table and generating the intermediate code corresponding to the quadruple is as follows.

4.2 Incremental Translation

The gen() function in the textbook is implemented using incremental translation. It is defined as follows, where arg1, arg2, and arg3 are all memory addresses of parameters.

void gen(quadtable *table, symbol_list *slist, constant_list *clist, int op, int arg1, int arg2, int arg3)

The specific implementation steps of this function are as follows. First, the size of the quadruple table uses dynamic memory allocation. If the size is insufficient, realloc is performed during insertion:

if (table->size == table->bufsize)
{
   table->bufsize += MAXLEN;
   table->buf = realloc(table->buf, table->bufsize);
}

Then, assign values in sequence and insert the current quadruple. If the values of arg1, arg2, and arg3 are -1, it means that the parameter is not used or the parameter is to be backfilled.

quadruple *t = table->buf + table->size;
t->op = op;
t->arg1 = arg1;
t->arg2 = arg2;
t->arg3 = arg3;

After insertion, it is necessary to generate the intermediate code corresponding to the quadruple. The address of the intermediate code is the size of the current quadruple table plus the starting address of the code CODE_BEGIN, which is set to 100 here. First, use the arg_name function to search the symbol table and constant table, converting arg1, arg2, and arg3 from memory addresses to corresponding symbol names or string representations of constants. If the value of arg is -1, it corresponds to an empty string for subsequent backfilling.

int addr = table->size + CODE_BEGIN;
char code1[MAXLEN];
char code2[MAXLEN];
char code3[MAXLEN];
arg_name(code1, slist, clist, t->arg1);
arg_name(code2, slist, clist, t->arg2);
arg_name(code3, slist, clist, t->arg3);

Then, generate different forms of three-address intermediate code according to different operators. Since the backpatching method is used, the intermediate code uses one label per line instead of forms like L0 and L1.

switch (t->op)
{
case movi:
case movf:
case movc:
case movb:
    sprintf(t->code, "%d:\t%s = %s\n", addr, code3, code1);
    break;
case addi:
case addf:
    sprintf(t->code, "%d:\t%s = %s + %s\n", addr, code3, code1, code2);
    break;
    ...
case and:
    sprintf(t->code, "%d:\t%s = %s && %s\n", addr, code3, code1, code2);
    break;
    ...
case eq:
    sprintf(t->code, "%d:\t%s = %s == %s\n", addr, code3, code1, code2);
    break;
  ...
case invi:
case invf:
    sprintf(t->code, "%d:\t%s = -%s\n", addr, code3, code1);
    break;
case not:
    sprintf(t->code, "%d:\t%s = !%s\n", addr, code3, code1);
    break;
case ftoi:
case ctoi:
case btoi:
    sprintf(t->code, "%d:\t%s = (int)%s\n", addr, code3, code1);
    break;
    ...
case jmp:
    sprintf(t->code, "%d:\tgoto %s\n", addr, code3);
    break;
    ...
case jle:
    sprintf(t->code, "%d:\tif %s <= %s goto %s\n", addr, code1, code2, code3);
    break;
}

Here, an enumeration type of operators is defined. The mov series is for assignment operations, the add series for addition operations, the inv series for negation operations, the to series for type conversion operations, and the j series for jump and branch instructions. The suffixes i, f, c, and b correspond to int, float, char, and bool types respectively. If a branch instruction needs to be backfilled, since the string corresponding to address -1 is an empty string, the newline character can be overwritten after the original string and content can be appended.

5 Expression Translation

5.1 Temporary Variables

Used to implement \textbf{new}\;Temp() in the textbook. First, maintain a global variable temp_count to represent the subscript of the current temporary variable. Then generate a temporary variable name, add it to the current symbol table in the same way as new_symbol, and finally return the address of the generated temporary variable:

int new_temp(symbol_list *list, type type)
{
    symbol *id = malloc(sizeof(symbol));
    sprintf(id->name, "_t%d", temp_count++);
    id->type = type;
    id->width = type_width[type];
    id->addr = list->end;
    list->end += id->width;
    id->next = list->head;
    list->head = id;
    return id->addr;
}

5.2 Type Conversion

Used to implement the widen() function and max() function in the textbook.

The following enumeration type and conversion matrix conv are defined in types.h, corresponding to the op of type conversion. For example, conv[INT][FLOAT] = conv[0][1] = itof, which means the instruction for converting an integer to a floating-point number. -1 means no conversion is needed.

typedef enum type
{
    INT,
    FLOAT,
    CHAR,
    BOOL
} type;

int conv[4][4] = {
    {-1, itof, itoc, itob},
    {ftoi, -1, ftoc, ftob},
    {ctoi, ctof, -1, ctob},
    {btoi, btof, btoc, -1}};

Then, write the widen function. This function first determines whether conversion is needed according to the types before and after conversion. If no conversion is needed, it directly returns the original parameter address; otherwise, it adds a new intermediate variable through new_temp and generates a type conversion intermediate code through gen, returning the address of the intermediate variable.

int widen(quadtable *table, symbol_list *slist, constant_list *clist, int addr, type type1, type type2)
{
    int op = conv[type1][type2];
    if (op != -1)
    {
        int arg3 = new_temp(slist, type2);
        gen(table, slist, clist, op, addr, -1, arg3);
        return arg3;
    }
    return addr;
}

Next, the max array is defined to indicate the type of the result when numerical values of two types perform four arithmetic operations, which is used to provide a basis for the above type conversion.

int max[4][4] = {
    {INT, FLOAT, INT, INT},
    {FLOAT, FLOAT, FLOAT, FLOAT},
    {INT, FLOAT, CHAR, INT},
    {INT, FLOAT, INT, BOOL}
};

For example, max[CHAR][FLOAT] = MAX[2][1] = FLOAT, which means the result of the operation between a character type and a floating-point type should be a floating-point type.

5.3 Four Arithmetic Operations

Take addition as an example:

expr\to expr_1+expr_2\qquad\begin{array}{l}\{\;expr.type=max(expr_1.type,\,expr_2.type);\\ \;\;expr_1.addr=widen(expr_1.addr,\,expr_1.type,\,expr.type);\\ \;\;expr_2.addr=widen(expr_2.addr,\,expr_2.type,\,expr.type);\\ \;\;expr.addr=\textbf{new}\;Temp();\\ \;\;gen(expr.addr\;'\texttt{=}'\;expr_1.addr\;'\texttt{+}'\;expr_2.addr);\;\}\end{array}

First, determine the type of the result through the max array. Then, perform type conversion on the source operands through the widen function (if needed). Create a new intermediate variable to save the result of the expression, and assign the address of this intermediate variable to the address attribute of the non-terminal expr. Finally, incrementally translate the statement through the gen function to generate intermediate code. The specific implementation is as follows:

expr    : expr '+' expr
        {
            $$.type = max[$1.type][$3.type];
            int arg1 = widen(table, slist, clist, $1.addr, $1.type, $$.type);
            int arg2 = widen(table, slist, clist, $3.addr, $3.type, $$.type);
            int arg3 = new_temp(slist, $$.type);
            gen(table, slist, clist, _add[$$.type], arg1, arg2, arg3);
            $$.addr = arg3;
        }

Among them, the _add array is used to determine which addition instruction to use (i.e., addi or addf) according to the type. Subtraction, multiplication, and division are similar, except for the different operators in gen.

5.4 Boolean Operations

Since boolean operations here appear in the production of expr instead of bool, only the value of boolean operations needs to be considered. Similar to four arithmetic operations, but expr.type is specified as BOOL in advance instead of being determined using the max array.

5.5 Unary Operators and Parentheses

For logical negation operations, the type is also set to BOOL in advance, while for negation (negative number) operations, the type is set to the maximum of the operand type and INT, because for floating-point numbers, negation remains a floating-point number, and for integers, characters, and boolean types, negation becomes a positive number. At the same time, since there is only one operand, arg2 needs to be set to -1 when using gen for incremental translation.

For parentheses, it is simpler to directly assign the attributes to the non-terminal symbol on the left side of the production as they are:

expr\to(\;expr_1\;)\qquad\begin{array}{l}\{\;expr.type=expr_1.type;\\\;\;expr.addr=expr_1.addr;\;\}\end{array}

The specific implementation is as follows:

expr    : '!' expr %prec NOT
        {
            $$.type = BOOL;
            int arg1 = widen(table, slist, clist, $2.addr, $2.type, $$.type);
            int arg3 = new_temp(slist, $$.type);
            gen(table, slist, clist, not, arg1, -1, arg3);
            $$.addr = arg3;
        }
        | '-' expr %prec INV
        {
            $$.type = max[$2.type][INT];
            int arg1 = widen(table, slist, clist, $2.addr, $2.type, $$.type);
            int arg3 = new_temp(slist, $$.type);
            gen(table, slist, clist, _inv[$$.type], arg1, -1, arg3);
            $$.addr = arg3;
        }
        | '(' expr ')'
        {
            $$.type = $2.type;
            $$.addr = $2.addr;
        }

5.6 Assignment Statements

After the expression calculation is completed, it may be necessary to assign it to a defined identifier. However, type conversion is still involved here, and the type of the expression needs to be converted to the type of the identifier itself (if needed) through widen, and the intermediate code for assignment is generated:

stmt\to\textbf{id}=expr\qquad\{\;gen(\textbf{id}\;'\texttt{=}'\;widen(expr.addr,\,expr.type.\,\textbf{id}.type);\;\}

In addition, it is necessary to consider whether the identifier has been defined. If it is not found in all symbol tables, an error is reported, but a new identifier can be created immediately for error recovery. The specific implementation is as follows:

stmt    : ID '=' expr ';'
        {
            $$.n_list = NULL;
            symbol *id = find_symbol(slist, $1.name);
            if (id == NULL)
            {
                char msg[MAXLEN];
                sprintf(msg, "undeclared symbol \"%s\"", $1.name);
                yyerror(msg);
                id = new_symbol(slist, $1.name, INT);
            }
            int arg1 = widen(table, slist, clist, $3.addr, $3.type, id->type);
            int arg3 = id->addr;
            gen(table, slist, clist, _mov[id->type], arg1, -1, arg3);
        }

Among them, _mov is used to determine which move (assignment) instruction to use according to the type, including movi, movf, movc, and movb.

6 Control Flow Translation

6.1 Backpatching Technology

The goto_list type is defined in types.h for the backpatching table structure, i.e., t_list, f_list, n_list (corresponding to truelist, falselist, nextlist in textbooks). It stores the address list of jump instructions to be backfilled with target addresses. Related functions are defined, where the first two new_goto_list and merge_goto_list correspond to the makelist() and merge() functions in the textbook respectively.

typedef struct goto_list
{
    int addr;
    struct goto_list *next;
} goto_list;

goto_list *new_goto_list(int addr);
goto_list *merge_goto_list(goto_list *list1, goto_list *list2);
void delete_goto_list(goto_list *list);

At the same time, the backpatching function backpatch() in the textbook is implemented as follows: for a given backpatching table and backpatching address, access the quadruple in the quadruple table, modify the destination address (third parameter arg3) of its jump instruction, and modify the generated intermediate code by appending the string representation of the backpatching address:

void backpatch(quadtable *table, goto_list *list, int addr)
{
    while (list != NULL)
    {
        quadruple *t = table->buf + list->addr;
        t->arg3 = addr;
        sprintf(t->code + strlen(t->code) - 1, "%d\n", addr);
        list = list->next;
    }
}

6.2 Boolean Expressions

For boolean expressions, it is necessary to maintain truelist and falselist for the non-terminal symbol bool. Boolean expressions are translated using the “short-circuit” method of boolean operators, so the production needs to be further appropriately rewritten:

\begin{array}{ll} bool\to bool_1\;||\;M\;bool_2 & \begin{array}{l}\{\;backpatch(bool_1.falselist,\,M.addr);\\\;\;bool.truelist=merge(bool_1.truelist,\,bool_2.truelist);\\\;\;bool.falselist=bool_2.falselist;\;\}\end{array}\\\\ bool\to bool_1\;\&\&\;M\;bool_2 & \begin{array}{l}\{\;backpatch(bool_1.truelist,\,M.addr);\\\;\;bool.truelist=bool_2.truelist;\\\;\;bool.falstlist=merge(bool_1.falselist,\,bool_2.falstlist);\;\}\end{array}\\\\ bool\to\;!\;bool_1 & \begin{array}{l}\{\;bool.truelist=bool_1.falselist;\\\;\;bool.falselist=bool_1.truelist;\;\}\end{array}\\\\ bool\to(\;bool_1\;) & \begin{array}{l}\{\;bool.truelist=bool_1.truelist;\\\;\;bool.falselist=bool_1.falselist;\;\}\end{array}\\\\ M\to\varepsilon & \{\;M.addr=nextinstr;\;\} \end{array}

where nextinstr is the address of the next instruction, i.e., the length of the current quadruple table plus the starting address of the code segment table->size + CODE_BEGIN. The marker symbol M is introduced to record the starting address of the code corresponding to the second boolean expression of the AND and OR operations.

Taking the OR operation as an example to explain the “short-circuit” method. First, backfill the jump address where the condition of bool_1 is false to the starting address of bool_2, because if the first part of the OR operation is false, the second part may still be true, and the result cannot be obtained directly. However, the jump address where the condition of bool_1 is true is merged with that of bool_2, both being the jump address where the condition of bool is true. In this way, when bool_1 is indeed true, the judgment of bool_2 can be skipped and jump directly. The specific implementation is as follows:

M       :
        {   $$.addr = table->size + CODE_BEGIN;  }
bool    : bool OR M bool
        {
            backpatch(table, $1.f_list, $3.addr);
            $$.t_list = merge_goto_list($1.t_list, $4.t_list);
            $$.f_list = $4.f_list;
        }
        | bool AND M bool
        {
            backpatch(table, $1.t_list, $3.addr);
            $$.t_list = $4.t_list;
            $$.f_list = merge_goto_list($1.f_list, $4.f_list);
        }
        | '!' bool %prec NOT
        {
            $$.t_list = $2.f_list;
            $$.f_list = $2.t_list;
        }
        | '(' bool ')'
        {
            $$.t_list = $2.t_list;
            $$.f_list = $2.f_list;
        }

6.3 Conditional Expressions

Boolean operations can also derive conditional expressions. Conditional expressions are divided into two types: one uses relational operators \textbf{rel}, i.e., ==, !=, >, >=, <, <=; the other directly evaluates the expression as a boolean type. The translation is as follows:

\begin{array}{ll} bool\to expr_1\;\textbf{rel}\;expr_2 & \begin{array}{l}\{\;type=max(expr_1.type,\,expr_2.type);\\\;\;expr_1.addr=widen(expr_1.addr,\,expr_1.type,\,type);\\\;\;expr_2.addr=widen(expr_2.addr,\,expr_2.type,\,type);\\\;\;bool.truelist=makelist(nextinstr);\\\;\;bool.falselist=makelist(nextinstr+1);\\\;\;gen('\texttt{if}'\;expr_1.addr\;\textbf{rel}\;expr_2.addr\;'\texttt{goto \_}');\\\;\;gen('\texttt{goto \_}');\;\}\end{array}\\\\ bool\to expr & \begin{array}{l}\{\;type=max[expr_1.type][\textbf{bool}];\\\;\;expr_1.addr=widen(expr_1.addr,\,expr_1.type,\,type);\\\;\;bool.truelist=makelist(nextinstr);\\\;\;bool.falselist=makelist(nextinstr+1);\\\;\;gen('\texttt{if}'\;expr_1.addr\;'\texttt{==}'\;\textbf{true}\;'\texttt{goto \_}');\\\;\;gen('\texttt{goto \_}');\;\}\end{array} \end{array}

For the first case, it is also necessary to convert the parameters on both sides of the relational operator to the same type through max and widen before comparison. Then add the goto statement where the condition is true to the truelist backpatching table, and add the next (condition not met) goto statement to the falselist backpatching table.

For the second case, first convert the expression to the bool type and compare it with the constant true for equality; the rest is similar to the first case. The specific implementations of the two cases are as follows, with the first case taking greater than or equal to as an example:

bool    : expr GE expr
        {
            int arg1 = widen(table, slist, clist, $1.addr, $1.type, max[$1.type][$3.type]);
            int arg2 = widen(table, slist, clist, $3.addr, $3.type, max[$1.type][$3.type]);
            $$.t_list = new_goto_list(table->size + CODE_BEGIN);
            $$.f_list = new_goto_list(table->size + CODE_BEGIN + 1);
            gen(table, slist, clist, jge, arg1, arg2, -1);
            gen(table, slist, clist ,jmp, -1, -1, -1);
        }
        | expr
        {
            int arg1 = widen(table, slist, clist, $1.addr, $1.type, BOOL);
            int arg2 = new_constant(clist, (constant_val)true, BOOL)->addr;
            $$.t_list = new_goto_list(table->size + CODE_BEGIN);
            $$.f_list = new_goto_list(table->size + CODE_BEGIN + 1);
            gen(table, slist, clist, jeq, arg1, arg2, -1);
            gen(table, slist, clist ,jmp, -1, -1, -1);
        }

6.4 Branch Statements

In addition to boolean expressions, it is also necessary to maintain nextlist for general expression statements, i.e., the list of jump instructions to be backfilled with target addresses at the end of the statement. To this end, the above-mentioned marker M needs to be introduced, and a new marker N is added:

\begin{array}{ll} stmt\to\textbf{if}\;(\;bool\;)\;M\;stmt_1 & \begin{array}{l}\{\;backpatch(bool.truelist,\,M.addr);\\\;\;stmt.nextlist=merge(bool.falstlist,\,stmt_1.nextlist);\;\}\end{array}\\\\ stmt\to\textbf{if}\;(\;bool\;)\;M_1\;stmt_1\;N\;\textbf{else}\;M_2\;stmt_2 & \begin{array}{l}\{\;backpatch(bool.truelist,\,M_1.addr);\\\;\;backpatch(bool.falselist,\,M_2.addr);\\\;\;stmt.nextlist=merge(stmt_1.nextlist,\,merge(N.nextlist,\,stmt_2.nextlist));\;\}\end{array}\\\\ M\to\varepsilon & \{\;M.addr=nextinstr;\;\}\\\\ N\to\varepsilon & \begin{array}{l}\{\;N.nextlist=makelist(nextinstr);\\\;\;gen('\texttt{goto \_}');\;\}\end{array} \end{array}

Taking if-else as an example. Backfill the address where the condition of the boolean expression is true to the starting address of stmt_1, and backfill the address where the condition of the boolean expression is false to the starting address of stmt_2. At the same time, generate a jump instruction through the marker N, so that after stmt_1 is executed, it can jump to the end of the if-else statement instead of entering the next else statement. Finally, merge the backpatching tables of the addresses of the next instructions of stmt_1, N, and stmt_2 to form the backpatching table of the address of the next instruction of stmt. The specific implementation is as follows:

N       :
        {
            $$.n_list = new_goto_list(table->size + CODE_BEGIN);
            gen(table, slist, clist, jmp, -1, -1, -1);
        }
stmt    : IF '(' bool ')' M stmt
        {
            backpatch(table, $3.t_list, $5.addr);
            $$.n_list = merge_goto_list($3.f_list, $6.n_list);
        }
        | IF '(' bool ')' M stmt N ELSE M stmt
        {
            backpatch(table, $3.t_list, $5.addr);
            backpatch(table, $3.f_list, $9.addr);
            $$.n_list = merge_goto_list(merge_goto_list($6.n_list, $7.n_list), $10.n_list);
        }

It should be noted here that the production of N needs to be placed before stmt. Due to the original ambiguity problem of the if-else statement (shift-reduce conflict between shifting else and reducing a single if statement), yacc will default to shifting, thus correctly resolving the ambiguity of else. However, after introducing N, it becomes a reduce-reduce conflict between reducing N and reducing a single if statement. When handling such conflicts, yacc determines the priority according to the order of productions, so the production of N needs to be placed first to ensure that N is reduced first and the next else statement is shifted.

6.5 Loop Statements

In addition to conditional judgments that are basically similar to branch statements, to implement break in loops, it is necessary to additionally define a global stack break_lists to store the addresses of break statements in loops at all levels at the beginning of the loop. Therefore, a marker Q needs to be introduced before each loop, and the role of this marker is to push a new breaklist onto the stack. Once a break is encountered, the top of the stack is merged with the address of the newly generated goto instruction. At the end of the loop, the top of the stack is popped and added to the nextlist of stmt.

\begin{array}{ll} stmt\to\textbf{while}\;Q\;M_1\;(\;bool\;)\;M_2\;stmt_1 & \begin{array}{l}\{\;backpatch(stmt_1.nextlist,\,M_1.addr);\\\;\;backpatch(bool.truelist,\,M_2.addr);\\\;\;stmt.nextlist=merge(bool.falselist,\,breaklists.pop());\\\;\;gen('\texttt{goto}'\;M_1.addr);\;\}\end{array}\\\\ stmt\to\textbf{do}\;Q\;M_1\;stmt_1\;M_2\;\textbf{while}\;(\;bool\;)\;; & \begin{array}{l}\{\;backpatch(bool.truelist,\,M_1.addr);\\\;\;backpatch(stmt_1.nextlist,\,M_2.addr);\\\;\;stmt.nextlist=merge(bool.falstlist,\,breaklists.pop());\;\}\end{array}\\\\ stmt\to\textbf{break}; & \begin{array}{l}\{\;stmt.nextlist=\textbf{null};\\\;\;breaklists.top=merge(breaklists.top,\,makelist(nextinstr));\\\;\;gen('\texttt{goto \_}');\;\}\end{array}\\\\ Q\to\varepsilon & \{\;breaklists.push(\textbf{null});\;\} \end{array}

At the same time, for while loops, an additional goto instruction needs to be generated at the end of the loop to jump to the beginning of the loop. However, do-while does not need this; it is directly determined according to the truelist of the boolean expression. The specific implementation is as follows:

Q       :
        {   break_lists[tos++] = NULL;  }
stmt    : WHILE Q M '(' bool ')' M stmt
        {
            backpatch(table, $8.n_list, $3.addr);
            backpatch(table, $5.t_list, $7.addr);
            $$.n_list = merge_goto_list($5.f_list, break_lists[--tos]);
            gen(table, slist, clist, jmp, -1, -1, $3.addr);
        }
        | DO Q M stmt WHILE M '(' bool ')' ';'
        {
            backpatch(table, $8.t_list, $3.addr);
            backpatch(table, $4.n_list, $6.addr);
            $$.n_list = merge_goto_list($8.f_list, break_lists[--tos]);
        }
        | BREAK ';'
        {
            $$.n_list = NULL;
            if (tos == 0)
                yyerror("\"break\" statement doesn't match any loop");
            else
            {
                break_lists[tos - 1] = merge_goto_list(break_lists[tos - 1], new_goto_list(table->size + CODE_BEGIN));
                gen(table, slist, clist, jmp, -1, -1, -1);
            }
        }

6.6 Backpatching nextlist

The nextlist generated by the above control flow statements temporarily does not determine the address of the next statement of the statement, so it is recorded and left for backfilling later. Therefore, at the end of each statement, the starting address of the next statement is recorded through M, and the backpatching operation is performed:

stmts\to stmts\;stmt_1\;M\qquad\{\;backpatch(stmt_1.nextlist,\,M.addr);\;\}

The specific implementation is as follows:

stmts   : stmts stmt M
        {   backpatch(table, $2.n_list, $3.addr);  }

At the same time, for other stmt productions, such as stmt\to\textbf{id} and stmt\to block, their nexlist should be set to null to prevent errors caused by uninitialized pointers during merge_goto_list.

7 Running Results

7.1 Compilation

Compile in the terminal using the following commands, where yacc uses the -v command to output state transition graph information to y.output:

lex lexer.l
yacc -v -d grammar.y
cc types.c y.tab.c -ly -ll

During yacc compilation, it prompts 5 shift-reduce conflicts and 22 reduce-reduce conflicts, and the specific conditions of the conflicts can be viewed in y.output. The 5 shift-reduce conflicts are conflicts between reducing bool\to expr and shifting AND, OR, and shifting should be prioritized, so the default action is taken. The 21 reduce-reduce conflicts are caused by some of the same productions between bool and expr. Since the production of bool is placed before that of expr, bool is prioritized for reduction, i.e., jump code is generated first instead of evaluating boolean expressions. There is another reduce-reduce conflict caused by the introduction of marker N due to the ambiguity of if-else mentioned above, and the solution is to place the production N in front to prioritize reducing N and shifting else.

If the following test sample is saved as test.c:

{
    int i;
    int sum;
    i = 2;
    while (i <= 100)
    {
        sum = sum + i;
        i = i + 2;
    }
}

Run the following command to get the output result:

./a.out < test.c

The program prints the symbol table, global constant table, quadruple table in hierarchical order, and finally prints the intermediate code corresponding to the quadruples.

7.2 Symbol Table and Constant Table Test

For the following program:

{
    int a; float b; char c;
    a = 137; b = 3.14159;
    if (a == 137)
    {
        bool d; int e;
        d = true; e = 137;
    }
    else
    {
        char d; float e;
        d = '*'; e = 3.14159;
    }
    c = '*';
}

The program outputs the following symbol table and constant table. It can be seen that the three blocks are divided into 3 symbol tables, and duplicate symbol names in different blocks have no impact. The constant table also does not insert the same constant repeatedly:

7.3 Expression Translation Test

Use the following test code involving operator precedence, boolean expression evaluation, and type conversion:

{
    int a;
    float b;
    bool result;
    a = 1 * 1;
    b = a / 2 + 1;
    result = b != 1.5 && a == 1;
}

The translation result is as follows:

It can be seen that for integer division such as a / 2, the translation is correct, and the result is still an integer until it is assigned to b, which is then converted to a floating-point number.

7.4 Control Flow Translation Test

Use the following test code involving translation of if, if-else, while, do-while, break, and reflection of short-circuit code:

{
    int i; int j; float sum;
    i = 0; j = 0;
    while(true)
    {
        i = i + 1;
        do
        {
            sum = sum + 1.0 / i;
            if (sum > 3) break;
            else j = j + 1;
        } while (j < 100);
        if (i >= 100 || sum > 3) break;
    }
}

The result is as follows:

It can be seen that these goto statements are all correctly backfilled.

7.5 Error Recovery Test

Test with the following error-prone program: first, duplicate definition of a in the third line, missing semicolon in the fourth line, reference to undefined variable x in the sixth line, and break not inside any loop statement:

{
    int a; int b;
    float a;
    a = 1
    b = 0;
    if (x < 1)
    {
        a = a - 1;
        break;
    }
}

The program output is as follows, showing all errors:

Download source code here