img
Question:
Published on: 3 December, 2024

Write short notes on:

  • LEX
  • YACC
Answer:

 

LEX:Lex is a program generator designed for lexical processing of character input streams. It accepts a high-level, problem oriented specification for character string matching, and produces a program in a general purpose language which recognizes regular expressions. The regular expressions are specified by the user in the source specifications given to Lex. The Lex written code recognizes these expressions in an input stream and partitions the input stream into strings matching the expressions. At the boundaries between strings program sections provided by the user are executed. The Lex source file associates the regular expressions and the program fragments. As each expression appears in the input to the program written by Lex, the corresponding fragment is executed.

The general format of Lex source is:

                            {definitions}

                             %%

                            {rules}

                             %%

                            {user subroutines}

where the definitions and the user subroutines are often omitted. The second %% is optional, but the first is required to mark the beginning of the rules. The absolute minimum Lex program is thus

                                     %%

(no definitions, no rules) which translates into a program which copies the input to the output unchanged.

The general form of a Lex source file is:

                             {definitions}

                             %%

                             {rules}

                             %%

                             {user subroutines}

The definitions section contains a combination of

1) Definitions, in the form ``name space translation''.

2) Included code, in the form ``space code''.

3) Included code, in the form

                                       %

                                       code

                                       %}

4) Start conditions, given in the form

                                %S name1 name2 ...

5) Character set tables, in the form

                          %T

                          number space character-string

                          ...

                          %T

6) Changes to internal array sizes, in the form

                                     %x  nnn

where nnn is a decimal integer representing an array size and x selects the parameter as follows:

                        Letter          Parameter

                                positions

                          n      states

                          e      tree nodes

                          a      transitions

                          k      packed character classes

                          o      output array size

Lines in the rules section have the form ``expression action'' where the action may be continued on succeeding lines by using braces to delimit it.

Regular expressions in Lex use the following operators:

               x        the character "x"

               "x"      an "x", even if x is an operator

               \x       an "x", even if x is an operator.

               [xy]     the character x or y.

               [x-z]    the characters x, y or z.

               [^x]     any character but x.

               .        any character but newline.

               ^x       an x at the beginning of a line.

               <y>x     an x when Lex is in start condition y.

               x$       an x at the end of a line.

               x?       an optional x.

               x*       0,1,2, ... instances of x.

               x+       1,2,3, ... instances of x.

               x|y      an x or a y.

               (x)      an x.

               x/y      an x but only if followed by y.

               {xx}     the translation of xx from the

                        definitions section.

               x{m,n}   m through n occurrences of x

The programs generated by Lex handle character I/O only through the routines input, output, and unput. Thus the character representation provided in these routines is accepted by Lex and employed to return values in yytext.

 

YACC:  Yacc (for "yet another compiler compiler.") is the standard parser generator for the Unix operating system. An open source program, yacc generates code for the parser in the C programming language. The acronym is usually rendered in lowercase but is occasionally seen as YACC or Yacc. The original version of yacc was written by Stephen Johnson at American Telephone and Telegraph (AT&T). Versions of yacc have since been written for use with Ada, Java and several other less well-known programming languages.

yacc is designed for use with C code and generates a parser written in C. The parser is configured for use in conjunction with a lex-generated scanner and relies on standard shared features (token types, yylval, etc.) and calls the function yylex as a scanner coroutine. You provide a grammar specification file, which is traditionally named using a .y extension. You invoke yacc on the .y file and it creates the y.tab.h and y.tab.c files containing a thousand or so lines of intense C code that implements an efficient LALR(1) parser for your grammar, including the code for the actions you specified. The file provides an extern function yyparse.y that will attempt to successfully parse a valid sentence. You compile that C file normally, link with the rest of your code, and you have a parser! By default, the parser reads from stdin and writes to stdout, just like a lex-generated scanner does.

% yacc myFile.y                                                //creates y.tab.c of C code for parser

% gcc -c y.tab.c                                                //compiles parser code

% gcc –o parse y.tab.o lex.yy.o –ll -ly                //link parser, scanner, libraries

                                                 //invokes parser reads from stdin

 

 

yacc File Format

The heart of the input specification is a collection of grammar rules. Each rule describes an allowable structure and gives it a name. For example, one grammar rule might be

        date  :  month_name  day  ','  year   ;

Here, date, month_name, day, and year represent structures of interest in the input process; presumably, month_name, day, and year are defined elsewhere. The comma ``,'' is enclosed in single quotes; this implies that the comma is to appear literally in the input. The colon and semicolon merely serve as punctuation in the rule, and have no significance in controlling the input.

Names refer to either tokens or nonterminal symbols. Yacc requires token names to be declared as such. In addition, for reasons discussed in Section 3, it is often desirable to include the lexical analyzer as part of the specification file; it may be useful to include other programs as well. Thus, every specification file consists of three sections: the declarations, (grammar) rules, and programs. The sections are separated by double percent ``%%'' marks. (The percent ``%'' is generally used in Yacc specifications as an escape character.)

In other words, a full specification file looks like 

%{

Declarations

%}

Definitions

%%

Productions

%%

User subroutines  

The optional Declarations and User subroutines sections are used for ordinary C code that we want copied verbatim to the generated C file, declarations are copied to the top of the file, user subroutines to the bottom. The optional Definitions section is where we configure various parser features such as defining token codes, establishing operator precedence and associatively, and setting up the global variables used to communicate between the scanner and parser. The required Productions section is where we specify the grammar rules. As in lex, we can associate an action with each pattern (this time a production), which allows us to do whatever processing is needed as we reduce using that production.The declaration section may be empty. Moreover, if the programs section is omitted, the second %% mark may be omitted also;

A grammar rule has the form:

        A  :  BODY  ;

A represents a nonterminal name, and BODY represents a sequence of zero or more names and literals. The colon and the semicolon are Yacc punctuation.

Names may be of arbitrary length, and may be made up of letters, dot ``.'', underscore ``_'', and non-initial digits. Upper and lower case letters are distinct. The names used in the body of a grammar rule may represent tokens or nonterminal symbols.

A literal consists of a character enclosed in single quotes ``'''. As in C, the backslash ``\'' is an escape character within literals, and all the C escapes are recognized. Thus

        '\n'           newline

        '\r'             return

        '\''            single quote ``'''

        '\\'           backslash ``\''

        '\t'             tab

        '\b'            backspace

        '\f'            form feed

        '\xxx'        “xxx” in octal

For a number of technical reasons, the NUL character ('\0' or 0) should never be used in grammar rules.

If there are several grammar rules with the same left hand side, the vertical bar ``|'' can be used to avoid rewriting the left hand side. In addition, the semicolon at the end of a rule can be dropped before a vertical bar. Thus the grammar rules

        A       :       B  C  D   ;

        A       :       E  F   ;

        A       :       G   ;

can be given to Yacc as

        A       :       B  C  D

                  |       E  F

                  |       G

                  ;

It is not necessary that all grammar rules with the same left side appear together in the grammar rules section, although it makes the input much more readable, and easier to change.

If a nonterminal symbol matches the empty string, this can be indicated in the obvious way:

        empty :   ;

Names representing tokens must be declared; this is most simply done by writing

        %token   name1  name2 . . .

in the declarations section. (See Sections 3 , 5, and 6 for much more discussion). Every name not defined in the declarations section is assumed to represent a nonterminal symbol. Every nonterminal symbol must appear on the left side of at least one rule.

Of all the nonterminal symbols, one, called the start symbol, has particular importance. The parser is designed to recognize the start symbol; thus, this symbol represents the largest, most general structure described by the grammar rules. By default, the start symbol is taken to be the left hand side of the first grammar rule in the rules section. It is possible, and in fact desirable, to declare the start symbol explicitly in the declarations section using the %start keyword:

        %start   symbol

The end of the input to the parser is signaled by a special token, called the endmarker. If the tokens up to, but not including, the endmarker form a structure which matches the start symbol, the parser function returns to its caller after the endmarker is seen; it accepts the input. If the endmarker is seen in any other context, it is an error.

It is the job of the user-supplied lexical analyzer to return the endmarker when appropriate;

 

 

Random questions