LL(1) Parser for Battle Turing

Tags:

By: Aaron Jencks

Views: 2894
Posted on:

April 30, 2022, 10:13 a.m.

0 👍 0 👎

The first step for parsing, is implementing a grammar. We've already defined the grammar we're going to use in my previous post Context Free Grammar. Below I'm going to demonstrate how to implement the grammar table in python. Keep in mind for this implementation

  • 0 represents EOF
  • -1 represents epsilon

For help understanding the rest of the numbers, I'll refer you back to the list of token values I generated previously Battle Turing's List of Tokens and Operators

I wanted to use a table driven parser, so that I could create an attributed grammar system in the next step of the process, to do this requires 4 steps:

  1. Generate first table
  2. Generate follows table
  3. Generate first+ table
  4. Generate the LL(1) Parsing Table

First

Basically the concept of first is that for every non-terminal in your context free grammar (CFG) you determine the set of first terminal symbols that can be seen by it. I've already implemented a table for it in python, see below:

first_set = [
    [302, 303, 305, 307, 309, 310, 312, 313, 318, 321, 322, 323, 0, -1],                    # S
    [307, 309, 310, 312, 313, 318, 321, 322, 323, 308, -1],                                 # E
    [302, 303, 305],                                                                        # F
    [ord('{')],                                                                             # B
    [ord('(')],                                                                             # H
    [ord('{')],                                                                             # T
    [ord('='), ord('<'), ord('>'), 311, 314, 315, 316, 317, ord('&'), ord('|'), ord('!')],  # X
    [ord('='), ord('<'), ord('>'), 311, 314, 315],                                          # O
    [ord('^'), 300, 301],                                                                   # Y
    [ord('^'), 300],                                                                        # N
    [307, 309, 310, 312, 313, 318, 321, 322, 323, 319, 320, 303, 305, -1],                  # U
    [303, 305],                                                                             # I
    [304, -1],                                                                              # J
    [304, -1],                                                                              # K
    [301, -1],                                                                              # L
    [ord('{')],                                                                             # M
    [303, 305, 307, 309, 310, 312, 313, 318, 321, 322, 323, -1],                            # P
    [303, 305]                                                                              # R
]

This is helpful because the type of parser we're generating is considered backtracking free, which means that using a lookahead symbol, it can accurately determine which production to go to and the first set helps to narrow that decision down.

Follows

The concept of the follows set is that it's the opposite of the first set, basically, for each non-terminal in the CFG, we want to determine the set of terminals that would appear immediately following it, this is helpful for productions that have epsilon, or empty strings, this tells the parser which production to go to after setting the current one to empty. I've implemented this one in python too, see blow:

follow_set = [
    [ord('}'), 0],                                                                          # S
    [ord(';')],                                                                             # E
    [302, 303, 305, 307, 309, 310, 312, 313, 318, 321, 322, 323, 308, 0],                   # F
    [302, 303, 305, 307, 309, 310, 312, 313, 318, 321, 322, 323, 308, 304, 0],              # B
    [ord('{')],                                                                             # H
    [302, 303, 305, 307, 309, 310, 312, 313, 318, 321, 322, 323, 308, 319, 320, 0],         # T
    [ord('='), 311, 314, 315, 316, 317, ord('&'), ord('|'), ord('!'), ord(')')],            # X
    [ord('^'), 300, 301],                                                                   # O
    [ord('='), 311, 314, 315, 316, 317, ord('&'), ord('|'), ord('!'), ord(')')],            # Y
    [ord('^'), 306, 300, ord(')')],                                                         # N
    [307, 309, 310, 312, 313, 318, 321, 322, 323, 308, 319, 320, 303, 305, 0, ord('}')],    # U
    [307, 309, 310, 312, 313, 318, 321, 322, 323, 308, 319, 320, 303, 305, 0],              # I
    [302, 303, 305, 307, 309, 310, 312, 313, 318, 321, 322, 323, 308, 0],                   # J
    [307, 309, 310, 312, 313, 318, 321, 322, 323, 308, 319, 320, 303, 305, 0],              # K
    [ord(';')],                                                                             # L
    [304, 302, 303, 305, 307, 309, 310, 312, 313, 318, 321, 322, 323, 308, 0, -1],          # M
    [ord('}')],                                                                             # P
    [302, 303, 305, 307, 309, 310, 312, 313, 318, 321, 322, 323, 308, 0],                   # R
]

First+

Okay, so now that the first and follow tables have been compiled, it's time to generate the first+ table, this is a combination of the latter two. If the first set of a production does not include epsilon, then the first+ of that non-terminal is the first set, otherwise, it's the union of the first and follows sets, see below:

first_plus_set = [
    [302, 303, 305, 307, 309, 310, 312, 313, 318, 321, 322, 323, 308, 0, ord('}'), -1],     # S
    [307, 309, 310, 312, 313, 318, 321, 322, 323, 308, ord(';'), -1],                       # E
    [302, 303, 305],                                                                        # F
    [ord('{')],                                                                             # B
    [ord('(')],                                                                             # H
    [ord('{')],                                                                             # T
    [ord('='), ord('<'), ord('>'), 311, 314, 315, 316, 317, ord('&'), ord('|'), ord('!')],  # X
    [ord('='), ord('<'), ord('>'), 311, 314, 315],                                          # O
    [ord('^'), 306, 300, 301],                                                              # Y
    [ord('^'), 306, 300],                                                                   # N
    [307, 309, 310, 312, 313, 318, 321, 322, 323, 308, 319, 320, 303, 305, -1, ord('}')],   # U
    [303, 305],                                                                             # I
    [302, 303, 305, 307, 309, 310, 312, 313, 318, 321, 322, 323, 308, 0, 304, -1],          # J
    [307, 309, 310, 312, 313, 318, 321, 322, 323, 308, 319, 320, 303, 305, 0, 304, -1],     # K
    [301, -1, ord(';')],                                                                    # L
    [ord('{')],                                                                             # M
    [303, 305, 307, 309, 310, 312, 313, 318, 321, 322, 323, 308, -1, ord('}')],             # P
    [303, 305]                                                                              # R
]

Now that we have the first+ table we are ready to generate our table driven grammar!

Grammar Table

I ran into a few different challenges with this:

  1. I needed to figure out how to represent all of the different productions from the CFG in the table, for this I created a productions list, which I've put below
  2. I struggled to figure out how I wanted to represent the LL(1) table in python, I decided to use a list of dicts, that way I can omit any tokens that aren't accepted by a given non-terminal
production_table = [
    [ProductionEnum.E, ord(';'), ProductionEnum.S],                 # S         0
    [ProductionEnum.F, ProductionEnum.S],
    [ProductionEnum.EOF],
    [309, ProductionEnum.L],                                        # E         3
    [310, ProductionEnum.L],
    [313, ProductionEnum.L],
    [312, ProductionEnum.L],
    [307, 300],
    [318],
    [322],
    [323],
    [321],
    [302, ord(':'), ProductionEnum.B],                              # F         12
    [303, ProductionEnum.H, ProductionEnum.M, ProductionEnum.J],    # R         13
    [305, ProductionEnum.H, ProductionEnum.T],
    [ord('{'), ProductionEnum.S, ord('}')],                         # B         15
    [ord('('), ProductionEnum.X, ord(')')],                         # H         16
    [ord('{'), ProductionEnum.U, ord('}')],                         # T         17
    [ProductionEnum.O, ProductionEnum.Y],                           # X         18
    [316],
    [317],
    [ord('&'), ProductionEnum.X, ProductionEnum.X],
    [ord('|'), ProductionEnum.X, ProductionEnum.X],
    [ord('=')],                                                     # O         23
    [311],
    [ord('<')],
    [ord('>')],
    [314],
    [315],
    [ProductionEnum.N, ProductionEnum.N],                           # Y         29
    [301, 301],
    [ord('^')],                                                     # N         31
    [306],
    [300],
    [ProductionEnum.E, ord(';'), ProductionEnum.U],                 # U         34
    [319, ord(';')],
    [320, ord(';')],
    [ProductionEnum.I, ProductionEnum.U],
    [303, ProductionEnum.H, ProductionEnum.T, ProductionEnum.K],    # I         38
    [305, ProductionEnum.H, ProductionEnum.T],
    [304, ProductionEnum.M],                                        # J         40
    [304, ProductionEnum.T],                                        # K         41
    [301],                                                          # L         42
    [],                                                             # Epsilon   43
    [ord('{'), ProductionEnum.P, ord('}')],                         # M         44
    [ProductionEnum.E, ord(';'), ProductionEnum.P],                 # P         45
    [ord('!'), ProductionEnum.X],                                   # X         46
    [308, 302],                                                     # E         47
    [ProductionEnum.R],                                             # F         48
]

parse_table = [
    {
        0: 2,
        302: 1,
        303: 1,
        305: 1,
        307: 0,
        309: 0,
        310: 0,
        312: 0,
        313: 0,
        318: 0,
        321: 0,
        322: 0,
        323: 0,
        308: 0,
        ord('}'): 43
    },
    {
        307: 7,
        309: 3,
        310: 4,
        312: 6,
        313: 5,
        318: 8,
        321: 11,
        322: 9,
        323: 10,
        308: 47,
        ord(';'): 43
    },
    {
        302: 12,
        303: 48,
        305: 48
    },
    {
        ord('{'): 15
    },
    {
        ord('('): 16
    },
    {
        ord('{'): 17
    },
    {
        ord('='): 18,
        311: 18,
        314: 18,
        315: 18,
        316: 19,
        317: 20,
        ord('&'): 21,
        ord('|'): 22,
        ord('!'): 46
    },
    {
        ord('='): 23,
        311: 24,
        ord('<'): 25,
        ord('>'): 26,
        314: 27,
        315: 28
    },
    {
        ord('^'): 29,
        306: 29,
        300: 29,
        301: 30
    },
    {
        ord('^'): 31,
        306: 32,
        300: 33
    },
    {
        307: 34,
        309: 34,
        310: 34,
        312: 34,
        313: 34,
        318: 34,
        319: 35,
        320: 36,
        321: 34,
        322: 34,
        323: 34,
        308: 34,
        303: 37,
        305: 37,
        ord('}'): 43
    },
    {
        303: 38,
        305: 39
    },
    {
        302: 43,
        303: 43,
        305: 43,
        307: 43,
        309: 43,
        310: 43,
        312: 43,
        313: 43,
        318: 43,
        321: 43,
        322: 43,
        323: 43,
        308: 43,
        0: 43,
        304: 40
    },
    {
        303: 43,
        305: 43,
        307: 43,
        309: 43,
        310: 43,
        312: 43,
        313: 43,
        318: 43,
        321: 43,
        322: 43,
        323: 43,
        308: 43,
        0: 43,
        304: 40
    },
    {
        301: 42,
        ord(';'): 43
    },
    {
        ord('{'): 44
    },
    {
        ord('}'): 43,
        303: 45,
        305: 45,
        307: 45,
        309: 45,
        310: 45,
        312: 45,
        313: 45,
        318: 45,
        321: 45,
        322: 45,
        323: 45,
        308: 45
    },
    {
        303: 13,
        305: 14
    }
]

And there we have it, next time I'll be implementing an attributed grammar system that can parse this!

Comments

You are not logged in, please log in here to comment

Current Comments