Battle Turing

LL(1) Parser for Battle Turing

By: Aaron Jencks

Views: 3150
Posted on:

April 30, 2022, 10:13 a.m.

The first step for parsing, is implementing a grammar. We've already defined the grammar we're going to use in my previous post Context Free Grammar. Below I'm going to demonstrate how to implement the grammar table in python. Keep in mind for this implementation

0 represents EOF
-1 represents epsilon

For help understanding the rest of the numbers, I'll refer you back to the list of token values I generated previously Battle Turing's List of Tokens and Operators

I wanted to use a table driven parser, so that I could create an attributed grammar system in the next step of the process, to do this requires 4 steps:

Generate first table
Generate follows table
Generate first+ table
Generate the LL(1) Parsing Table

First

Basically the concept of first is that for every non-terminal in your context free grammar (CFG) you determine the set of first terminal symbols that can be seen by it. I've already implemented a table for it in python, see below:

first_set = [
    [302, 303, 305, 307, 309, 310, 312, 313, 318, 321, 322, 323, 0, -1],                    # S
    [307, 309, 310, 312, 313, 318, 321, 322, 323, 308, -1],                                 # E
    [302, 303, 305],                                                                        # F
    [ord('{')],                                                                             # B
    [ord('(')],                                                                             # H
    [ord('{')],                                                                             # T
    [ord('='), ord('<'), ord('>'), 311, 314, 315, 316, 317, ord('&'), ord('|'), ord('!')],  # X
    [ord('='), ord('<'), ord('>'), 311, 314, 315],                                          # O
    [ord('^'), 300, 301],                                                                   # Y
    [ord('^'), 300],                                                                        # N
    [307, 309, 310, 312, 313, 318, 321, 322, 323, 319, 320, 303, 305, -1],                  # U
    [303, 305],                                                                             # I
    [304, -1],                                                                              # J
    [304, -1],                                                                              # K
    [301, -1],                                                                              # L
    [ord('{')],                                                                             # M
    [303, 305, 307, 309, 310, 312, 313, 318, 321, 322, 323, -1],                            # P
    [303, 305]                                                                              # R
]

This is helpful because the type of parser we're generating is considered backtracking free, which means that using a lookahead symbol, it can accurately determine which production to go to and the first set helps to narrow that decision down.

Follows

The concept of the follows set is that it's the opposite of the first set, basically, for each non-terminal in the CFG, we want to determine the set of terminals that would appear immediately following it, this is helpful for productions that have epsilon, or empty strings, this tells the parser which production to go to after setting the current one to empty. I've implemented this one in python too, see blow:

follow_set = [
    [ord('}'), 0],                                                                          # S
    [ord(';')],                                                                             # E
    [302, 303, 305, 307, 309, 310, 312, 313, 318, 321, 322, 323, 308, 0],                   # F
    [302, 303, 305, 307, 309, 310, 312, 313, 318, 321, 322, 323, 308, 304, 0],              # B
    [ord('{')],                                                                             # H
    [302, 303, 305, 307, 309, 310, 312, 313, 318, 321, 322, 323, 308, 319, 320, 0],         # T
    [ord('='), 311, 314, 315, 316, 317, ord('&'), ord('|'), ord('!'), ord(')')],            # X
    [ord('^'), 300, 301],                                                                   # O
    [ord('='), 311, 314, 315, 316, 317, ord('&'), ord('|'), ord('!'), ord(')')],            # Y
    [ord('^'), 306, 300, ord(')')],                                                         # N
    [307, 309, 310, 312, 313, 318, 321, 322, 323, 308, 319, 320, 303, 305, 0, ord('}')],    # U
    [307, 309, 310, 312, 313, 318, 321, 322, 323, 308, 319, 320, 303, 305, 0],              # I
    [302, 303, 305, 307, 309, 310, 312, 313, 318, 321, 322, 323, 308, 0],                   # J
    [307, 309, 310, 312, 313, 318, 321, 322, 323, 308, 319, 320, 303, 305, 0],              # K
    [ord(';')],                                                                             # L
    [304, 302, 303, 305, 307, 309, 310, 312, 313, 318, 321, 322, 323, 308, 0, -1],          # M
    [ord('}')],                                                                             # P
    [302, 303, 305, 307, 309, 310, 312, 313, 318, 321, 322, 323, 308, 0],                   # R
]

First+

Okay, so now that the first and follow tables have been compiled, it's time to generate the first+ table, this is a combination of the latter two. If the first set of a production does not include epsilon, then the first+ of that non-terminal is the first set, otherwise, it's the union of the first and follows sets, see below:

first_plus_set = [
    [302, 303, 305, 307, 309, 310, 312, 313, 318, 321, 322, 323, 308, 0, ord('}'), -1],     # S
    [307, 309, 310, 312, 313, 318, 321, 322, 323, 308, ord(';'), -1],                       # E
    [302, 303, 305],                                                                        # F
    [ord('{')],                                                                             # B
    [ord('(')],                                                                             # H
    [ord('{')],                                                                             # T
    [ord('='), ord('<'), ord('>'), 311, 314, 315, 316, 317, ord('&'), ord('|'), ord('!')],  # X
    [ord('='), ord('<'), ord('>'), 311, 314, 315],                                          # O
    [ord('^'), 306, 300, 301],                                                              # Y
    [ord('^'), 306, 300],                                                                   # N
    [307, 309, 310, 312, 313, 318, 321, 322, 323, 308, 319, 320, 303, 305, -1, ord('}')],   # U
    [303, 305],                                                                             # I
    [302, 303, 305, 307, 309, 310, 312, 313, 318, 321, 322, 323, 308, 0, 304, -1],          # J
    [307, 309, 310, 312, 313, 318, 321, 322, 323, 308, 319, 320, 303, 305, 0, 304, -1],     # K
    [301, -1, ord(';')],                                                                    # L
    [ord('{')],                                                                             # M
    [303, 305, 307, 309, 310, 312, 313, 318, 321, 322, 323, 308, -1, ord('}')],             # P
    [303, 305]                                                                              # R
]

Now that we have the first+ table we are ready to generate our table driven grammar!

Grammar Table

I ran into a few different challenges with this:

I needed to figure out how to represent all of the different productions from the CFG in the table, for this I created a productions list, which I've put below
I struggled to figure out how I wanted to represent the LL(1) table in python, I decided to use a list of dicts, that way I can omit any tokens that aren't accepted by a given non-terminal

production_table = [
    [ProductionEnum.E, ord(';'), ProductionEnum.S],                 # S         0
    [ProductionEnum.F, ProductionEnum.S],
    [ProductionEnum.EOF],
    [309, ProductionEnum.L],                                        # E         3
    [310, ProductionEnum.L],
    [313, ProductionEnum.L],
    [312, ProductionEnum.L],
    [307, 300],
    [318],
    [322],
    [323],
    [321],
    [302, ord(':'), ProductionEnum.B],                              # F         12
    [303, ProductionEnum.H, ProductionEnum.M, ProductionEnum.J],    # R         13
    [305, ProductionEnum.H, ProductionEnum.T],
    [ord('{'), ProductionEnum.S, ord('}')],                         # B         15
    [ord('('), ProductionEnum.X, ord(')')],                         # H         16
    [ord('{'), ProductionEnum.U, ord('}')],                         # T         17
    [ProductionEnum.O, ProductionEnum.Y],                           # X         18
    [316],
    [317],
    [ord('&'), ProductionEnum.X, ProductionEnum.X],
    [ord('|'), ProductionEnum.X, ProductionEnum.X],
    [ord('=')],                                                     # O         23
    [311],
    [ord('<')],
    [ord('>')],
    [314],
    [315],
    [ProductionEnum.N, ProductionEnum.N],                           # Y         29
    [301, 301],
    [ord('^')],                                                     # N         31
    [306],
    [300],
    [ProductionEnum.E, ord(';'), ProductionEnum.U],                 # U         34
    [319, ord(';')],
    [320, ord(';')],
    [ProductionEnum.I, ProductionEnum.U],
    [303, ProductionEnum.H, ProductionEnum.T, ProductionEnum.K],    # I         38
    [305, ProductionEnum.H, ProductionEnum.T],
    [304, ProductionEnum.M],                                        # J         40
    [304, ProductionEnum.T],                                        # K         41
    [301],                                                          # L         42
    [],                                                             # Epsilon   43
    [ord('{'), ProductionEnum.P, ord('}')],                         # M         44
    [ProductionEnum.E, ord(';'), ProductionEnum.P],                 # P         45
    [ord('!'), ProductionEnum.X],                                   # X         46
    [308, 302],                                                     # E         47
    [ProductionEnum.R],                                             # F         48
]

parse_table = [
    {
        0: 2,
        302: 1,
        303: 1,
        305: 1,
        307: 0,
        309: 0,
        310: 0,
        312: 0,
        313: 0,
        318: 0,
        321: 0,
        322: 0,
        323: 0,
        308: 0,
        ord('}'): 43
    },
    {
        307: 7,
        309: 3,
        310: 4,
        312: 6,
        313: 5,
        318: 8,
        321: 11,
        322: 9,
        323: 10,
        308: 47,
        ord(';'): 43
    },
    {
        302: 12,
        303: 48,
        305: 48
    },
    {
        ord('{'): 15
    },
    {
        ord('('): 16
    },
    {
        ord('{'): 17
    },
    {
        ord('='): 18,
        311: 18,
        314: 18,
        315: 18,
        316: 19,
        317: 20,
        ord('&'): 21,
        ord('|'): 22,
        ord('!'): 46
    },
    {
        ord('='): 23,
        311: 24,
        ord('<'): 25,
        ord('>'): 26,
        314: 27,
        315: 28
    },
    {
        ord('^'): 29,
        306: 29,
        300: 29,
        301: 30
    },
    {
        ord('^'): 31,
        306: 32,
        300: 33
    },
    {
        307: 34,
        309: 34,
        310: 34,
        312: 34,
        313: 34,
        318: 34,
        319: 35,
        320: 36,
        321: 34,
        322: 34,
        323: 34,
        308: 34,
        303: 37,
        305: 37,
        ord('}'): 43
    },
    {
        303: 38,
        305: 39
    },
    {
        302: 43,
        303: 43,
        305: 43,
        307: 43,
        309: 43,
        310: 43,
        312: 43,
        313: 43,
        318: 43,
        321: 43,
        322: 43,
        323: 43,
        308: 43,
        0: 43,
        304: 40
    },
    {
        303: 43,
        305: 43,
        307: 43,
        309: 43,
        310: 43,
        312: 43,
        313: 43,
        318: 43,
        321: 43,
        322: 43,
        323: 43,
        308: 43,
        0: 43,
        304: 40
    },
    {
        301: 42,
        ord(';'): 43
    },
    {
        ord('{'): 44
    },
    {
        ord('}'): 43,
        303: 45,
        305: 45,
        307: 45,
        309: 45,
        310: 45,
        312: 45,
        313: 45,
        318: 45,
        321: 45,
        322: 45,
        323: 45,
        308: 45
    },
    {
        303: 13,
        305: 14
    }
]

And there we have it, next time I'll be implementing an attributed grammar system that can parse this!

Comments

You are not logged in, please log in here to comment

Current Comments

You are not logged in, please log in here

About the Author

Aaron Jencks

Hi, I'm Aaron! Someday I'm going to become the leading researcher in AI and create true artificial intelligence. I spend a lot of my time coding, usually in python, but if the occasion calls for it, I enjoy working in C/C++ as well. Check out my github!

Aaron Jencks's Programming Languages

Contact

email: aaronjencks@gmail.com

Join our new Discord!

Menu