The first step for parsing, is implementing a grammar. We've already defined the grammar we're going to use in my previous post Context Free Grammar. Below I'm going to demonstrate how to implement the grammar table in python. Keep in mind for this implementation
For help understanding the rest of the numbers, I'll refer you back to the list of token values I generated previously Battle Turing's List of Tokens and Operators
I wanted to use a table driven parser, so that I could create an attributed grammar system in the next step of the process, to do this requires 4 steps:
first
tablefollows
tablefirst+
tableBasically the concept of first
is that for every non-terminal in your context free grammar (CFG) you determine the set of first terminal symbols that can be seen by it. I've already implemented a table for it in python, see below:
first_set = [
[302, 303, 305, 307, 309, 310, 312, 313, 318, 321, 322, 323, 0, -1], # S
[307, 309, 310, 312, 313, 318, 321, 322, 323, 308, -1], # E
[302, 303, 305], # F
[ord('{')], # B
[ord('(')], # H
[ord('{')], # T
[ord('='), ord('<'), ord('>'), 311, 314, 315, 316, 317, ord('&'), ord('|'), ord('!')], # X
[ord('='), ord('<'), ord('>'), 311, 314, 315], # O
[ord('^'), 300, 301], # Y
[ord('^'), 300], # N
[307, 309, 310, 312, 313, 318, 321, 322, 323, 319, 320, 303, 305, -1], # U
[303, 305], # I
[304, -1], # J
[304, -1], # K
[301, -1], # L
[ord('{')], # M
[303, 305, 307, 309, 310, 312, 313, 318, 321, 322, 323, -1], # P
[303, 305] # R
]
This is helpful because the type of parser we're generating is considered backtracking free, which means that using a lookahead symbol, it can accurately determine which production to go to and the first
set helps to narrow that decision down.
The concept of the follows
set is that it's the opposite of the first
set, basically, for each non-terminal in the CFG, we want to determine the set of terminals that would appear immediately following it, this is helpful for productions that have epsilon, or empty strings, this tells the parser which production to go to after setting the current one to empty. I've implemented this one in python too, see blow:
follow_set = [
[ord('}'), 0], # S
[ord(';')], # E
[302, 303, 305, 307, 309, 310, 312, 313, 318, 321, 322, 323, 308, 0], # F
[302, 303, 305, 307, 309, 310, 312, 313, 318, 321, 322, 323, 308, 304, 0], # B
[ord('{')], # H
[302, 303, 305, 307, 309, 310, 312, 313, 318, 321, 322, 323, 308, 319, 320, 0], # T
[ord('='), 311, 314, 315, 316, 317, ord('&'), ord('|'), ord('!'), ord(')')], # X
[ord('^'), 300, 301], # O
[ord('='), 311, 314, 315, 316, 317, ord('&'), ord('|'), ord('!'), ord(')')], # Y
[ord('^'), 306, 300, ord(')')], # N
[307, 309, 310, 312, 313, 318, 321, 322, 323, 308, 319, 320, 303, 305, 0, ord('}')], # U
[307, 309, 310, 312, 313, 318, 321, 322, 323, 308, 319, 320, 303, 305, 0], # I
[302, 303, 305, 307, 309, 310, 312, 313, 318, 321, 322, 323, 308, 0], # J
[307, 309, 310, 312, 313, 318, 321, 322, 323, 308, 319, 320, 303, 305, 0], # K
[ord(';')], # L
[304, 302, 303, 305, 307, 309, 310, 312, 313, 318, 321, 322, 323, 308, 0, -1], # M
[ord('}')], # P
[302, 303, 305, 307, 309, 310, 312, 313, 318, 321, 322, 323, 308, 0], # R
]
Okay, so now that the first
and follow
tables have been compiled, it's time to generate the first+
table, this is a combination of the latter two. If the first
set of a production does not include epsilon, then the first+
of that non-terminal is the first
set, otherwise, it's the union of the first
and follows
sets, see below:
first_plus_set = [
[302, 303, 305, 307, 309, 310, 312, 313, 318, 321, 322, 323, 308, 0, ord('}'), -1], # S
[307, 309, 310, 312, 313, 318, 321, 322, 323, 308, ord(';'), -1], # E
[302, 303, 305], # F
[ord('{')], # B
[ord('(')], # H
[ord('{')], # T
[ord('='), ord('<'), ord('>'), 311, 314, 315, 316, 317, ord('&'), ord('|'), ord('!')], # X
[ord('='), ord('<'), ord('>'), 311, 314, 315], # O
[ord('^'), 306, 300, 301], # Y
[ord('^'), 306, 300], # N
[307, 309, 310, 312, 313, 318, 321, 322, 323, 308, 319, 320, 303, 305, -1, ord('}')], # U
[303, 305], # I
[302, 303, 305, 307, 309, 310, 312, 313, 318, 321, 322, 323, 308, 0, 304, -1], # J
[307, 309, 310, 312, 313, 318, 321, 322, 323, 308, 319, 320, 303, 305, 0, 304, -1], # K
[301, -1, ord(';')], # L
[ord('{')], # M
[303, 305, 307, 309, 310, 312, 313, 318, 321, 322, 323, 308, -1, ord('}')], # P
[303, 305] # R
]
Now that we have the first+
table we are ready to generate our table driven grammar!
I ran into a few different challenges with this:
production_table = [
[ProductionEnum.E, ord(';'), ProductionEnum.S], # S 0
[ProductionEnum.F, ProductionEnum.S],
[ProductionEnum.EOF],
[309, ProductionEnum.L], # E 3
[310, ProductionEnum.L],
[313, ProductionEnum.L],
[312, ProductionEnum.L],
[307, 300],
[318],
[322],
[323],
[321],
[302, ord(':'), ProductionEnum.B], # F 12
[303, ProductionEnum.H, ProductionEnum.M, ProductionEnum.J], # R 13
[305, ProductionEnum.H, ProductionEnum.T],
[ord('{'), ProductionEnum.S, ord('}')], # B 15
[ord('('), ProductionEnum.X, ord(')')], # H 16
[ord('{'), ProductionEnum.U, ord('}')], # T 17
[ProductionEnum.O, ProductionEnum.Y], # X 18
[316],
[317],
[ord('&'), ProductionEnum.X, ProductionEnum.X],
[ord('|'), ProductionEnum.X, ProductionEnum.X],
[ord('=')], # O 23
[311],
[ord('<')],
[ord('>')],
[314],
[315],
[ProductionEnum.N, ProductionEnum.N], # Y 29
[301, 301],
[ord('^')], # N 31
[306],
[300],
[ProductionEnum.E, ord(';'), ProductionEnum.U], # U 34
[319, ord(';')],
[320, ord(';')],
[ProductionEnum.I, ProductionEnum.U],
[303, ProductionEnum.H, ProductionEnum.T, ProductionEnum.K], # I 38
[305, ProductionEnum.H, ProductionEnum.T],
[304, ProductionEnum.M], # J 40
[304, ProductionEnum.T], # K 41
[301], # L 42
[], # Epsilon 43
[ord('{'), ProductionEnum.P, ord('}')], # M 44
[ProductionEnum.E, ord(';'), ProductionEnum.P], # P 45
[ord('!'), ProductionEnum.X], # X 46
[308, 302], # E 47
[ProductionEnum.R], # F 48
]
parse_table = [
{
0: 2,
302: 1,
303: 1,
305: 1,
307: 0,
309: 0,
310: 0,
312: 0,
313: 0,
318: 0,
321: 0,
322: 0,
323: 0,
308: 0,
ord('}'): 43
},
{
307: 7,
309: 3,
310: 4,
312: 6,
313: 5,
318: 8,
321: 11,
322: 9,
323: 10,
308: 47,
ord(';'): 43
},
{
302: 12,
303: 48,
305: 48
},
{
ord('{'): 15
},
{
ord('('): 16
},
{
ord('{'): 17
},
{
ord('='): 18,
311: 18,
314: 18,
315: 18,
316: 19,
317: 20,
ord('&'): 21,
ord('|'): 22,
ord('!'): 46
},
{
ord('='): 23,
311: 24,
ord('<'): 25,
ord('>'): 26,
314: 27,
315: 28
},
{
ord('^'): 29,
306: 29,
300: 29,
301: 30
},
{
ord('^'): 31,
306: 32,
300: 33
},
{
307: 34,
309: 34,
310: 34,
312: 34,
313: 34,
318: 34,
319: 35,
320: 36,
321: 34,
322: 34,
323: 34,
308: 34,
303: 37,
305: 37,
ord('}'): 43
},
{
303: 38,
305: 39
},
{
302: 43,
303: 43,
305: 43,
307: 43,
309: 43,
310: 43,
312: 43,
313: 43,
318: 43,
321: 43,
322: 43,
323: 43,
308: 43,
0: 43,
304: 40
},
{
303: 43,
305: 43,
307: 43,
309: 43,
310: 43,
312: 43,
313: 43,
318: 43,
321: 43,
322: 43,
323: 43,
308: 43,
0: 43,
304: 40
},
{
301: 42,
ord(';'): 43
},
{
ord('{'): 44
},
{
ord('}'): 43,
303: 45,
305: 45,
307: 45,
309: 45,
310: 45,
312: 45,
313: 45,
318: 45,
321: 45,
322: 45,
323: 45,
308: 45
},
{
303: 13,
305: 14
}
]
And there we have it, next time I'll be implementing an attributed grammar system that can parse this!
You are not logged in, please log in here to comment