Creating DSL with Antlr4 and Scala

Author: 
Saumitra
Thursday, January 11, 2018

Domain specific languages help a lot in improving developer productivity. First thing which you need while creating a DSL is a parser which can takes a piece of text and transforms it in structured format (like Abstract Syntax Tree) so that your program can understand and do something useful with it. DSL tends to stay for years so while choosing a tool for creating parser for you DSL you need to make sure that it’s easy to maintain and evolve the language. For parsing simple DSL, you can just use regular expression or Scala’s in-built parser-combinators, but for even slightly complex DSL, both become performance and maintenance nightmares.

In this post we will see how to use Antlr to create a basic grammar and use it in Scala. Full code and grammar for this post is available at http://saumitra.me/blog/antlr4-visitor-vs-listener-pattern/

ANTLR4

ANTLR can generate lexers, parsers, tree parsers, and combined lexer-parsers. Parsers can automatically generate abstract syntax trees which can be further processed with tree parsers. ANTLR provides a single consistent notation for specifying lexers, parsers, and tree parsers. This is in contrast with other parser/lexer generators and adds greatly to the tool’s ease of use. It supports:

  1. Tree construction
  2. Tree walking
  3. Error recovery
  4. Error handling
  5. Translation

Antlr supports a large number of target languages, so same grammar can be used for both backend parsing or frontend validations. Following languages are supported:

1 | Ada, Action Script, C. C#, D, Emacs, ELisp, Objective C, Java, Javascript, Python, Ruby, Perl6, Perl, PHP, Oberon, Scala

How ANTLR works

On a high level, here’s what you do to parse something with ANTLR

  1. Create lexer rules
  2. Create parser rules which uses lexer output
  3. Use lexer and parser to generate source code for a target language
  4. Use generated sources to convert some raw input into structured form (AST)
  5. Do something with this structured data

We will understand it with an example. Let’s say we want to create a DSL for allowing arithmetic operation. A valid input (expression) will be:

1 | 3 + (4 * 5)

As humans, if we want to evaluate this expression, here’s what we will do:

  1. Split this expression into different components.
    1. For example in above example, each character belongs to one of these group
      1. Operands (3, 4, 5)
      2. Operation (+ - * /)
      3. Whitespaces
    2. This part is called lexical analysis where you convert raw text (stream of characters) into tokens
  2. Create relationship between tokens

    1. To evaluate it efficiently we can create a tree like structure to define relationship between different expression like this:
    2. This is called AST (Abstract syntax tree) and this gets by applying rules you define in your grammar on input text. Once you have the AST, to evaluate the expression, we need to traverse or ‘walk’ it in a depth first manner. We start at the root ‘+’ and go as deep into the tree as we can along each child, and then evaluate the operations as we come back out of the tree. We will now setup the tools and try creating a simple grammar.

Setup

IDE

ANTLR provides a GUI based IDE for developing grammar. You can download it from http://www.antlr3.org/works/. It combines an excellent grammar-aware editor with an interpreter for rapid prototyping and a language-agnostic debugger for isolating grammar errors.

IntelliJ plugin

Intellij provides a plugin too for ANTLR. Refer this link for adding the plugin https://plugins.jetbrains.com/plugin/7358-antlr-v4-grammar-plugin

Command line setup

You can directly create, test and debug grammar from command line too. Here are the steps on how to:

  1. Download Antlr Jar http://www.antlr.org/download/antlr-4.7-complete.jar
  2. Create an alias for command to generate sources
    1. alias antlr4='java -jar /home/sam/softwares/antlr/antlr-4.6-complete.jar'
  3. Create an alias to test your grammar for some input
    1. alias grun='java -cp ".:/home/sam/softwares/antlr/antlr-4.6-complete.jar" org.antlr.v4.gui.TestRig'

Add this in ~/.bashrc to be able to directly call antlr4 and grun command from anywhere.

Creating grammar

A grammar will consist of 2 parts:

  • Lexer
  • Parser

Both of these can be defined in same file, but for maintenance sake it’s better to define it in separate files. Let’s create lexer and parser for a DSL which will allow basic arithmetic operations on 2 numbers.

Some valid inputs will be:

	1 | lexer grammar ArithmeticLexer;
	2 | 
	3 | WS: [ \t\n]+ -> skip ;
	4 | 
	5 | NUMBER: ('0' .. '9') + ('.' ('0' .. '9') +)?;
	6 | 
	7 | ADD: '+';
	8 | SUB: '-';
	9 | MUL: '*';
	10| DIV: '/';

Let’s first define lexer definitions in a file named ArithmeticLexer.g4 to extract tokens from input:

	1 | parser grammar ArithmeticParser;
	2 | 
	3 | options { tokenVocab=ArithmeticLexer; }
	4 | 
	5 | expr: NUMBER operation NUMBER;
	6 | 
	7 | operation: (ADD | SUB | MUL | DIV);
  • First definition for WS is telling to skip all the space, tabs and newline chars
  • Second definition for NUMBER is telling to extract all numbers as NUMBER token
  • ADD/SUB/MUL/DIV definition is assigning a named token to respective mathematical operator

Now let’s write some parser rules in file named ArithmeticParser.g4 which will processes tokens generated by lexer and create a AST for a valid input

	
	1 | parser grammar ArithmeticParser;
	2 | 
	3 | options { tokenVocab=ArithmeticLexer; }
	4 | 
	5 | expr: NUMBER operation NUMBER;
	6 | 
	7 | operation: (ADD | SUB | MUL | DIV);
  • expr is the base rule and will accept any 2 numbers with one of valid operation.
  • operation rule is telling tokens are valid operations

Generating sources for a target language

Now that we have our grammar, we can generate lexer and parser source in any of the supported language. Run following from command line:

	1 | antlr4 ArithmeticParser.g4	
	2 | antlr4 ArithmeticLexer.g4

By default it will generate sources in java. You can change that by passing language argument. For example, following command generate sources in JavaScript:

	1 | antlr4 -Dlanguage=JavaScript ArithmeticParser.g4

Instead of generating sources individually for lexer and parsr, you can do in same command too:

	1 | antlr4 *.g4	

After you run above, you will see following java source generated in same directory:

	1 ├── ArithmeticLexer.g4
	2 ├── ArithmeticLexer.java
	3 ├── ArithmeticLexer.tokens
	4 ├── ArithmeticParserBaseListener.java
	5 ├── ArithmeticParser.g4
	6 ├── ArithmeticParser.java
	7 ├── ArithmeticParserListener.java
	8 └── ArithmeticParser.tokens

You can also provide a package name for generated sources, like below

	1 | antlr4 -visitor *.g4

Antlr4 provide 2 ways to walk the AST - Listener and Visitor. Antlr doesn’t generate sources for visitor by default. Since we will be using visitor pattern while using it in Scala to avoid mutability, so let’s generate visitor source too. It can be done by providing visitor flag, like below:

	1 | antlr4 -package arithmetic *.g4

Now you will see source for visitor too

	
	1 ├── ArithmeticLexer.g4
	2 ├── ArithmeticLexer.java
	3 ├── ArithmeticLexer.tokens
	4 ├── ArithmeticParserBaseListener.java
	5 ├── ArithmeticParserBaseVisitor.java
	6 ├── ArithmeticParser.g4
	7 ├── ArithmeticParser.java
	8 ├── ArithmeticParserListener.java
	9 ├── ArithmeticParser.tokens
	10└── ArithmeticParserVisitor.java

Next compile the sources

	1 | javac -cp ".:/home/sam/softwares/antlr/antlr-4.6-complete.jar" *.java

Now you are ready to test any input against you DSL.

Testing the DSL

To do that, run following command

	1 | grun Arithmetic expr -tokens

What above command is saying that execute org.antlr.v4.gui.TestRig on Arithmetic grammar and test for rule named expr. -tokens flag will allow us to see the tokens generated by lexer. Enter any valid input like 10 + 3. Press enter. Press Ctrl+D. You will see input like below:

	1 | $ grun Arithmetic expr -tokens
	2 | 10 + 2
	3 | ^D
	4 | [@0,0:1='10',,1:0]
	5 | [@1,3:3='+',<'+'>,1:3]
	6 | [@2,5:5='2',,1:5]
	7 | [@3,7:6='',,2:0
	8 | $ grun Arithmetic expr -tokens

Since it didn’t show any error, it means your input in valid. Each line is showing token value, token name and its start and end offset.

In case of invalid input, Antlr will tell you what it was expecting, like below:

	1 | $ grun Arithmetic expr -tokens
	2 | 10-
	3 | line 2:0 missing NUMBER at ''

In case of valid input, in addition to tokens, you can also see the AST by passing -gui flag, like below:

	1 | $ grun Arithmetic expr -tokens -gui
	2 | 1272.12 * 743.12
	3 | ^D
	4 | [@0,0:6='1272.12',,1:0]
	5 | [@1,8:8='*',<'*'>,1:8]
	6 | [@2,10:15='743.12',,1:10]
	7 | [@3,17:16='',,2:0]

Using generated sources in code

We will now see how to extend generated interfaces and use it from within code. As I mentioned above, antlr4 provides 2 ways to walk the AST - Visitor and Listener. We will first see how to use the listener pattern. Although listener method is commonly used by Java devs, but Scala folks will not like it because it can only return unit, hence you need to use intermediate variables leading to side-effects. Refer to this post for a comparison between two patterns.

First thing you need to do is add Antlr dependency:

	1 | libraryDependencies ++= Seq(
	2 | "org.antlr" % "antlr4-runtime" % "4.6",
	3 | "org.antlr" % "stringtemplate" % "3.2"
	4 | )

Next import the entire generated source in your project and create a parse method which accept an input expression.

	1 | def parse(input:String) = {
	2 | println("\nEvaluating expression " + input)
	3 | 
	4 | val charStream = new ANTLRInputStream(input)
	5 | val lexer = new ArithmeticLexer(charStream)
	6 | val tokens = new CommonTokenStream(lexer)
	7 | val parser = new ArithmeticParser(tokens)
	8 | 
	9 | /* Implement listener and use parser */
	10| }
  • In line 4, we converted input text to a char stream, because lexer operates at char level
  • in line 5, we get a lexer object which uses ArithmeticLexer generated using definitions from ‘ArithmeticLexer.g4’ and pass input stream to it
  • in line 6, we got all the token obtained by applying lexer rules on input text
  • in line 7, we created a parser by applying rules which we defined in ArithmeticParser.g4

Next thing which we need to do is implement some methods in the BaseListener interface. Let’s see what content of generated ArithmeticParserBaseListener.

	1 | public class ArithmeticParserBaseListener implements ArithmeticParserListener {
	2 | //enter and exit methods for grammar rules
	3 | @Override public void enterExpr(ArithmeticParser.ExprContext ctx) { }
	4 | @Override public void exitExpr(ArithmeticParser.ExprContext ctx) { }
	5 | @Override public void enterOperation(ArithmeticParser.OperationContext ctx) { }
	6 | @Override public void exitOperation(ArithmeticParser.OperationContext ctx) { }
	7 | 
	8 | 
	9 | //default grammar independent methods
	10| @Override public void enterEveryRule(ParserRuleContext ctx) { }
	11| @Override public void exitEveryRule(ParserRuleContext ctx) { }
	12| @Override public void visitTerminal(TerminalNode node) { }
	13| @Override public void visitErrorNode(ErrorNode node) { }
	14| }

For every rule which we defined in ArithmeticParser.g4, it created an enter and exit method. Since we had 2 rules, expr and operation, so it created 4 methods. As the name implies, these will get triggered every time walker enters and exits a matched rule. For now, let’s focus on the entry method of our starting rule expr. This problem can be solved by using visitor instead of listener as discussed in this post.

	1 | @Override public void enterExpr(ArithmeticParser.ExprContext ctx) { }

Notice that every rule has a context which has all the Meta information as well as matched input info. Also note that all methods return void which means you need to use mutable variables to store computational values if they needs to be shared among different rules or even by main caller.

So now we create our own class by extending ArithmeticParserBaseListener and implement enterExpr rule.

	1 | class ArithmeticListenerApp extends ArithmeticParserBaseListener {
	2 | 
	3 | override def enterExpr(ctx: ArithmeticParser.ExprContext): Unit = {
	4 | val exprText = ctx.getText
	5 | println(s"Expression after tokenization = $exprText")
	6 |	
	7 | val operands = ctx.NUMBER().toList.map(_.getText)
	8 | val operand1 = parseDouble(operands.lift(0).getOrElse("0.0")).getOrElse(0.0)
	9 | val operand2 = parseDouble(operands.lift(1).getOrElse("0.0")).getOrElse(0.0)
	10| 
	11| val operation = ctx.operation().getText
	12| 
	13| calculate(operand1, operand2, operation) match {
	14| case Some(result) =>
	15| println(s"Result of $operand1 $operation $operand2 = $result")
	16| case None =>
	17| println(s"Failed to evaluate expression. Tokenized expr = $exprText")
	18| }
	19| 
	20| }
	21| 
	22| def parseDouble(s: String): Option[Double] = Try(s.toDouble).toOption
	23| 
	24| def calculate(op1:Double, op2:Double, operation:String):Option[Double] = {
	25| operation match {
	26| case "+" => Some(op1 + op2)
	27| case "-" => Some(op1 - op2)
	28| case "*" => Some(op1 * op2)
	29| case "/" => Try(op1 / op2).toOption
	30| 
	31| case _ =>
	32| println(s"Unsupported operation")
	33| None
	34| }
	35| 
	36| }
  • In line 4, exprText will have tokenized text for this rule.
  • In line 7, expr rule’s context knows about NUMBER and operation. Since NUMBER occurs twice in the rule, hence ctx.NUMBER() will be a list containing 2 numbers.
  • In line 11, we get value of operation from expr rule’s context
  • We calculate the value and since there is no way to return it to caller from enterExpr method, hence we just print it. We could have stored it in some mutable variable in case caller needed it.

Now that we have implemented this, we need to use it in the parse method which we defined earlier, like below:

	1 | def parse(input:String) = {
	2 | println("\nEvaluating expression " + input)
	3 |	
	4 | val charStream = new ANTLRInputStream(input)
	5 | val lexer = new ArithmeticLexer(charStream)
	6 |	val tokens = new CommonTokenStream(lexer)
	7 | val parser = new ArithmeticParser(tokens)
	8 | 
	9 | val arithmeticVisitor = new ArithmeticVisitorApp()
	10| val res = arithmeticVisitor.visit(parser.expr())
	11| println(res)
	12| 
	13| }
	14| def parse(input:String) = {

Let’s test it now on some input expressions:

	1 | val expressions = List(
	2 | "127.1 + 2717",
	3 |	"2674 - 4735",
	4 | "47 * 74.1",
	5 | "271 / 281",
	6 |	"12 ^ 3" // unsupported expression
	7 | )
	8 | expressions.foreach(parse)

You will see following output:

	1 | Evaluating expression 127.1 + 2717
	2 | Expression after tokenization = 127.1+2717
	3 | Result of 127.1 + 2717.0 = 2844.1
	4 | 
	5 | Evaluating expression 2674 - 4735
	6 | Expression after tokenization = 2674-4735
	7 |	Result of 2674.0 - 4735.0 = -2061.0
	8 |	
	9 | Evaluating expression 47 * 74.1
	10| Expression after tokenization = 47*74.1
	11| Result of 47.0 * 74.1 = 3482.7
	12| 
	13| Evaluating expression 271 / 281
	14| Expression after tokenization = 271/281
	15| Result of 271.0 / 281.0 = 0.9644128113879004
	16| 
	17| Evaluating expression 12 ^ 3
	18| line 1:3 token recognition error at: '^'
	19| line 1:5 missing {'+', '-', '*', '/'} at '3'
	20| Expression after tokenization = 123
	21| Unsupported operation
	22| Failed to evaluate expression. Tokenized expr = 123
	

I hope this post gave you an idea on how to get started with creating your own DSL.