Helping the customer to understand what he did a long time ago…

Helping the customer to understand what he did a long time ago…
09 / 03 / 2021Temps de lecture 12 min.

In a galaxy far, far away…

EV-9D9: “How many languages do you speak?

C-3PO: “I am fluent in over six million forms of communication and can readily…”

EV-9D9: “Splendid. We have been without an interpreter since our master got angry with our last protocol droid and disintegrated him.

C-3PO: “Disintegrated?

EV-9D9: “Yes, Disintegrated …. Can you read and write a dialect of SQL, MySQL and generate AST to translate to another database language ? We have 70.000 SQL statements that we don’t really understand …

C-3PO: “Of course I can ! I have a built-in ANTLR4 device that can do that!

EV-9D9: “Perfect! You’re in then! Follow me…

When I started to work for one of my customers, they asked me to change their monolithic platform to a full distributed one with HA. If it was so monolithic and full of SPoF’s, it was not because of a real lack of skills or vision but it was more a history made by events, people and… turn over. The lines of code I found and the design of the platform were telling me a real story implying machines, code and human beings spread over more than 4 years of hotfixes, maintenance and development.

Who is my secret customer ?

Of course, I can’t say any name of company (or I will have to hire Boba Fett …) but I need to describe the flow of data which is the key to designing a new architecture.

First things first, Auto !

The first thing to know is : we’re gonna listen to clusters of machines. These are industrial machines producing a huge amount of data every second. All these expensive bunches of steel and iron are grouped by site and will send all that data through an ftp connection (yes, you read it correctly…) using a 4G or landline connection. Sometimes those sites are really in the middle of nowhere (it sounds like Tatooine, I know…).

Second things… Second !

These data going through the FTP pipe are then processed and inserted in a MySQL database. All this data will provide dashboards and visual tools to monitor customer sites.

What’s up doc ?

My customer needed to build a brand new platform that will be able to scale-out, to be easy to maintain and, of course, that will be iso-functional to the existing one.

That’s when the tricky part comes in : all the computation logic for aggregations are stored in the SQL database as SQL strings (e.g. : “AVG(FIELD1)”) in a table. When the computation needs to be done, all requests are concatenated in one to be executed later.

The caveat is : it’s really hard to maintain a system where you have strings that contain something written in a language (In that case SQL) that you manipulate like simple pieces you put together. Without knowing if the concatenated query is valid or makes sense, running it is at your own risk. If you have a few hundreds of it, then you can have human eyes to check it … but when it’s 70.000 request blocks then… you don’t want to hear about “testing it all by hand » !

Goal of my mission (you said impossible ?)

The main goal of my mission was to build the new platform and migrate the current SQL logic to something else. The chosen database to store all messages from the different sites was Elasticsearch. The problem was : the SQL computation was not translatable to an Elasticsearch query or aggregation.

So we decided to explore all these SQL requests to try to simplify and maybe afterwards to translate to Elasticsearch “jsonish” queries (with painless script of course) or to run a dockerized SQL engine for some of them in a streaming way. I mean whenever we receive a message from FTP it will trigger computation using a SQL engine on a limited time window. These aggregations will then consolidate Elasticsearch database. Helping the customer to understand these queries would maybe bring questions or a different point of view of the running platform.

How to evaluate complexity of SQL queries stored as strings ?

The answer is : use a lexer and a parser that understand SQL grammar. This, maybe, looks tricky but if you use ANTLR4 … it’s not !

My idea was to use Zeppelin and therefore Spark to load and work on these bunch of queries. The advantage of doing this is that you can try and fail, changing your code quickly in the Zeppelin notebook and find exactly what you were looking for.

So the first thing to do was to package a lexer/parser in order to add it as a spark dependency and use it directly from the notebook.

Find the grammar for your parser

To find the right grammar for my parser, I checked this repository. It’s a compilation of a lot of grammars organized by folder. I found the droids ones I was looking for (even if a weird old bearded man told me : “these aren’t the ones you’re looking for”). There were 2 files : one for the lexer (MySqlLexer.g4 : tokens allowed) and one for the parser (MySqlParser.g4: rules). These are the 2 files I downloaded to prepare my lexer/parser project.

Compile and package the SQL lexer/parser

To compile and package the lexer/parser, I wrote a small project using maven and its really useful plugin antlr4-maven-plugin. First I wrote a pom.xml to describe the project (groupId, artifactId, …) and some version stuff. Here is the pom.xml (I removed some verbose xml to make it more readable) :

<project ...>
    <name>MySQL Antlr 4 Lexer/Parser</name>


Then in the root folder of the project (next to the pom.xml) I created one folder to put all the grammar files : src/main/antlr4/com/agaetis/antlr4/mysql. Here is the structure of the project :

File structure

When you have your project setup, the next step is to package the jar. For that purpose, of course, we use the antlr4 maven plugin like this :

mvn clean install

After that you should have in your local maven repository the jar that you will use with Zeppelin/Spark.


Add the parser package as a dependency in the Zeppelin Spark Interpreter

When you have your parser packaged and ready to be used, you need to tell Zeppelin and Spark to make it available from your notebook. In the interpreter section you just have to search for spark interpreter, edit it and add in the end of the page the dependency to the package located in your maven repository.

All requests that will be parsed are in a MySQL database, therefore I added a dependency to the connector to access a MySQL database and because we’re going to use ANTLR4, obviously, I added as well the ANTLR4 runtime dependency.

Load queries from the database

Here is the snippet to create a dataframe from a MySQL table :

val options = Map(    "url"       -> dbUrl,
    "dbtable"   -> tableName,
    "user"      -> userName,
    "password"  -> userPassword,
    "driver"    -> "com.mysql.jdbc.Driver"
val queries ="jdbc").options(options).load



That means, you’ll be able to use Spark dataframe operators to manipulate (filter, map …) your queries. How to compute the complexity of a SQL statement ?

The most intuitive way to compute the complexity of a SQL query (the one that came in my mind first), was, IMHO, the computation of the complexity of the AST (Abstract Syntax Tree) that represents a given SQL query (that’s why I used a library like ANTLR).

Let’s take an example : here is a query that is just a SELECT statement without any FROM part or WHERE statement.

SELECT IF(AVG(age) > 10, (AVG(age) * 20), 0)

ANTLR4 will compute an AST for that query that looks like this :

Example of an AST (simplified) for a SQL query

In that case, the purpose of the query is : if the average of the field age is greater than 10, then the value returned is the average of the field age multiplied by 20, otherwise the result is 0. This query is not a complicated one but it shows what the AST looks like.

We can see that we have five layers (I don’t count the first one AST in the picture). My first idea was to compute the number of elements for each layer times the layer index (first level has index equals to 1) and then to multiply them. In practice, the number returned in that case is too high. So I computed the natural logarithm of each number of nodes per layer times the level of the layer (plus 1 because it’s a logarithm) and then multiplied them to apply finally a last natural logarithm.

So in this example the complexity would be :


which equals to :


That gives 5.48. When we use this query in the function implemented in the notebook we don’t have the same result because I simplified the abstract syntax tree to illustrate the complexity factor computation. ANTLR4 generates a more complex tree from this same query and therefore, we get a higher complexity factor when it is executed in the notebook.

Here is the code to compute the complexity of a query :

import com.agaetis.antlr4.mysql.MySqlParserBaseListener
import com.agaetis.antlr4.mysql.MySql Lexer
import com.agaetis.antlr4.mysql.MySqlParser
import org.antlr.v4.runtime.ParserRuleContext
import org.antlr.v4.runtime.CharStream
import org.antlr.v4.runtime.CharStreams
import org.antlr.v4.runtime.CommonTokenStream
import org.antlr.v4.runtime.tree.ParseTree
import org.antlr.v4.runtime.tree.ParseTreeWalker
import scala.collection.mutable.Map

class MyListener extends MySqlParserBaseListener {
    val nodeCountPerLevel = LinkedHashMap[Int, Int]();
    override def enterEveryRule(ctx:ParserRuleContext) {
        nodeCountPerLevel(ctx.depth()) = nodeCountPerLevel.getOrElse(ctx.depth(),0) + 1

class MyErrorListener extends BaseErrorListener {
    override def syntaxError( recognizer: Recognizer[_, _],
                              offendingSymbol:Object, line:Int ,
                              charPositionInLine:Int ,
                              msg:String ,
                              e:RecognitionException) = {
                                  println(s"ERROR $msg")
                                  throw new ParseCancellationException(msg);

def evalComplexity(input:String, exprType:String) : Double = {
    if(input == null || input.isEmpty)
    else {
        try {
            val inputCharStream = CharStreams.fromString(input)
            val sqlLexer:MySqlLexer = new MySqlLexer(inputCharStream)
            val tokenStream = new CommonTokenStream(sqlLexer)
            val parser = new MySqlParser(tokenStream)
            parser.addErrorListener(new MyErrorListener());
            val tree:ParseTree = exprType match {
                case "SELECT" => parser.selectStatement()
                case "EXPR" => parser.expression()
                case _ => parser.root()
            val walker = new ParseTreeWalker()
            val listener = new MyListener()
            walker.walk(listener, tree)
            val nodeCountPerLevel = listener.nodeCountPerLevel
            val score = {
                case (k,v) => { 
                    Math.log(k * v) + 1.0
            }.reduce( (a,b) => a * b )
        } catch {
            case e:ParseCancellationException => -1.0


Here is an example of using this function :

Using dataframe (where the queries are) and udf’s (user defined functions), you can add a complexity column to your dataframe like this :

def evalComplexityUdf = udf((input:String) => {
    evalComplexity(input, "EXPR")

val allQueriesWithComplexity = allQueries.withColumn("valCplx", evalComplexityUdf($"query"))


Then you can display and see the complexity of any query in the dataset :

And so what?

Then we were able to identify complex queries, to analyze why it was so complex and to check if it was still in use in the system. There was also an advantage in using this method : we found few invalid statements in the MySQL database (missing parenthesis, invalid variable names …). These queries have since been corrected or removed if it was not used anymore.

The customer was able to understand a bit more what was in the database, and it was a good start to rethink the way these queries should be kept or mapped to Elasticsearch.


When you have a SQL-based system that contains SQL queries which will be concatenated or manipulated as strings then you can assume that it’s not the best practice you can find. Indeed, it’s a trick that hides a deep misconception problem of your platform. In that case you can either run away and say you can’t do anything for them or you can pluck up the courage to give sense to customer’s data and make them understand that there is always a better way : rethink the data modeling and use a database that fits better the use case.

Space Odyssey – Warner Bros. Entertainment Inc.

Even if the project looks like a space odyssey to the customer (like fighting against an “already in place” platform to change the way it thinks), there’s always a way or a workaround to ignite a seismic shift.

Un sujet vous intéresse ? Une question ? Contactez-nous !

Logo Agaetis
Nos adresses
9, allée Evariste Galois
63170 Aubière
Tél. 04 73 35 47 51
21, rue de la banque
75002 Paris
Tél. 01 44 63 53 13
52, Quai Rambaud
69002 Lyon
© 2021 Agaetis - Tous droits réservés

Ce site utilise des cookies à des fins de mesures d'audience, ainsi que pour améliorer votre expérience de navigation