Processing Big Data with Maestro and Spark SQL
In this tutorial, I will show you how I built an Apache Spark code generator with Maestro Framework.
The purpose of the code generator is to allow me to configure any arbitrary number of tables and do a SELECT query in Spark using a Scala script.
The Scala script is auto-generated based on my configuration of tables and input formats.
The output is always produced as a single CSV file.
We will be using two CSV files as input - Users.csv, and Content.csv.
Please download the example app to get the full example based on this tutorial.
Let's begin.
Step 1 - Writing the Scala Script Templates
We need Maestro to write us a Scala script for Spark based on configuration.
The configuration will look like this (State.json):
{
"sparkConfig": {
"tables": [
{ "table": "Users", "location": "Users.csv", "format": "csv", "header": "true" },
{ "table": "Content", "location": "Content.csv", "format": "csv", "header": "true" }
],
"query": "<Loaded from File>",
"workingDir": ".",
"args": "-i SparkScript.scala",
"exe": "spark-shell"
}
}
We will specifically read from the "tables" key and the "query" key.
Our Spark SQL query will be loaded from a txt file, SelectQuery.txt. I will show that later.
Here is the first-level template, ScalaTemplate.txt:
#@$.sparkConfig.tables@
val dfMaestroSparkResult = spark.sql("""#@$.sparkConfig.query@""")
dfMaestroSparkResult.coalesce(1).write.format("csv").save("Output")
System.exit(0)
Notice the syntax of how we read config keys here is #@[jsonPath]@.
Now we also need to create a second-level template, ScalaInnerTemplate.txt. The reason is because $.sparkConfig.tables is an array.
We will need Maestro Framework to iterate this array and produce multiple instances of our template with different values.
Here is the second-level template, ScalaInnerTemplate.txt:
val df#@$.level2Item.table@ = spark.read.format("#@$.level2Item.format@").option("header", "#@$.level2Item.header@").load("#@$.level2Item.location@")
df#@$.level2Item.table@.createOrReplaceTempView("#@$.level2Item.table@")
Notice how here we reference level2Item, which is the item in the "tables" array shown earlier.
Step 2 - Writing the App Logic
The app logic is always the simplest part with Maestro Framework.
We're going to show you the plain English SparkQueryApp.txt file below:
alias Dacris.Maestro as the
read state from file using the core
read string from file using the core with queryConfig
txt file from template using the core with templateFile
move file using the storage with scalaOut
run exe using the core with sparkConfig
The steps here are:
1. Read State.json.
2. Read SelectQuery.txt, which contains our Spark SQL query.
3. Apply our Scala template to generate a Scala script, as Output.txt.
4. Rename Output.txt to SparkScript.scala.
5. Run spark-shell with our SparkScript.scala.
Step 3 - Creating State.json
The different config keys at the end of the "with" clauses in the plain English app logic file exist inside State.json.
State.json is the default configuration of a Maestro Framework app.
In Maestro, the state is where data and configuration reside while the app is running.
Please refer to the example package to see how we configured the State.json.
Step 4 - Creating SelectQuery.txt
Let's create our select query now, in Spark SQL.
What is our aim? Our data is a Content table and a Users table.
We want to create a dialogue between the users as a CSV file.
Example:
User 1: Hi there, User 2!
User 2: Hello there, User 1!
In order to do this, we need to get the user name from the Users table and the content from the Content table, and concatenate a string.
Here's how the final SQL query looks like:
SELECT
CONCAT(u.name,': ',c.content) AS statement
FROM
Users u
JOIN Content c ON u.user_id = c.user_id
ORDER BY
u.user_id DESC
Conclusions
The total amount of time it took me to create this application was 1 hour, and I wrote only 34 lines of code, including all templates and query.
The advantage of this approach is that I can configure any data sources I want without changing the code.
For instance, if I wanted to read two CSVs and one parquet file for my SQL query, I would just have to change a few configuration values in State.json.
More advanced applications can be created as well, such as ones that string together multiple queries.
In such a case, the template would be 3-level - one level for SQL queries and one level for data sources, as well as the root template.
This is the magic of Maestro templates: You can make apps flexible through configuration-based code generation.