Multiple chained queries in SQL on CSV processor

While working on data flow and using the SQL on CSV processor, you may encounter the problem of writing a series of queries where the result of the first query becomes the input for the following one. Depending on the complexity of the queries and the size of the table, nested queries are sometimes not the best option in this case.

WITH clause

One of the solutions for this problem is utilising a WITH the clause which allows you to name the temporary table and to reference it in your query multiple times using its name. This way you can simplify the code, make it more readable and avoid multiple subqueries which makes the query less expensive in terms of performance. Here is a simple example of using a WITH clause:

Query:

WITH survived as (
SELECT * FROM TITANIC t
WHERE t.Survived = 1
)
SELECT COUNT(*) FROM survived

Input File titanic.csv

Output File survived.csv

After running the configuration you should see the result file survived.csv in the Data Out bucket with the header and 1 row

COUNT(*)
109

The drawback of this solution is that you can chain only 2 queries and the result of the query inside the WITH clause won’t be saved. Below we describe another solution that is not limited to 2 queries.

Chained queries

It is possible to use an output of one query as an input for the next one by applying the logic of the folder structure of Meiro Integration. 

The folder structure of all components is similar and generally looks like this:

data/
    in/
        /tables
        /files
    out/
        /tables
        /files

The result of each query in SQL on CSV processor is written by Meiro Integrations to the folder data/out/tables. The input tables are read from the data/in/tables by default, but it is possible to access the data/out/tables by going two levels up to to the parent directory (data):../../out/tables/.

Let’s try to re-write the query from our earlier example into two separate queries.

Query (1)

SELECT * FROM TITANIC t
WHERE t.Survived = 1

Input File (1) titanic.csv

Output File (1) survived.csv

Query (2)

SELECT COUNT(*) FROM survived

Input File (2) ../../out/tables/survived.csv

Note that we are accessing the Query (1) result in the Input File field.

Output File (2) count_result.csv

After running the configuration, 2 tables will be saved to data/out/tables 

  1. Table survived.csv the result of a Query (1)
  2. Table count_result.csv the result of Query (2) where the input data was the table survived.csv.

Using this technique, you can combine as many queries as you need and, if necessary, save the results of the queries.

Learn more: about SQL on CSV processor, please refer to this article.