Multiple chained queries in SQL on CSV processor
While working on data flow and using the SQL on CSV processor, you may encounter the problem of writing a series of queries where the result of the first query becomes the input for the following one. Depending on the complexity of the queries and the size of the table, nested queries are sometimes not the best option in this case.
WITH clause
One of the solutions for this problem is utilising a WITH
the clause which allows you to name the temporary table and to reference it in your query multiple times using its name. This way you can simplify the code, make it more readable and avoid multiple subqueries which makes the query less expensive in terms of performance. Here is a simple example of using a WITH
clause:
Query:
WITH survived as (
SELECT * FROM TITANIC t
WHERE t.Survived = 1
)
SELECT COUNT(*) FROM survived
Input File titanic.csv
Output File survived.csv
After running the configuration you should see the result file survived.csv
in the Data Out bucket with the header and 1 row
COUNT(*)
109
The drawback of this solution is that you can chain only 2 queries and the result of the query inside the WITH
clause won’t be saved. Below we describe another solution that is not limited to 2 queries.
Chained queries
It is possible to use an output of one query as an input for the next one by applying the logic of the folder structure of Meiro Integration.
The folder structure of all components is similar and generally looks like this:
data/
in/
/tables
/files
out/
/tables
/files
The result of each query in SQL on CSV processor is written by Meiro Integrations to the folder data/out/tables
. The input tables are read from the data/in/tables
by default, but it is possible to access the data/out/tables
by going two levels up to to the parent directory (data
):../../out/tables/
.
Let’s try to re-write the query from our earlier example into two separate queries.
Query (1)
SELECT * FROM TITANIC t
WHERE t.Survived = 1
Input File (1) titanic.csv
Output File (1) survived.csv
Query (2)
SELECT COUNT(*) FROM survived
Input File (2) ../../out/tables/survived.csv
Note that we are accessing the Query (1) result in the Input File field.
Output File (2) count_result.csv
After running the configuration, 2 tables will be saved to data/out/tables
- Table
survived.csv
the result of a Query (1) - Table
count_result.csv
the result of Query (2) where the input data was the tablesurvived.csv
.
Using this technique, you can combine as many queries as you need and, if necessary, save the results of the queries.
Learn more: about SQL on CSV processor, please refer to this article.
No Comments