Processor Command Line Interface Code

Processor Command Line Interface Code intends to transform files in the configuration using Bash script. Bash is a command-line interface for interacting with the operating system. Bash shell script allows you to run an entire script of commands, which might contain a single simple command, list of commands, or even functions, loops, and other control flow tools.

Requirements

Fundamental knowledge of programming concepts and some experience with Bash scripting language.

Useful Links: GNU Coreutils manual, Shell Scripting Tutorial

Features

Distribution: Debian Jessie

Available Utilities:

Complete BASH with standard Unix utilities.
jq
Additional utilities can be installed on request.

Limitations:

2 GB RAM
1 vCPU
3 hours of execution

Data In/Data Out

Data In

Files for processing and transformation can be located in /data/in/files/ or in /data/in/tables/ folder depending on the previous component in the dataflow.

Data Out

Files should be moved to /data/out/tables or /data/out/files depending on the need of the next component.

Learn more: about the folder structure please go to this article.

Code Editor

Script

This field is intended for the Bash script that you write for processing the file.

How to search & replace within a code editor

Script location and paths

The script file script.sh is located in the root /data folder. For accessing the data files, you can use an absolute or relative path.

Data In

an absolute path /data/in/tables/.. and /data/in/files/..
a relative path in/tables/.. and in/files/..

Data Out

an absolute path /data/out/tables/.. and /data/out/files/..
a relative path out/tables/.. and out/files/..

Standard output

The analog of the console log in Meiro Integrations is the activity log. If you run echo 'Hello world!' script in the configuration, the system will write the result in the activity log in the following way.

Examples

In this section, we demonstrate how to solve frequently encountered problems with files and tables:

Moving and renaming files.
Creating headers for a table.
Downloading a file into the required folder.

Example 1

By default, CSV files are saved by connectors and processors in the /data/out/tables. The next configuration in dataflow will locate it in /data/in/tables. In the same time, AWS S3 loader requires all files to be located in the /data/out/files. It is possible to move the file from tables to files folder.

move all files from /data/in/tables to /data/out/files

mv in/tables/* out/files/

It is possible to move only required files using wildcard or regular expressions. Move CSV files from /data/in/tables to /data/out/files

mv in/tables/*.csv out/files/

Move and rename the file. Move .csv files from /data/in/tables to /data/out/tables

mv in/tables/test.csv out/files/newfile.csv

Example 2

This example demonstrates how to download a file using the URL link.

Data out folder: /data/out/tables/titanic_data.csv

URL: http://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv

wget -O /data/out/tables/titanic_data.csv "http://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv"

Example 3

Sometimes the data you receive does not have any headers which make it inconvenient for future transformations. This example demonstrates how to create a new file with the headers for the table and join it with the table you have. For the purposes of this demonstration, we create the example file ourselves using the script.

Create an example file: table with 5 rows and 2 columns

for num in 1 2 3 4 5

echo $num, $((num*2)) >> out/files/numbers.csv

done

for num in 1 2 3 4 5
do
echo $num, $((num*2)) >> out/files/numbers.csv
done

Create a file with headers for the columns.

echo '"number", multiply_by_two' > out/files/headers.csv

Concatenate rows from numbers.csv file with headers.csv file

cat out/files/numbers.csv >> out/files/headers.csv

Reproducing and debugging

If you want to reproduce running the code on your computer for testing and debugging, or you want to write the script in a local IDE and copy-paste it in Meiro Integrations configuration, the easiest way to do this will be to reproduce the structure of folders as follows:

/data
    script.sh
    /in
          /tables
          /files
    /out
          /tables
          /files

The script file should be located in the /data folder, input files and tables in the folder in/tables in the corresponding subfolders, output files and tables in out/files and out/tables respectively.

For reproducing the scripts from example 1, you need to save any CSV file named test.csv to the folder /data/in/tables, paste code from the example to the script file and run it. Files will be moved to the folder /data/out/tables/`, and/or renamed.