Processor Python from Git repository

The Python from Git repository processor allows you to run a Python code located in a Git repository. Meiro clones the Git repository into the /data/repository folder and runs the code in the configuration.

The code repository is a file archive and a web hosting platform where a source of code for software, web pages, and other projects, is kept publicly or privately. Git is a version control system for tracking changes in a project, or in a set of files to collaborate effectively with a developer team or to manage a project. It stores all the information about the project in a special data structure called a Git repository. This is a Git directory located in your project directory.

Python is an interpreted high-level programming language for general-purpose programming. Because of its convenience, it is becoming a popular tool for solving data analytics problems such as data cleaning and data analysis.

Requirements

When setting up a configuration for a Python from Git repository component, you will need an account in any version-control platform that supports Git (most common are GitHub, BitBucket, GitLab) with a created repository containing Python script in it.

To work with Python from Git Repository component comfortably, you need to be familiar with:

Programming in general
Python syntax (enough for writing simple scripts)
Git, a version control system
any version-control platform supporting git (for example, GitHub).

Topics in Python you need to be familiar with include (but are not limited to):

Data structures
Control Flow tools
Working with files (opening, reading, writing, unpacking, etc.)
Modules and packages

Useful links for Python

Useful links for a version-control platform

Features

OS Debian Jessie

Version Python 3.6

Packages pre-installed

apache-libcloud, awscli
beautifulsoup4, boto
cx_Oracle
flake8
GitPython, google-api-python-client, google-auth-httplib2, google-cloud-bigquery, google-cloud-storage
html5lib, httplib2
ijson
ipython
Jinja2, jsonschema
lxml
matplotlib, maya
numpy
oauthlib
pandas, pandas2, pymongo, pytest, python-dateutil, PyYAML
requests, requests-oauthlib
scikit-learn, scikits.statsmodels, scipy, seaborn, simplejson, SQLAlchemy, statsmodels
tqdm
urllib3

Data In/Data Out

Data In

Files for processing and transformation can be located in in/tables/ (CSV files) or in/files/ (all other types of files) folder depending on the previous component in the dataflow and the type of the file.

Data Out

Output files should be written in out/tables (CSV files) or out/files (all other types of files) folder depending on the need for the next component and the type of the file.

You can learn more about folder structure in configuration here.

Runtime

Repository (required)

This is the link to the Git repository that you are planning to work with.

To find it on GitHub, click on the Clone or Download button in the repository and copy displayed link.

User (optional)

Your account name on a version control platform (required only for private repositories).

Password (optional)

The password to your account on a version control platform (required only for private repositories).

Branch or Tag (required)

The name of the branch or tag of the Git repository you are using.

To read more about the concept of branches in Git please refer to this article.

Code Editor

Parameters

The Parameters field is the property in config.json file, which allows you to keep the necessary values and to make them accessible from the script. It is supposed to be in a JSON format and represents a collection of property-value pairs. Parameters are accessible via property “parameters”.

Parameters can be useful in different cases, such as:

Keeping sensitive information, such as a username and password or API key for authentication on a third-party platform outside your code repository.
Keeping parameters of the test environment, so after moving to a stage or production phase, you can change parameters fast without changing the script.

Use parameters to keep whatever values you want to make it easily accessible and changeable throughout the Meiro platform. This is demonstrated in the example section below.

Requirements to the script

Structure of the project

The file in the repository containing the code you want to run should have the name main.py. If there is no such file in the repository, Meiro will return an error in the activity log.

If your code is too long, consider splitting it into modules - logical blocks and functions. We advise you to think through the structure of the project as all your modules should be accessible through the script file main.py.

Installing packages

Python from Git repository processor comes pre-installed with the most common packages, listed in the Features section above. However, if you need additional packages, it is possible to install them.

Meiro uses pip utility to install packages. To do this you need to create a requirement file named requirements.txt, generally, all the packages available on pypi.org can be installed. You need to list the names of the packages and their versions (if needed) in the file requirements.txt in your repository. If versions are not given, the latest versions will be installed.

Paths

Use absolute paths when you need to access data files in your script.

Input files

an absolute path /data/in/tables/.. and /data/in/files/..

Output files

an absolute path /data/out/tables/.. and /data/out/files/..

Standard output

The analogue of console log in Meiro is the activity log, where you can view the result of running your script including errors and exceptions.

Other requirements

When working with dataflow, you will need to work with files and tables a lot. Generally, you will need to open the input file, transform the data and write it to the output file. There are no special requirements to the script except the ones mentioned above in this section and the Python syntax itself.

You can check a few examples of scripts solving common tasks in the section below.

Examples

Example 1

This example illustrates a simple code that imports an open dataset from an external source, writes it to an output file and prints a standard output to the console log. Usually, you will need to open the file from the input bucket, which was downloaded using a connector component, but in some cases, requesting the data from external resources may be necessary.

Additionally, this example demonstrates how the parameters feature can be applied. In this example, the URL of the dataset and its path and name are saved in the parameters property of the config.json file. All these values can easily be changed without changing the script in the repository.

Parameters

{  
"url": "http://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv",  
"path": "newfolder",  
"name": "titanic.csv"
}

Script (main.py in Git repository)

#import necessary libraries
import requests
import json
import pandas as pd
from pathlib import Path

#obtain url property from config.json file
with open('/data/config.json') as f:
    json_data = json.load(f)
    url = json_data['parameters']['url']
    path = json_data['parameters']['path']
    name = json_data['parameters']['name']
    
#request url and save response into variable
r = requests.get(url)

#create full path for the file, make parent directory
full_path = '/data/out/tables/' + path + "/" + name
Path('/data/out/tables/' + path).mkdir(parents=True, exist_ok=True)

#create or open (if exists) output file for writing and write the content of the response
with open(full_path, 'wb') as f:
    f.write(r.content)

#read output file and print to console log first 10 rows
data = pd.read_csv(full_path, header=0)
print(data.loc[0:10,["Name", "Survived"]])

Example 2

This example illustrates opening, filtering and writing a CSV file using pandas, which is one of the most common Python libraries for data analytics. In this script, we’ll use the Titanic dataset which contains data of about 887 of the real Titanic passengers. This dataset is open and very common in data analytics and data science courses.

Let’s imagine we need to analyze the data of male and female passengers separately and want to write the data in 2 separate files. Data in this example was previously downloaded using a Connector component. We will show you how you can reproduce the code on your computer below.

Script (main.py in Git repository)

#import pandas library
import pandas as pd

#read data from file and keep it in data variable
data = pd.read_csv('/data/in/tables/titanic.csv')

#filter rows with male and female passengers, keep data in variables
data_male = data[data['Sex']=='male']
data_female = data[data['Sex']!='male']

#reset indexes to each dataset so it starts from 0 and grows incrementally   
data_male.reset_index(drop=True, inplace=True)
data_female.reset_index(drop=True, inplace=True)

#write data to output files
data_female.to_csv('/data/out/tables/titanic_filter_female.csv')
data_male.to_csv('/data/out/tables/titanic_filter_male.csv')

Reproducing and debugging

If you want to reproduce running the code on your computer for testing and debugging, or you want to write the script in a local IDE and copy-paste it in Meiro configuration, the easiest way to do this would be to reproduce the folder structure as below:

/data
     /in
          /tables
          /files
     /out
          /tables
          /files
    /repository
        main.py
        requirements.txt

Input files and tables should be located in the folder in/ in the corresponding subfolders, output files, and tables in out/files and out/tables respectively.

To recreate Example 2, you will need to download the dataset and save it to the folder /data/in/tables as titanic.csv, paste the code from the example to the script file in the repository folder and run it. New files will be written to the folder /data/out/tables/. We draw your attention to the fact that in both the examples, absolute paths are used because of the specifics of Python from Git Repository processor.