install_airflow_ubuntu_18.04

Install Apache Airflow on Ubuntu 18.04

Airflow is one of the most popular workflow management solution, it author, schedule and monitor workflows. This blog post will talk about how to install Airflow on Ubuntu 18.04 Server.

Requirements

  • Python 2.7
  • pip
  • Ubuntu 18.04 Server (at least 4 GB RAM size)

Install Python and pip

We will be using Python 2.7 in this tutorial, Lets start by installing Python on your Ubuntu Machine.

sudo apt-get install python-setuptools

pip is a Python package management tool. We’ll be using this for installing packages required in Airflow.

sudo apt-get install python-pip

Note: It is recommended to use latest pip version. For upgrading pip version, use pip upgrade command given below.

sudo pip install --upgrade pip

Installing PostgreSQL for Airflow

Airflow comes with sqlite database backend, this database system will not be able to run data pipeline on webUI. We would require to have more powerful database system like PostgreSQL, it is an open source database management system, that comes with  robust feature set, data integrity and extensibility. We will install PostgreSQL and configure it to use with Airflow.

sudo apt-get install postgresql postgresql-contrib

As we have already installed postgresql database using above mentioned command. We will now create a database for airflow and grant access to a sudo user. Lets access to psql, a command line tool for Postgres.

sudo -u postgres psql

After  logging in successfully, we will get psql prompt (postgres=#). We will create a new user and provide privileges to it.

CREATE ROLE ubuntu;
CREATE DATABASE airflow;
GRANT ALL PRIVILEGES on database airflow to ubuntu;
ALTER ROLE ubuntu SUPERUSER;
ALTER ROLE ubuntu CREATEDB;
GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public to ubuntu;

Now connect to airflow database and get connection information.

postgres-# \c airflow

After successful connection, prompt will be changed to airflow-#. We will verify this by fetching connection info

airflow=# \conninfo

\conninfo command output:

You are connected to database "airflow" as user "postgres" via socket in "/var/run/postgresql" at port "5432".

We’ll change settings in pg_hb.conf file for required configuration as per Airflow. You can run command SHOW hba_file to find location of pg_hba.conf file.Most likely located at pg_hb.conf located at /etc/postgresql/*/main/pg_hba.conf

open this file with vim and change ipv4 address to 0.0.0.0/0 and listen_addresses to listen_addresses = ‘*’.

We will restart PostgreSQL to load changes.

sudo service postgresql restart

Install Airflow

As PostgreSQL is already installed and configured. Next, We will install Airflow and configure it.

Set AIRFLOW_HOME environment variable to ~/airflow.

export AIRFLOW_HOME=~/airflow

Install Ubuntu dependencies required for Apache Airflow.

  • sudo apt-get install libmysqlclient-dev ( for airflow airflow mysql )
  • sudo apt-get install libssl-dev ( for airflow cryptograph package)
  • sudo apt-get install libkrb5-dev (  for airflow kerbero package )
  • sudo apt-get install libsasl2-dev ( for airflow hive package )

After installing dependencies, Install Airflow and its packages.

sudo pip install apache-airflow

for other subpackages like celery, async, crypto, rabbitmq etc., you can check apache airflow installation page

After successfully installing airflow, we will initialise Airflow’s database

airflow initdb

Now airflow.cfg file should be generated in airflow home directory, we will tweak some configuration here to get better airflow functionality.

We will be using CeleryExecutor instead of SequentialExecutor which come by default with airflow. Change

executor = CeleryExecutor

For DB connection we will pass PostgreSQL database ‘airflow’, that we have created in earlier step.

sql_alchemy_conn = postgresql+psycopg2://ubuntu@localhost:5432/airflow

For removing examples on the home page load_examples variable can set to False

Change broker_url and celery_result_backend to the same config, as shown below

broker_url = amqp://guest:guest@localhost:5672//
celery_result_backend = amqp://guest:guest@localhost:5672//

After doing all these setting just save your configuration and exit.

For Loading new configurations, we should run

airflow initdb

Installing Rabbitmq

Rabbitmq is a message broker, that required to rerun airflow dags with celery. Rabbitmq can be installed with following command.

sudo apt install rabbitmq-server

We will change configuration NODE_IP_ADDRESS=0.0.0.0 in configuration file located at

/etc/rabbitmq/rabbitmq-env.conf
Now Start RabbitMQ service
sudo service rabbitmq-server start

Installing Celery

Celery is a python api for rabbitmq, We can install celery using pip

sudo pip install celery

Some Celery versions may not be compatible with rabbitmq, so you should have to check versions that are supported with airflow. It is know that celery version between 3.1.17 and less than version 4.0 is compatible with airflow.

you can install higher version with

sudo pip uninstall celery

and install lower version like this

sudo pip install 'celery>=3.1.17,<4.0

install_airflow_ubuntu_18.04

Starting Airflow

All the required installation and configuration is done. We will create a dags folder in airflow home directory .i.e;  at /home/ubuntu/airflow location

mkdir -p /home/ubuntu/airflow/dags/

and then we’ll start all airflow services to up airflow webUI

airflow webserver
airflow scheduler
airflow worker

If you want to up airflow continuously up, you should run these command with -D flag like
airflow webserver -D, this will run airflow as a Daemon in background. You required to do it for all the services, If you want to keep these services continuously up.

Stopping Aiflow

When you are running Airflow as a Daemon, it becomes little trickier to stop it. First you have to get process id of airflow and then kill it using sudo.

cat $AIRFLOW_HOME/airflow-webserver.pid

above command will print Airflow process ID now kill it using command

sudo kill -9 {process_id of airflow}

Start Airflow, using commands

airflow webserverairflow scheduler and airflow worker.

Airflow runs on port 8080, port configuration can also be changed form airflow.cfg. Visit localhost:8080 to find Airflow running with user interface.

Pranav is a software developer at Vuja De. His works includes managing Amazon AWS, other cloud services and internal infrastructure

Leave a comment