

Snapshot: This is a pretty straightforward approach where first we take a snapshot of the master, then upgrade to the new schema and roll back to the snapshot if we see issues.We wanted to upgrade the database in a way where we can rollback quickly and also minimize overall downtime. To optimize the upgrade based on the initial requirements, we considered some approaches for Database Backup and Schema Upgrade, which we will delve into next. Test, and either fix forward or roll back.Fix incompatibilities with libraries that are installed on top of Airflow.Launch an instance with Airflow 1.10 installed.We did these steps in dev first and then in our prod environment: Steps needed for the big-bang upgradeįollowing are the high-level steps for the upgrade. We added some of our critical DAGs from prod to dev for testing. Since the red-black upgrade was not feasible and involved more risk to the data quality, we went ahead with the big-bang upgrade. The challenge here was that we didn’t have very good test data and had only a handful of DAGs in dev. Big-Bang upgrade: We test as much as possible in dev and move all DAGs to the new version in one big bang! If there are issues, either fix forward or roll back the upgrade.Creating two databases for each, and moving DAGs piecewise, will result in losing history. This is the more reliable option, but it wasn’t feasible: To do this, we would need each version of Airflow to point to its own metadata database, since sharing the same database can cause the same tasks to be scheduled on both Airflows (resulting in duplicates). Red-Black upgrade: We run old and new versions of Airflow side-by-side, and move a small set of workflows over at a time.We considered a couple strategies for the Airflow upgrade: History preserved: The metadata of previous runs should be preserved, so that we could run backfills and don’t have to update start_dates on the DAGs.Minimized downtime: We want to reduce the time Airflow is down during the upgrade, so we don’t affect our Airflow users and don’t miss SLAs, as folks rely on having data on time.If things go wrong, we want to be able to roll back to the previous Airflow version and schema quickly. Fast rollback: Besides all the bug fixes and improvements that this new version brings, it also involves a backwards incompatible schema upgrade on Airflow’s metadata database.We are running hundreds of workflows that are managing the state of thousands of tasks and all these should be scheduled and executed successfully.

Reliability: The Airflow scheduler and webserver should run without issues after the upgrade.This data is then processed by hundreds of workflows running Apache Hive, Apache Spark, and Presto on Airflow that submit jobs to our clusters running on thousands of EC2 instances. This results in over 700 billion records loaded daily into our S3 Data Warehouse. For two years we’ve been running Airflow 1.8, and it was time for us to catch up with the recent releases and upgrade to Airflow 1.10 in this post we’ll describe the problems we encountered and the solutions we put in place.Īs of September 2019, Slack has over 12 million daily active users performing 5 billion actions on average every week. At Slack, we use Airflow to orchestrate and manage our data warehouse workflows, which includes product and business metrics and also is used for different engineering use-cases (e.g. Apache Airflow is a tool for describing, executing, and monitoring workflows.
