Skip to content

Commit f563e58

Browse files
committed
Adds quick airflow comparison for docs
This is a quick comparison to show how lightweight Hamilton is and that it can also be used within Airflow.
1 parent a6090fc commit f563e58

File tree

1 file changed

+118
-1
lines changed

1 file changed

+118
-1
lines changed

docs/code-comparisons/airflow.rst

Lines changed: 118 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,4 +2,121 @@
22
Airflow
33
======================
44

5-
Check back soon!
5+
For more details see this `Hamilton + Airflow blog post <https://blog.dagworks.io/p/supercharge-your-airflow-dag-with>`_.
6+
7+
**TL;DR:**
8+
9+
1. Hamilton complements Airflow. It'll help you write better, more modular, and testable code.
10+
2. Hamilton does not replace Airflow.
11+
12+
13+
High-level differences:
14+
-----------------------
15+
16+
* Hamilton is a micro-orchestator. Airflow is a macro-orchestrator.
17+
* Hamilton is a Python library standardizing how you express python pipelines, while Airflow is a complete platform and
18+
system for scheduling and executing pipelines.
19+
* Hamilton focuses on providing a lightweight, low dependency, flexible way to define data pipelines as Python functions,
20+
whereas Airflow is a whole system that comes with a web-based UI, scheduler, and executor.
21+
* Hamilton pipelines are defined using pure Python code, that can be run anywhere that Python runs. While Airflow uses
22+
Python to describe a DAG, this DAG can only be run by the Airflow system.
23+
* Hamilton complements Airflow, and you can use Hamilton within Airflow. But the reverse is not true.
24+
* You can use Hamilton directly in a Jupyter Notebook, or Python web-service. You can't do this with Airflow.
25+
26+
27+
Code examples:
28+
--------------
29+
Looking at the two examples below, you can see that Hamilton is a more lightweight and flexible way to define data pipelines.
30+
There is no scheduling information, etc required to run the code because Hamilton runs the pipeline in the same process as the
31+
caller. This makes it easier to test and debug pipelines. Airflow, on the other hand, is a complete system for scheduling and
32+
executing pipelines. It is more complex to set up and run. Note: If you stuck the contents of `run.py` in a function within
33+
the `example_dag.py`, the Hamilton pipeline could be used in the Airflow PythonOperator!
34+
35+
Hamilton:
36+
_________
37+
The below code here shows how you can define a simple data pipeline using Hamilton. The pipeline consists of three functions
38+
that are executed in sequence. The pipeline is defined in a module called `pipeline.py`, and then executed in a separate
39+
script called `run.py`, which imports the pipeline module and executes it.
40+
41+
.. code-block:: python
42+
43+
# pipeline.py
44+
def raw_data() -> list[int]:
45+
return [1, 2, 3]
46+
47+
def processed_data(raw_data: list[int]) -> list[int]:
48+
return [x * 2 for x in data]
49+
50+
def load_data(process_data: list[int], client: SomeClient) -> dict:
51+
metadata = client.send_data(process_data)
52+
return metadata
53+
54+
# run.py -- this is the script that executes the pipeline
55+
import pipeline
56+
from hamilton import driver
57+
dr = driver.Builder().with_modules(pipeline).build()
58+
metadata = dr.execute(['load_data'], inputs=dict(client=SomeClient()))
59+
60+
Airflow:
61+
________
62+
The below code shows how you can define the same pipeline using Airflow. The pipeline consists of three tasks that are executed
63+
in sequence. The entire pipeline is defined in a module called `example_dag.py`, and then executed by the Airflow scheduler.
64+
65+
.. code-block:: python
66+
67+
# example_dag.py
68+
from airflow import DAG
69+
from airflow.operators.python_operator import PythonOperator
70+
from datetime import datetime, timedelta
71+
72+
default_args = {
73+
'owner': 'airflow',
74+
'depends_on_past': False,
75+
'start_date': datetime(2023, 1, 1),
76+
'email_on_failure': False,
77+
'email_on_retry': False,
78+
'retries': 1,
79+
'retry_delay': timedelta(minutes=5),
80+
}
81+
82+
dag = DAG(
83+
'example_dag',
84+
default_args=default_args,
85+
description='A simple DAG',
86+
schedule_interval=timedelta(days=1),
87+
)
88+
89+
def extract_data():
90+
return [1, 2, 3]
91+
92+
def transform_data(data):
93+
return [x * 2 for x in data]
94+
95+
def load_data(data):
96+
client = SomeClient()
97+
client.send_data(data)
98+
99+
extract_task = PythonOperator(
100+
task_id='extract_data',
101+
python_callable=extract_data,
102+
dag=dag,
103+
)
104+
105+
transform_task = PythonOperator(
106+
task_id='transform_data',
107+
python_callable=transform_data,
108+
op_args=['{{ ti.xcom_pull(task_ids="extract_data") }}'],
109+
dag=dag,
110+
)
111+
112+
load_task = PythonOperator(
113+
task_id='load_data',
114+
python_callable=load_data,
115+
op_args=['{{ ti.xcom_pull(task_ids="transform_data") }}'],
116+
dag=dag,
117+
)
118+
119+
extract_task >> transform_task >> load_task
120+
121+
122+

0 commit comments

Comments
 (0)