The expectation for job scheduling is that the job starts at the specified date that we state (like with Cron jobs); in the case of Airflow, it does not quite work like this…
Notices about this post
Although my experience is with Airflow version 2.3.3, the underlying concepts remain applicable but with possible terminology changes.
The current iteration of this post does not delve into Airflow Timetables (dynamic scheduling) or the best practice property of Idempotency for DAGs (rerunning the DAG will always have the same effect).
Airflow Scheduling
Start Date
The Start Date parameter provides the date to start running your DAG as a whole. The best way to clarify your start date is by using a static value as the Airflow documentation states possible errors otherwise. The following code will have this affect.
"start_date": pendulum.datetime(2022, 10, 15, tz="Europe/London"),
# Or
"start_date": datetime.datetime(2022, 10, 15)
Data Interval
The Data Interval is how often you would like to run your DAG; think of this as the time bound to collect the data for your workflow.
Logical Date (aka Data Interval Start)
The Logical date refers to the start date and time of the Data Interval. The triggering of the DAG is often confused with the time of the Logical Date. However, this is the starting time for getting potential data for your workflows. It did not help that before Airflow 2.2, the Logical Date was called the Execution Date.
Data Interval End
This is the end of the period to get the data. Once complete, we move to the stage of actually triggering the tasks within our DAGs.
⭐DAG Trigger⭐
Only after the Data Interval is complete does the DAG trigger and run the tasks.
Run ID | Schedule Period (data_interval_start) — (data_interval_end) | ⭐Triggers at⭐ |
scheduled__2022-11-15T00:00:00+00:00 | 2022-11-15, 00:00:00 UTC — 2022-11-15, 01:00:00 UTC | 2022-11-15, 01:00:00 UTC |
scheduled__2022-11-15T01:00:00+00:00 | 2022-11-15, 01:00:00 UTC — 2022-11-15, 02:00:00 UTC | 2022-11-15, 02:00:00 UTC |
scheduled__2022-11-15T02:00:00+00:00 | 2022-11-15, 02:00:00 UTC — 2022-11-15, 03:00:00 UTC | 2022-11-15, 03:00:00 UTC |
A Practical Example With My Mistake Tracker (Productivity Project)
For my project, I wanted to collect data on the music I listen to during the day. To automate the ETL process for this task, I used Apache Airflow as my workflow management tool to complete my ETL process.
I was not interested in Catchup, so I set the Start Date parameter for the day after. If I had wanted Catchup, I would have needed my DAG to be Idempotent (completed as of 9/11/22) and for my requests to the Spotify API to not reach the rate limits.
The Data Interval is set to hourly because the Spotify API endpoint for recently played songs limits retrieving details to 50 songs per request; the Data Interval of an hour is a safe duration to avoid the possibility of data loss.
To finish up, the DAG for the Mistake Tracker has a start date of 2022-11-15. Once reached, the first Data Interval starts and ends an hour later. After the Data Interval is over, the DAG is triggered, running the ETL process to collect details of the songs played during this time period. Once completed, the subsequent Data Interval starts, and this process continues till you either manually stop the DAG or add an end_date parameter to the DAG file.
Tweet