The expectation for job scheduling is that the job starts at the specified date that we state (like with Cron jobs); in the case of Airflow, it does not quite work like this…
Although my experience is with Airflow version 2.3.3, the underlying concepts remain applicable but with possible terminology changes.
The current iteration of this post does not delve into Airflow Timetables (dynamic scheduling) or the best practice property of Idempotency for DAGs (rerunning the DAG will always have the same effect).
The Start Date parameter provides the date to start running your DAG as a whole. The best way to clarify your start date is by using a static value as the Airflow documentation states possible errors otherwise. The following code will have this affect.
"start_date": pendulum.datetime(2022, 10, 15, tz="Europe/London"),
# Or
"start_date": datetime.datetime(2022, 10, 15)
The Data Interval is how often you would like to run your DAG; think of this as the time bound to collect the data for your workflow.
The Logical date refers to the start date and time of the Data Interval. The triggering of the DAG is often confused with the time of the Logical Date. However, this is the starting time for getting potential data for your workflows. It did not help that before Airflow 2.2, the Logical Date was called the Execution Date.
This is the end of the period to get the data. Once complete, we move to the stage of actually triggering the tasks within our DAGs.
DAG Trigger
Only after the Data Interval is complete does the DAG trigger and run the tasks.
| Run ID | Schedule Period (data_interval_start) — (data_interval_end) | Triggers at![]() |
| scheduled__2022-11-15T00:00:00+00:00 | 2022-11-15, 00:00:00 UTC — 2022-11-15, 01:00:00 UTC | 2022-11-15, 01:00:00 UTC |
| scheduled__2022-11-15T01:00:00+00:00 | 2022-11-15, 01:00:00 UTC — 2022-11-15, 02:00:00 UTC | 2022-11-15, 02:00:00 UTC |
| scheduled__2022-11-15T02:00:00+00:00 | 2022-11-15, 02:00:00 UTC — 2022-11-15, 03:00:00 UTC | 2022-11-15, 03:00:00 UTC |
For my project, I wanted to collect data on the music I listen to during the day. To automate the ETL process for this task, I used Apache Airflow as my workflow management tool to complete my ETL process.
I was not interested in Catchup, so I set the Start Date parameter for the day after. If I had wanted Catchup, I would have needed my DAG to be Idempotent (completed as of 9/11/22) and for my requests to the Spotify API to not reach the rate limits.
The Data Interval is set to hourly because the Spotify API endpoint for recently played songs limits retrieving details to 50 songs per request; the Data Interval of an hour is a safe duration to avoid the possibility of data loss.
To finish up, the DAG for the Mistake Tracker has a start date of 2022-11-15. Once reached, the first Data Interval starts and ends an hour later. After the Data Interval is over, the DAG is triggered, running the ETL process to collect details of the songs played during this time period. Once completed, the subsequent Data Interval starts, and this process continues till you either manually stop the DAG or add an end_date parameter to the DAG file.
TweetThe post Why Do My Apache Airflow Tasks Not Work on Time? appeared first on Blog.]]>There comes a time when answering a question with data that not all the information needed is within a singular table. That is where knowing how to join tables becomes invaluable.
We will go through an understanding of Joins and how to apply them in MySQL.
Joins help you connect multiple tables that have a column in common. The tables below have a relationship, highlighted in red, that we can use to connect the tables.
Employees
| ID | first_name | last_name | office_code | boss_ID |
| 100 | Donatello | Bianchi | 001 | null |
| 101 | Ugo | Cavallo | 001 | 100 |
| 102 | Gionata | Lemmi | 001 | 100 |
| 103 | Madeline | Fournier | 002 | 100 |
| 104 | Luuk | Roosa | 002 | 100 |
| 105 | Niall | McCabe | 003 | 101 |
| 106 | Rosalva | Rojas | null | 101 |
| 107 | Kostis | Antonis | null | 101 |
Offices
| office_ID | city | phone_number |
| 001 | Seattle | +1202-555-0183 |
| 002 | Dublin | +353 01 918 3457 |
| 003 | Sydney | +61 2 5550 9137 |
| 004 | Busan | +82 2 407 5914 |
The returned table shows all rows from the left table (employees) and rows that have a matching value with the right table (offices).
SELECT *
FROM employees
LEFT JOIN offices
ON employee.office_code = offices.office_ID
| ID | first_name | last_name | office_code | boss_ID | office_ID | city | phone_number |
| 100 | Donatello | Bianchi | 001 | null | 1 | Seattle | +1202-555-0183 |
| 101 | Ugo | Cavallo | 001 | 100 | 1 | Seattle | +1202-555-0183 |
| 102 | Gionata | Lemmi | 001 | 100 | 1 | Seattle | +1202-555-0183 |
| 103 | Madeline | Fournier | 002 | 100 | 2 | Dublin | +353 01 918 3457 |
| 104 | Luuk | Roosa | 002 | 100 | 2 | Dublin | +353 01 918 3457 |
| 105 | Niall | McCabe | 003 | 101 | 3 | Sydney | +61 2 5550 9137 |
| 106 | Rosalva | Rojas | null | 101 | null | null | null |
| 107 | Kostis | Antonis | null | 101 | null | null | null |
The returned table shows all rows from the right table (offices) and rows that have a matching value with the left table (employees).
SELECT *
FROM employees
RIGHT JOIN offices
ON employees.office_code = offices.office_ID
| ID | first_name | last_name | office_code | boss_ID | office_ID | city | phone_number |
| null | null | null | null | null | 4 | Busan | +82 2 407 5914 |
| 100 | Donatello | Bianchi | 001 | null | 1 | Seattle | +1202-555-0183 |
| 101 | Ugo | Cavallo | 001 | 100 | 1 | Seattle | +1202-555-0183 |
| 102 | Gionata | Lemmi | 001 | 100 | 1 | Seattle | +1202-555-0183 |
| 103 | Madeline | Fournier | 002 | 100 | 2 | Dublin | +353 01 918 3457 |
| 104 | Luuk | Roosa | 002 | 100 | 2 | Dublin | +353 01 918 3457 |
| 105 | Niall | McCabe | 003 | 101 | 3 | Sydney | +61 2 5550 9137 |
As you can see above, the second table (offices) returned the Busan office details, although there are no recorded employees at this location.
The table will only return rows with the same value in common from both tables.
SELECT *
FROM employees
INNER JOIN offices
ON employees.office_code = offices.office_ID
| ID | first_name | last_name | office_code | boss_ID | office_ID | city | phone_number |
| 100 | Donatello | Bianchi | 001 | null | 1 | Seattle | +1202-555-0183 |
| 101 | Ugo | Cavallo | 001 | 100 | 1 | Seattle | +1202-555-0183 |
| 102 | Gionata | Lemmi | 001 | 100 | 1 | Seattle | +1202-555-0183 |
| 103 | Madeline | Fournier | 002 | 100 | 2 | Dublin | +353 01 918 3457 |
| 104 | Luuk | Roosa | 002 | 100 | 2 | Dublin | +353 01 918 3457 |
| 105 | Niall | McCabe | 003 | 101 | 3 | Sydney | +61 2 5550 9137 |
Duplicates a singular table, treating each version as unique and joining them into one. But why would you need a table to do this? In the case of hierarchical data.
Each row in the Employees’ table references their boss’s ID number. We may want the information of the higher-up to show on the same row as the subordinate.
-- I'm narrowing down the columns viewed after the join
SELECT emp.ID, emp.first_name, emp.last_name, emp.office_code,
emp.boss_ID, boss.first_name AS boss_fname,
boss.last_name as boss_lname, boss.office_code AS boss_office_code
FROM employees AS emp
INNER JOIN employees AS boss
ON emp.boss_ID= office_table.ID
| ID | first_name | last_name | office_code | boss_ID | boss_fname | boss_lname | office_code |
| 101 | Ugo | Cavallo | 001 | 100 | Donatello | Bianchi | 001 |
| 102 | Gionata | Lemmi | 001 | 100 | Donatello | Bianchi | 001 |
| 103 | Madeline | Fournier | 002 | 100 | Donatello | Bianchi | 001 |
| 104 | Luuk | Roosa | 002 | 100 | Donatello | Bianchi | 001 |
| 105 | Niall | McCabe | 003 | 101 | Ugo | Cavallo | 001 |
| 106 | Rosalva | Rojas | null | 101 | Ugo | Cavallo | 001 |
| 107 | Kostis | Antonis | null | 101 | Ugo | Cavallo | 001 |
Coding References:
Computing References
Markdown references:
Compress Image files
Writing style guide:
Task tracking management
Time tracking
General
Data Science
| Prefixes | Symbol | Scientific Notation | E notation * | Order of Magnitude | |
|---|---|---|---|---|---|
| 0.000000000001 | pico | p | x10-12 | 1E-12 | -12 |
| 0.000000001 | nano– | n | x10-9 | 1E-9 | -9 |
| 0.000001 | micro– | µ | x10-6 | 1E-6 | -6 |
| 0.001 | milli– | m | x10-3 | 1E-3 | -3 |
| 0.01 | centi– | c | x10-2 | 1E-2 | -2 |
| 0.1 | deci– | d | x10-1 | 1E-1 | -1 |
| 1 | —————- | —————- | —————- | —————- | —————- |
| 10 | deka– | da | x101 | 1E1 | 1 |
| 100 | hecto– | h | x102 | 1E2 | 2 |
| 1’000 | kilo– | k | x103 | 1E3 | 3 |
| 1’000’000 | mega– | M | x106 | 1E6 | 6 |
| 100’000’000 | giga– | G | x109 | 1E9 | 9 |
| 1’000’000’000’000 | tera– | T | x1012 | 1E12 | 12 |
| 1’000’000’000’000’000 | peta– | P | x1015 | 1E15 | 15 |