Blog

Why Do My Apache Airflow Tasks Not Work on Time?

James Glassey — Fri, 25 Nov 2022 17:29:58 +0000

The expectation for job scheduling is that the job starts at the specified date that we state (like with Cron jobs); in the case of Airflow, it does not quite work like this…

Notices about this post

Although my experience is with Airflow version 2.3.3, the underlying concepts remain applicable but with possible terminology changes.

The current iteration of this post does not delve into Airflow Timetables (dynamic scheduling) or the best practice property of Idempotency for DAGs (rerunning the DAG will always have the same effect).

Airflow Scheduling

Start Date

The Start Date parameter provides the date to start running your DAG as a whole. The best way to clarify your start date is by using a static value as the Airflow documentation states possible errors otherwise. The following code will have this affect.

"start_date": pendulum.datetime(2022, 10, 15, tz="Europe/London"), 
# Or
"start_date": datetime.datetime(2022, 10, 15)

Data Interval

The Data Interval is how often you would like to run your DAG; think of this as the time bound to collect the data for your workflow.

Logical Date (aka Data Interval Start)

The Logical date refers to the start date and time of the Data Interval. The triggering of the DAG is often confused with the time of the Logical Date. However, this is the starting time for getting potential data for your workflows. It did not help that before Airflow 2.2, the Logical Date was called the Execution Date.

Data Interval End

This is the end of the period to get the data. Once complete, we move to the stage of actually triggering the tasks within our DAGs.

DAG Trigger

Only after the Data Interval is complete does the DAG trigger and run the tasks.

Run ID	Schedule Period (data_interval_start) — (data_interval_end)	Triggers at
scheduled__2022-11-15T00:00:00+00:00	2022-11-15, 00:00:00 UTC — 2022-11-15, 01:00:00 UTC	2022-11-15, 01:00:00 UTC
scheduled__2022-11-15T01:00:00+00:00	2022-11-15, 01:00:00 UTC — 2022-11-15, 02:00:00 UTC	2022-11-15, 02:00:00 UTC
scheduled__2022-11-15T02:00:00+00:00	2022-11-15, 02:00:00 UTC — 2022-11-15, 03:00:00 UTC	2022-11-15, 03:00:00 UTC

DAG with an @hourly schedule interval

A Practical Example With My Mistake Tracker (Productivity Project)

For my project, I wanted to collect data on the music I listen to during the day. To automate the ETL process for this task, I used Apache Airflow as my workflow management tool to complete my ETL process.

I was not interested in Catchup, so I set the Start Date parameter for the day after. If I had wanted Catchup, I would have needed my DAG to be Idempotent (completed as of 9/11/22) and for my requests to the Spotify API to not reach the rate limits.

The Data Interval is set to hourly because the Spotify API endpoint for recently played songs limits retrieving details to 50 songs per request; the Data Interval of an hour is a safe duration to avoid the possibility of data loss.

To finish up, the DAG for the Mistake Tracker has a start date of 2022-11-15. Once reached, the first Data Interval starts and ends an hour later. After the Data Interval is over, the DAG is triggered, running the ETL process to collect details of the songs played during this time period. Once completed, the subsequent Data Interval starts, and this process continues till you either manually stop the DAG or add an end_date parameter to the DAG file.

The post Why Do My Apache Airflow Tasks Not Work on Time? appeared first on Blog.

All You Need To Know About Joins In MySQL

James Glassey — Wed, 28 Sep 2022 17:14:16 +0000

When To Use a Join?

There comes a time when answering a question with data that not all the information needed is within a singular table. That is where knowing how to join tables becomes invaluable.

We will go through an understanding of Joins and how to apply them in MySQL.

What are Joins?

Joins help you connect multiple tables that have a column in common. The tables below have a relationship, highlighted in red, that we can use to connect the tables.

Employees

ID	first_name	last_name	*office_code*	boss_ID
100	Donatello	Bianchi	001	null
101	Ugo	Cavallo	001	100
102	Gionata	Lemmi	001	100
103	Madeline	Fournier	002	100
104	Luuk	Roosa	002	100
105	Niall	McCabe	003	101
106	Rosalva	Rojas	null	101
107	Kostis	Antonis	null	101

Offices

*office_ID*	city	phone_number
001	Seattle	+1202-555-0183
002	Dublin	+353 01 918 3457
003	Sydney	+61 2 5550 9137
004	Busan	+82 2 407 5914

Types of Joins

LEFT JOIN
RIGHT JOIN
INNER JOIN
- SELF-JOIN

Left Join

The returned table shows all rows from the left table (employees) and rows that have a matching value with the right table (offices).

SELECT * 
FROM employees
LEFT JOIN offices
ON employee.office_code = offices.office_ID

ID	first_name	last_name	office_code	boss_ID	office_ID	city	phone_number
100	Donatello	Bianchi	001	null	1	Seattle	+1202-555-0183
101	Ugo	Cavallo	001	100	1	Seattle	+1202-555-0183
102	Gionata	Lemmi	001	100	1	Seattle	+1202-555-0183
103	Madeline	Fournier	002	100	2	Dublin	+353 01 918 3457
104	Luuk	Roosa	002	100	2	Dublin	+353 01 918 3457
105	Niall	McCabe	003	101	3	Sydney	+61 2 5550 9137
106	Rosalva	Rojas	null	101	null	null	null
107	Kostis	Antonis	null	101	null	null	null

Right Join

The returned table shows all rows from the right table (offices) and rows that have a matching value with the left table (employees).

SELECT * 
FROM employees
RIGHT JOIN offices
ON employees.office_code = offices.office_ID

ID	first_name	last_name	office_code	boss_ID	office_ID	city	phone_number
null	null	null	null	null	4	Busan	+82 2 407 5914
100	Donatello	Bianchi	001	null	1	Seattle	+1202-555-0183
101	Ugo	Cavallo	001	100	1	Seattle	+1202-555-0183
102	Gionata	Lemmi	001	100	1	Seattle	+1202-555-0183
103	Madeline	Fournier	002	100	2	Dublin	+353 01 918 3457
104	Luuk	Roosa	002	100	2	Dublin	+353 01 918 3457
105	Niall	McCabe	003	101	3	Sydney	+61 2 5550 9137

As you can see above, the second table (offices) returned the Busan office details, although there are no recorded employees at this location.

Inner Join

The table will only return rows with the same value in common from both tables.

SELECT * 
FROM employees
INNER JOIN offices
ON employees.office_code = offices.office_ID

ID	first_name	last_name	office_code	boss_ID	office_ID	city	phone_number
100	Donatello	Bianchi	001	null	1	Seattle	+1202-555-0183
101	Ugo	Cavallo	001	100	1	Seattle	+1202-555-0183
102	Gionata	Lemmi	001	100	1	Seattle	+1202-555-0183
103	Madeline	Fournier	002	100	2	Dublin	+353 01 918 3457
104	Luuk	Roosa	002	100	2	Dublin	+353 01 918 3457
105	Niall	McCabe	003	101	3	Sydney	+61 2 5550 9137

Self-Join

Duplicates a singular table, treating each version as unique and joining them into one. But why would you need a table to do this? In the case of hierarchical data.

Each row in the Employees’ table references their boss’s ID number. We may want the information of the higher-up to show on the same row as the subordinate.

-- I'm narrowing down the columns viewed after the join
SELECT emp.ID, emp.first_name, emp.last_name, emp.office_code, 
       emp.boss_ID, boss.first_name AS boss_fname, 
       boss.last_name as boss_lname, boss.office_code AS boss_office_code
FROM employees AS emp
INNER JOIN employees AS boss
ON emp.boss_ID= office_table.ID

ID	first_name	last_name	office_code	boss_ID	boss_fname	boss_lname	office_code
101	Ugo	Cavallo	001	100	Donatello	Bianchi	001
102	Gionata	Lemmi	001	100	Donatello	Bianchi	001
103	Madeline	Fournier	002	100	Donatello	Bianchi	001
104	Luuk	Roosa	002	100	Donatello	Bianchi	001
105	Niall	McCabe	003	101	Ugo	Cavallo	001
106	Rosalva	Rojas	null	101	Ugo	Cavallo	001
107	Kostis	Antonis	null	101	Ugo	Cavallo	001

The post All You Need To Know About Joins In MySQL appeared first on Blog.

Metrics That I Need to Know About

James Glassey — Mon, 04 Jul 2022 18:52:24 +0000

This document has metrics that I think will be important to know about. I will add tooltips and blog posts to the metrics when I have a project that requires that I need to have a deeper understanding of them.

To Measure Financial Performance

Net Profit
Net Profit Margin
Gross Profit Margin
Operating Profit Margin
EBITDA
Revenue Growth Rate
Total Shareholder Return (TSR)
Economic Value Added (EVA)
Return on Investment (ROI)
Return on Capital Employed (ROCE)
Return on Assets (ROA)
Return on Equity (ROE)
Debt-to-Equity (D/E) Ratio
Cash Conversion Cycle (CCC)
Working Capital Ratio
Operating Expense Ratio (OER)
CAPEX to Sales Ratio
Price Earnings Ratio (P/E Ratio)
Accounting Rate of Return (ARR)

Metrics to Understand Customers and users

Net Promoter Score (NPS)
Customer Retention Rate
Customer Satisfaction Index
Customer Profitability Score
Customer Lifetime Value (LTV)
Customer Turnover Rate (Churn)
Customer Engagement
Customer Complaints
Active users

Metrics to Gauge Market and Marketing Efforts

Market Growth Rate
Market Share
Brand Equity
Cost per Lead
Conversion Rate
Search Engine Rankings (by keyword) and click-through rate
Page Views and Bounce Rate
Customer Online Engagement Level
Online Share of Voice (OSOV)
Social Networking Footprint
Klout Score
Social Engagement

Metrics to Measure Operational Performance

Six Sigma Level
Capacity Utilisation Rate (CUR)
Process Waste Level
Order Fulfilment Cycle Time
Delivery In Full, On Time (DIFOT) Rate
Inventory Shrinkage Rate (ISR)
Project Schedule Variance (PSV)
Project Cost Variance (PCV)
Earned Value (EV) Metric
Innovation Pipeline Strength (IPS)
Return on Innovation Investment (ROI2)
Time to Market
First Pass Yield (FPY)
Rework Level
Quality Index
Overall Equipment Effectiveness (OEE)
Process or Machine Downtime Level
First Contact Resolution (FCR)

Metrics to Understand Employees and Their Performance

Human Capital Value Added (HCVA)
Revenue Per Employee
Employee Satisfaction Index
Employee Engagement Level
Staff Advocacy Score
Employee Churn Rate
Average Employee Tenure
Absenteeism Bradford Factor
360-Degree Feedback Score
Salary Competitiveness Ratio (SCR)
Time to Hire
Training Return on Investment

Metrics to Measure Your Environmental and Social Sustainability Performance

Carbon Footprint
Water Footprint
Energy Consumption
Saving Levels Due to Conservation and Improvement Efforts
Supply Chain Miles
Waste Reduction Rate
Waste Recycling Rate
Product Recycling Rate
ESG Score

References:

The inspiration and the information to have a list of metrics came from Bernard Marr.

The post Metrics That I Need to Know About appeared first on Blog.

Probability Distributions to Cover

James Glassey — Sun, 15 May 2022 18:23:21 +0000

This post will be a page to keep track of my written pieces on different probability distributions.

Discrete Probability Distributions

Bernoulli distribution
Binomial distribution
Benford distribution
Geometric distribution
Poisson distribution
Uniform distribution
Hypergeometric distribution
Negative binomial distribution

Continuous Probability Distributions

Continuous Uniform
Exponential Distribution
Gamma Distribution
Normal / Gaussian Distribution
Chi-squared Distribution

The post Probability Distributions to Cover appeared first on Blog.

Data Project Aims

James Glassey — Wed, 06 Apr 2022 19:15:49 +0000

[Ongoing List]

Checklist

Intended quality of Code
- Reproducible
- Generic and Reusable (Can be used with other data inputs)
- Extendable
- Modularised
- Tested Code
- Version Control (Git and Github)

Documentation
- Project Scope
- Iterative nature to project
- Explanation for choices
- Assumptions of models used

The post Data Project Aims appeared first on Blog.

Useful Resources And Figures For Me

James Glassey — Thu, 31 Mar 2022 08:10:07 +0000

[An Ongoing list]

Computing:

Coding References:

W3Schools

Computing References

Baeldung

Markdown references:

Markdownguide

Design

Compress Image files

Tinyjpg

Communication

Writing style guide:

The Economist Style guide, 12th edition

Organisation

Task tracking management

Jira

Time tracking

Now Then Pro (app)

Figures

General

Lex Fridman

Data Science

Eric Weber
Cassie Kozyrkov

The post Useful Resources And Figures For Me appeared first on Blog.

Scientific Notation

James Glassey — Thu, 17 Mar 2022 18:06:01 +0000

	Prefixes	Symbol	Scientific Notation	E notation *	Order of Magnitude
0.000000000001	pico	p	x10^-12	1E-12	-12
0.000000001	nano–	n	x10^-9	1E-9	-9
0.000001	micro–	µ	x10^-6	1E-6	-6
0.001	milli–	m	x10^-3	1E-3	-3
0.01	centi–	c	x10^-2	1E-2	-2
0.1	deci–	d	x10^-1	1E-1	-1
1	—————-	—————-	—————-	—————-	—————-
10	deka–	da	x10¹	1E1	1
100	hecto–	h	x10²	1E2	2
1’000	kilo–	k	x10³	1E3	3
1’000’000	mega–	M	x10⁶	1E6	6
100’000’000	giga–	G	x10⁹	1E9	9
1’000’000’000’000	tera–	T	x10¹²	1E12	12
1’000’000’000’000’000	peta–	P	x10¹⁵	1E15	15

* The ‘E’ in E notation does not represent the exponential or the constant ‘e’; it is a placeholder for ‘times 10 to the power of…’. To avoid possible confusion, ‘E’ is uppercased.

Further Information

For E notation, it is common to see numerical representations, for example, 2.34000E+3, this is the equlivalent to 2.34×10³

The post Scientific Notation appeared first on Blog.