Blog https://blog.jamesglassey.com James Glassey's Blog Tue, 15 Oct 2024 12:12:03 +0000 en-GB hourly 1 https://wordpress.org/?v=6.6.2 https://i0.wp.com/blog.jamesglassey.com/wp-content/uploads/2022/03/cropped-IMG_3184-2-1.jpg?fit=32%2C32&ssl=1 Blog https://blog.jamesglassey.com 32 32 209629685 Why Do My Apache Airflow Tasks Not Work on Time? https://blog.jamesglassey.com/why-do-my-apache-airflow-tasks-not-work-on-time/ Fri, 25 Nov 2022 17:29:58 +0000 https://blog.jamesglassey.com/?p=240 The expectation for job scheduling is that the job starts at the specified date that we state (like with Cron jobs); in the case of Airflow, it does not quite work like this… Notices about this post Although my experience… Continue Reading

The post Why Do My Apache Airflow Tasks Not Work on Time? appeared first on Blog.]]>

The expectation for job scheduling is that the job starts at the specified date that we state (like with Cron jobs); in the case of Airflow, it does not quite work like this…

Notices about this post

Although my experience is with Airflow version 2.3.3, the underlying concepts remain applicable but with possible terminology changes.

The current iteration of this post does not delve into Airflow Timetables (dynamic scheduling) or the best practice property of Idempotency for DAGs (rerunning the DAG will always have the same effect).

Airflow Scheduling

Start Date

The Start Date parameter provides the date to start running your DAG as a whole. The best way to clarify your start date is by using a static value as the Airflow documentation states possible errors otherwise. The following code will have this affect.

"start_date": pendulum.datetime(2022, 10, 15, tz="Europe/London"), 
# Or
"start_date": datetime.datetime(2022, 10, 15)
Calendar for airflow scheduler[2603]

Data Interval

The Data Interval is how often you would like to run your DAG; think of this as the time bound to collect the data for your workflow.

Logical Date (aka Data Interval Start)

The Logical date refers to the start date and time of the Data Interval. The triggering of the DAG is often confused with the time of the Logical Date. However, this is the starting time for getting potential data for your workflows. It did not help that before Airflow 2.2, the Logical Date was called the Execution Date.

Data Interval End

This is the end of the period to get the data. Once complete, we move to the stage of actually triggering the tasks within our DAGs.

⭐DAG Trigger⭐

Only after the Data Interval is complete does the DAG trigger and run the tasks.

Job Scheduling
Run ID Schedule Period
(data_interval_start) — (data_interval_end)
⭐Triggers at⭐
scheduled__2022-11-15T00:00:00+00:002022-11-15, 00:00:00 UTC —
2022-11-15, 01:00:00 UTC
2022-11-15, 01:00:00 UTC
scheduled__2022-11-15T01:00:00+00:002022-11-15, 01:00:00 UTC —
2022-11-15, 02:00:00 UTC
2022-11-15, 02:00:00 UTC
scheduled__2022-11-15T02:00:00+00:002022-11-15, 02:00:00 UTC —
2022-11-15, 03:00:00 UTC
2022-11-15, 03:00:00 UTC
DAG with an @hourly schedule interval

A Practical Example With My Mistake Tracker (Productivity Project)

For my project, I wanted to collect data on the music I listen to during the day. To automate the ETL process for this task, I used Apache Airflow as my workflow management tool to complete my ETL process. 

I was not interested in Catchup, so I set the Start Date parameter for the day after. If I had wanted Catchup, I would have needed my DAG to be Idempotent (completed as of 9/11/22) and for my requests to the Spotify API to not reach the rate limits

The Data Interval is set to hourly because the Spotify API endpoint for recently played songs limits retrieving details to 50 songs per request; the Data Interval of an hour is a safe duration to avoid the possibility of data loss.

To finish up, the DAG for the Mistake Tracker has a start date of 2022-11-15. Once reached, the first Data Interval starts and ends an hour later. After the Data Interval is over, the DAG is triggered, running the ETL process to collect details of the songs played during this time period. Once completed, the subsequent Data Interval starts, and this process continues till you either manually stop the DAG or add an end_date parameter to the DAG file.

The post Why Do My Apache Airflow Tasks Not Work on Time? appeared first on Blog.]]>
240
All You Need To Know About Joins In MySQL https://blog.jamesglassey.com/all-you-need-to-know-about-joins-in-mysql/ Wed, 28 Sep 2022 17:14:16 +0000 https://blog.jamesglassey.com/?p=194 When To Use a Join? There comes a time when answering a question with data that not all the information needed is within a singular table. That is where knowing how to join tables becomes invaluable. We will go through… Continue Reading

The post All You Need To Know About Joins In MySQL appeared first on Blog.]]>
When To Use a Join?

There comes a time when answering a question with data that not all the information needed is within a singular table. That is where knowing how to join tables becomes invaluable.

We will go through an understanding of Joins and how to apply them in MySQL.

What are Joins?

Joins help you connect multiple tables that have a column in common. The tables below have a relationship, highlighted in red, that we can use to connect the tables.

Employees

IDfirst_namelast_nameoffice_codeboss_ID
100DonatelloBianchi001null
101UgoCavallo001100
102GionataLemmi001100
103MadelineFournier002100
104LuukRoosa002100
105NiallMcCabe003101
106RosalvaRojasnull101
107KostisAntonisnull101

Offices

office_IDcityphone_number
001Seattle+1202-555-0183
002Dublin+353 01 918 3457
003Sydney+61 2 5550 9137
004Busan+82 2 407 5914

Types of Joins

Left Join

The returned table shows all rows from the left table (employees) and rows that have a matching value with the right table (offices).

SELECT * 
FROM employees
LEFT JOIN offices
ON employee.office_code = offices.office_ID
IDfirst_namelast_nameoffice_codeboss_IDoffice_IDcityphone_number
100DonatelloBianchi001null1Seattle+1202-555-0183
101UgoCavallo0011001Seattle+1202-555-0183
102GionataLemmi0011001Seattle+1202-555-0183
103MadelineFournier0021002Dublin+353 01 918 3457
104LuukRoosa0021002Dublin+353 01 918 3457
105NiallMcCabe0031013Sydney+61 2 5550 9137
106RosalvaRojasnull101nullnullnull
107KostisAntonisnull101nullnullnull

Right Join

The returned table shows all rows from the right table (offices) and rows that have a matching value with the left table (employees).

SELECT * 
FROM employees
RIGHT JOIN offices
ON employees.office_code = offices.office_ID
IDfirst_namelast_nameoffice_codeboss_IDoffice_IDcityphone_number
nullnullnullnullnull4Busan+82 2 407 5914
100DonatelloBianchi001null1Seattle+1202-555-0183
101UgoCavallo0011001Seattle+1202-555-0183
102GionataLemmi0011001Seattle+1202-555-0183
103MadelineFournier0021002Dublin+353 01 918 3457
104LuukRoosa0021002Dublin+353 01 918 3457
105NiallMcCabe0031013Sydney+61 2 5550 9137

As you can see above, the second table (offices) returned the Busan office details, although there are no recorded employees at this location.

Inner Join

The table will only return rows with the same value in common from both tables.

SELECT * 
FROM employees
INNER JOIN offices
ON employees.office_code = offices.office_ID
IDfirst_namelast_nameoffice_codeboss_IDoffice_IDcityphone_number
100DonatelloBianchi001null1Seattle+1202-555-0183
101UgoCavallo0011001Seattle+1202-555-0183
102GionataLemmi0011001Seattle+1202-555-0183
103MadelineFournier0021002Dublin+353 01 918 3457
104LuukRoosa0021002Dublin+353 01 918 3457
105NiallMcCabe0031013Sydney+61 2 5550 9137

Self-Join

Duplicates a singular table, treating each version as unique and joining them into one. But why would you need a table to do this? In the case of hierarchical data.

Each row in the Employees’ table references their boss’s ID number. We may want the information of the higher-up to show on the same row as the subordinate.

-- I'm narrowing down the columns viewed after the join
SELECT emp.ID, emp.first_name, emp.last_name, emp.office_code, 
       emp.boss_ID, boss.first_name AS boss_fname, 
       boss.last_name as boss_lname, boss.office_code AS boss_office_code
FROM employees AS emp
INNER JOIN employees AS boss
ON emp.boss_ID= office_table.ID
IDfirst_namelast_nameoffice_codeboss_IDboss_fnameboss_lnameoffice_code
101UgoCavallo001100DonatelloBianchi001
102GionataLemmi001100DonatelloBianchi001
103MadelineFournier002100DonatelloBianchi001
104LuukRoosa002100DonatelloBianchi001
105NiallMcCabe003101UgoCavallo001
106RosalvaRojasnull101UgoCavallo001
107KostisAntonisnull101UgoCavallo001
The post All You Need To Know About Joins In MySQL appeared first on Blog.]]>
194
Metrics That I Need to Know About https://blog.jamesglassey.com/metrics-that-i-need-to-know-about/ Mon, 04 Jul 2022 18:52:24 +0000 https://blog.jamesglassey.com/?p=156 This document has metrics that I think will be important to know about. I will add tooltips and blog posts to the metrics when I have a project that requires that I need to have a deeper understanding of them.… Continue Reading

The post Metrics That I Need to Know About appeared first on Blog.]]>
This document has metrics that I think will be important to know about. I will add tooltips and blog posts to the metrics when I have a project that requires that I need to have a deeper understanding of them.

Contents

To Measure Financial Performance

  • Net Profit
  • Net Profit Margin
  • Gross Profit Margin
  • Operating Profit Margin
  • EBITDA
  • Revenue Growth Rate
  • Total Shareholder Return (TSR)
  • Economic Value Added (EVA)
  • Return on Investment (ROI)
  • Return on Capital Employed (ROCE)
  • Return on Assets (ROA)
  • Return on Equity (ROE)
  • Debt-to-Equity (D/E) Ratio
  • Cash Conversion Cycle (CCC)
  • Working Capital Ratio
  • Operating Expense Ratio (OER)
  • CAPEX to Sales Ratio
  • Price Earnings Ratio (P/E Ratio)
  • Accounting Rate of Return (ARR)

Metrics to Understand Customers and users

  • Net Promoter Score (NPS)
  • Customer Retention Rate
  • Customer Satisfaction Index
  • Customer Profitability Score
  • Customer Lifetime Value (LTV)
  • Customer Turnover Rate (Churn)
  • Customer Engagement
  • Customer Complaints
  • Active users

Metrics to Gauge Market and Marketing Efforts

  • Market Growth Rate
  • Market Share
  • Brand Equity
  • Cost per Lead
  • Conversion Rate
  • Search Engine Rankings (by keyword) and click-through rate
  • Page Views and Bounce Rate
  • Customer Online Engagement Level
  • Online Share of Voice (OSOV)
  • Social Networking Footprint
  • Klout Score
  • Social Engagement

Metrics to Measure Operational Performance

  • Six Sigma Level
  • Capacity Utilisation Rate (CUR)
  • Process Waste Level
  • Order Fulfilment Cycle Time
  • Delivery In Full, On Time (DIFOT) Rate
  • Inventory Shrinkage Rate (ISR)
  • Project Schedule Variance (PSV)
  • Project Cost Variance (PCV)
  • Earned Value (EV) Metric
  • Innovation Pipeline Strength (IPS)
  • Return on Innovation Investment (ROI2)
  • Time to Market
  • First Pass Yield (FPY)
  • Rework Level
  • Quality Index
  • Overall Equipment Effectiveness (OEE)
  • Process or Machine Downtime Level
  • First Contact Resolution (FCR)

Metrics to Understand Employees and Their Performance

  • Human Capital Value Added (HCVA)
  • Revenue Per Employee
  • Employee Satisfaction Index
  • Employee Engagement Level
  • Staff Advocacy Score
  • Employee Churn Rate
  • Average Employee Tenure
  • Absenteeism Bradford Factor
  • 360-Degree Feedback Score
  • Salary Competitiveness Ratio (SCR)
  • Time to Hire
  • Training Return on Investment

Metrics to Measure Your Environmental and Social Sustainability Performance

  • Carbon Footprint
  • Water Footprint
  • Energy Consumption
  • Saving Levels Due to Conservation and Improvement Efforts
  • Supply Chain Miles
  • Waste Reduction Rate
  • Waste Recycling Rate
  • Product Recycling Rate
  • ESG Score

References:

  • The inspiration and the information to have a list of metrics came from Bernard Marr.
The post Metrics That I Need to Know About appeared first on Blog.]]>
156
Probability Distributions to Cover https://blog.jamesglassey.com/probability-distributions-to-cover/ Sun, 15 May 2022 18:23:21 +0000 https://blog.jamesglassey.com/?p=138 A list of probability distributions. Continue Reading

The post Probability Distributions to Cover appeared first on Blog.]]>
This post will be a page to keep track of my written pieces on different probability distributions.

Discrete Probability Distributions

  • Bernoulli distribution
  • Binomial distribution
  • Benford distribution
  • Geometric distribution
  • Poisson distribution
  • Uniform distribution
  • Hypergeometric distribution
  • Negative binomial distribution

Continuous Probability Distributions

  • Continuous Uniform
  • Exponential Distribution
  • Gamma Distribution
  • Normal / Gaussian Distribution
  • Chi-squared Distribution
The post Probability Distributions to Cover appeared first on Blog.]]>
138
Data Project Aims https://blog.jamesglassey.com/data-project-aims/ Wed, 06 Apr 2022 19:15:49 +0000 https://blog.jamesglassey.com/?p=69 The Intended qualities of a data science project Continue Reading

The post Data Project Aims appeared first on Blog.]]>
[Ongoing List]

Checklist

  • Intended quality of Code
    • Reproducible
    • Generic and Reusable (Can be used with other data inputs)
    • Extendable
    • Modularised
    • Tested Code
    • Version Control (Git and Github)
  • Documentation
    • Project Scope
    • Iterative nature to project
    • Explanation for choices
    • Assumptions of models used
The post Data Project Aims appeared first on Blog.]]>
69
Useful Resources And Figures For Me https://blog.jamesglassey.com/useful-resources/ Thu, 31 Mar 2022 08:10:07 +0000 https://blog.jamesglassey.com/?p=74 References to external resources that I have found useful with the occasional link. Continue Reading

The post Useful Resources And Figures For Me appeared first on Blog.]]>
[An Ongoing list]

Computing:

Coding References:

Computing References

Markdown references:

Design

Compress Image files

Communication

Writing style guide:

  • The Economist Style guide, 12th edition

Organisation

Task tracking management

Time tracking

Figures

General

  • Lex Fridman

Data Science

  • Eric Weber
  • Cassie Kozyrkov
The post Useful Resources And Figures For Me appeared first on Blog.]]>
74
Scientific Notation https://blog.jamesglassey.com/scientific-notation/ Thu, 17 Mar 2022 18:06:01 +0000 https://blog.jamesglassey.com/?p=30 A summary of how to represent very large or small numbers. Continue Reading

The post Scientific Notation appeared first on Blog.]]>
PrefixesSymbolScientific NotationE notation
*
Order of Magnitude
0.000000000001picopx10-121E-12-12
0.000000001nano–nx10-91E-9-9
0.000001micro–µx10-61E-6-6
0.001milli–mx10-31E-3-3
0.01centi–cx10-21E-2-2
0.1deci–dx10-11E-1-1
1—————-—————-—————-—————-—————-
10deka– dax1011E11
100hecto–hx1021E22
1’000kilo–kx1031E33
1’000’000mega–Mx1061E66
100’000’000giga–Gx1091E99
1’000’000’000’000tera–Tx10121E1212
1’000’000’000’000’000peta–Px10151E1515
* The ‘E’ in E notation does not represent the exponential or the constant ‘e’; it is a placeholder for ‘times 10 to the power of…’. To avoid possible confusion, ‘E’ is uppercased.

Further Information

  • For E notation, it is common to see numerical representations, for example, 2.34000E+3, this is the equlivalent to 2.34×103

The post Scientific Notation appeared first on Blog.]]>
30