The Normality of Data - Geordie Consulting

While the concept of ETL pipelines may seem straightforward, the reality is that developing mechanisms to support these pipelines in data projects is a task of immense complexity that requires careful planning and execution.

ETL, or Extract, Transform, and Load, involves extracting data from the source system, processing it to align with the intended data models, and then loading it into those models. These three steps may appear straightforward but often become challenging due to seemingly simple assumptions or expectations that can lead to considerable time and resource expenditure. In our video series on Single Pane of Glass for Enterprise Analytics, we’ve purposefully chosen data that mirrors scenarios encountered by many of our clients. We are adopting the perspective of “New York City” to evaluate how various departments or functions are performing across different areas (the Boroughs). When we examine the CitiBike data, we notice it lacks Borough information and only presents Latitude and Longitude details for the bike stations.

The challenge presented here is quite common: business attributes must be incorporated into data to effectively utilise it within the data model. We face a critical business requirement to report by Borough, which must be addressed. With over 2000 Stations in our selected data, and a few new stations being added monthly, we have two primary challenges. Initially, the Project Team or your Centre of Excellence must cleanse and fix these 2000+ stations. Subsequently, the ongoing maintenance of this list needs to be transitioned back to the team managing the process, implying that a simple and low-impact solution should be provided to the business. Our extensive experience shows that such solutions often become overly complex, neglecting the KISS (Keep It Simple Stupid) principle. In our video, we demonstrate a streamlined approach triggered by an email, though we will discuss an alternative below.

1. Manual – Get source Data

2. Automatic – Source Data processed by the solution

3. Automatic – Data compared to distinct Station Names

4. Automatic – New Station list sent to Team

5. Manual – Updates processed

Steps 4 and 5 can be integrated within Microsoft Teams, so instead of sending an email, the entire team receives a message in Teams. The list filtered to the “Unknown Boroughs” will be accessible on another Tab within the Team. This centralizes everything, eliminating the need for an additional app that people might forget to access and update. The training required is minimal: “A message comes through into our main Teams chat informing us of items to check; we visit the other tab and update the items by clicking the link and identifying the location.”

The solution is intentionally embedded in the team’s existing workflow because the number of required interactions is low; developing a specific app for handling 20 items a month would be costly, even with PowerApps, and would likely result in data not being properly maintained. A potential downside is the necessity to use a Microsoft List, which is practically limited to about 10,000 rows (paging solutions need consideration at over 5,000 rows, although the real limit is in the millions). Given the current ~2500 + 20 monthly additions, this solution will likely last beyond the lifespan of our data platform, with an estimated 10+ years before reaching 5,000 rows.

In summary, it is evident that while some solutions may seem unnecessary initially, they ultimately become critical. Although it was reasonable to expect complete address availability for all stations, managing the lookup and validation of critical business data became essential. We considered automating this completely using Google or Azure Maps APIs to look up Latitude and Longitude and return full street addresses. Such a solution would take about a day to write and test, presenting the primary challenge. Our no-app solution took under half a day to set up and deploy, with only an additional hour needed for documentation. In contrast, an app would require at least half a day to document. An automated solution would also introduce potential errors and necessitate ongoing maintenance. Our approach adheres to the “KISS” principle by providing a straightforward solution seamlessly integrated into user’ workflows.