Month: November 2020

Data Quality in Real time

When I started working in Data & analytics projects, I realized testing a workflow during development is only ‘part’ of the story. We can ‘assume’ what kind of data we will get in production, but cannot be certain, therefore testing data quality in real time is going to be paramount.


On reading, exploring & supporting to fix data quality issues, the importance of these checks became evident. Over the years I’ve seen exponential impact of these checks and now are always a integral part of any data strategy I work on. The impact these checks have had are tremendous, I’ll highlight a few of the important ones here.


Builds trust in analytics


This is usually the toughest part of any data program – to build trust with decision makers in the analytics results being presented. A common obstacle to building that trust is skewed information coming out of the analytics – caused by data quality issues across he data pipeline.


When the analytics team is questioned about it – naturally it takes them time to figure out what could be the problem. With complex data pipelines the challenge is compounded and often can take a while to figure out the source of the problem. This slowness creates further distrust and takes away team’s capacity from delivering new features and instead spent on debugging old ones.


Once teams start to run data quality checks across different stages in the data pipeline, they are capable to identify data quality issues almost as soon as they occur. Teams are then able to preemptively highlight a potential data quality issue and very well equipped to fix the problem as well, since source of the problem is very easy to find now.


This changes the conversation from “why the data isn’t making sense” to proactive alerting of a potential upcoming data quality issue which might cause skewed metrics – and sometimes fixing the problem even before consumers get a chance to look at the alerts or skewed results.


Reduced rework


Analytics teams spending time on debugging data quality issues is a common activity you’d find on their sprint boards (unfortunately). True, there will always be ‘some’ capacity going towards this – simply because usually teams have very less control on the data ingested into the data lake. However this can get very bad if appropriate enablers are not in place, I’ve seen the amount of rework going as high as 50% of team’s capacity across a PI (quarter)!


With data quality checks running in production across the pipeline – this manual toil significantly reduces and teams mostly don’t have to spend time ‘finding’ the cause of the problem, instead have a very good idea of the problem source.


Mental peace


Not to mention the mental peace teams get from this. The thought of you never know what kind of nasty data might make it’s way into your analytics is pretty draining. Once teams see the this benefit – they often become the biggest drivers of implementing these checks.


Patterns of issues


Ever felt Deja vu while fixing an issue? Yeah that’s all too common. Over a longer period of time the similar kind of issues might keep coming up again and again. Since the occurrences are spread across time – not easy to make the connection.


With running data quality checks in real time – teams build a very valuable data set showing data quality across the pipeline over a longer period of time. Capturing trends and correlating them with possible causes becomes easier.


How to build Data quality checks?


There are a couple of ways I’ve seen this built – in a separate post will talk about how some best practices in building these checks.

Baking quality into your data pipeline

Organizations when start on their data analytics journey, it mostly begins as an MVP (Minimum Viable Product) and evolved over time. While that is exactly how things should go – sometimes they miss out on building the foundations right as they evolve their MVP. A big factor that get’s missed is around data quality, data governance & stewardship.


From my experience building a data pipeline it not the challenging part – ensuring data saved in our warehouse is of quality (fit for purpose) is the tricky part. Unfortunately this is easier said than done and requires some foundational work and structure to consistently deliver quality data.


About the talk


In this talk at TestBash New Zealand 2020 I gave a quick introduction to how data analytics works and discussed some basics of building quality into your data pipeline.


https://twitter.com/manoj9788/status/1330052146839031816?s=20&t=1tsAnJjd0vYObUMHmeyknw
https://twitter.com/heather_reiduff/status/1330051793909407744?s=20&t=0GH-GbSMtKsJHtLIjJopZQ

It all starts with data


It’s surprising how easy it is for teams to forget the importance of having ‘fit for purpose’ data – be that for business intelligence, analytics and especially AI / ML initiatives. This is precisely what I like to remind teams I work with is – it all starts with having the right data.


The Pipeline


For building enterprise data assets, often we get data from a lot of different sources across the organization & perhaps from 3rd part sources also. These sources have their own paradigms in which the data is developed and consumed, and may not always be consistent with one another.


To homogenize all this data and make it fit for purpose, it passes through a lot of different stages of cleansing, curating, transforming and modelling. All these activities combined are called the data pipeline.


Image below gives a high level overview of a sample data pipeline:


Data Quality


To get the desired output from analytics, the quality of data across the data pipeline has to be measured and fixed. It happens at the analytics stage we get data which might not add up. Backtracking from there to figure out exactly what went wrong is a tedious job.


To avoid this, data quality must be measured across all stages. Here is a sample process to writing these quality checks:


At different stages, different type of data quality checks might be more important. While ingesting data we would be more concerned if the source data is as per the defined schema. As we move along the pipeline the objectives of the checks change according to the underlying ETL process.

Getting quality of the data right is always a journey and never ends. Being agile about this and consistently working on feedback gathered from these quality checks is paramount for success in any data analytics initiative.

Data Warehouse vs OLTP

OLTPs / Traditional database designs are perhaps one of the most common database types used. Data warehouse as a concept was introduced in 1980, but widespread adoption came a good few years later.


Since OLTP designs are very common and well known, It helps to understand the difference in purpose between traditional OLTP DBs and data warehouses.


OLTP USAGE​


Traditional DBs are primarily a persistence layer of an application – where data is changed very frequently requiring a quick operational response time. It’s all about the speed at which CRUD operations can be performed, and ensuring high availability of data for other processes.


To make that happen, these DBs are supposed to be:


  • Very quick in inserting / updating new records, those being the most frequent operations.

  • The DB can be accessed by different operations and therefore locking needs to be minimized.

  • Typical read operations usually fetch a small amount of data – so querying typically should be optimized and not heavy on processing

  • It’s crucial the data is consistent, entries match and are linked properly, which leads to:

  • Normalization – which helps with reducing redundancy, maintaining consistency making updates easier.

DATA WAREHOUSE​


Data warehouses have a completely different purpose. OLTPs are used to maintain data for ‘system of records’, Data warehouses are meant to serve as a single source of enterprise wide data for reporting. That means – OLTPs have small, quick and very frequent updates. Data warehouses (DW) are used for querying very large data sets and typically ‘batch’ updates which don’t have to be that performance critical.


To elaborate further this is how data warehouses are different in contrast with OLTPs:


  • The most important purpose of a DW is retrieval of large data sets through queries instead of very limited data sets

  • Since the primary job is retrieval of data – performance of ‘read’ operations is essential

  • Typically you might not have many operations trying to access the DW, therefore locking is not an issue

  • DW are a copy of OLTP data – that means we typically will not have frequent inserts or updates like an OLTP system since those are the customer facing ones

  • Since insert / update will not be as frequent, therefore performance on write operations is not a concern