Getting Started with Data Observability

Table of Contents

From Language Models to Autonomous Agents

In the past years, organizations have been investing heavily to convert themselves into data-driven organizations with the objective to personalize customer experiences, optimize business processes, drive strategic business decisions, etc. As a result, modern data environments are constantly evolving and becoming more and more complex. In general, more data means more business insights that can lead to better decision-making. However, more data also means more complex infrastructure, which can cause decreased data quality, a higher chance of data breaking, and consequently, erosion of trust within organizations and risk of not being compliant with regulations. The data observability category — which has quickly been developing during the past couple of years — aims to solve these challenges by enabling organizations to trust their data at all times. Although the category is relatively young, there are already a wide variety of players with different offerings and applying various technologies to solve data quality problems.

But let’s start from the beginning. In this blog, I aim to give an overview of the topic of data observability by going through the following points:

‍

What is data observability?
How is observability different from testing and monitoring?
What are the pillars of data observability?
What are some use cases of data observability?
Do you need to adopt a data quality program?
What should you look for in a data observability platform?

‍

What is data observability?

Originally borrowed from control theory, software engineering adopted the term “observability” before spreading to the data world. The foundations laid by tools like AWS enabled the rise of software observability tools like Datadog and New Relic, which revolutionized the software world. These tools gave engineering teams the ability to collect metrics across their systems to provide them with a complete understanding of the health status of the systems. In simpler words, observability allows engineers to understand if a system is working the way it should.

As the cloud allows to host more and more infrastructure components — like databases, servers, and API endpoints — it has become increasingly important to monitor and obtain visibility over these complex infrastructure systems carefully. And today, software engineers cannot imagine not having the centralized view across their systems provided by software observability tools. Salma Bakouk and I discussed the relationship between software and data observability in more detail in this blog.

Today, the data space is facing similar challenges as the ones faced by software a few years ago. Organizations are consuming increasingly high quantities of data, providing them with numerous use cases and products to deal with. As previously mentioned, more data also means a higher possibility of it breaking. And organizations are facing major data quality issues.

Too often data engineers don’t get to work on valuable, revenue-generating activities because they are stuck fixing pipeline issues or trying to understand where the absurd numbers in a business dashboard came from. On top of the lost revenue that having data engineers focus on repetitive work brings, several other problems can be caused by not having an observability program in practice. For example, loss of trust in data within the organization, reduced team productivity and morale, the risk of being non-compliant with existing regulations, and decreased quality of decision making.

‍

How is observability different from testing and monitoring?

Now you may be asking yourself, “but what about the existing data quality solutions? How do they differ from what you’ve just explained?”. Let’s deep dive into the differences between data quality monitoring, testing, and observability.

On the one hand, testing used to be the preferred method until companies started consuming so much data that quality issues became increasingly harder to detect and predict. Organizations can put hundreds of tests in place to cover the predictable issues. However, there will be infinite other possibilities for the data to break throughout its whole journey.

Data quality monitoring, on the other hand, is often used interchangeably with observability. In reality, data quality monitoring is the first step needed to enable observability. In other words, data quality monitoring alerts users when assets or sets don’t match the established metrics and parameters. In this way, teams get visibility over the quality status of the assets. However, this approach ultimately presents the same limitations as testing. Although you can have visibility over the status of your assets, you have no way of knowing how to solve the issue. Observability is designed to overcome this challenge.

Because observability constantly collects signals across the entire data stack — logs, jobs, datasets, pipelines, BI dashboards, data science models, etc. — enabling monitoring and anomaly detection at scale. In other words, observability can be defined as an overseeing layer for the data stack, ensuring that the data is reliable and traceable at every stage of the pipelines, regardless of which processing point it resides.

To better understand what observability is and the value it can bring to organizations, let’s dive into some of its main technological components.

‍

What are the pillars of data observability?

Although Data Observability originated in Software Observability, there are some significant differences to keep in mind. Software Observability is built on three pillars:

Metrics
Traces
Logs

And while these pillars were enough to build viable full-stack observability frameworks for modern cloud applications, they are not quite adaptable to data and its infrastructure. At Sifflet, we built our framework around the following pillars of data observability:

Metrics — measure the quality of the data.
Metadata — have access to data about the data.
Lineage — know the dependencies between data assets.

Data Observability should be perceived as an overseeing layer to make your Modern Data Stack more proficient and ensure that data is reliable regardless of where it sits. In the Full Data Stack Observability approach, each component of the Modern Data Stack is perceived as a compartment that serves a purpose in the data journey. These compartments have a logic in which they operate and release information that can be leveraged to understand and observe the metadata, the data itself, the lineage, and the resulting data objects (metrics, charts, dashboards, etc.). To this end, the extensive lineage between the data assets and the objects across the data stack is the backbone of the Full Data Stack Observability framework.

To add some context to this definition, let’s look at some of the most critical capabilities and use cases.

‍

Anomaly detection at both the metadata and the data level. The idea is to introduce a set of metrics that can help define the health state of a data platform. Some standard business agnostic metrics include:

Freshness/Timeliness: is the data up-to-date?
Completeness/Volume: are there missing values? Missing rows? Missing records? Incomplete pipelines?
Duplication: are there any duplicates?
Schema: did the structure change? Did a maintenance SQL query change something and is causing pipelines to break? Or dashboards to become obsolete?
Accuracy: is the data within a specific range? In a particular format?

Now, say you’ve implemented an anomaly detection model that can consume and produce lineage information, and you receive an alert notifying you that there is an anomaly; what’s next?

There are three things your team needs to do:

Incident management: the first step is understanding the business impact of the anomaly. What does it mean for the data consumers? Who is using these dashboards, charts, or ML models? Does this data feed into other dashboards? What is the data flow?
Root cause analysis: the next step involves data engineers, who need to understand where the accident took place after having found out about the anomaly.
Post-mortem: finally, a good practice for teams is to take the time to understand what happened and learn how to prevent the issue from happening again. Implementing a purposeful post-mortem analysis is key to achieving sustainable data health.

These three practices are powered by lineage. Extensive lineage allows teams to quickly gain visibility over the dashboards, charts, or models impacted by the accident enabling teams to track back to the issue at its origin. Lineage models, in fact, can show teams the upstream dependencies (or how we like to call it the “left-hand side of the warehouse” — or simply everything the data went through before getting into the data warehouse or data lake) — enabling them to better understand where the issue stems from. On top of that, lineage models can link to applications, jobs, and orchestrators that gave rise to the anomalous data model.

‍

Lineage: lineage represents the dependencies between the data assets within an organization. As data volumes grow and platforms become more complex, keeping track of how one asset relates to another becomes impossible. But why is keeping track of the dependencies even relevant? Let’s look at a few scenarios that a practitioner at your average data mature organization deals with daily.

Brian from Engineering is releasing a new version of one of the company’s core products. Brian then conducts a check within his team and pushes for the release overnight, oblivious to how that might impact the data team. The team is run by Susie, whom he rarely interacts with. Susie wakes up to find that the data her team is using to prepare for an important board meeting is gone. Lineage (and granted, better team collaboration) would have allowed Brian to check the downstream dependencies of his actions beyond the scope covered by his team and notify Susie of the potential impact before the release was rolled out so that she could take appropriate action.

Let’s look at another example.

Mona, a CEO, checks the revenue numbers for their Asia business for the quarter on her way to an important press conference. She immediately spots something that seems odd to her: the revenue number from Japan. She calls Jacob, the Head of Data, and asks him to justify the number. Jacob then calls his team for an urgent meeting and asks them to investigate the discrepancy. The business has grown so fast over the past couple of years, and it has become impossible for the team to keep up with records and documentation. Jacob knew that it would take them hours, if not days, to get to the bottom of the issue, so he apologized to Mona (who would have to “improvise”) and said his team would do better next time. Few things to dissect here; first, the team could’ve avoided the hiatus had they implemented an anomaly detection model to know of the potential error way ahead of Mona’s presentation. Second, after the said anomaly had been detected, extensive lineage — from ingestion to BI — would’ve allowed Jacob’s team to:

Know the downstream impact of such anomaly and alert appropriately so Mona wouldn’t have had to discover it on the spot, and
Assess the upstream root cause and take appropriate action to resolve the issue.

Do you need to adopt a data quality program?

The short answer to this question is yes. Every organization needs to implement solutions or processes that ensure that its data is reliable and high-quality at all times. This can look different for every organization. What instead is similar in every business, is the risks you run when you are not implementing data quality monitoring and swiftly troubleshooting the issues you encounter. These risks can be the following:

Decreased decision-making quality.

Data-driven decision-making is supposed to enhance business decisions. However, this is strictly not the case when the data used to drive decisions is not of good quality. In fact, it may actually have the opposite effect. Potentially leading to:

Material loss
Missed opportunities
Misallocated resources

Erosion of trust in data within teams.

Errors in reporting can be costly. For example, in the case of public companies, bad data can lead to errors in reporting which can consequently hurt its stock price as well as cost some fines. In the same way, as mentioned above, data-driven decision-making will only lead to positive results if the data is accurate and up-to-date. Companies facing data quality issues often may lose their team’s trust in the use of data within the organization. Consequently jeopardizing the data culture, which can have huge negative consequences for the company in the long term.

Compliance with regulations.

There are a few ways in which not implementing thorough data quality monitoring can create data governance challenges within organizations:

First, thanks to real-time data streams, today’s data flows are dynamic, making it challenging to keep up-to-date documentation of your data
Second, as the data stack grows, so does the team. This can become a great challenge as more people working with data means more risk of unexpected changes to it that can decrease its quality. Read more about how we believe you should structure your team in this blog
Third, data democratization allows more stakeholders to make data-informed business decisions and create a data culture within the organization. At the same time, as different roles have different expertise and confidence around data, things can start going wrong, leading to data governance challenges.

In general, regulations can be intimidating because of the numerous engineering challenges they can bring to organizations. These challenges, however, are not impossible to overcome if you adopt the right tools and practices.

What should you look for in a data observability tool?

If you’ve made it this far into the blog it probably means that you are interested in using observability within your company, and possibly even choosing an observability tool. The current market has plenty of players, and there are many different tools you can choose from. Benchmarking goes beyond the scope of this blog, but here I’d like to give a brief overview of the key aspects to keep in mind when looking into data observability tools for your organization.

Number and diversity of integrations. This is probably an obvious point, but if your data observability tool is to become an overseeing layer to your data stack, it needs to integrate with all (if not most) of your other tools. In general, the most integrations on both sides of the data warehouse (or data lake) the tool has, the highest level of visibility you’ll manage to achieve.

Lineage visualization. How easy is it to visualize and go through the lineage? This is related to the UX of the lineage as well as the number of integrations your tool has. Because the more integrations your data observability solution presents, the better your lineage will be due to improved visibility.

Field-level lineage. Field-level lineage allows you to get more granularity in your data. So, having access to field-level lineage rather than only table lineage allows you to better and more quickly identify and troubleshoot potential anomalies.

Ease of set-up and migration. Once you get a data observability platform, you may already have other quality/catalog solutions in place. Your new data observability tool should provide you with some automatic ways to migrate your data from one tool to the other without having to do it manually.

Variety in the catalog of monitoring tests. This point heavily depends on the type of use the organization is planning for the observability tool. This means that, when researching for a new tool, you’ll have to check which tests the tool offers and check whether they match the expectations of your organization.

Needs of your user’s persona. Finally, it is essential to understand who will be using the tool and how they will be using it because different tools have very different offerings. On the one hand, if the data observability solution will be used by a technical persona, you’ll need to look for one that provides access to interface with SDK or CLI — to give engineers the ability to interact with the tool programmatically. On the other hand, if it will mostly be used by business personas, you’ll need to look for better UX and UI to make sure that the is user-friendly and easy to understand.

‍

Conclusion

To conclude, the complexity of the modern data stack is not an excuse to justify poor data quality anymore. Embracing complexity while ensuring that data quality is not left behind is possible with full data stack observability. More on the full data stack observability approach on this blog.

Data observability can be considered the overseeing layer that enables data users (e.g., data analysis, data scientists) to always be in the know about the status of their data, while also giving them the ability to swiftly solve any potential accidents by leveraging data monitoring, lineage, and metadata management practices. However, not every data observability tool is the same. Always ask yourself who will be using the tool (will it be used by data engineers? Or by data scientists? etc.) and how to ensure you choose the right tool for your organization.

‍

The Modern Data Stack is not complete without an overseeing observability layer to monitor data quality throughout the entire data journey. Do you want to learn more about Sifflet’s Full Data Stack Observability approach? Would you like to see it applied to your specific use case? Book a demo or get in touch for a two-week free trial!

‍

Getting Started with Data Observability

Discover more ressources