By clicking “Accept”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.
ENTROPY

Single Source of Truth - A Modern Data Myth?

Post by
Salma Bakouk
&
Benedetta Cittadin

Data-driven decision-making has placed increased importance on collecting and analyzing data within companies. Acting on data-derived information is crucial for organizations; however, data teams spend too much time debating which numbers from different sources are the right ones to use. The concept of a Single Source of Truth (SSOT) promises to solve this issue by ensuring that everyone in an organization bases their decision-making on the same data. This is achieved by aggregating data from many sources into a single location. In other words, a Single Source of Truth is a centralized repository for all the data within an organization.

It is crucial for companies that want to extract the most value from their data to ensure that everyone speaks the same language and bases their decision-making on the same information. However, what does Single Source of Truth mean practically? And most importantly, is it achievable within organizations? We sat down with Edouard Flouriot from Aircall and Alexandre Cortyl from Dashlane to answer these questions and provide more clarity on this widely discussed topic.

What does Single Source of Truth mean?

The concept of a Single Source of Truth gets thrown around in discussions on companies’ data management. However, its definition varies depending on whom you ask.

Edouard explained that it is a highly complex and primarily unattainable concept when asked about this. He argued that the Single Source of Truth could work only if one single application is used to run everything within the company. The reality is that more nuance is needed. Alexandre agreed that the Single Source of Truth is an ideal concept that is often hard to achieve in practice because data can be seen from many different perspectives. For him, each tool may have other measurements, specificities, or best practices that inevitably lead to various sources of truth or different perspectives. However, if organizations want to get more value out of their data, they cannot allow silos, and the data team plays a crucial role in ensuring that silos are not created in two main ways:

  • First, the data team needs to limit the number of interfaces to access data coming from different sources, hence reducing friction within the company. One of the strategies that allow you to do this is to build the modern data stack around the warehouse.
  • Second, the data team needs to align people on what “the single source of truth” is, if there is one to choose amongst them. In other words, it is imperative to align everyone on which specific definition or context you assume within a concept. This then becomes the “truth” you build consensus around.

What about the stakeholders? How important is this definition to resonate with them and its overall objectives?

It’s key to ensure that the main stakeholders have easy access to one specific source rather than a multitude of them not to impact their trust in the data you are showcasing. This is critical to avoid the worst-case scenario, which is when stakeholders see different data in different dashboards. When asked how often this happens, Alexandre explained that it depends on many factors, but mainly on how mature the organization is in providing complete access to information. He then explained that it is essential to avoid this situation because stakeholders will start challenging the data quality, and their trust in the information they are provided with will diminish. This will have implications on how often they will be leveraging data and how much they are going to rely on it for their decision-making. So, this is something to master as a data team — ensuring that the organization provides stakeholders with consistent information throughout time and perspectives. To do this, data teams need to ensure that reports and analyses are done repetitively and consistently.

Edouard also added that while stakeholders might not actively participate in data teams’ DQM roadmaps, they are adamant about it. Stakeholders use data to improve their activities and reach their targets. To do so, they want the maximum level of trust in the data they use, and they rely on the data team to monitor their quality. For Edouard, it is critical to put data at the center of the strategy and ensure that the data team centralizes and controls significant data interactions from stakeholders. In this way, it is also possible to improve data quality at scale. Some best practices were put in place within his organization to implement the needed change. First, creating a vertical organization within the data team ensured that data people were very close to stakeholders and how they interacted with the data. Practically, this means creating different teams dedicated to different sides of the organization — a team works together with product managers, other works with customer-facing teams, and so on. Second, focusing on culture and data literacy is instrumental. In the past, most of the dashboards and monitoring were coming from the data team, but today, decentralization is needed more and more. For Edouard, decentralization is possible when technical barriers are lowered and when you ensure that everyone uses the same metrics’ definitions. The key to enabling this change is to focus on quality and governance, which can only be achieved through changing culture.

At what point should you start thinking about data quality?

When asked about this, Alexandre said that the earlier, the better. For him, it is also essential to implement processes robustly — CI/CD, code review, and so on. Additionally, it is crucial to align and ensure that there are clear definitions and documentation of data, with a specific emphasis on expectations. So, what are you expecting from the data? When are you hoping to get it, and in what format? This is critical to ensure that there are no different perspectives and understandings of what works and what doesn’t. On top of this, you should also apply automated tests, checks, and monitoring. For Edouard, there are many best practices from the software engineering world that can be used for data. Generally, data teams are mostly focused on bringing value and impact, so software engineering basics are not always obvious. However, they are vital. He recalled an episode within his organization in which the CEO shared his concerns about the reliability of the data he was using. To solve this issue, they took the time to audit the systems and understand how they could mitigate the risk of failure. The first reliability issue they found was the stack, so they decided to review it. Second, they focussed on strengthening the reliability. To do so, they enlarged their team to be able to take more care in the way they were releasing new changes and fixes in the pipelines.

Data Observability — Validation of the Single Source of Truth

Every organization wants to establish a Single Source of Truth. However, its actual implementation is often a topic for debate. Luckily, it comes down to people and tooling.

People or Internal Data Literacy

Gartner defines Data Literacy as the ability to read, write and communicate data in context, including an understanding of data sources and constructs, analytical methods and techniques applied, and the ability to describe the use case application and resulting value.

It is key to ensure that everyone is aligned and speaks the same language when it comes to data in an organization, and it starts with the following key steps.

  1. Top-down commitment: C-level engagement and support are paramount. Senior executives’ commitment and efforts toward data literacy will help instill a data-driven culture within the organization.
  2. Use case definition: although this might sound counterintuitive, starting by understanding what you expect to achieve from the data, the business uses cases you are looking to cover, and the expectations of different areas of the business is essential. It is highly suggested that no important detail is left uncovered, from how many people will be relying on data in their day-to-day to the format of the dashboards your stakeholders are more inclined to use. Data collection is a means to an end; a deep understanding of your business needs when it comes to data is a prerequisite to ensuring your data assets are relevant.
  3. Scope and organization of the data team: once you have a better understanding of the use cases, depending on the complexity of those but also the size of your business, you might think of having a vertical (decentralized) data team with the domain expertise to serve each business domain. It is crucial in this case that the knowledge and the tooling within the data team remain consistent to avoid creating unnecessary silos and favor knowledge sharing. In addition, this kind of set-up also helps improve company-wide data literacy as the barriers between the business and engineering are naturally softened.
  4. (Actual) Data Literacy: This is the part where you hire data coordinators or data project managers or designate someone or a group of people internally that can help bridge the gap between engineering and the business, improve business understanding of the data and enhance overall adoption. Ultimately, business users are data consumers and customers of the data producers, and they should be seen and treated as such.

Tooling

While we can all agree that achieving a Single Source of Truth starts with people and organization, choosing the right tools is key to successful adoption. Data Observability tools help ensure the reliability of the data and increase trust in data assets. They do so by bringing complete visibility of the data assets and hence allowing data teams to monitor the health of the data platform proactively. Including Data Observability in your data stack (or top of your stack) can help your organization establish a Single Source of Truth.

Observability is a concept that came from the Software Development world, or DevOps more specifically. In DevOps, the notion of Observability is centered around three main pillars: traces, logs, and metrics that represent the Golden Triangle of any infrastructure and application monitoring framework. Solutions like Datadog, NewRelic, Splunk, and others have paved the way for what has become standard practice in software development; it only makes sense that some of their best practices would translate into Data Engineering, a subset of Software Engineering. In data, testing the code or monitoring the infrastructure producing it simply isn’t enough, particularly as external sources and internal data producing processes grow exponentially. Essentially, data observability covers an umbrella of processes — like automated monitoring, alerting, and triaging — that allows data producers and consumers to know when data breaks and resolve the issues in near real-time. But, most importantly, data observability provides enough information to enable data users to resolve problems quickly and prevent those types of errors from occurring again.

How to pick the best Data Observability tool?

A highly proficient Data Observability tool that will help your teams not only conduct automated anomaly detection but solve the issues in the most efficient way while getting ahead of downstream impact should include the following features:

  • Automated Data Quality Monitoring: your most recurrent data quality issues should be detected automatically and with minimum to no effort. Things like: Freshness (Is the data up to date?), Completeness (Is the volume of data as expected?), Schema Change (Has the structure of my data changed? Did a maintenance SQL query change something, and is it causing pipelines to break? Or are dashboards to become obsolete? Duplication, etc.
  • Automated Field-level Data Lineage computation: Lineage should be computed from Data Ingestion to Data Consumption (BI or ML). To enable efficient root cause analysis of the data issues for data engineers but also conduct proactive incident management reporting to data consumers.
  • Powerful Metadata Search Engine: With the increasing number of data sources, assets, processes, and stakeholders, your data observability solution must help you quickly navigate this complexity. Hence, an excellent Data Observability tool should include a powerful search engine that enables you to locate your data assets user-friendly.

A solution like Sifflet allows you to unlock Full Data Stack Observability by bringing extensive Field Level Lineage across your whole data pipeline from Ingestion to BI and allowing you to fully automate your Data Quality Monitoring workflow by covering thousands of tables with fundamental data quality checks in a couple of minutes (auto-coverage feature). Get in touch for a demo or 15-day free trial contact@siffletdata.com