By clicking “Accept”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.
Sep 16, 2021
Data Infrastructure

GDPR Compliance in Big Data Infrastructure 

Post by
Salma Bakouk

‍What is GDPR? What Engineering challenges does it bring? And how to build the right Data Infrastructure to ensure GDPR compliance?

The GDPR, or the Global Data Protection Regulation, is an EU legislation that went into effect on May 28th, 2018. Since it was first introduced in 2016, companies that do business in Europe have invested a lot of resources to achieve compliance. Your company and many others probably hired a Data Protection Officer and conducted several training sessions for your staff to ensure understanding of the new rules; you put in place new processes to document and classify the data you have, you introduced and established consent procedures, conducted several information audits, and reviewed your data governance process. According to FT, your company and others globally spent billions of dollars in preparation for the enforcement of the regulation. And according to a PwC report, more than 88% of companies spend $1 million, and 40% spend more than $10 million on the cost of maintaining GDPR compliance. ‍

And yet.

‍To cite a few, Google, H&M, TIM, British Airways, and Marriott have paid hefty fines for failure to comply with the guidelines. You’ve probably also heard of the record Amazon GDPR fine. On July 30th, 2021, news broke that Luxembourg’s National Commission for Data Protection (CNPD) had hit Amazon with a record-breaking €746 million ($887 million) GDPR fine over its use of customer data for targeted advertising purposes. The fine is unprecedented; it is the most significant GDPR fine issued and is more than double the amount of any other GDPR fines combined.

‍‍So what is GDPR? What engineering challenges does it bring? And how to build the proper data infrastructure to ensure GDPR compliance?

What is GDPR?

I’ll keep it short.‍ GDPR is a regulation that requires businesses to protect the personal data and privacy of EU citizens for transactions that occur within EU member states. It came into force on May 25th, 2018, and aims to give every EU citizen the right to know and decide how their data is used, stored, protected, transferred, and deleted.

Here is an excellent summary of what GDPR means and the broader scope of what it covers.

What Engineering challenges does it bring?

To understand the challenges GDPR brings to your data infrastructure, let’s go over some of the framework's pillars. Under GDPR, EU citizens are given a particular set of rights:

  • The right to be forgotten

Complete elimination of users’ data must be conducted upon their request. In the era of auto-scheduled backups, non-volatile storage systems, and all-pervasive caching, this represents a real engineering challenge.

  • The right to data portability

Or the right to request. And this means that users have the right to retrieve all the information a company has collected from them in an exportable, universally readable format, a.k.a. another technical challenge to overcome.

  • The right to object processing

While keeping the data (upon user consent and for “necessary” business operations) is allowed, additional explicit user consent is still required to process the data. Think about how you are going to exclude certain records when writing SQL code…

  • The right to rectification

This gives users the right to change their PII data in your system as they see fit. From an engineering perspective, this means your company will need to have a way of tagging and enumerating PII-related data.

  • The right to be informed

Users need to be informed when their data is being collected and for what purposes. In case of a data breach, users need to be notified as soon as possible, and data protection authorities informed within 72 hours.‍

Taking all this into account and mindful of other broader engineering challenges GDPR might bring, let’s dive into the obstacles Data Engineers face and how to resolve them.

Personal Data

Let’s start from here. An essential requirement for compliance with GDPR is locating, enumerating, and accessing all user data classified as PII. You also need to think about PII security measures (encryption and fine-grained security access for PII), but that’s another complex topic.

Gone are the days when you could hoard all sorts of data and wish for the best. Clear scope for data collection, storage, and flow through all the processes within your organization is paramount to ensuring a technical foundation for GDPR compliance.

The right to be forgotten

The idea here is having the ability to remove ALL records related to a user that are distributed across several databases, tables, and systems should they make the request. Pseudonymization of data (GDPR articles 6, 25, 32, 89) could bring a solution to this, but we’ll dive into that later.

Let’s take a look at some technical challenges that present themselves here:

  • Bulk deletion from all storage, primarily where the entries are used in aggregate metrics or have different identifiers, can be extremely hard to implement correctly. Orchestrating a cascade-like deletion without assessing the impact on other related data assets and BI reports is a recipe for disaster. Data lineage here is critical; more on this later.
  • Deleting rows that were once allocated to a specific user vs. replacing PII data with a “removed user” placeholder can present some storage challenges, especially in SQL-based databases.
  • Backups can quickly go from being a routine procedure to becoming an absolute engineering nightmare. You are faced with either: no backup of PII at all or the pseudonymization of the backup. If you chose pseudonymization, you need to have a proper mechanism that matches Users_IDs to PII identifiers in a way that allows you to get rid of “forgotten” Users_IDs without having to go through the backup again to delete the users’ data. In addition, you should also try to keep a separate table of forgotten user IDs that should be in a separate database with a different backup/restore process so that when you restore a backup, you omit the ignored users.

The right to object processing

This is arguably the least challenging requirement, although preventing an automated system from processing the data it stores may seem like an arduous and almost illogical task. Here are a few ideas:

  • Adding a table of “restricted users” and filtering the outputs against it. Easy and works for most modern databases.
  • Adding a boolean column/field to all tables and collections containing PII is another way of solving this. Less accessible and requires some careful orchestrating.
  • A more sophisticated solution would be to add in your admin panel, and within the users' settings page, a button labeled “restrict processing” when clicked should mark the profile as restricted. That should create a “wall” preventing the back-office staff from accessing the data and processing it.

The right to be informed

Under GDPR, when collecting user data for a particular business use case, users will frequently have to understand and agree to whatever your organization wants to use their data for. This can be an absolute nightmare if your organization, like many, doesn’t fully understand the data it has, where it is being stored, or its specific business use case once it’s been stored. Often organizations “hoard” data hoping that the use cases will define themselves later. To ensure GDPR compliance, companies will need to understand precisely why they are collecting data and get into the habit of tagging that data at the time of collection.

Another difficulty here is in the definition of the scope of use. Say, for example, a user only consents to “marketing purposes,” you’re going to need to track and enforce that restriction from collection to use, cf—Data lineage.

The right to rectification

This might seem like an obvious rule but isn't always diligently followed. A user should be able to access and edit all sorts of personal data you’ve collected about them, including PII you would’ve fetched from third parties (Salesforce, Apple login, Facebook ID, etc.). As a general rule, all personal data should be editable through the UI.

How to build the proper Data Infrastructure to ensure GDPR compliance?

Let me start by saying that when it comes to building a Data Infrastructure that complies perfectly with GDPR, there is no magic recipe or one size fits all approach. However, some best practices and tools can enhance overall governance.

In no particular order, the below tools can be used in conjunction with each other or to complement an existing infrastructure to achieve Data protection and GDPR compliance. Perhaps the key thing to keep in mind when building your data infrastructure is; that you need to quickly access ALL the PII data within your organization, tag it and segregate it into a separate table, making it easy to encompass and delete if necessary.

Data Lineage

Data Lineage allows organizations to trace the movement of data from its source to its point of use, providing visibility into how it has changed from point to point. It is already being widely used in heavily regulated industries such as banking, insurance, and healthcare as a way to maintain data-related regulatory compliance. Particularly on “the right to be forgotten” and the “right to be informed” pillars by visualizing and mapping metadata, Data Lineage allows to understand in which table PII resides and presents the entire lineage and interdependencies from a specific BI report through the tables and ETL processes. Hence 1. keeping a record of usages by the business and 2. assessing the impact of a row deletion and ensuring it is orchestrated in a GDPR compliant manner.

Data Lineage also eases some of the challenges related to “the right to access”.

Source: Sifflet

‍Data Catalog or a Metadata Discovery tool

‍A Data Catalog is a detailed inventory of all data assets in an organization and their metadata. It is designed to help data professionals quickly find the most appropriate data for any analytical business purpose by facilitating the access and classification of data at scale.

For GDPR purposes, the catalog solution must have the ability to automate data profiling and data tagging, thus allowing the organization to take those tags and feed them into a metadata processing engine and eventually anonymize that data from raw files to the initial data load.

Using a combination of Lineage and Catalog should allow you to find, access quickly, and tag the PII data within your storage systems and effectively orchestrate deletions by visualizing and mapping metadata and interdependencies between data assets.

Data warehouse architecture modules

  • Consent

When it comes to consent, GDPR requires explicit permission from the user to process their data. Practically speaking, there should be an explicit “check box” on the UI for each particular processing activity. From a data storage perspective, you should keep these consent checkboxes in separate columns in the database. Suppose the user unchecks the box from their profile as a consent withdrawal. In that case, a fetching mechanism will allow you to link to the processes reliant on PII related to a particular user and exclude it.

  • Pseudonymisation engine

As defined in article 4 of GDPR, pseudonymization is a data management and de-identification process by which personally identifiable information (PII) can no longer be attributed to a specific data subject without the use of additional information. When appropriately implemented, pseudonymization ensures a certain level of protection during the processing of personal data. Although pseudonymization is not entirely exempt from data privacy requirements as re-identification remains possible, it can prove advantageous for comprehensive data analysis, mainly (if done correctly and consistently across all data processes) in the data warehouse domain comprising analytics.

  • Deletion engine

When it comes to carrying out PII deletions, a good practice is to conduct them in batches; the best way to incorporate this is to flag and date PII data as a data management process and then show batch deletions once a month within the 30-day window stated by GDPR.

  • Information reports and tracking

By leveraging your metadata management and lineage engines, you should be able to automate the identification and localization of personal data within your data warehouse and storage systems and query the metadata table to generate reports for individuals who make the request.


It goes without saying: data is every modern company’s greatest asset, and therefore, regulating Data is indirectly shaping the economy and the way modern-day organizations conduct their businesses.

The introduction of the European Union’s General Data Protection Regulation (GDPR) in 2018 pioneered Data Protection Regulation by introducing a new set of rules for global companies operating in the EU. The California Consumer Protection Act (CCPA) came into play later on the 1st of January 2021, while similar legislation has also been passed in countries such as China, New Zealand, Canada, and South Africa.

Despite it being an EU legislation, the GDPR has far-reaching implications. So far, since its enforcement on May 25th, 2018, companies like Amazon, Google, H&M, and British Airways, to cite a few, have paid hefty fines for failure to comply with the guidelines.

Two main factors make it challenging to support GDPR in Big Data technologies: the immutable nature of storage and the infeasibility of partitioning datasets by a single user.

Although there is no one-size-fits-all approach, implementing certain technologies when building your data infrastructures, such as lineage and metadata management, can prove to be efficient in keeping up with the PII within your organization and easing GDPR compliance while still being able to rely on data to drive your decision making.

Related content