Introduction

Table-level lineage—the relationship between upstream (source) and downstream (destination) tables in SQL queries—is fundamental to understanding and maintaining data observability in modern systems. The SQL query history is the definitive source of truth for these relationships, as it captures the actual data transformations and dependencies within a system.

SQL Table Tracer (STT) is a lightweight library designed to automate the extraction of table-level lineage directly from individual SQL queries. By providing a single method call, STT takes a query and its dialect as input and reliably outputs:

The destination table (if applicable).
A list of upstream tables referenced in the query.

This simple utility is the cornerstone of lineage workflows, serving as the foundational building block for more complex processes such as stitching together end-to-end lineage across systems.

Background and Motivation

The Importance of Table Lineage

Table lineage is essential for ensuring the reliability, transparency, and observability of data systems. Key use cases include:

Impact Analysis: Assessing which downstream systems are affected by changes to upstream tables.
‍
Debugging Pipelines: Identifying dependencies to trace the root cause of data issues.
‍
Auditing and Compliance: Demonstrating how data flows across the organization to meet regulatory requirements.

Why Build SQL Table Tracer?

Automating table-level lineage extraction requires accurate parsing and interpretation of SQL queries across different dialects. While some tools and libraries exist, none met the specific needs of our use case, prompting us to develop our own solution.

Issues with Apache Calcite: Calcite is a widely used query parser but fell short for our requirements. It supports a limited number of SQL dialects, struggles with non-standard syntax, and often requires significant customization to handle real-world queries. These limitations made it impractical for consistent, reliable lineage extraction.
‍
Lack of Suitable Open-Source Options: At the time of development, we could not find an open-source SQL parser that:
- Supported a wide range of SQL dialects out of the box.
- Delivered high accuracy in extracting table relationships.
- Could easily integrate into a lightweight, purpose-driven lineage extraction tool.

Role of SQL Table Tracer

While STT is not an all-encompassing lineage platform, it plays a critical role in enabling such systems. By automating the extraction of table-level lineage from individual SQL queries, STT:

Provides the raw data needed to build end-to-end lineage across systems.
‍
Acts as a consistent, reliable method for lineage extraction, regardless of query complexity or SQL dialect.

Our Approach

To address these challenges, we built SQL Table Tracer using Antlr, a flexible and powerful parser generator. Antlr enables us to:

Leverage well-maintained grammars for multiple SQL dialects.
‍
Extend and customize these grammars to support dialect-specific features and edge cases.
‍
Ensure scalability and performance, even for large or complex SQL queries.

By taking this approach, SQL Table Tracer achieves a level of accuracy and reliability that would not have been possible with existing tools. The result is a library that is lightweight, focused, and capable of serving as the cornerstone for broader lineage workflows.

Core Functionality

SQL Table Tracer provides a single, intuitive interface. At its core, it processes a SQL query and outputs lineage information in a structured format.

Workflow Overview

Input: The SQL query and its dialect (if known - e.g., PostgreSQL, MySQL, Snowflake).
‍
Processing: The library parses the query to identify:
- The operation being performed (e.g., SELECT, INSERT).
- The destination table being written to (if any).
- A list of upstream tables the query depends on (if any).
  ‍
Output: A structured object containing the lineage information.So for example, with this query as an input:

INSERT INTO sales_summary SELECT SUM(sales) FROM sales_data;

We should get this output:

{
    "operation": "INSERT",
    "destinationTable": "sales_summary",
    "upstreamTables": ["sales_data"]
}

‍Key Features

Minimal API: The library offers a simple interface, making it easy to integrate into larger workflows or pipelines.
‍
Dialect Support: Handles variations in SQL syntax across supported dialects.
‍
Maintainability and Evolvability:
- The tool is designed for ease of adding new dialects. Developers can quickly extend support by integrating Antlr grammar support for additional SQL dialects.
- Existing dialect implementations can be refined and evolved over time to handle new features or edge cases, ensuring the tool adapts to changing requirements.
- The modular design separates parsing logic from lineage extraction, enabling targeted updates without affecting the overall system.
  ‍
Focus on Table-Level Dependencies: By narrowing its scope to table relationships, STT remains lightweight and efficient while providing the foundational data needed for more complex lineage workflows.

Implementation Details

SQL Table Tracer achieves its functionality through a carefully designed implementation that balances flexibility, maintainability, and performance.

Unified Grammar for All Dialects

To support multiple SQL dialects while maintaining a single point of entry, SQL Table Tracer uses a single Antlr4 grammar. This eliminates the need for multiple parsers and allows all dialect-specific logic to be centralized in one interpreter.

The key to this approach is Antlr4's Semantic Predicates, which enable the grammar to dynamically adapt its behavior based on the specified dialect. By leveraging Semantic Predicates, the parser can evaluate dialect-specific rules while keeping the overall structure unified.

‍Example: Semantic Predicates in the Grammar

sqlQuery
  : createTableAsSelect
  | createSchema
  | createTable
  | createView
  | {redshift}? vacuum
  | delete
  | insert
  | select
  | update
  | merge
  | copy
  ;
  
  createTableAsSelect
  : {redshift}? CREATE skip_to_table? TABLE destinationTableName stuff_in_parenthesis? (BACKUP (YES | NO))? tableAttributes* AS select
  | {synapse}? CREATE TABLE destinationTableName stuff_in_parenthesis? WITH stuff_in_parenthesis AS select optionClause?
  ;

‍‍‍Dialect Flags: At runtime, flags for dialects like redshift, synapse, and snowflake are set through a setDialect() method. These flags control the evaluation of Semantic Predicates, enabling or disabling specific rules based on the active dialect.
‍‍
Flexible Parsing: The same parser can seamlessly handle queries for all supported dialects, activating only the rules relevant to the specified dialect.

Placeholder Rules for Skipping Irrelevant Query Parts

One trick we found that accelerates writing the grammar is placeholder rules, such as stuff_in_parenthesis, to skip over parts of the SQL query that are not currently relevant to table level lineage extraction. These rules are designed to handle non-essential components, such as column definitions, storage settings, or other optional clauses, without requiring the implementation of detailed parsing logic.

Some examples:

stuff_in_parenthesis: Used to skip over parenthesized sections like column definitions or parameter lists. A SQL create table statement is a perfect example of where this can come in handy:

CREATE TABLE Persons (PersonID int, LastName varchar(255), FirstName varchar(255), Address varchar(255), City varchar(255));

‍stuff_in_quotes: Used to skip over (or bundle) quoted sections.

skip_to_token: Jumps directly to the specified Token, bypassing intermediate syntax.

This approach keeps the grammar lightweight and extensible while ensuring that the parsing process focuses on the key elements needed to extract table-level lineage.

Single Interpreter for Lineage Extraction

The unified grammar is paired with a single interpreter (implementation of Antlr visitors) that processes the parser's output to extract lineage information.

This design ensures:

Consistency: All dialects share the same extraction logic, simplifying maintenance and reducing the potential for errors.
Ease of Extension: New dialects can be added by updating the grammar to account for new specific syntax. The interpreter on the other hand remains generally untouched.
Centralized Control: A single entry point ensures consistent behavior across all dialects and makes debugging simpler.

Advantages of This Approach:

Maintainability: With a single grammar and interpreter, the codebase is easier to understand, modify, and extend.
Evolvability: Adding support for new dialects or evolving existing ones requires changes only in targeted parts of the grammar.
Performance: By avoiding multiple parsers and redundant logic, the implementation remains lightweight and efficient.

The Main Visitor

The main visitor, and the core of the library really is the TableLineageExtractor. It extends the base visitor, and it is responsible of constructing the TableLineage object:

public record TableLineage(
	Operation operation, 
	TableName destinationTable, 
	Set<TableName> upstreamTables) {...}

And just like with any other language application, we implement the visitor methods for relevant parser rules in order to build the TableLineage object. For instance, just after the root node, we have the SqlQuery rule (grammar snippet above), which we use to extract the Operation, before delegating the rest to its children.

Now you might be tempted to create and maintain TableLineage as a mutable field in TableLineageExtractor.class, mutating it to add relevant information on each rule visit:

When visiting nodes of a join rule, we could extract the table reference from the context, and add it to upstreamTables:

public class TableLineageExtractor extends SqlBaseVisitor<Void> {
	
	private TableLineage tableLineage = new TableLineage(...);
	
	...
	
	@Override
  public Void visitJoin(JoinContext ctx) {
    tableLineage.upstreamTables().add(extractTableFromJoinContext(ctx));
    return null;
  }

This works… until we need to solve for more complexe, context dependent cases like CTEs.

The Issue With CTEs

Common Table Expressions (CTEs) introduce a significant challenge for a mutable approach to building the TableLineage object. Unlike straightforward query structures, CTEs create temporary named result sets that can be referenced multiple times in subsequent parts of the query. This creates two key issues:

1. Deferred Resolution: The upstream tables used within a CTE are not immediately part of the main query's lineage. Instead, they need to be resolved in the context of their usage in the larger query. For example:

WITH filtered_data AS (
    SELECT * FROM raw_data WHERE event_type = 'purchase'
)
SELECT region, COUNT(*)
FROM filtered_data
GROUP BY region;

Here, filtered_data depends on raw_data, but filtered_data itself is referenced later in the main query. Both layers of dependency need to be captured accurately.

2. Multiple Scope Levels: A single query can define multiple CTEs, each with their own internal scope. These scopes must be tracked separately to avoid collisions or missing dependencies. For example:

WITH cte1 AS (
    SELECT * FROM table1
),
cte2 AS (
    SELECT * FROM cte1
)
SELECT *
FROM cte2;

In this case, cte1 depends on table1, and cte2 depends on cte1. The correct lineage should trace cte2 → cte1 → table1. A mutable approach to constructing the table lineage is not a good solution for solving CTEs.

Enter Monoid

To address the challenges posed by CTEs and simplify the aggregation of lineage information, we reimagine TableLineage as a Monoid. In functional programming, a Monoid is an abstraction that represents a type equipped with a binary operation (for combining two instances) and an identity element (representing the "empty" value).

Another way to think about this is an algebra: a set (defined by the object), a binary operation (combine), and the identity element.

By leveraging this abstraction, TableLineage becomes inherently composable, enabling us to aggregate lineage data seamlessly as we traverse the query

The Key Properties of TableLineage as a Monoid:‍

Binary Operation (Combine):The binary operation defines how two TableLineage objects can be merged. For lineage extraction, this involves:
- Combining the upstream tables from both objects into a unified set.
- Preserving or updating the destination table and operation as necessary

public TableLineage combine(TableLineage other) {
    return new TableLineage(
        this.operation != null ? this.operation : other.operation,
        this.destinationTable != null ? this.destinationTable : other.destinationTable,
        Stream.concat(this.upstreamTables.stream(), other.upstreamTables.stream())
              .collect(Collectors.toSet())
    );
}

‍Identity Element: The identity element represents an "empty" lineage object, with no operation, no destination table, and an empty set of upstream tables. This acts as the starting point for lineage aggregation.

public TableLineage combine(TableLineage other) {
    return new TableLineage(
        this.operation != null ? this.operation : other.operation,
        this.destinationTable != null ? this.destinationTable : other.destinationTable,
        Stream.concat(this.upstreamTables.stream(), other.upstreamTables.stream())
              .collect(Collectors.toSet())
    );
}

How the Monoid Design Solves the CTE Problem

Incremental Composition: As each AST node is visited, it returns a TableLineage object. For example, visiting a CTE node produces its own lineage, while the main query produces another. These intermediate results are then combined using the Monoid's combine operation, building the full lineage step by step.
‍
‍Simplicity in Aggregation: With a Monoid design, the aggregation logic becomes straightforward. Each node contributes its lineage independently, and the combining logic ensures that dependencies are resolved and merged correctly.
‍‍
Consistency and Safety: The immutability of TableLineage ensures that no intermediate results are accidentally overwritten or mutated, reducing the risk of subtle bugs and making the code easier to reason about.

Finally, when applied to TableLineageExtractor we get something like this:

public class TableLineageExtractor extends SqlBaseVisitor<TableLineage> {

	@Override
  public TableLineage visitSqlQuery(SqlQueryContext ctx) {
    return switch (ctx.getChild(0)) {
      case CreateTableContext createTableContext ->
		      new TableLineage(Operation.CREATE_TABLE, null, Set.of()).combine(visit(createTableContext));
      ...
  }

	@Override
  public TableLineage visitChildren(RuleNode node) {
    var ctx = (ParserRuleContext) node;
    return ctx.children.stream().map(this::visit).reduce(IDENTITY, TableLineage::combine);
  }

While CTEs illustrate the Monoid composition approach, they are far from the only concept that benefits from this approach - Subqueries present another significant complexity in lineage tracking. With a Monoid-based design, handling subquery lineage becomes elegantly straightforward—essentially "free" in terms of implementation complexity. I'll leave it for you dear reader, to rationalize that.

Tracing Lineage with SQL Table Tracer