Leon Tchikindas
Leon Tchikindas has been working in analytics for more than seven years, helping businesses grow sales revenue, increase product engagement, and amplify marketing ROI. He currently serves as Head of Analytics and Business Operations at Periscope Data.

Automated Identification and Graphing of SQL Dependencies

July 6, 2017

If you’re anything like me—and if you’re reading this, you probably are—you’ve found yourself in a position where you’ve created 60+ interdependent (600+ line) data transformations defined in SQL as a core step of your ETL. You may even be sitting there thinking how did I get here, and what do I do now?

All of your queries are running, and your SQL-defined charts are returning results, but when it’s maintenance time, you don’t have a clear picture of the cascade of views, tables, and CSV files that your analysis depends on. At least it’s lucky that we’re good analysts, and we don’t make mistakes like circular references … right. How many of you ran straight for the Git repo?

You can sit down and start coding directed acyclical graphs (DAGs) in something like Airflow, but that’s a laborious task that involves reading through tons of SQL files and recording dependencies. There must be a better way!

I searched far and wide and found nothing, so I decided to dust off my Python skills from a previous life. This blog post will walk you through what I wrote, and leave you with a (mostly) working mapper, so you too can identify, clean up, and maintain your SQL ETL. If you want a copy of the script to run on your own set of SQL, email us at hello@periscopedata.com and we’ll send it right over.

First things first, you’re going to need Python (I chose to use 2.7, because that’s what I’m used to), an IDE (Spyder was readily available), and a handful of packages. One of the folks on my team pointed me to Anaconda Navigator, which effortlessly set up all of the above.

Before we dive in, here’s an overview of what the code will do:

  1. Import relevant packages
  2. Open the CSV file that contains the relevant SQL
  3. Iterate over the CSV, extract SQL and metadata
  4. Clean SQL
  5. Extract parents from SQL
  6. Define relevant node-set
  7. Graph and color the defined node-set

Import Relevant Packages

We’re going to need to import a couple of packages to make this work:

  • os—used to access  the current working directory
  • argparse—parses command line arguments
  • re—regex library for parsing out table names from SQL
  • csv—a library for navigating CSV files
  • networkx—modern Python graphviz library used for both creating and visualizing the directed acyclical graph

Some of these may require additional installation; I had to use Brew to get it set up.  

Open the CSV File That Contains the Relevant SQL

To begin, you need to get all your DDLs for your  SQL-defined views, charts, and CSVs into a single table. We want the four following columns:

  • Query ID (can just be auto-increment)
  • Query Name
  • Query SQL
  • Object Type (CSV, Chart, View)

We use Periscope Data for our reporting needs, so it’s straightforward to access the repository of SQL that I’ve created for my data transformations. With Periscope Data’s new Usage Data feature, you can easily pull this information:

select
     sql_views.id thing_id
     , sql_views.name
     , sql_views.sql
     , 'view' as tpe
from
    periscope_usage_data.sql_views
where
     sql_views.deleted_at is null
union all
select
    charts.id
    , charts.name
    , charts.sql
    , 'chart' as tpe
from
    periscope_usage_data.charts
join periscope_usage_data.dashboards on
    charts.dashboard_id = dashboards.id
where
    charts.deleted_at is null
    and dashboards.deleted_at is null
union all
select
    csvs.id
    , csvs.name
    , '' as sql
    , 'csv' as tpe
from
    periscope_usage_data.csvs

Once we’ve got our query data, we can open it up in Python.

I’ve broken the code up into functional pieces, with everything coming together in the main() function near the bottom. The rest of the tutorial will bounce around a bit so we can follow the code as it executes, rather than in the order it is written. I’ll try my best to keep you oriented, but just in case, you can get the full Python file by emailing hello@periscopedata.com.

We will create two data structures from the information in this CSV:

  1. A dictionary, whose keys are the names of all of the objects and whose values are the object type (view, csv, chart). This will later be used for node color coding.
  2. A similar dictionary with the same keys, but with the direct parents of each object as the values.

Before going much further, let’s disambiguate some DAG terminology:

Each object is a node, each arrow is an edge describing a relationship. Parents, children, ancestors, and descendants are concisely explained in the following diagram:

Our first function will then be defined as:

def get_node_info(csv_infile_path):

It will take one parameter, the location and name of the CSV file we generated from the SQL above; for example:

'/Users/username/Gviz/digraph.csv'

We create the desired dictionaries using {} notation:

   node_info_dict = {}
   parent_dict = {}

and immediately open the file:

   with open(csv_infile_path) as infile:

Using the with command means we don’t have to worry about closing the file later, we just un-indent. This could also have been written as infile = open(csv_infile_path, “rb”), but would have required cleanup work at the end.

The open command opens the CSV as a text file, but we know it’s a CSV that has attributes like headers which we’d like to use. To take advantage of that, we will read our CSV file with csv.DictReader, which allows us to call values based on headers, among a few other neat tricks:

       infile_reader = csv.DictReader(infile)

Though it may seem like we could have nested those commands into a single line, saving the naming of one variable, we actually can’t. With runs an exit method on un-indent, which doesn’t exist in csv.DictReader. This would have made our code error.

Iterate Over CSV, Extract SQL and Metadata

Once we have our CSV loaded, we can iterate over it and pull out all of the information we want to put in our node_info_dict:

       for row in infile_reader:
           # pull data from CSV
           object_name = row['name']
           object_type = row['tpe']
           sql = row['sql']

We’re not done yet. We don’t just want the SQL, we want the parents in the SQL statement.

Luckily, there’s only two places where you can define a source table in SQL: immediately following a from or a join statement. It’s time for regex man!

Before we dig into the details, let’s review what we want to accomplish:

  1. SQL cleanup
  2. Lowercase everything
  3. Remove block comments
  4. Remove line comments
  5. Replace all whitespace with single spaces
  6. Get set of parents
  7. Get set of Common Table Expressions(CTEs / with statement)
  8. Remove CTEs from set of parents
  9. Remove sub-selects
  10. Clean up brackets*

*Periscope uses bracketed table names for views defined in the product; e.g., [customers_view].

To achieve this, we define a set of RegEx strings:

REG_BLOCK_COMMENT = re.compile("(/\*)[\w\W]*?(\*/)", re.I)
REG_LINE_COMMENT = re.compile('(--.*)', re.I)
REG_BRACKETS = re.compile("\[|\]|\)|\"", re.I)
REG_PARENTS = re.compile("(?<=join\s)+[\S\.\"\']+|(?<=from\s)+[\S\.\"\']+", re.I)
REG_CTES = re.compile("(\S+)\sas\W*\(", re.I)

I won’t dive into how these work—that could be a blog post all on its own. In the meantime, we've already written a few posts on RegEx: Getting Started With Regular Expressions and String Parsing in SQL.

SQL Cleanup

We use a handful of these in part one described above, by defining a clean_sql function:

def clean_sql(sql):
   c_sql = sql.lower()  # lowercase everything (for easier match)
   c_sql = REG_BLOCK_COMMENT.sub('', c_sql)  # remove block comments
   c_sql = REG_LINE_COMMENT.sub('', c_sql)  # remove line comments
   c_sql = ' '.join(c_sql.split())  # replace \n and multi space w space
   return c_sql

c_sql = ' '.join(c_sql.split()) is a little tricky. This bit splits the SQL statement into an array based on arbitrary whitespace as a delimiter, and rejoins that array with a single space as a delimiter. This is to reduce multi-spaces, line breaks, and combinations of the two down to a single space.

Extract Parents From SQL

Next we create functions that get the parents and CTEs:

# this returns the unique set of parents per query
def get_parents(c_sql):
   parents = set(REG_PARENTS.findall(c_sql))  
   return parents
# this returns the unique set of ctes per query, so we can exclude them from the list of parents
def get_ctes(c_sql):
   ctes = set(REG_CTES.findall(c_sql))  
   return ctes

coming back to our iteration over the CSV in get_node_info_dict,we call all of these:

           # clean the sql
           c_sql = clean_sql(sql)
           # get the set of parents
           parents = get_parents(c_sql)
           # get set of ctes to exclude from parents
           ctes = get_ctes(c_sql)
           # remove CTES from parent dict
           for cte in ctes:
               parents.discard(cte)
           # get rid of brackets in views
           c_parents = set()
           for parent in parents:
               if not parent[:1] == '(':
                   c_parents.add(REG_BRACKETS.sub('', parent))
           # add the object name and type and direct parents to the dict
           node_info_dict[object_name] = object_type  
           parent_dict[object_name] = c_parents
   return (parent_dict, node_info_dict)

Define Relevant Node-set

My first iteration of this project would draw a full map of the entire view and chart ecosystem. Here’s a screenshot of what that looked like:

Needless to say, that isn’t the most useful visualization of my career. I quickly realized that specifying a focal object around which to graph ancestors and descendants makes for a much more digestible map.

However, in order to achieve that, we first need to find all of the relatives of a given node and only graph that set.  

This tool allows the user to specify a direction: ancestors, descendants, or both. In this tutorial, we will go through the process of identifying a node set that contains both, as it will require us to go through every relationship type.

We define a function that returns a relevant node set based on a parent dictionary, a focal node, and a direction:

def get_node_set(parent_dict, focal_node=None, direction=None):

As before, we define some placeholders to keep our data:

   descendant_dict = {}  # intended to store all descendants (any generation)
   ancestor_dict = {}  # intended to store all ancestry (any generation)
   node_set = set()  # final result is stored and then returned via this set
   node_set.add(focal_node)

Here’s where it gets a little tough. If you recall, our parent_dict is a key:value dictionary that contains an object and its direct parents as a set. For example: parent_dict('A')=[C,B,D] where C, B, and D together create A:

create table A as
select
    foo
from
    C
join B using (id)
join D using (id)

But what if we wanted a dictionary of children instead, so that we can look at descendants and not just ancestors? This would make C, B, and D the keys, and A one of the values for each.

To derive that, we loop over all of the parents and create a reverse dictionary called child_dict. Notice how there are two loops: one over each key in parent_dict, and one over each node in the value set. Deduplication is handled natively by writing to set and dictionary objects:

# this reverses a parent tree to a child tree
def get_child_dict(parent_dict):
   child_dict = {}
   for node in parent_dict:
       for parent in parent_dict[node]:
           if not parent in child_dict.keys():
               child_dict[parent] = set(node)
           else:
               child_dict[parent].add(node)
   return child_dict

We then call this function and store the child_dict for use in just a bit:

   child_dict = get_child_dict(parent_dict)  # immediate children

Now we have a parent list and a child list, but what we really want is all the ancestors and descendants of a given node. Given this data structure, we must execute a tree traversal! This is where a recursive function comes in handy, since the parent has parents, and so on. Here’s the result:

# this traverses an arbitrary tree (parents or children) to get all ancestors or descendants
def traverse_tree(node, d_tree, been_done=set()):
   tree_outputs = set(d_tree.get(node, set()))  # direct relatives
   for key in d_tree.get(node, set()):  # 2nd step relatives
       if not key in been_done:
           been_done.add(key) # to break any circular references
           tree_outputs = tree_outputs.union(traverse_tree(key,
                   d_tree, been_done))
   return tree_outputs

Using this function, we build a dictionary that contains ancestors, and another dictionary that contains descendants:

   # build descendant dict
   for node in child_dict:
       descendant_dict[node] = traverse_tree(node, child_dict,
               been_done=set())
   # build ancestor dict
   for node in parent_dict:
       ancestor_dict[node] = traverse_tree(node, parent_dict,
               been_done=set())

Depending on whether we want ancestors, descendants, or everything related to a node, we pick some overlap of these sets as our relevant set list.

Build the Graph

Finally, with the relevant nodes and the parent-child relationships, we can build our DAG:

def build_d_graph(parent_dict, node_set, node_type_dict):
   G = nx.DiGraph()  # initialize graph object
   for node in node_set:
       # Add color nodes
       if node_type_dict.get(node, None) == 'view':
           G.add_node(node, color='green')
       elif node_type_dict.get(node, None) == 'chart':
           G.add_node(node, color='blue')
       elif node_type_dict.get(node, None) == 'csv':
           G.add_node(node, color='red')
       # add edges and non-color nodes
       for parent in parent_dict.get(node, set()):
           G.add_edge(parent, node)
   return G

And done!

Using some Python you can now build clear visualizations which will allow you better understand your data, dependencies and ETL processes, without having to recode those relationships after each SQL adjustment.  This will help you fix or simplify the ETL processes and create a more dependable analytics stack.

Haven't tried Periscope Data yet?
Start a trial and we’ll send you one of our famous coffee mugs.
Read More
Haven’t tried Periscope Data yet?