Lesson #3: Generating SQL to Profile Table Data

If you are here to use the data-profiling-SQL-generating example provided with Aginity Pro, you can jump straight to Step 4 - Aginity Active Catalog Example. If you want to dig into the concepts behind that example, then read on.

A large part of the day-to-day work of both data analysts and data engineers is to explore data in database tables. For small datasets, analysts have many tools we can use. In addition to showing us 5 number summaries, they might display distribution graphs or plots of relationships between columns. However, in the world of cloud-based data these tools might not be available. The data might be too big to bring back to the tool. We might have other restrictions related to connections or even legal concerns with transferring data. Sometimes, in order to begin exploring our data, we need to begin that exploration directly in SQL.

Of course, SQL is great for aggregating data. MPP cloud databases like Redshift, Snowflake, Netezza, and Hive are optimized for responding to this kind of SQL query. SQL is the most straight-forward language for expressing ideas around data relationships and performing manipulations on data. However, there aren’t generally-available facilities to explore “all” the data. For example, In Python, R, or SAS, there are language keywords that allow me to say “apply this function to all of the columns of this dataset. On the other hand,in SQL we have to specifically list each column transformation that we want to be performed.

All database systems have the concept of “system tables” or “system catalog” or “catalog tables” that are tables or views that contain information about all of the database objects, including table and column names. In this tutorial, we’ll learn how to use these system tables to automatically generate SQL that can summarize our data.

Step 1 - What do you mean “generate SQL”?

I love programming languages. A lot of languages are great for generating source code, either of their own language or of a different language. SQL is not a programming language that naturally comes to mind when we think about doing a lot of string manipulation. However, database administrators have a long history of writing SQL that generates SQL.

You might run into the term “dynamic SQL”. This is a technical term for database systems that include a procedural language. That procedural language is used to generate SQL that will be submitted later within the same procedural language program.

For our purposes, we are technically using “static SQL”, that is, when the SQL statement is submitted to the database system, it is a complete SQL statement. However, before submitting that SQL statement, we will submit a prior statement that will generate as the statement output, the static SQL that will profile our data. This idea will become clearer as we work through some examples.

Step 2 - What are “system tables”?

Just one more term before we jump into an example. Database “system tables” or “system views” hold information about the database itself. In Redshift (and in most database systems based on PostgreSQL), the view information_schema.columns contains all of the columns for all of the tables in the currently active database.

../../../_images/generate_columns_table.png

As an initial example, let’s say that we want to generate a SQL statement to get the maximum value from each column in the stl_plan_info table that has “node” in its name. Here is a query to get the column names we are interested in:

select column_name, data_type
from information_schema.columns
where table_name = 'stl_plan_info'
and data_type = 'integer'
and column_name like '%node%'
;
../../../_images/generate_results_1.png

Using some SQL string concatenation, we can generate the aggregation SQL based on those criteria:

select
'select '||quote_literal(column_name)||' as col_name, max('||column_name||') as max_value from stl_plan_info;' as generated_sql
from information_schema.columns
where table_name = 'stl_plan_info'
and data_type = 'integer'
and column_name like '%node%'
;
../../../_images/generate_results_2.png

You’ll notice that this naive example produces one SQL statement per column, which might not be what we want. We’ll revisit this issue later.

Note

A note on the examples used here.

Redshift includes STL Tables that contain log information for events that occur on the system. Some of these tables are only accessible to administrators. However, STL Tables pertaining to the queries that you execute are available to you. In order for you to follow along with this tutorial, when we need to profile some data we’ll point to one of these STL Tables that is guaranteed to be available.

Method

As you can guess from looking at the brief example above, SQL that generates SQL can get complicated pretty quickly. The general method for developing these queries is first to figure out what you want the profiling SQL to look like. Then, write a simple query to generate just one portion of that target SQL. Keep adding to your query until you achieve the target SQL.

In this case, we have a wonderfully complex SQL-generating query provided by Aginity as an example in the Active Catalog. Below, we’ll look at some general principles, then explore this example query.

Step 3 - Data Profiling

If we already know which columns of a table are “interesting”, then we can just write SQL to explore those columns. However, when we are presented with a new table, we don’t know which columns are interesting and which are not. We don’t know whether a column contains just a few values repeated over and over, or whether there are millions of unique values. We don’t know whether the date columns represent only today, or whether they stretch back 20 or 30 years. Getting this knowledge over the data is one reason that we profile.

Another reason that we profile tables is to get a handle on data quality. If one day, our STATE table has 50 unique values and the next day, it has 482 unique values, then we might need to investigate the ETL process because something has clearly gone wrong. Other changes are more subtle. If the average transaction_count is 4,927,642 one day and it is 3,477,923 the next, then is there a problem with the data? Maybe, maybe not. However, we can capture the profile data each date to store in a table. Then we can check the standard deviation for the average_transaction_count to see whether there might be a problem worth investigating.

Data Types

This might seem obvious, but it bears emphasizing. Different data types have different aggregation functions. Taking the average (mean) of a list of dates doesn’t make any sense. You might be able to get the max and min of a character string, but that doesn’t really give the same insight as the same aggregation over numeric data.

A final consideration is that physical data types are not always useful for determining aggregation. For example, the ID column of a table might be of type bigint. You can take the min, max, and average of this column, but that doesn’t tell you anything useful about the data. So, feel free to create your own version of this example – that’s why Aginity provided it – that takes into account your local naming standards and business knowledge to avoid performing aggregation on columns that won’t provide useful information.

Step 4 - Aginity Active Catalog Example

In Aginity Pro, in the “Catalog” tab, you should see a folder labeled Sample Catalog. Open the folder path Sample Catalog -> Examples -> Redshift -> Data Profile.

../../../_images/generate_catalog_navigation.png

Set up your Query Window with this query:

select query from @{/Sample Catalog/Examples/Redshift/Data Profile/Profile Data in Table};

and run with these parameters:

../../../_images/generate_params.png

Your results will look like this:

../../../_images/generate_results_3.png

with the “query” column containing the text of the data profiling SQL.

Here is the SQL that was returned with some formatting applied.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
 SELECT 'starttime',
        count(starttime) AS ROW_COUNT,
        min(starttime) AS MIN,
        max(starttime) AS MAX,
        count(DISTINCT starttime) AS distinct_count
 FROM pg_catalog.stl_query
 UNION ALL
 SELECT 'endtime',
        count(endtime) AS ROW_COUNT,
        min(endtime) AS MIN,
        max(endtime) AS MAX,
        count(DISTINCT endtime) AS distinct_count
 FROM pg_catalog.stl_query;

 SELECT 'userid',
        count(userid) AS ROW_COUNT,
        min(userid) AS MIN,
        max(userid) AS MAX,
        avg(userid) AS average,
        count(DISTINCT userid) AS distinct_count
 FROM pg_catalog.stl_query
 UNION ALL
 SELECT 'query',
        count(query) AS ROW_COUNT,
        min(query) AS MIN,
        max(query) AS MAX,
        avg(query) AS average,
        count(DISTINCT query) AS distinct_count
 FROM pg_catalog.stl_query
 UNION ALL
 SELECT 'xid',
        count(xid) AS ROW_COUNT,
        min(xid) AS MIN,
        max(xid) AS MAX,
        avg(xid) AS average,
        count(DISTINCT xid) AS distinct_count
 FROM pg_catalog.stl_query
 UNION ALL
 SELECT 'pid',
        count(pid) AS ROW_COUNT,
        min(pid) AS MIN,
        max(pid) AS MAX,
        avg(pid) AS average,
        count(DISTINCT pid) AS distinct_count
 FROM pg_catalog.stl_query
 UNION ALL
 SELECT 'aborted',
        count(aborted) AS ROW_COUNT,
        min(aborted) AS MIN,
        max(aborted) AS MAX,
        avg(aborted) AS average,
        count(DISTINCT aborted) AS distinct_count
 FROM pg_catalog.stl_query
 UNION ALL
 SELECT 'insert_pristine',
        count(insert_pristine) AS ROW_COUNT,
        min(insert_pristine) AS MIN,
        max(insert_pristine) AS MAX,
        avg(insert_pristine) AS average,
        count(DISTINCT insert_pristine) AS distinct_count
 FROM pg_catalog.stl_query
 UNION ALL
 SELECT 'concurrency_scaling_status',
        count(concurrency_scaling_status) AS ROW_COUNT,
        min(concurrency_scaling_status) AS MIN,
        max(concurrency_scaling_status) AS MAX,
        avg(concurrency_scaling_status) AS average,
        count(DISTINCT concurrency_scaling_status) AS distinct_count
 FROM pg_catalog.stl_query;


 SELECT 'label',
        count(label) AS ROW_COUNT,
        max(top10_literals) AS top10_literals,
        count(DISTINCT label) AS distinct_count
 FROM pg_catalog.stl_query r
 CROSS JOIN
   (SELECT listagg(label, ',') top10_literals
    FROM
      (SELECT top 10 label
       FROM
         (SELECT label,
                 count(*) cnt
          FROM pg_catalog.stl_query
          GROUP BY label)
       ORDER BY cnt DESC)) AS rr
 UNION ALL
 SELECT 'database',
        count(DATABASE) AS ROW_COUNT,
        max(top10_literals) AS top10_literals,
        count(DISTINCT DATABASE) AS distinct_count
 FROM pg_catalog.stl_query r
 CROSS JOIN
 (SELECT listagg(DATABASE, ',') top10_literals
    FROM
      (SELECT top 10 DATABASE
       FROM
         (SELECT DATABASE,
                 count(*) cnt
          FROM pg_catalog.stl_query
          GROUP BY DATABASE)
       ORDER BY cnt DESC)) AS rr
 UNION ALL
 SELECT 'querytxt',
        count(querytxt) AS ROW_COUNT,
        max(top10_literals) AS top10_literals,
        count(DISTINCT querytxt) AS distinct_count
 FROM pg_catalog.stl_query r
 CROSS JOIN
   (SELECT listagg(querytxt, ',') top10_literals
    FROM
      (SELECT top 10 querytxt
       FROM
         (SELECT querytxt,
                 count(*) cnt
          FROM pg_catalog.stl_query
          GROUP BY querytxt)
       ORDER BY cnt DESC)) AS rr ;

Here we see three SQL statements: one for the two time columns, one for the seven numeric columns, and one for the three text columns.

Step 5 - Digging into the Example

Let’s open up the example. In the Active Catalog, navigate as before but rather than double clicking on Profile Data in Table, this time drag it into the Query Window. This will expand the Catalog Item so that it looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
  (select
  case

      when section = 'numeric' then
          'select '''||column_name||''', count('||column_name||') as row_count, min('||column_name||') as min, max('||column_name||') as max, avg('||column_name||') as average,
              count(distinct '||column_name||') as distinct_count
          from '||$schema||'.'||$table_name
      when section = 'text' then
          'select '''||column_name||''', count('||column_name||') as row_count,  max(top10_literals) as top10_literals,
              count(distinct '||column_name||') as distinct_count
          from '||$schema||'.'||$table_name||' r
          cross join (select listagg( '||column_name||', '','') top10_literals from (select top 10 '||column_name||' from (select '||column_name||', count(*) cnt from '||$schema||'.'||$table_name||' group by '||column_name||') order by cnt desc)) as rr '
      when section = 'datetime' then
          'select '''||column_name||''', count('||column_name||') as row_count, min('||column_name||') as min, max('||column_name||') as max,
              count(distinct '||column_name||') as distinct_count
          from '||$schema||'.'||$table_name
  end ||
  case when ordinal_position = (
  select max(ordinal_position) from
  (
      select column_name, ordinal_position, 'numeric'::varchar(50) section
      from information_schema.columns
      where table_schema = $schema
          and table_name = $table_name
          and data_type in ('bigint', 'double precision', 'integer', 'numeric', 'real', 'smallint')
              union all
      select column_name, ordinal_position, 'text'::varchar(50) section
      from information_schema.columns
      where table_schema = $schema
          and table_name = $table_name
          and data_type in ('"char"', 'character', 'character varying', 'text')
              union all
      select column_name, ordinal_position, 'datetime'::varchar(50) section
      from information_schema.columns
      where table_schema = $schema
          and table_name = $table_name
          and data_type in ('abstime', 'date', 'timestamp with time zone', 'timestamp without time zone')
  ) c2 where c2.section = c1.section ) then ';' else
  ' union all'  end as query, section, ordinal_position
  from
  (
      select column_name, ordinal_position, 'numeric'::varchar(50) section
      from information_schema.columns
      where table_schema = $schema
          and table_name = $table_name
          and data_type in ('bigint', 'double precision', 'integer', 'numeric', 'real', 'smallint')
              union all
      select column_name, ordinal_position, 'text'::varchar(50) section
      from information_schema.columns
      where table_schema = $schema
          and table_name = $table_name
          and data_type in ('"char"', 'character', 'character varying', 'text')
              union all
      select column_name, ordinal_position, 'datetime'::varchar(50) section
      from information_schema.columns
      where table_schema = $schema
          and table_name = $table_name
          and data_type in ('abstime', 'date', 'timestamp with time zone', 'timestamp without time zone')
  ) c1
  ) r order by section , ordinal_position;

You can learn a lot by digging into this example and adapting it for your own purposes, creating your own Active Catalog entries. Here, I’ll draw your attention to two particular aspects.

The problem of stopping

In our initial example, we generated one statement per “node” column. We might try to combine these into a single statement with this (broken) code:

select
'select '||quote_literal(column_name)||' as col_name, max('||column_name||') as max_value from stl_plan_info union all' as generated_sql
from information_schema.columns
where table_name = 'stl_plan_info'
and data_type = 'integer'
and column_name like '%node%'
;

which, after formatting, produces this SQL

SELECT 'plannode'    AS col_name,
       max(plannode) AS max_value
FROM   stl_plan_info
UNION ALL
SELECT 'nodeid'    AS col_name,
       max(nodeid) AS max_value
FROM   stl_plan_info
UNION ALL

The final “UNION ALL” doesn’t belong there. This is a common problem when generating “delimited” items: you need to handle the last item slightly differently than the rest of the items. How does our Active Catalog example handle this?

Look particularly at this construction, with the concatenation operator on line 38:

case .. end || case .. end as query

The first case statement sets up generation of the aggregate functions, very similarly to our simplified example, with one clause for each data type. The second case statement begins by selecting only the row with the max(ordinal_position). ordinal_position is given to us by the information_schema.columns view as the ordering of column names in the table. So, this second clause finds the last column name and appends a semicolon (;) in that case; otherwise, it appends ' union all'.

In our simplified example, which doesn’t need the first case statement it looks like this:

select
'select '||quote_literal(column_name)||' as col_name,
    max('||column_name||') as max_value from stl_plan_info' ||
case when ordinal_position =
  (select max(ordinal_position) from
    (select column_name, ordinal_position from information_schema.columns
     where table_name = 'stl_plan_info' and data_type = 'integer' and column_name like '%node%')
   )
  then ';' else ' union all'
  end as generated_sql
from information_schema.columns
where table_name = 'stl_plan_info'
and data_type = 'integer'
and column_name like '%node%'
order by ordinal_position
;

which produces the desired output giving us a single, well-constructed SQL statement covering both columns. In our simplified version, we only have one possible row in the subselect with max(ordinal_position). Because the Active Catalog version is handling multiple data types, it can get multiple result rows. It uses a technique that we won’t cover here called a correlated subquery to manage that situation, which requires the usage of the c1 and c2 aliases that appear on lines 59 and 80.

Rolling up using list_agg()

A final consideration with the Active Catalog version is the situation with the “top 10” most frequently occurring values for text data. Here is the generated code for the querytxt column:

select 'querytxt', count(querytxt) as row_count,  max(top10_literals) as top10_literals,
            count(distinct querytxt) as distinct_count
        from pg_catalog.stl_query r
        cross join (select listagg( querytxt, ',') top10_literals
                    from
                     (select top 10 querytxt from
                          (select querytxt, count(*) cnt
                           from pg_catalog.stl_query group by querytxt
                          ) order by cnt desc
                     )
                   ) as rr ;

Let’s examine this from the inside out. The innermost subselect returns the count(*) for our target text column. The next layer up uses a nice Redshift function TOP to get the top 10 querytxt values by count. But we want those querytxt values rolled up to a single row rather than on multiple rows. This is exactly what the LISTAGG function does. It takes the text rows from the querytxt column and concatenates them into a single long string that is named “top10_literals”. This probably isn’t too useful for a complex text value like querytxt, but LISTAGG is great to have in your bag of tricks.

Conclusion

First, use the Active Catalog. Seriously, right now open a Redshift connection tab in Aginity Pro, navigate to the Profile Data in Table object, double-click it, put select * in front of the inserted text, hit F5, and start using the generated code.

The idea of SQL that generates SQL is kind of mind-bending, but it is a technique that has been in use for several decades. You can find examples of it all over the web for every conceivable database system. I always learn something from deconstructing other people’s code.

When writing your own SQL-generating SQL, start slowly. Once you have a technique that works, put that into the Active Catalog so that you can find it when you need it and even share it with your teammates.