Extracting String Items from JSON Arrays in Snowflake

Introduction

Have you ever struggled with handling JSON data in SQL queries? If so, you’re not alone. In this post, we’ll explore a real-world scenario where we tackled this challenge in Snowflake, transforming complex JSON structures into structured, queryable data.

Understanding the Challenge

In one of our ETL projects, we process raw data ingested from a third-party REST API into a Snowflake database. Some parent-child relationships within this data are stored as JSON strings inside VARCHAR columns.

Here is a simplified example for the purposes of this article:

Consider a table storing Person records where the Countries column contains country details in JSON format. This field can hold different types of values:

  • A single JSON object:
    {“id”:11,”name”:”France”}
  • A JSON array with multiple records:
    [{“id”:22,”name”:”Germany”}, {“id”:33,”name”:”Italy”}]
  • Blank value:
    NULL

We need to extract country names associated with each person as a simple comma-separated string for further processing. For instance, given PersonID = 3, we aim to extract “Germany, Italy”. Let’s walk through how to achieve this efficiently in Snowflake.

Extracting String Items from JSON Arrays

Step 1: Create a Sample Table

First, let’s set up a temporary table with sample data:

CREATE OR REPLACE TEMPORARY TABLE tmpSampleData (PersonID int, Countries VARCHAR(16777216));

INSERT INTO tmpSampleData VALUES (1, '{"id":11,"name":"France"}');
INSERT INTO tmpSampleData VALUES (2, '{"id":22,"name":"Germany"}');
INSERT INTO tmpSampleData VALUES (3, '[{"id":22,"name":"Germany"}, {"id":33,"name":"Italy"}]');
INSERT INTO tmpSampleData VALUES (4, '[{"id":44,"name":"USA"}, {"id":33,"name":"Italy"}, {"id":55,"name":"Spain"}]');
INSERT INTO tmpSampleData VALUES (5, NULL);

Step 2: Extract Country Names

Now, let’s write a SQL query to extract country names and convert them into a comma-separated string:

SELECT a.*, b.*
FROM tmpSampleData a
LEFT OUTER JOIN (
   SELECT PersonID,
          LISTAGG(CountryName, ', ') WITHIN GROUP(ORDER BY CountryName) AS CountryName
   FROM (SELECT PersonID, items.value:name as CountryName
         FROM tmpSampleData, LATERAL FLATTEN(PARSE_JSON(Countries)) AS items
         WHERE ARRAY_SIZE(PARSE_JSON(Countries)) IS NOT NULL
         UNION ALL
         SELECT PersonID, PARSE_JSON(Countries):name
         FROM tmpSampleData
         WHERE ARRAY_SIZE(PARSE_JSON(Countries)) IS NULL)
   GROUP BY PersonID) b ON a.PersonID = b.PersonID
ORDER BY a.PersonID;

Here is the output of the query where you can see the desired output in the rightest column:

Query Breakdown

1) Parsing JSON Data

Since Countries is stored as a VARCHAR, we use PARSE_JSON(Countries) to convert it into a  VARIANT data type in Snowflake which can store either OBJECT or ARRAY.

2) Handling Different JSON Structures

Parsing our JSON would return either an array of objects or a single object. We need to handle each case differently. We do this in this sub-query where we combine the two cases with a UNION ALL clause:

SELECT PersonID, items.value:name as CountryName
FROM tmpSampleData, LATERAL FLATTEN(PARSE_JSON(Countries)) AS items
WHERE ARRAY_SIZE(PARSE_JSON(Countries)) IS NOT NULL
UNION ALL
SELECT PersonID, PARSE_JSON(Countries):name
FROM tmpSampleData
WHERE ARRAY_SIZE(PARSE_JSON(Countries)) IS NULL


The top part handles the records where the JSON contain an array of items. We recognize these records by checking the ARRAY_SIZE() function looking for non-null values: ARRAY_SIZE(PARSE_JSON(Countries)) IS NOT NULL.

The bottom part handles the records where the JSON contain a single object (or NULL). We recognize these records by checking the ARRAY_SIZE() function looking for null value: ARRAY_SIZE(PARSE_JSON(Countries)) IS NULL.

3) Flatten the Items

The top part flattens the array of countries linked to each person. We do this by combining these two functions:

a) FLATTEN – This flattens the compound array into multiple rows

b) LATERAL – When this function is used in a FROM clause it behaves like joining the main table to a correlated subquery producing a combination for each record from the main query and an associated record from the subquery.

Here is an example how the countries linked to each person are flattened out:

The bottom part directly uses the object notation accessing the “name” property using the semi-colon syntax: PARSE_JSON(Countries):name

4) Aggregating Country Names

Once we have flattened the country names for all persons from the source table, we are using the LISTAGG() function to combine the records back to a comma-separated string containing all countries per person:

LISTAGG(CountryName, ‘, ‘) WITHIN GROUP(ORDER BY CountryName) AS CountryName

GROUP BY  PersonID

The LISTAGG() returns the concatenated input values, separated by the delimiter string which is comma in this case. The input values are coming from the items contained into each PersonID group.

5) Merging Data with the Original Table

Finally, we LEFT JOIN our results back to the original table to retain other fields from the main table.

Real-World Use Cases & Lessons Learned

In our ETL pipeline, we handle multiple JSON fields using a similar approach—creating specific LEFT JOIN subqueries for each JSON attribute. If we only need the flattened data (without aggregation), we can skip the LISTAGG() step.

Mastering these Snowflake functions makes working with JSON much more manageable. Stay tuned for more insights into Snowflake and data analytics challenges!

Happy querying!

Leave a Reply

Your email address will not be published. Required fields are marked *