A deep dive into Snowflake’s data clean rooms

  • febrero 01, 2024
GettyImages-1140492463.jpg

In my previous article, I introduced the major concept of data clean rooms and why they’re gaining popularity within the tech industry. In this article, I’ll take you on a deeper dive into the architecture of Snowflake's data clean room and why they’ve taken the stage as a major presence as a provider of data clean rooms.

If you compare the previous data clean rooms that Snowflake has provided over the past few years with the new data clean rooms (announced in June 2023 and currently available as a part of a private preview), you’ll find that they have different mechanisms. The existing data clean rooms, called Framework-based Data Clean Rooms, or Framework-based DCR, are like a recipe that combines various Snowflake features to create data clean rooms. These new data clean rooms, called Native Data Clean Rooms, or Native DCR, are provided as a completely native feature.

Native Data Clean Rooms are an easy-to-use innovation. While existing data clean rooms can be difficult to implement, native data clean rooms have been designed from the ground up to make it easy — not just automate it. Native Data Clean Rooms are enhancing the data-sharing capabilities that Snowflake has to offer. On the other hand, Framework-based Data Clean Rooms will continue to provide the most stringent privacy control mechanisms.

Let's look at Framework-based DCR and Native DCR.

Framework-based data clean rooms

Framework-based data clean rooms are made up of a combination of existing features. Snowflake provides the underlying source code for its users to build their accounts with tables, views, shares, streams, tasks, row access policies, Python User-Defined Functions (UDF), and more. Snowflake also provides a public Streamlit App for building demo environments automatically. Essentially, you'd use a framework-based data clean room to build the data clean rooms yourself.

Snowflake gives users a unique concept for its query templates. The data provider can create a template of aggregation queries that follow agreed-upon rules, which results in the data consumer being able to execute only those queries. This concept doesn’t allow data consumers to reference the data directly, but only aggregates the data as the consumer needs. This allows only the aggregation results to be shared as if there were virtual data clean rooms rather than allowing for direct data sharing.

The provider uses Jinja, a Python library, to create and maintain query templates. Consumers don't directly execute queries against the shared data — they request the dimensions they need from the provider according to a prior agreement, and the provider generates a query to receive that information. Then, the data is shared with the consumer, which they can access by executing the query provided by the provider.

Check out this graphic to see the flow implemented using Snowflake's existing features, such as data sharing, stored procedures, Python User-Defined Functions (UDF), streams, tasks and row access policies. Configuration and usage are accomplished by executing the source code. This is a complex mechanism, but the critical point is that instead of sharing data directly, the consumer only has access to aggregate results through templated queries.

The aggregation policy for native data clean rooms

Native data clean rooms consist of two entirely new access control policies: an Aggregation Policy and Projection Policy. Each can be used individually or combined. Implementing these new data clean rooms doesn't require templates or complex workflows — only a provider that applies the policies to tables so they can achieve the functionality like the Framework-based Data Clean Rooms.

When the Aggregation Policy is applied to a table, the table will always return results only if aggregated. In addition, a minimum number of records in an aggregation group can be defined. If the aggregation group contains less than a defined number of records, to protect privacy, the aggregation results won't be available to the consumer. This is to prevent personal information from being identified if the number of records in the aggregate grouping is too small. 

Download this image for a code suggestion to create a policy and set it in a table.

In this way, it’s easy to implement the most important aspect of data clean rooms, which is to return only aggregate results. It can be done with exactly two commands. By adding a CASE clause, it’s also possible to describe conditions, such as what policies are applied based on roles or accounts.

The projection policy for native data clean rooms

Another important aspect of the security of data clean rooms is the capability to hide data that you don't want to be visible. However, if the columns themselves are removed, consumers can't filter the data that they need. Therefore, it’s necessary to make the columns available as keys to explore the necessary information but not to show them as a result. The feature for this purpose is known as the Projection Policy.

Download this image for a code suggestion to create a policy set on a column.

This means that the consumer can't see the personal information that the provider has, such as customers’ email addresses. However, consumers can use their list of email addresses to analyze the provider's data set. Similar to the Aggregation Policy, the Projection Policy is easy to implement. In addition, you can also set the condition in which the policy is applied, depending on the role or account.

The integration of multiple policies

What would happen if you combined the Aggregation Policy, the Projection Policy and the existing Row Access Policy?

The Aggregation Policy allows you to force aggregate. The Projection Policy allows you to prevent the keys used to retrieve data from being visible in the results. The Row Access Policy allows you to aggregate only the data that is allowed to be shared with the consumer. This addresses most of the common requirements for data clean rooms — hide raw data, show only aggregate results and filter only the data allowed by each consumer. In most situations, these policies are enough to provide data security. While all these policies have different characteristics, when used in combination with one another, they can provide highly effective data security.

Policies will be standard features of Snowflake for every user after General Availability. Most importantly, these work perfectly for Snowflake data sharing. Therefore, both providers and consumers can give and receive only the insights they need, without giving their data to the other party.

Native data clean rooms app

The Native Data Clean Rooms App is a Snowflake Native Application, available on Snowflake Marketplace that allows these new access policies to be easily configured via web interface. Snowflake Native Applications are provided by third-party vendors on Snowflake Marketplace, but this application is an exception — it's built and provided directly by Snowflake.

The functionality is limited and requires the user to create an Aggregation Policy, a Projection Policy and a Row Access Policy via the command line in advance. The application also allows users to attach those policies to each table, as well as provide them as an application in Marketplace or Private Listing.

One of the advantages of this application is the ability to set policies via a web interface, but that might not be to the user’s best advantage. The best advantage is the ability to use the same concepts of testing, staging, versioning and usage monitoring as other native applications. Of course, users can also create their own more enhanced applications focused on their own data; Snowflake's offering is intended only as a simple provider application. It’s important to keep in mind that this is a private preview, and these features may change significantly before General Availability.

The future of Snowflake’s data clean rooms

In native data clean rooms today, the Aggregation Policy can be used for setting a minimum number of records in a group, which prevents the analysis results from being used to identify individuals. While this is an effective approach, it's somewhat primitive.

Therefore, Snowflake is trying to integrate a new technique called Differential Privacy into their data clean rooms. Differential Privacy is a mathematical technique used to prevent the inference of personal information from analytical results. It removes personal characteristics from the analysis results by randomizing the data within statistically equivalent ranges. This technique is already being used with Google Maps, as well as device analytics for Apple products.

Snowflake announced the development of this feature in November 2023. While the timeline has not yet been determined, it'll eventually be incorporated into the native data clean rooms.

There are various stages and plans for data clean rooms available today, which will lead to further advancements for the future. From framework-based data clean rooms that are already available, native data clean rooms that are currently in preview. And Differential Privacy techniques that'll be implemented in the years to come. In any case, Snowflake's native data clean rooms provide the means necessary to extract value from data, even in situations where the data itself isn't visible.

This is an important way to extract value from data in today’s world where privacy is so important. Using data clean rooms will lower the barriers for data sharing. Companies that have had difficulty with sharing and monetizing their data in the past, will be able to further expand their data ecosystem across all industries.

NTT DATA isn't only an elite service partner of Snowflake, but we also have a great track record as a data provider and application provider in the Snowflake Marketplace. We have a strong understanding of Snowflake data clean rooms, and we’d love to share that knowledge with you. If you are interested in a data clean room, please contact us.

Subscribe to our blog

ribbon-logo-dark
Ryo-Shibuya.png
Ryo Shibuya

Ryo Shibuya is a highly accomplished Cloud Data Architect with a wealth of experience in the field. With a career spanning over a decade, Ryo holds multiple certifications in Snowflake and AWS, and he has been instrumental in implementing data analytics platforms and cloud data warehousing solutions for various organizations. Notably, Ryo played a key role in the early adoption of Snowflake technology in the Japanese market, spearheading transformative change. Ryo's expertise extends to providing technical consulting for numerous Snowflake implementation projects and developing educational programs on data analytics platforms. For his supreme Snowflake expertise, Snowflake awarded Ryo the title of Snowflake Data Superhero, and he is considered a true thought leader in the industry.

 

Related Blog Posts