Snowflake Summit 2020
By any standards 2020 has been quite a year for Snowflake. They became the largest software IPO in history just a few months ago and last week, on 18th November, held their annual summit showing some of the areas in which the money raised would be used.
The engineers from the DTSQUARED Snowflake team were all in attendance at the summit and, as we are already intimately familiar with the platform, the team was drawn to the presentations where Snowflake discussed and/or demonstrated the new and extended capabilities in their product pipeline.
This blog presents an overview of the features that were showcased with, in some cases, a few thoughts on the application of these features based upon our client’s usage of the platform.
Today the world’s data is fragmented and siloed and lives in millions of different places making it impossible to analyse. The philosophy underpinning the data cloud is to deliver a single platform with the ability to support the world’s data in one place in order to facilitate analysis. That’s worth repeating…. to support the world’s data in the same single platform.
The Snowflake Data Cloud is currently made up of 22 regions (a Snowflake region maps directly to a cloud vendor region) and all regions inter-connect. This connectivity is used to share metadata about the data held in each region and to replicate data between regions as required (for sharing and/or contingency). Not only can Snowflake customers store and query their data and share it internally, the data cloud allows them to share their data with any other Snowflake customer and to connect their data to the world’s data held in Snowflake with zero latency and zero friction.
Snowflake accounts are created per region and can be grouped together into a single organisation level account (all in SQL). To make it easy for an organisation to manage its global presence on the data cloud, work is currently underway to support centralised management of global users and roles and also to enable centralised monitoring of the global usage of an entire organisations data. More regions are in the pipeline too.
The Snowflake Data Cloud has been a priority for the company for some time and has now moved from a vision to a reality.
The Snowflake Data Marketplace, introduced in 2019, enables data to be shared both cross-region and cross-cloud without the producer ever surrendering custody of the data. There are now more than 100 companies providing data and the rate of growth of providers is showing a distinct acceleration.
The Data Marketplace has now been extended to facilitate data services in addition to data sharing. A demonstration, using a fictional insurance company, showcased the data service made available by the Quantifined team. By sharing their entity/party data the insurer was able to leverage the Quantified fraud risk classification service in order to flag entities who, for example, should be subject to the high scrutiny used when validating their insurance claims.
The end-to-end process of finding and leveraging such external services can be achieved in mere minutes, a capability that could otherwise take many days to achieve and potentially require bespoke code to be written.
Snowpark is a family of libraries that allow code (written in Java, Scala or Python) to be executed natively within Snowflake. Snowpark delivers the power of a declarative SQL statement directly into the programming languages Data Engineers are most familiar with.
Again, the fictional insurance company made for a great demonstration of this new feature. A small piece of Scala code was shown that read data from a table holding the transcripts of customer phone calls, passed this data to a pre-trained analytics model to assess the ‘sentiment’ of the call with the result of the model written back into Snowflake. In this way ‘difficult’ conversations could be easily identified in order that they could then be reviewed for training purposes.
The Snowflake team described Snowpark as being ‘transformative for data programmability’ and we have to agree with this sentiment; the use cases opened up by this capability are near infinite. A technical architecture leveraging this at the expense of technologies such as Spark, would be both simpler and likely more cost-effective for clients.
The Search Optimization Service (SOS), introduced earlier this year and designed to allow Snowflake to handle queries that are more OLTP in nature than OLAP is being extended; the service will now include pattern matching within strings. We have a current use case where millions of digitized contracts are uploaded to Snowflake and then scanned to look for the inclusion of standard contract clauses, once this extension to SOS is available the query process could be orders or magnitude faster.
Sticking with performance, a new Query Acceleration Service was also announced, this service will automatically identify and scale out queries that would benefit from increased parallelisation to dramatically increase performance. Queries running against large data sets should see the largest benefit and Snowflake quoted a fifteenfold increase in some queries in their own testing.
The term ‘time to insight’ was used in the summit defined as being the time it takes from the production of data to that data available to a business consumer. At present the Snowpipe ingestion latency often defines the critical path in this process and can add a minute to the time to insight. Snowflake may never become a real-time data store but with this latency it’s difficult to consider it as a near-time store either. Development work in flight is targeting an ingestion latency of just a second or two with end-to-end time to insight improvement of the order of 10x expected.
A headline grabbing figure presented was that over the last 12 months the performance of queries that took more than 1 second to compile has improved by more than 50%.
All performance improvements benefit Snowflake customers not just in terms of the SLA they can provide to their business users but also in terms of cost. As the Snowflake charging model is time based and to the nearest second, every increase in performance should translate to a lower overall cost to customers.
Even external functions, the ability for Snowflake to call an AWS Lambda function, announced earlier this year, received a slight make-over with the announcement that the capability is being extended to the Azure and GCP equivalents.
Security and Governance
Data Governance is a maturing discipline and still means many things to many people, Snowflake presented a handy definition of governance being the ability to know your data, manage your data and to collaborate with confidence in order to frame the improvements they are making in this area.
Dynamic data masking was announced earlier this year, allowing data in one or more columns to be masked as a function of the role of the user looking at the data. Now a complimentary Data Tagging capability has been announced allowing a table or a column to be tagged. In time it will be possible to apply policies such as dynamic data automatically to columns tagged as sensitive for example.
Snowflake acquired the CryptoNumerics platform in July this year and in time this is expected to further push the governance abilities of the platform by automatically discovering sensitive data. Such sensitive data would then be tagged automatically and the relevant masking policy then applied automatically too.
Snowflake is now adding row access policies enabling the rows that a user sees in a table to be filtered as a function of the role and/or region of the user. So for example a sales team in the US would see just their relevant data, the UK team would also see their relevant data but the head of sales or perhaps the finance team would be able to see all data.
All governance features in Snowflake are developed such that they are also automatically applied when data is shared, replicated (cross region and/or cross cloud) or copied.
One Last Thing
To date Snowflake has delivered the best of breed cloud data platform with support for structured and semi-structured data. To complete the suite of data types Snowflake announced support for unstructured data, such as image files or videos. Data of this type will apparently also be supported such that it can be stored in Snowflake and queried using standard SQL. This capability doesn’t just expand the Snowflake offering it also starts to see the offering begin to move ‘upstream’ in the data management capability of an organisation.
With support for structured, semi-structured and unstructured data in a platform with exemplary security and governance capabilities, unrivalled performance and the ability to support the world’s data at a price point that is both transparent and highly competitive, Snowflake is really changing the art of the possible.
Their vision of ‘enabling every company to leverage the world’s data through seamless and governed access’ which only a year or two ago would have felt like a ridiculous notion to support, now feels like a simple question of when not if.
There is a lot to get excited about in the Snowflake pipeline, however, it can be difficult to know the status of each feature. We’ve pulled together the status of the features presented in the summit in the table below and will continue to share updates as and when we hear of them. What we’d really like to see is Snowflake provide a clear road-map showing when these features are likely to move through the release workflow and into general availability (accepting that estimate can and will change). Clarity of the basis by which new features like Snowpark will be charged would also be helpful at this stage.
We can’t wait to get going with these new features and to continue to work closely with Snowflake to see all the ways in which we can use these features to deliver value to the current and future DTSQUARED client base.
Snowflake’s ambition is clear. They have moved away from their ‘warehouse’ branding some time ago but their rebranding to ‘platform’ doesn’t just mean they are a lake and warehouse and can support AI / ML / BI functions. Instead, their ‘platform’ has everything in its sights; OLTP databases as a means to feed data into Snowflake could be a thing of the past, in time source systems could just write and read data directly into Snowflake. This source data can then be securely combined with not just the entire data of an organisation but the entire data of the world. Their IPO price may have raised a few eyebrows but imagine the company worth if and when they deliver to this ambition.
If you are interested either joining our rapidly growing Snowflake team or leveraging the team in the delivery of a Snowflake project then please get in touch.
Steve Jenkings – “The Data Cloud really sets Snowflake apart from all other CDWs and as more providers and services are added to the Data Marketplace and leverage this our clients will see significant and near instant value. As a software engineer the thought of the complexity of the code and release process that supports both these capabilities makes my head spin!”
Mike Reay – “For me the most exciting thing to come out of the conference was Snowpark, the ability to develop business logic in familiar programming languages and execute everything within Snowflake means that very complex solutions can be built entirely within Snowflake without ever having to take any data out”.
Charlie Birch – “I love the row access policy capability, having the data in a single table and controlling access in this way will give real value to business users whilst minimising complexity within the technical team”.
Suni Minhas – “Another vote for Snowpark, as a python developer I’m particularly excited to see this addition to Snowflake. Could it be possible to combine Snowpark and unstructured data to give a much deeper understanding of the world we live in? For example, combining machine learning with audio call data in order to improve our emergency services provision.”