Reimagine Cloud-Based Data Processing and Optimize Costs with AWS Glue and Snowflake
Snowflake has made a significant impact on the enterprise data landscape with its groundbreaking data warehouse solution. As organizations continue to transition from traditional on-premises solutions to modern, cloud-based platforms, AWS Glue has emerged as an innovative, highly efficient, serverless data integration service. Combined, AWS Glue and Snowflake offer an incredibly powerful toolset for data processing. Here, we delve into how decoupling the data processing from the data warehouse by pairing AWS Glue with Snowflake can help you increase performance by as much as 120% and reduce costs by as much as 89%.
AWS Glue and Snowflake: Increase Performance, Decrease Costs
Over the past few years of a booming economy and low interest rates, enterprises were not as concerned about cost savings and data infrastructure costs. Many enterprises had adopted a fully native Snowflake architecture, inclusive of running their data processing workloads with Snowflake tools such as Snowpipe, tasks, and the newly released Snowpark. As enterprises began to scale on Snowflake and the economy entered a downturn at the end of 2022, these enterprises began to look for cost optimization of their Snowflake environment and came to the realization a minimum of 40% of their bill was driven by data processing (ETL/ELT) costs1. Below is the high-level architecture that these enterprises were utilizing:
Those same enterprises that implemented Snowflake on AWS made the decision they needed to rearchitect their data stack to lower costs and increase performance. In comes AWS Glue, with its ability to harness the distributed processing power of Apache Spark and the flexibility of multi-language support. AWS Glue’s suite of tools integrates seamlessly with Snowflake, enabling a decoupling of the data processing from the data warehouse to provide many benefits. In addition to increased performance and decreased costs, further adjacent benefits were discovered. Once the data is processed by AWS Glue jobs, the results can be stored in a separate Amazon Simple Storage Service (S3) bucket before being written to Snowflake. Amazon Athena can then be utilized to query the data and ensure the expected results before writing to Snowflake. Amazon SageMaker can also be leveraged to run machine learning models against the data in Amazon S3, regardless of the data structure. Below is the high-level re-architecture from the Snowflake native architecture:
Effectual saw this trend in the market and began advising their Snowflake customers to test POCs to replace Snowflake native data processing with AWS Glue. Effectual partnered with many enterprises and proved the hypothesis to be true. In one specific case, Effectual’s customer had many siloed data sources consisting of custom web scrapers, data warehouses and databases across on-premises and cloud environments. This customer was utilizing Snowpipe SQL and Snowflake Tasks to load the data from these disparate data sources into Snowflake tables. Their costs were increasing rapidly as the data load increased and the performance of the data processing was suffering which was impacting their ability to deliver faster products to their customers. Effectual migrated their Snowflake data processing workloads to AWS Glue which resulted in 30% cost reduction and 35% performance increase.
The Results: Together, AWS Glue and Snowflake Revolutionize Cloud-Based Data Processing
We set out to test this architecture for high-volume data loads. The first test was to process and transform 40 TB of parquet files stored in Amazon S3 and then store the results in Snowflake. We found that for the size of this data set, the Snowflake Warehouse needed to be scaled up to XL as the smaller sizes hit resource limits and could not finish the data load. We tested equivalent AWS Glue DPU capacity sizes and the results were the following:
- AWS Glue was up to 78% less expensive and a minimum of 23% more performant when compared to a Snowflake XL Warehouse
- The smallest AWS Glue compute (8 DPU) can process the 40 TB and write to Snowflake successfully, while Snowflake XS – L could not finish the data load within 4 hours
- AWS Glue is better suited for ETL transformations on large amounts of data
AWS Glue was up to 78% less expensive and a minimum of 23% more performant when compared to a Snowflake XL Warehouse
The second test was to process and transform the same 40 TB of parquet files in Amazon S3 but utilizing Snowpark instead. The results were the following:
- AWS Glue was at least 89% less expensive and 120% more performant when compared to a Snowpark XL Warehouse
- Both XL and 2XL Snowpark Optimized Warehouses were cancelled after 4+ hours
AWS Glue was at least 89% less expensive and 120% more performant when compared to a Snowpark XL Warehouse
With AWS Glue, developers and data analysts can focus on data analysis rather than data plumbing. AWS Glue handles the undifferentiated heavy lifting of data processing, freeing the data teams to concentrate on delivering business value.
Optimize Your Snowflake Environments with Effectual’s Snowflake Optimization Accelerator
Effectual is an AWS Premier Consulting Partner with deep expertise in AWS and Snowflake technologies. We’ve developed an accelerator offering to optimize enterprise Snowflake environments. By leveraging the right tools for the right job, Snowflake + AWS are truly better together when architecting the most optimized data architecture regarding cost and performance, and analytics.
- Sourced from internal data. ↩︎