July 29, 2019 | Point of View

Evaluating modern data warehousing platforms with a performance per-dollar approach

Introduction

A modern data warehouse platform acts as the centralized data reporting foundation for any data- driven organization. If you’re not building a data warehouse, chances are you may be considering it or are simply maintaining your data warehouse in order to centralize and tap into actionable data within your organization. 

This whitepaper helps guide decision makers on the important factors at play when assessing data warehouse platforms and gives tangible metrics to help support their decisions. The performance-per-dollar approach provides an effective way to evaluate platforms by focusing on two metrics that matter for every customer: performance and cost of ownership. These metrics tend to be most important for organizations because they quantify operational costs of the platform for the required performance and scale. To make the comparison easier, we combined these into a single performance-per-dollar metric described below. 

As a group of data and analytics practitioners, we wanted to see how vendor claims held up and which aspects turned out to be not so scientific. This is not a scientific approach, but simply an exercise that our data and analytics team approached to best make well-informed recommendations for organizations when choosing the best data warehouse platform for their organization. 

Evaluation: Evaluating 5 leading data warehouse platforms

For this comparison, we started by selecting four leading data warehousing platforms used by our data engineering practice at West Monroe to implement data solutions. This was a fairly narrowly focused analysis based on cost of performance, though we are not claiming to have tested features comprehensively. These data engineering solutions include dashboarding, data warehousing, data source integration, and advanced analytics, including Machine Learning and AI. These solutions require a foundational platform that offers a wide array of features alongside powerful performance. 

To round out the list to five platforms, we selected the new, GPU-based Brytlyt platform. GPU computing is a fast-evolving field, and we wanted to gain experience with it to see how it compares with more traditional columnar databases. 

For consistency, we chose to standardize on AWS cloud infrastructure where possible, minimizing any differences in infrastructure architectures. 

Measurement 

Having all five contenders hosted in the cloud enables easy conversion of performance to hourly run cost which we will use to normalize observed performance results. 

For each technology, we came up with four deployment sizes from small to extra-large in order to align contenders as close as possible on cost per hour, compute, and storage capacity. The table below summarizes sizing: 

Small dataset – 9 million records 

When performing queries on small datasets, we found Brytlyt to be dominant in comparison to the other technologies tested. The performance metrics of Brytlyt were consistently two to three times higher than that of the other technologies. See details in figure B below. 

Medium dataset – 108 million records 

Performing queries on the medium-sized dataset pushed the GPU based Brytlyt to its limits. At the largest server size, Brytlyt continued to perform with substantially higher performance than the other measured technologies. However, Brytlyt was unable to operate on this dataset at lower costs due to memory constraints. This can be seen in Figure C below. 

Medium dataset – excluding Brytlyt 

Excluding Brytlyt, as seen in Figure D below, allows us to better see the performance of the other technologies. SQL Server shows its diminishing return on spend here while the other technologies continue to show near-linear growth. 

Large dataset – 1.1 billion records 

For the large data sets, we have no Brytlyt metrics because the dataset was too large to fit into a memory table and perform queries with any of our tested server sizes. Of the other technologies, Snowflake is clearly dominant here, achieving five to six times the performance of other technologies and continuing to scale in a linear fashion based on spend. Redshift and Azure DW begin with lower performance at low spend and scaled less per dollar as the spend increased. SQL Server shows the lowest starting performance overall and completely flat lines beyond $8/hour (16 cores) as spending increases. This can be seen in Figure E. 

Medium dataset – excluding Brytlyt 

Excluding Brytlyt, as seen in Figure D below, allows us to better see the performance of the other technologies. SQL Server shows its diminishing return on spend here while the other technologies continue to show near-linear growth. 

Medium dataset – excluding Brytlyt 

Excluding Brytlyt, as seen in Figure D below, allows us to better see the performance of the other technologies. SQL Server shows its diminishing return on spend here while the other technologies continue to show near-linear growth. 

Large dataset – 1.1 billion records 

For the large data sets, we have no Brytlyt metrics because the dataset was too large to fit into a memory table and perform queries with any of our tested server sizes. Of the other technologies, Snowflake is clearly dominant here, achieving five to six times the performance of other technologies and continuing to scale in a linear fashion based on spend. Redshift and Azure DW begin with lower performance at low spend and scaled less per dollar as the spend increased. SQL Server shows the lowest starting performance overall and completely flat lines beyond $8/hour (16 cores) as spending increases. This can be seen in Figure E. 

Conclusion

1| Snowflake came out on top in almost all areas. 

Using our performance-per-dollar measure, Snowflake drives four times more cost efficiency compared to Redshift and SQL Azure DW. 

Additional observations:

  • If your usage patterns are intermittent and spiky, as is typical for many data warehousing implementations, Snowflake can drive even greater efficiency with automatic start/stop and instant, seamless scaling
  • Snowflake was fastest to load data from AWS S3 

It was, in general, the easiest platform to work with—no external tools required, loading/queries and monitoring can be done via a web-based UI. 

2| Brytlyt outperformed all systems on complex query executions. 

Although we were unable to test with the largest dataset due to memory constraints, our results show clear benefits vs. the competition on a price for performance basis if you primarily process extremely complex or compute intensive queries. 

Additional observations: 

  • Based on our experience, the platform is not yet mature enough to act as an enterprise data warehouse 
  • It’s more of a high-performant caching/calculation engine then a traditional reporting database 
  • GPU-enabled tables lack persistence and need to be loaded into memory before each use 
  • GPU memory management lacks stability, resulting in frequent crashes 
  • GPU tables don’t support common SQL data types such as exact numeric/decimal, which limits its suitability for precise financial calculations 

3| AWS Redshift vs. Azure SQL DW were very close in all tests, illustrating how close and competitive AWS and Azure are in their top-tier offerings. 

  • Both have two compute density options and easily scale into PB range 
  • Azure offers more flexibility by scaling storage independently from compute 

4| SQL server worked well for small datasets but does not scale well. 

This was expected, as it’s unable to leverage MPP scale of other platforms. 

Future ideas 

As previously stated, this was not an opinions-based result, but rather based on testing with parameters around cost of performance. By focusing on a single metric, we were able to uncover interesting results. In future work, we hope to expand on these results to include larger datasets, as well as a wider array of query types—most notably JOIN performance. For this first pass, we focused on a fully flattened, denormalized dataset, as it has become our target model for use with RAP. 

Appendix: Platform cost and sizing

Appendix A

The table below illustrates sizes of each platform that we selected for our testing, aligned into size groups for an apple-to- apple comparison. Results were further adjusted by total hourly cost. 

Explore our latest perspectives