Icicle is fast. By compiling all queries to efficient C code with unboxed data structures, we can ensure that our queries run efficiently.
The benchmarks below are true, but also unfair to Icicle’s detriment.
We thought of a set of queries which would be easy and possible to write in SQL, then wrote simple versions in Icicle, and optimised the daylights out of the SQL versions to get the results show below.
It took about half an hour to write the Icicle, and many iterations to nut out the SQL.
The thing is though, when we just gave the definitions for the queries we wanted to an engineer who hadn’t been thinking about this problem for the past few years, they took 2 weeks to write the SQL to perform them, and they ran 1000x slower than our naïve Icicle version.
We generated a small 5gb dataset of simulated supermarket basket data including:
On this data, we generated 206 simple queries using the SKUs, costs, and brands; and 21 more complex queries which required an extra group by and join in all SQL engines.
We ran these queries on Icicle on a single machine, as well as clusters of Redshift, Presto, and Spark; optimising all as best we could.
Icicle | AWS Redshift | Apache Spark | Presto | |
---|---|---|---|---|
Machines | 1x c5.9xlarge | 2x dc2.large | 2x m5.xlarge | |
Cores | 16 used (of 18) | 4 | 8 | |
Runtime: 206 | 48.9s | 8m 41s | 24m | - |
Runtime: 206 + 21 | 81.8s | Crashed | 31m | Crashed |
Icicle was simpler to write and faster. While our first iterations in SQL ran many times slower that our most optimised versions; we didn’t touch our Icicle queries after they were first written.