Accelerating your Datalake tables with a Cache of Db2 Warehouse MQTs
In the last year a bit of a revolution has emerged in the Business Intelligence (BI) space, with the introduction of watsonx.data, IBM’s newest data store built on an open lakehouse architecture, and the modernization of the storage architecture of Db2 Warehouse to support Cloud Object Storage natively. With the new Db2 Warehouse storage architecture, we enabled customers to create native Db2 tables on Cloud Object Storage, and with this achieve 34X storage savings over the previous generation for the most demanding BI and AI workloads, and at the same time achieve 4X performance improvements. One of the challenges of this new landscape is that now your data could end up being spread in multiple repositories. For this reason, this year Db2 Warehouse also introduced the seamless integration with watsonx.data, in order to break these silos and allow the free movement of data between the two, and more importantly the ability to stitch the two together and query all your data from a single Db2 Warehouse engine. In this blog post, we explore some of the performance challenges of this integration and propose a solution using MQTs on Native Cloud Object Storage.
Challenges of Datalake tables
A fundamental piece of the integration of watsonx.data and Db2 Warehouse are the new Datalake tables. This is a new kind of table in Db2 Warehouse that allows users to define an “external” Datalake table from your watsonx.data data store, stored as Iceberg and other open data formats, and with this, be able to query the Datalake table from Db2 Warehouse. Internally, to access these tables, the Db2 Warehouse has a new component named “ODF Scheduler” that splits the multiple objects that form the Datalake table in order to distribute them across the Db2 Warehouse MPP nodes (ODF is short for Open Data Format). With this, each time the query is executed, the external Datalake table must be accessed by the corresponding Db2 nodes from the source location, which incurs the full communication and data processing costs. This becomes more significant with frequent access to the same data source and in concurrent workloads.
Accelerating Datalake table access using MQTs
In order to alleviate those challenges, the Db2 Warehouse release 11.5.9 introduced the full support of MQTs (Materialized Query Tables) over Datalake tables. With this support, you can create a column organized MQT as a Native Cloud Object Storage (COS) MQT over a Datalake table and get the full performance benefit of both column organized tables and Native COS tables. With this, you can then scale up your workload over a mix of tables from your Db2 Warehouse and watsonx.data data store to the full scale of what your Db2 Warehouse deployment can do, and all while maintaining all your data stored in Cloud Object Storage.
Let’s go to the punch line first in order to tease you on what you can accomplish with this integration. The following chart is showing a highlight of the performance results that we will discuss in more detail in the rest of this blog post. In this chart you can see some of the performance results that we achieved when comparing the throughput of directly querying the Datalake tables versus the throughput with MQTs created over the Datalake tables. In this example we run a typical BI workload, BDInsights, with a mix of query complexities, and as you can see, achieves more than 40X throughput performance improvement when using MQTs. In the chart you can see that we also breakdown the throughput improvement by query complexity, as the workload was dominated by short-duration simple queries that benefited the most from the MQTs.
Why this difference?
There are quite a few factors that help the MQTs perform significantly better than the direct access to the source Datalake tables. In the core of this, MQTs are native Db2 tables, so they can benefit from all the performance smarts that have been put into these over the 30 years of Db2 in order to satisfy the most demanding workloads. These are of course partitioned across the multiple members of the MPP architecture, and they are benefiting from the many performance enhancements that make column-organized tables on Native Cloud Object storage industry leading.
Under this, these queries over the source Datalake table are identified by the query compiler as having a matching MQT that can produce the same result, and as a result generate a query plan over the MQT instead of accessing the source table. These MQTs are obviously smaller than the external Datalake tables they reference, as they are benefiting from the high level of compression achieved with column organized tables in Db2. In addition, they are also column-clustered when stored in the new Native COS storage hierarchy. This results in much faster access, even when they are not readily accessible in the multi-tier cache (in-memory buffer pool and caching tier in locally attached NVMe drives). This is the case even though they are stored in the same type of cloud object storage repository as the source Datalake table. Once cached, the query access is even faster and can be part of a highly concurrent workload without incurring any of the network cost.
The nuts and bolts
This section shows an example of an ICEBERG Datalake table and MQTs defined on the table. First, we define a Datalake table in Apache Iceberg data format named SixColsIce on the DataSchem schema, with 6 columns and stored in Cloud Object Storage.
CREATE DATALAKE TABLE dataschem.sixcolsice(
cint INT,
cvarchar STRING,
cfloat FLOAT,
cdouble DOUBLE,
cdate DATE,
cts TIMESTAMP)
STORED BY ICEBERG
LOCATION 'DB2REMOTE://datalakealias//sixcolsice';
After one has loaded the table and collected statistics on it, one can make row and column organized MQTs. The following is an example of a column organized MQT that is being created in the objstorespace1 table space, the default table space for Native COS in Db2 Warehouse on Cloud environments.
CREATE TABLE mqtice AS (
SELECT cint, cfloat, cdouble
FROM dataschem.sixcolsice
WHERE cint > 7)
DATA INITIALLY DEFERRED
REFRESH DEFERRED
MAINTAINED BY USER
DISABLE QUERY OPTIMIZATION
ORGANIZED BY COLUMN IN objstorespace1;
MQTs on Datalake tables can only be defined as maintained by user which means the user needs to load rows into the MQTs, activate the MQTs, and maintain the MQTs manually. An example of loading the MQTs and activating the MQTs is as follows:
SET INTEGRITY FOR mqtice ALL IMMEDIATE UNCHECKED;
INSERT INTO mqtice SELECT cint, cfloat, cdouble FROM dataschem.sixcolsice WHERE cint > 7;
Next you can collect the statistics on the table to allow the query compiler to get the best query plan for the MQT using the following:
CALL sysproc.admin_cmd('runstats on table mqtice on all columns');
Finally, you can activate the MQT to be usable in query compilation as follows:
ALTER MATERIALIZED QUERY mqtice SET ENABLE QUERY OPTIMIZATION;
Db2 has mechanisms that need to be activated to allow the user maintained MQTs to be considered in query compilation and execution. One can set the configuration parameters “dft_refresh_age” to ANY and “dft_mttb_types” to include USER to have it apply to all queries. Alternatively, one can set special registers per Db2 connection, which have to be in the same connection as the queries to compile and execute with the MQTs. For the special registers, to compile a query and execute with these MQTs, one needs to set the MQT table refresh age to ANY and maintained table types to include USER. For example, one can compile the following query to allow for the MQT to be considered when running the query over the base table dataschem.sixcolsice:
SET CURRENT REFRESH AGE ANY;
SET CURRENT MAINTAINED TABLE TYPES FOR OPTIMIZATION ALL;
SELECT cint, cfloat, cdouble
FROM dataschem.sixcolsice
WHERE cint > 7;
Rules of Thumb for MQT Creation
As you saw in the previous example, we created an MQT over a subset of the base Datalake table, but this is not the only approach to take in order to create MQTs. In this section we provide guidance on choosing MQTs for Datalake tables for different workloads. Before one applies some of the rules of thumb, you should analyze the workload and consider Datalake tables that are in the workload. The rules of thumb are for column organized MQTs. One could use these for both row organized MQTs and column organized MQTs, but the first rule of thumb is to use column organized MQTs as these can achieve the highest performance benefits and can be directly stored on COS and benefit also from the multi-tier cache and storage savings of COS (refer to the performance section below to learn more about the benefit of column organized MQTs on Native COS).
Rules of thumb include:
- Simple Datalake table caching: Create a column-organized MQT for each Datalake table accessed in the workload. For an MQT on a Datalake table, define the MQT with all columns from the Datalake table, i.e., use a query with "SELECT * FROM <DATALAKE TABLE>”. Adding all columns from the Datalake table in the MQT will allow the MQT to be usable by queries that reference other columns in the Datalake table used in case the workload changes. Alternatively, the MQT query could just select only the columns from its Datalake table that are accessed in the workload. Note that if the MQT is column organized and stored in Native COS, then adding all columns in the MQT query is better as the column organization and column clustering will already take care of optimizing the loading that is necessary for matching the MQT with a workload query without users needing to think about it. I.e., these MQTs allow the query plan that matches the column-organized MQT to take advantage of executing operations or joins with that MQT in the columnar data engine (i.e., in an explained plan, the operations would be under a CTQ)
- Datalake table caching with aggregation: Use only single table column-organized MQTs on the Datalake tables but include aggregations derived from the workload. For example, have the MQT include GROUP BY on Datalake columns and aggregate functions in the SELECT clause that will match queries in the workload with this aggregation and select these MQTs such that they are < 20% of the size of the original Datalake table on which it references.
- For example, if a workload query is of the form: SELECT cint, MAX(cfloat), COUNT(cdouble) FROM dataschem.sixcolsice group by cint, then one could use this query to define the MQT as well.
Performance Results
Early on in this blog post we talked before about the 40X improvements we observed when running the BDInsights workloads. In this section we will dig some more into the details of that result. Our test environment consisted of a Db2 Warehouse on Cloud on AWS environment, with 2 physical nodes and 4 logical Db2 nodes per physical node. Each of the two physical nodes were AWS Node Type “r5dn.8xlarge” with 32 CPUs, 256 GB of memory, 2 x 600 GB locally-attached NVMe drives, and 25 Gbps networking. The system under test was in AWS’s us-east-2 (Ohio) region.
The system has Db2 table space USERSPACE1 using network-attached block storage stored on Amazon EBS, and a table space OBJSTORESPACE1 for storing column organized tables in native Db2 format using an AWS S3 bucket for cloud object storage. The Native COS table space is benefiting from the multi-tier cache using the locally attached NVMe drives, and the Datalake tables are stored in a separate AWS S3 bucket in the same region.
For concurrent query experiments we ran our Big Data Insight (BDInsight, or just BDI) concurrent query workload on a 1 TB database TPCDS database. The BDI workload is an IBM-defined workload that models a day in the life of a Business Intelligence application and that uses the schema of TPC-DS. The query workload is based on a retail database with in-store, on-line, and catalog sales of merchandise. Three types of users are represented in the workload, running three types of queries:
- Simple: Output dashboard analysts query results that investigate the rates of return and impact on the business bottom line.
- Intermediate: Sales report analysts generate sales reports to understand the profitability of the enterprise.
- Complex: Deep-dive analysts (data scientists) run deep-dive analytics to answer questions identified by the returns dashboard and sales report analysts.
In order to mimic a complex query workload, we used a mix of queries consisting of 59 simple queries, 23 intermediate queries, and 5 complex queries. To drive higher load on the test system, we ran 6 concurrent streams of queries. Five streams ran only simple queries, and the sixth stream ran the remaining intermediate and complex queries (i.e, the “heavy” queries). Each stream shuffled the query order and ran its queries repeatedly until the test ended. We conducted multiple 4-hour tests. We ran 3 tests (4 hours each) with MQTs enabled, and 4 tests (4 hours each) with MQTs disabled. When MQTs are disabled, the queries take longer. To get more confidence in the query duration time and resource utilization, we ran an extra test (4 tests instead of 3) with MQTs disabled. For each query executed, the test driver fetched all rows in the answer set from the Db2 engine. The test driver ran external to the AWS environment.
We tested a simple set of MQTs in which each of the 24 TPC-DS tables has an associated MQT in tablespace OBJSTORESPACE1 with same table schema and table data. Hence, we make a cache of each Datalake table in an MQT, resulting in 24 MQTs. It took 2-3 hours to complete the load of the MQTs from the base tables and gather the statistics for the MQTs, loading the MQTs serially one after another. The appendix below includes more details on the MQTs we used.
The impact of the MQTs on query performance varies by query. The time to process some queries is dominated by the time to retrieve table data stored as Iceberg tables in COS. These queries generally produce a significant performance benefit when using the MQTs. Other queries spend much more time processing data, such as joining data or grouping data, and relatively little time accessing the Datalake source table data. The MQTs had generally less performance benefit on these queries but would benefit more if the MQT definitions already contained these pre-computations (but this is not the case in these experiments).
One way to quantify the impact of MQTs is to calculate a ratio using this formula:
Query speed up factor = “Duration without MQTs” / “Duration with MQTs”
We observed that average elapsed times for each query were the same or shorter after enabling MQTs. Thus, the “speed up factor” for each query was 1.0 or higher.
Note: When calculating speed up factors using the formula above, we didn’t want to over-exaggerate the speed up. Thus, for “Duration with MQTs” values that were under 1 second (many of the simple queries are sub-second with MQTs enabled), we rounded the values up to 1 second.
MQTs significantly shortened the elapsed time to complete the queries in our environment with a high concurrency workload.
The graph below shows “factor speed up” metric value ranges for the following three buckets of BDI queries:
- Simple queries: the 59 simple queries in the workload.
- Heavy queries: we grouped together the 23 intermediate queries and the 5 complex queries for total of 28 queries.
- All queries: all 87 queries.
The average query sped up about 27x. Among the 59 simple queries, those queries taking over 100 seconds to complete (without MQTs) were the queries with the highest speed up factors after enabling MQTs. These queries generally access several Iceberg tables, including one or more large Iceberg tables (SALES and RETURNS tables).
Some of the simple queries were already completing quickly without MQTs, typically because they only reference one or two smaller dimension tables, such as the STORE table. For simple queries that were already taking 10 seconds or less to complete (without MQTs), these queries had more modest speed up factors which were, of course, at most 10.
With MQTs enabled, all the simple queries (except one) finished under 5 seconds, with over two-thirds of the simple queries finishing within 1 second.
The range of speed up factor for the heavy queries is still impressive, up to 76x, but the average of 14.8x is lower than the 33.3x average we saw for simple queries. This is because heavy queries generally still take some time to complete even with MQTs enabled. The simple queries tend to be dominated by table data retrieval time, while the heavy queries also spend significant time processing the data when computing the final query result. With MQTs enabled, one heavy query finishes under 2 seconds, but the rest take 5 seconds or longer to complete, with the average heavy query duration being about 27 seconds.
Another reason that the maximum range of speed up for the heavy queries is lower (than for simple queries) is because there are half as many heavy queries compared to simple queries, reducing the odds of discovering queries with higher than average speed up factors.
As you look at the vertical range bars in the graph, you’ll notice that the averages are well below the middle of the vertical bars. With our workload, we saw that a few queries sped up dramatically after enabling MQTs, and those queries extended the vertical range lines higher in the graph. For example, for simple queries, the highest speed up factor is 156x. The next highest speed up factor across the simple queries is 90x, much closer to the maximum (76x) for the heavy queries.
The range for all 87 queries is derived by combining (taking maximum and minimum values) the simple and heavy ranges.
Conclusion
Db2 Warehouse is increasingly being used to access data in existing Data lakehouses backed by object storage, especially in the cloud. Such table data may consist of files in raw text format, or more likely in modern open data formats such as Parquet, and in some cases stored in Apache Iceberg open table format in order to benefit of the ACID support and other capabilities like snapshotting.
In this blog post, we saw that columnar MQTs are an effective mechanism to accelerate queries against Datalake tables. Running a highly concurrent warehouse workload consisting of a range of easy to hard queries, throughput increased on the order of 40x just by enabling MQTs, and individual queries speed up as much as 160x. Of course, the benefits of MQTs must be weighed against the costs, including the need to regularly refresh them to stay in sync with the base Datalake table. MQTs also require additional storage, but when being stored as column organized tables in Native COS, they not only benefit from the compression of column organization in Db2, but also benefit from the use of the multi-tier cache of Native COS, without any cost difference when compared to the base Datalake tables themselves.
About the Authors
Daniel Zilio has a PhD in Computer Science and has been with IBM for 25 years. Daniel has been one of the fathers of physical database design methods within IBM, including developing methods to automatically select indexes, DPF distribution keys, as well as MQTs. He was also a senior member of the Db2 compiler team and has recently worked on introducing column organized MQTs on Native COS. Daniel can be reached at zilio@ca.ibm.com.
John Poelman has been testing the performance and scalability of relational tables stored in open data formats for nearly a decade now, originally with a focus on Apache Hadoop-based offerings such as IBM Big SQL. John’s recent work is towards optimizing the performance of Datalake table and data virtualization capabilities integrated into the Db2 family of products. John can be reached at poelman@us.ibm.com.
Christian Garcia-Arellano is Senior Technical Staff Member, Master Inventor and lead architect in the DB2 Kernel Development team at the IBM Toronto Lab and has a MSc in Computer Science from the University of Toronto. Christian has been working in various DB2 Kernel development areas since 2001. Initially Christian worked on the development of the self-tuning memory manager (STMM) and led various of the high availability features for DB2 pureScale that make it the industry leading database in availability. More recently, Christian was one of the architects for Db2 Event Store, and the leading architect of the Native Cloud Object Storage feature in Db2 Warehouse. Christian can be reached at cmgarcia@ca.ibm.com.
Appendix – MQT Details
The table below describes the 24 MQTs. We created one MQT for each of the 24 TPC-DS tables, that are the base tables used by the BDInsights query workload. The table columns are as follows:
- TPC-DS table: Name of the TPC-DS table that the MQT is derived from.
- Cardinality: Number of rows in the MQT (and underlying Datalake table).
- FPAGES: Total number of 32 pages used by the MQT.
- Insert time (seconds): The time to read the data from the corresponding Datalake Iceberg table and insert the data into the MQT.
- Runstats time (seconds): The time to calculate runstats for the MQT.
Table 1: MQT details
TPC-DS table |
Cardinality |
FPAGES |
Insert time (seconds) |
Runstats time (seconds) |
CALL_CENTER |
32 |
256 |
2.75 |
0.354 |
CATALOG_PAGE |
30016 |
808 |
3.177 |
0.36 |
CATALOG_RETURNS |
143916328 |
303984 |
48.298 |
45.352 |
CATALOG_SALES |
1439017576 |
4511056 |
1672.627 |
541.574 |
CUSTOMER |
12000000 |
27936 |
11.794 |
3.72 |
CUSTOMER_ADDRESS |
6000000 |
7272 |
7.036 |
1.743 |
CUSTOMER_DEMOGRAPHICS |
1920800 |
1960 |
4.667 |
0.573 |
DATE_DIM |
73040 |
2984 |
2.981 |
0.422 |
HOUSEHOLD_DEMOGRAPHICS |
7200 |
104 |
2.51 |
0.341 |
INCOME_BAND |
16 |
72 |
2.685 |
0.351 |
INVENTORY |
696000000 |
136112 |
22.227 |
33.328 |
ITEM |
300000 |
6800 |
4.847 |
0.641 |
PROMOTION |
1496 |
160 |
2.536 |
0.35 |
REASON |
32 |
104 |
2.452 |
0.345 |
SHIP_MODE |
16 |
104 |
2.452 |
0.341 |
STORE |
1008 |
240 |
2.674 |
0.365 |
STORE_RETURNS |
287768368 |
459616 |
79.289 |
73.944 |
STORE_SALES |
2878307312 |
5660856 |
2792.35 |
803.302 |
TIME_DIM |
86376 |
1272 |
2.863 |
0.387 |
WAREHOUSE |
16 |
120 |
2.714 |
0.34 |
WEB_PAGE |
3024 |
168 |
2.745 |
0.347 |
WEB_RETURNS |
71928528 |
148304 |
27.177 |
20.969 |
WEB_SALES |
719545456 |
2260080 |
773.949 |
290.296 |
WEB_SITE |
32 |
216 |
2.66 |
0.359 |