Db2 catalog tables - Statistics
WRITTEN BY THE IBM DATA MANAGEMENT CONSOLE DEVELOPMENT GROUP
Jun Liu, Guo Bin Zhao, and Yinmei Chen
Db2 maintains a set of catalog tables that describes the metadata of each database, including tables, indexes, views, schemas, procedures, sequences, etc. Authorized users can query those catalog tables to get useful information like how many tables are stored in the database, how many columns are defined for each table and what’s the column types, etc. But most of catalog tables are used by Db2 itself. A good example is the statistics associated with those database objects. Statistics in Db2 database is a very important kind of information in Db2 system. The optimizer, the "brain" of Db2, is a cost-based optimizer that calculates costs based on statistics about database objects (tables, columns, indexes, etc.) The following subsections will describe the types of Db2 statistics in detail, how to use them in the Db2 cost model and how to collect them efficiently.
(Note that this article is applicable to Db2 for LUW).
Statistics types: cardinality, frequency and quantile
The statistics used by the Db2 optimizer fall into three main types: cardinality, frequency, and quantile. These three types of statistics are applicable to different types of queries. Let's first look at the specific definitions and examples of these three types of statistics, and then look at what the corresponding columns are in the main Db2 catalog table.
Cardinality: Indicates the number of rows in a row set.
Frequency: These statistics provide information about the columns and data values with the highest number of repeats, the second highest number of repeats, etc. for the level specified by the value of the num_freqvalues database configuration parameter. To disable the collection of high frequency value statistics, set num_freqvalues to 0.
Quantile: These statistics provide information on how the data values are distributed compared to other values. Called K-quartile, these statistics represent values V for which at least K values lie at or below that value. K-quartile can be calculated by sorting the values in ascending order. The K-quantile value is the value in the Kth position from the low end of the range. Specifies the number of parts into which the column data values should be grouped, setting the num_quantiles database configuration parameter to a value between 2 and 32,767. The default value is 20.
In the Db2 catalog tables (views), these three kinds of statistics are distributed in different catalog tables, see in following tables (views):
Table 1: Table statistics (SYSCAT.TABLES and SYSSTAT.TABLES)
Statistics |
Description |
CARD |
Table cardinality |
FPAGES |
The number of pages used in a table |
NPAGES |
The number of pages containing rows |
OVERFLOW |
The number of rows overflowed |
ACTIVE_BLOCKS |
For MDC tables, the total number of occupied blocks |
Table 2: Column statistics (SYSCAT.COLUMNS and SYSSTAT.COLUMNS)
Statistics |
Description |
COLCARD |
Column cardinality |
AVGCOLLEN |
The average length of columns |
HIGH2KEY |
A second highest value in the column |
LOW2KEY |
A second lowest value in the column |
NUMNULLS |
The number of null values in the column |
SUB_DELIM_LENGTH |
The average length of each delimiter separating each child element |
Table 3: Column distribution statistics (SYSCAT.COLDIST and SYSSTAT.COLDIST)
Statistics |
Description |
DISTCOUNT |
If TYPE is Q, then it is the number of single values less than or equal to the COLVALUE statistic |
TYPE |
Indicator of whether the row provides high frequency values or quantile statistics |
SEQNO |
Serial numbers by frequency to help uniquely identify the rows in the table |
COLVALUE |
Data values |
VALCOUNT |
If TYPE = F, VALCOUNT is the number of occurrences of COLVALUE in the column. If TYPE = Q, VALCOUNT is the number of rows whose value is less than or equal to COLVALUE. |
Table 4: Multiple columns statistics (SYSCAT.COLGROUPS and SYSSTAT.COLGROUPS)
Statistics |
Description |
COLGROUPCARD |
Column group cardinality |
Table 5: Multiple columns statistics (SYSCAT.COLGROUPDIST and SYSSTAT.COLGROUPDIST)
Statistics |
Description |
TYPE |
F = Frequency Q = Quantile |
ORDINAL |
The ordinal number of columns in the group |
COLVALUE |
Data value or null value as character text |
Table 6:Index key statistics (SYSCAT.INDEXES and SYSSTAT.INDEXES)
Statistics |
Description |
NLEAF |
Number of leaf pages |
NLEVELS |
Number of index levels |
FIRSTKEYCARD |
Number of distinct first-key values |
FIRST2KEYCARD |
Number of distinct keys using the first two columns of the index |
FIRST3KEYCARD |
Number of distinct keys using the first three columns of the index |
FIRST4KEYCARD |
Number of distinct keys using the first four columns of the index |
FULLKEYCARD |
Number of distinct full-key values |
CLUSTERRATIO |
Degree of data clustering with the index |
CLUSTERFACTOR |
Finer measurement of the degree of clustering |
Table cardinality: For a normal table, it indicates the total number of rows data in the table; For auxiliary tables, indicates the total number of LOBs in the auxiliary table. In this article, it is for a normal table, and the table cardinality is CARD column in system view SYSCAT.TABLES. Example:
Table 7: Table cardinality example
Col1 |
Col2 |
1 |
1 |
1 |
1 |
2 |
2 |
Column cardinality:For a normal table it indicates the number of different values in the corresponding column of the table. It is COLCARD column in system view SYSCAT.COLUMNS. Example:
Table 8: Column cardinality example
Col1 |
Col2 |
3 |
1 |
3 |
2 |
3 |
2 |
Col1’s column cardinality is 1, because there is only one kind of value, which is ‘3’.
Col2’s column cardinality is 2, because there are 2 kind of values, which are ‘1’and ‘2’.
Column group: The combination of columns (at least one column). Example:
Table 9: Column group example
Col1 |
Col2 |
Col3 |
1 |
2 |
3 |
It can define column group like (Col1),(Col2),(Col1,Col2), (Col2,Col1),(Col1,Col3,Col2)...
Column group cardinality: Indicates the number of different values of column group in a table. It is in system view SYSCAT.COLGROUPS, the column group cardinality corresponds to the COLGROUPCARD column. Example:
Table 10: Column group cardinality
Col1 |
Col2 |
Col3 |
11 |
21 |
31 |
11 |
22 |
31 |
11 |
22 |
32 |
11 |
22 |
32 |
The column group(Col1,Col2,Col3)’s column group cardinality is 3. The different values it contains are as follows:
Col1 |
Col2 |
Col3 |
11 |
21 |
31 |
11 |
22 |
31 |
11 |
22 |
32 |
Index full key cardinality: Indicates the number of different values for all keys in the index. It is FULLKEYCARD column in system view SYSCAT.INDEXES. Example:For Table 10, create following index.
CREATE INDEX INDEX1 ON T10 (Col3,Col1,Col2)
The index full key cardinality is 3. The different values for index full key are as follows:
Table 11: Index full key cardinality
Col1 |
Col2 |
Col3 |
31 |
11 |
21 |
31 |
11 |
22 |
32 |
11 |
22 |
Index first key cardinality: Indicates the number of different values of the first key of the index.It is FIRSTKEYCARD column in system view SYSCAT.INDEXES. Example:For index1, the index first key cardinality is 2. The index first key is Col3,this key has different values as following:
Table 12: Index first key cardinality example
Col3 |
31 |
32 |
Column Frequency: The frequency of column group values in percentage form. It is VALCOUNT column in system table SYSCAT.COLDIST. Example:ing:
Table 13: Column group frequency example
Col1 |
1 |
2 |
2 |
3 |
For column group (Col1):
Col1 value |
Column group frequency |
1 |
0.25 |
2 |
0.50 |
3 |
0.25 |
Histogram Frequency: Indicates how often a column group value in the form of a percentage falls into the specified range. It is DISTCOUNT column in system view SYSCAT.COLDIST. Example:
Table 14 : Histogram frequency example
Col1 |
1.1 |
2.1 |
2.3 |
3.5 |
For column group (Col1) histogram frequency, see following::
|
Minimum value in range |
Maximum value in range |
Histogram frequency |
Range 1 |
1 |
2 |
0.25 |
Range 2 |
2 |
3 |
0.5 |
Range 3 |
3 |
4 |
0.25 |
How Db2 estimates cost based on statistics
In the previous section, we have understood the classification of statistics. So how exactly does Db2 use various kinds of statistics for cost estimation? In different cases, the statistics that Db2 can use is different, that is, not all the statistics collected by the user can be used in the cost estimation, I often encounter the situation that the user collects a lot of statistics, but most of the statistics is unnecessarily collected, while some critical statistics is not collected, thus both putting a huge burden on the system and not being able to finally solve the performance problem.
In order to understand what kind of statistics should be collected under what circumstances, we first need to understand how the Db2 optimizer uses statistics for cost estimation. Please refer the article “Access Paths Part 1: The introduction of Db2 optimizer and cost model”.
Methods and principles of collecting statistics
Generally speaking, in order to get accurate statistics, Db2 database administrator (DBA) or database application developer can collect the required statistics through RUNSTATS provided by Db2. But using RUNSTATS to collect all the complete statistics in the database can be time-consuming, especially for large databases storing huge amount of data (such as banking system database), a complete RUNSTATS can take a day or even several days in some cases, which is obviously too costly. Therefore, in the actual production process, users often use RUNSTATS to collect statistics of some database objects, and then use RUNSTATS to collect statistics of another part of database objects after a period of time. The advantage of this approach is that it is more flexible, as the user only needs to collect statistics on certain database objects that he is interested in, and the implementation cost is lower.
But this "flexible" RUNSTATS operation also has its disadvantages. Because there may be data changes in the Db2 database at a certain stage, such as adding data or deleting data, and such data changes do not automatically update the corresponding Db2 statistics, which leads to inconsistent or even contradictory statistics collected by the user at different points in time. Such inconsistencies between statistics have a significant negative impact on the Db2 optimizer and may result in poor SQL query performance, but such conditions may indeed exist in the user's database at some point in time, and they are often difficult to be detected.
In the following subsections, we describe the methods of collecting statistics and the principles of how they should be collected in different situations.
Methods of collecting statistics
- RUNSTATS STATEMENT THAT COLLECTS TABLE AND INDEX STATISTICS.
-
Table: RUNSTATS ON TABLE T1 (collects basic statistics on table T1, e.g., NPAGES, FPAGES, etc., and basic statistics on all columns on table T1, e.g., COLCARD, HIGH2KEY, LOW2KEY, etc.)
-
Index: RUNSTATS ON TABLE T1 FOR INDEXES ALL (collects basic statistics for all indexes on the table, e.g., NLEVEL, NLEAF, FirstKeyCard, FullKeyCard, etc.)
-
Table and index: RUNSTATS ON TABLE T1 AND INDEXES ALL (collects basic statistics on table T1 and all its indexes)
-
The above RUNSTATS statements to collect basic statistics are very common and fast to execute, and many users like to use them. But the problem is: firstly, only basic statistics are collected, which may not be enough in some cases, and secondly, a lot of useless data are often collected, especially when some columns do not appear in the query at all, but also the statistics are collected.
Let's look at some extended statements.
-
Table: RUNSTATS ON TABLE T1 WITH DISTRIBUTION (collects distributed statistics for all columns on table T1, including frequency and column group frequency)
-
Index: RUNSTATS ON TABLE T1 FOR DETAILED INDEXES ALL (collects distributed statistics on all indexes on table T1, including column group frequencies and column group histogram frequencies)
-
Table and index: RUNSTATS ON TABLE T1 WITH DISTRIBUTION AND DETAILED INDEXES ALL (collects distributed statistics for table T1 and all its indexes)
Similar to the previous RUNSTATS statements, these extended statements still do not specify on which columns distributed statistics are collected, which can cause over-collection. In addition, because of the high cost of collecting distributed statistics, these RUNSTATS statements above usually consume a lot of resources and time. So in most cases, users are recommended to collect on specified columns.
-
- COLLECT STATISTICS FOR SPECIFIC COLUMNS AND COLUMN GROUPS.
-
-
Suppose there are 10 columns C1,C2,...C10, on table T1, and two columns C1, C2 are often used in queries. Then we can collect statistics for these two columns separately, as follows.
RUNSTATS ON TABLE T1 WITH DISTRIBUTION ON COLUMNS (C1,C2) (collects basic statistics on table T1, as well as basic statistics and distributed statistics on columns C1 and C2)
In addition, we can customize the default values for distributed statistics, for example:
RUNSTATS ON TABLE T1 WITH DISTRIBUTION ON COLUMNS (C1, C2 NUM_FREQVALUES 50 NUM_QUANTILES 75) (collect distributed statistics on C2, for frequency collect the top 50, for histogram frequency split into 75 ranges)
Suppose there are two indexes on table T1, IX1(C1,C2), IX2(C3,C4,C5). Generally, the database administrator or application developer has taken into account the usage of the indexes when designing the indexes, so the statistics on the indexes are mandatory to collect. As follows.
RUNSTATS ON TABLE T1 ON KEY COLUMNS (collect distributed statistics for indexes IX1 and IX2 on table T1)
If there are correlations in the index columns, statistics on column groups also need to be collected, as follows.
RUNSTATS ON TABLE T1 ON COLUMNS (C1, (C3, C4), C2) (collect basic statistics on column C1, column group (C3, C4), and column C2)
-
Principles of collecting Statistics
This section focuses on the general principles of collecting statistics. Statistics is very important for Db2 optimizer, in some cases data storage changes, in order to reflect the latest statistics changes and prevent misleading Db2 optimizer, statistics collection must be performed. The following cases are recommended to collect.
- The need to collect table statistics after LOAD table data.
- After indexes have been created, statistics need to be collected on newly created indexes.
- After reorganizing table data with REORG, statistics on the table need to be collected.
- After running a large number of insert, delete, merge,and update operations, statistics need to be collected on the tables and associated indexes again. the predefined function UDI_COUNT is available in Db2 V11 version to find out the number of times these statements have been executed, thus helping to decide whether they need to be re-collected.
- it is recommended to re-collect statistics on dependent tables and indexes before the BIND procedure, if the procedure is very critical.
- After running any of the following procedures without collecting inline statistics: RECOVER TABLESPACE, REBUILD indexes, or REORG indexes.
- After running the ALTER TABLE ROTATE PARTITION statement to run RUNSTATS with REORG.
In addition to the above, as we mentioned before, to improve the performance of RUNSTATS and reduce the disk space consumption for storing statistics, try to collect distributed statistics for only some columns. In addition, after each collection of statistics, the related application needs to REBIND so that the Db2 optimizer can calculate the optimal access plan based on the new statistics.
If users do not have enough time to collect all the statistics at one time, they can often group the Db2 objects to be collected and collect one or several groups of statistics at a time, and the grouping needs to put the related tables, indexes and statistics views in a collection to minimize the inconsistency. If inconsistency occurs after several rounds and different times of collection, for example, distributed statistics of tables are collected at the beginning, then after some time, the tables have made some data changes and the user collects statistics of indexes, then Db2 optimizer will return a warning message (SQL0437W, code 6) after executing RUNSTATS in order to remind the user to re-collect the distributed statistics of the table.
The following are some best practices:
- The collection of statistics should be targeted, especially the columns where the statistics are collected should appear in: join predicates, WHERE conditions, GROUP BY clauses, etc. If the column is inside an index, you can specify ONLY ON KEY COLUMNS in the RUNSTATS statement to constrain the scope of the collection.
- Customize num_freqvalues and num_quantiles for the specified tables and columns to avoid over-collection.
- When collecting DETAILED statistics for indexes, adding the SAMPLE DETAILED clause can reduce background computation time. the SAMPLE DETAILED clause can provide proper accuracy in most cases.
- When creating a new index to a table, adding the COLLECT STATISTICS clause can collect the index statistics while creating the index, thus avoiding re-collection.
- When a large number of table records are inserted or deleted, if the data in the column where the statistics are collected changes significantly, the RUNSTATS statement needs to be executed to collect them again.
- The RUNSTATS statement can only collect a single database partition, if there is skewed data distribution in multiple database partitions, you need to use REDISTRIBUTE DATABASE PARTITION GROUP statement to redistribute the data first, and then run the RUNSTATS statement.
If you don’t have the above query performance tuning domain knowledge, or you don’t want to manually analyze the statements to build up the RUNSTATS command, never mind, IBM Data Management Console (DMC) has provided the functionality to help you to generate the RUNSTATS command for db2 to collect only the necessary statistics for a specific SQL statement or multiple SQL statements (Workload). Within DMC, you can just input your SQL statements there, and start a tuning task, then DMC will generate the recommended RUNSTATS commands for you. Download IBM Data Management Console from https://www.ibm.com/products/db2-data-management-console and have a try.
Conclusion
This article introduces the statistics in Db2 catalog tables,
which includes the explanation of the statistics types, how Db2 estimates cost based on statistics, the methods and principles of collecting statistics, some best practices, and the tool (IBM Data Management Console) to automate statistics collection. Hope it can help you better understand statistics in Db2 catalog tables and how to collect it and use it.