Access Paths Part 1: The introduction of Db2 optimizer and cost model

Posted By: Jun Liu Technical Content,

Written By: IBM Data Management Console Development Group
Jun Liu, Yan Zeng, Jun Qing Huang,  Bian Li 

 


Whether they are beginners or experienced DBA, they have more or less doubts about the behavior of Db2, especially in terms of the execution performance of SQL statements. The questions they often ask are: Obviously this SQL is very simple and only needs to return very few records, why did Db2 take so long? Why is there a primary key index on this table, but Db2 doesn't seem to use it at all? Why is this SQL sometimes fast and sometimes unbearably slow, and the difference is only the input parameters on the predicate? …To answer these questions, firstly they should understand how SQL statements are processed internally by Db2 optimizer.

 

  • Access path

In a sense, SQL is a declarative language that only describes the data of interest in the program, not an algorithm to obtain the data. Therefore, there are often many ways to implement a given SQL. These different methods are known as execution plans or access paths. Although different access paths of SQL statements will produce the same result set, they are likely not to perform tasks at the same performance level. Db2 needs to consider various factors comprehensively and try to obtain the results in the best way (execution plan). At this time, Db2 needs to create an access path. The access path defines how to access the table, which indexes to use and what join method to associate the table or intermediate results, so as to finally obtain the expected results. A good or optimal access path is very important for the efficient execution of SQL statements. If one day, it is found that the execution performance of SQL is not as expected. When optimizing from the SQL level, how to understand the execution plan currently selected by Db2 has become an indispensable part such as whether the access of tables is efficient, whether appropriate indexes are selected, whether the join order between tables and the join method between tables are appropriate and so on.

 

  • Db2 optimizer and cost model

The optimizer is the heart and soul of Db2. Functionally, it is equivalent to an expert system, a set of standard rules. Regardless of how the data is stored and manipulated, Db2 needs to access the data based on the SQL definition. Separating the data access criteria from the physical storage characteristics is called "physical data independence". The Db2 optimizer is the component that realizes this physical data independence. For example, in the case of deleting some indexes, Db2 can still access the data through a table scan, although it may not be so efficient. Another example is that you can add a new column to the table, and Db2 can still manipulate the data without changing the code of the program. All of this is possible because the physical access path to the data is not hard-coded by the programmer in the program, but automatically generated by Db2. It calls the data access path the Access path, which defines how to access the table, which indexes to use, and which join method to associate tables or intermediate results, and finally get the expected results.



The optimizer at runtime needs to be based on a lot of information and contains a very complex calculation process. To make the way the optimizer works more intuitive, you can imagine the optimizer as a process consisting of 4 steps as follows:

 

  • Parse the received SQL statement and verify it from the grammatical and semantic level to ensure correct SQL input.
  • Analyze the current environment information and generate the optimal execution plan, which may also include rewriting of the original SQL.
  • Create computer-readable instructions to execute optimized SQL.
  • Store them for subsequent execution.

 

For step 2, on what basis does the Db2 optimizer judge the optimal execution plan of SQL? In fact, the optimizer is a cost-based optimizer (CBO), which means that the optimizer will always try to develop an execution plan with the least overall cost for the query. To achieve this goal, the Db2 optimizer will apply a query cost formula that evaluates and weighs the four factors of each possible execution plan: CPU cost, I/O cost, statistics in the Db2 system catalog, and actual SQL statement. The cost of the CPU can also be broken down, including the time to find a data page from the cache (Page cost) and the time to read a data record while using the predicate (Scan cost).

 

Let’s see a simple example to understand cost estimation:

 

SELECT C3 FROM T1 WHERE C1 = 100 AND C2 > 800

 

This example is very simple. It is to find the value of C3 that satisfies the condition from the table T1. There are two conditions: one is C1 = 100, and the other is C2>800. There are 50 columns on the T1 table, which are C1, C2...C50. T1 has a total of 100,000 records, occupying 5000 data pages. With Db2 statistics, it is: CARD=100,000 for T1, NPAGES=5000 for T1. There are three columns referenced in this query: C1 and C2 appear in the predicate in the WHERE clause; C3 appears in the SELECT clause. The position where the column appears is different, the demand for statistical data is also different. Normally, the optimizer only needs statistics for the columns that appear in the predicate. In this example, there are 100 different values in column C1, and 1000 different values in column C2. The second largest value is 1000 and the second smallest value is 1 (highest and lowest are thrown out and send highest and second lowest is stored). The statistical data is: COLCARD of C1=100, COLCARD of C2=1000, HIGH2KEY=1000 of C2, LOW2KEY=1 of C2. Format this query and append all the statistics, as follows:

 

SELECT C3

FROM T1                      --------------CARD=100,000, NPAGES=5000

WHERE C1 = 100         --------------COLCARD=100

AND C2 > 800           ---------------COLCARD=1000, HIGH2KEY=1000, LOW2KEY=1

 

Now follow the method of Db2 optimizer for cost estimation. Firstly, the optimizer generates multiple access paths before estimating the cost. For this query, based on the definition of the user's table and index, the optimizer can choose two methods: full table scan and index scan. In the case of a full table scan, Db2 needs to access all data pages of this table. Assuming that it takes 1ms to access each data page and read the data of the table into the buffer pool.

 

Next, calculate the CPU time. A total of 5000 data pages are in the buffer pool. When Db2 performs a full table scan, the CPU needs to spend time to find each data page in the buffer pool. Assuming that it takes 0.1ms to find one, it takes a total of 500ms of CPU time (Page cost = 500ms). Another part of the CPU overhead is the time to evaluate the predicate while reading a data record. Different types of predicates require different calculation time. In this example, assuming that the evaluating time of each predicate is 0.01ms, the time required for a total of 100,000 records is 100,000 x 2 x 0.01, a total of 2000ms (Scan cost = 2000ms ).

So as to finally estimate the cost of this query when using the full table query:

 

IO cost + CPU cost = NPAGES * IO cost /page + (NPAGES * Page cost /page + CARD * Scan cost /record)

= 5000ms + (500ms + 2000ms) = 7500ms

 

Let's look at the difference in the estimated cost in the case of index scans. Let’s assume that the table T1 has an index built on column C1, the index name IX1. Some basic statistics are collected on IX1, such as the data page occupied by the page node of the index (NLEAF), the height of the index (NLEVEL), and the number of different values of the index key (FirstKeyCard and FullKeyCard). as follows:

 

IX1(C1 ASC)   ------------- NLEAF = 50

                     -------------- NLEVEL = 2

                     ------------- FirstKeyCard = 100

                     ------------- FullKeyCard = 100

 

When Db2 is performing an index scan, the main time spent is divided into two parts: one is the time it takes to scan the index; the other is the time to read the table record through the RID (pointer to the data record) stored in the index. The index itself is stored in a B+ tree structure. When Db2 scans the index, it compares from the root node, and then locates the subtree which the value that meets the condition, and finally locates the page node. In this process, the number of non-leaf nodes to be accessed by Db2 is equal to the height of the index NLEVEL. After that, the number of page nodes accessed by Db2 is related to the number of records meeting the conditions. Not all leaf nodes need to be accessed, so how many leaf nodes will be read? Here must be used to calculate the predicate selectivity statistics FirstKeyCard, which is equivalent to the COLCARD of C1. In this example, the selectivity of the predicate C1 = 100 refers to the ratio of the number of records in table T1 that meet this predicate condition divided by the total number of records in T1. Under normal conditions, Db2 will consider the data distribution of the table to be uniform, then for column C1, if C1 has different COLCARD values, then the ratio of each value is 1/COLCARD (Filter Factor, FF for short). In this example, it is 1/100, which means that 1/100 of the table data meets the predicate condition. Similarly, the RID stored in the leaf node of the index only needs to be read 1/100, that is, 1/100 x NLEAF = 0.5 ~ an index page. Well, so the IO time for scanning the index is determined:

 

IO cost = NLEVEL + max(NLEAF x FF, 1) = 2 + max(50 x 1/100, 1) = 3 x 1ms = 3ms

 

When performing an index scan, every time a non-page node is accessed, it needs to be compared with the key value stored in it to determine the subtree that meets the condition, so it also spends some CPU time, which is estimated as follows:

 

CPU cost = NLEVEL * (Page cost /page + Scan cost /record) = 2 * 0.11ms = 0.22ms

 

So the total time required to scan the index is: IO cost + CPU cost = 3.22ms

 

The second step is to read the time recorded in the table through RID. The number of table records that need to be read is CARD * FF = 100,000 * 1/100 = 1000 records. If viewed by the average distribution, it is NPAGAES * FF = 5000 * 1/100 = 50 data pages. Then the IO time is estimated as follows:

 

IO cost = NPAGES * FF * IO cost/ page = 50ms

 

The predicate C2> 800 needs to be used while reading the table records, which requires CPU time, which is estimated as follows:

 

CPU cost = NPAGES * FF * Page cost /page + CARD * FF * Scan cost /record)

         = 50 * 0.1ms + 1000 * 0.01ms = 15ms

 

Finally, in the stage of reading table data, the total time that Db2 estimates is:

 

IO cost + CPU cost= 65ms

 

Adding up the time of the two stages is the total time to execute SQL through index IX1: 3.22ms + 65ms = 68.22ms. This time is much less than the full table scan (7500ms). Therefore, under normal circumstances, database administrators often create some indexes for tables to improve efficiency.

 

Through the above simple example, it’s listed that in the cost model of Db2, the cost of each operation is calculated, such as full table scan, index scan, table join, data sorting, and so on. Of course, these calculation formulas are not as simple as the above examples, Db2 will also consider more complex factors. But for many SQL tuning cases, a general cost estimate formula can be analyzed.

 

  • Query Tuning Tips Sharing

Choose index scan or full table scan? This question is actually an extension of the above example. Using table and index statistics in our calculations. The most important statistics are the data pages of the table, the data pages of the index, and the selectivity of the predicate. When seeing that the data page of the table is large, the data page of the index is small, and the selectivity of the predicate is very good, the general index scan cost is smaller. On the contrary, if the data page of the table is small, the data page of the index is close to the table, and the predicate selectivity is not good, Db2 prefers to use a full table scan. In more complicated cases, considering data prefetching and random IO, etc., which will not be described in detail in this article.

 

Which is better under multiple indexes? This problem is very common. For specific problems, comparing the selectivity of the predicate corresponding to the key on each index, sometimes depending on the type of predicate and the order in which it is used on the index. The general principle is that when the first key of an index corresponds to a predicate with an operator of "=" in the predicate, and the predicate is very selective, even close to unique, then this index is definitely the best of. For example, in the above example, if the COLCARD of C1 is close to the number of records in T1, then the index using C1 as the first key will be optimal. Db2 also considers stability when selecting indexes. For example, for the same selective predicate, one operator is "=" and the other is ">". Relatively speaking, the former is more stable. When the input parameters change, the performance is relatively more consistent. Of course, some situations, such as data skew, require special consideration.

 

When do you need to avoid sorting? I have encountered a lot of sorting situations, because there are too many places to sort in Db2, such as the DINSTINCT, ORDER BY, and GROUP BY keywords added in SQL, as well as those needed in the table connection process, such as Merge Scan Join and so on. Sorting is not bad in all cases. A very simple example is to access data through a primary key index, but the column to be sorted is not in the index key. In this case, even if this column is added to the index key, the effect is very small, because after Db2 scans through the primary key, the number of records returned can only be at most one, and there is no need to sort at all. The situation where sorting really needs to be avoided is when the set to be handled is very large and expensive. An example I encountered before is the connection of two very large tables. After the join is completed, DISTINCT needs to be done to form a huge sort. In the end, after checking the join conditions, I put DISTINCT down on each table, and then created the appropriate index, thus finally avoiding sorting.

 

What are the important points to pay attention to in Nested Loop Join? Nested loop join is one of table join methods in Db2. Its principle is also very simple, which is to scan a table to obtain all the records that meet the conditions, and then access the second table according to the join conditions for each record, so that after scanning the second table multiple times, the records that meet the conditions on the two tables are finally merged together. The most critical part is the number of records that meet the conditions in the first table and the access method of the second table. Generally speaking, if the number of records in the first table that meet the conditions is large, and the access method of the second table is not good, such as a full-table query, you must pay attention. But rarely seeing this situation in the access path graph, because Db2 has calculated through statistical data and found that this situation is very costly, so choosing other table connection methods. Often our most common situation is that the number of records in the first table that meets the condition is very small, and the access method of the second table is a full table scan. At this time, please pay attention, because in some cases Db2 may incorrectly estimate the number of records in the first table that meets the conditions. The most common reasons are missing statistics, and the predicate type is difficult to estimate selectivity. So in this case, you need to split the query and run it in the actual environment to see what the number of records in the first table actually meets the conditions. If there is a big difference from the Db2 estimate, you must find out the reason before thinking a better solution. By the way, Db2 provides a good function to collect the actual cardinality. It’s very useful to determine such estimation problems.

 

  • Conclusion

This sharing has come to an end. Believing you have a preliminary understanding of how Db2 handles SQL statements, Db2 optimizer and cost model analysis. More and richer use cases will be provided for your reference in the future.