Logical Schema Backup and Restore for Columnar Tables, Part 2
Logical Backup and Restore utilities for columnar tables introduced in 11.5.8 release is a powerful feature that provides flexible and fine grain solution to both schema/table movement and data disaster recovery. This article is Part 2 of a series on this feature. In this blog we will explore what can go wrong while executing Logical Backup and Restore, how to troubleshoot, and what steps to take to alleviate the problem.
Looking under the hood
Externally, as described in the documentation, the feature is used by the end user through a call to LOGICAL_BACKUP or LOGICAL_RESTORE stored procedures. Under the hood, stored procedures invoke a collection of Python scripts that establish a connection back to the server and execute a series of SQL statements that leverage new metadata that is added as part of the row modification tracking enablement.
Since the scripts are treated as a regular application by the server, very little diagnostics are recorded in the usual db2dump folder and db2diag.log file. While these usually contain important diagnostics information if problems originate in the engine itself, just like for any other application nothing will be recorded by Logical Backup and Restore procedures if the problem is “logical” – meaning that it originates in the scripts – which is the main focus of this article.
Instead, the main diagnostic information produced by Logical Backup and Restore utilities is recorded into a tracelog file. This file is the goto debug source for most problems. In addition, like any other application that executes SQL statements, the stored procedures also return a SQLCA data to the caller, which contains useful information as well.
It should be noted that often problems may not originate in the Logical Backup or Restore utilities, and an investigation into those may resemble peeling an onion. Nonetheless, the information we provide you below should be valuable to understand what was going on at the time of the failure and point to the next step of investigation if it does not resolve the issue at once.
Starting with SQLCA
The Logical Backup and Restore stored procedures return a SQLCA data structure back to the end user. In many situations this information is sufficient to help diagnose the source of the issue. However, in some cases errors themselves are not helpful and more detailed information is required. For example, when the stored procedures return generic SQL1797 on error.
Another issue is that the message portion of the SQLCA structure is limited to 70 characters, which often is not enough to describe the error in full detail. For example, when the source of problem is some table and table name is longer than 70 characters. In many cases we can shorten the messages and try to make the most out of the space available. But the truncation of valuable information can still occur and the SQLCA by itself becomes insufficient. In this case, we need to dig deeper.
And for that we need to move onto the next level, which is the tracelog.
Finding the Error log folder
Like we said above, the main source of information for problem determination after the SQLCA should be error logs. These logs by default are stored in <install path>/tmp/bnr/logs folder on the server, and can be overridden by the user using -errorlogdir option. The user can also control whether to keep the files or not. This is controlled by -keeplogs option, with files always kept on error and either removed, stored in backup image, or left in errorlogdir on successful operation.
The automatic cleanup of diagnostic files on success prevents excessive disk cluttering. However, sometimes it is useful to compare logs even on success, for example to diagnose a performance problem or compare one invocation to another.
Similar to the backup image that is associated with a timestamp label, the tracelog files related to a particular invocation of LOGICAL_BACKUP or LOGICAL_RESTORE will be stored in a folder that used the timestamp as a name to identify the specific invocation of the stored procedure. However, note that this name (timestamp) may be different from the backup image timestamp, when produced during LOGICAL_BACKUP. This is because the errorlog folder has to be created before LOGICAL_BACKUP has opportunity to generate a unique timestamp to be used as a backup image label. Most of the time these two timestamps will differ by a few seconds and can be generally correlated if a few trailing digits are ignored. Tracelog folder timestamp generated as part of LOGICAL_RESTORE or LOGICAL_BACKUP_DETAILS is a timestamp of when the script is executed and it is not related to the source backup image label at all.
Three types of files can be found in this error log folder:
- Tracelog file
- Main diagnostic file, discussed in detail below. Only a single file created.
- Log file
- Summary of information in tracelog file. Since it is a summary, it will not be discussed
- Connector log files
- When target media is TSM, S3, or IBM COS the files are transferred to and from remote storage. Connector log files contain diagnostic information related to transfer of files. One log file is created for each file transferred.
Example: contents of errorlogdir after a successful Logical Backup operation (with -s3 and-keeplogs all_in_errorlogdir):
db2 "call sysproc.LOGICAL_BACKUP('-type full -schema rmt -s3 -access-key <access key> -secret-key <secret key> -default-region us-east-2 -bucket-url db2qa-regr -endpoint <endpoint> -errorlogdir /home/user/tmp/logs -keeplogs all_in_errorlogdir')"
$ ls /home/user/tmp/logs/20240223132733
1.ddl.20240223132746.connector.log
backup20240223132733.log
backup20240223132733.log.tracelog
.backup.info.json.20240223132801.connector.log
.info.json.20240223132757.connector.log
query.20240223132754.connector.log
schema.post.ddl.20240223132735.connector.log
schema.pre.ddl.20240223132743.connector.log
schema.skip.ddl.20240223132739.connector.log
Looking at a Tracelog file
The tracelog file in the errorlog folder contains the most detailed information on what is executed as part of Logical Backup or Restore operation.
Another important consideration is that the logical backup and restore application is multi-threaded, with all concurrent threads running within the scripts logging to the same tracelog file. Multi-line messages are grouped and it is not allowed for one thread to place a line into a middle of a multi-line message by another thread. All messages are prefixed with a timestamp, with the same timestamp used for all lines of a multi-line message. This can be used to recognize when one message ends and another starts, as well as how long it takes to execute a command. Thread number is not included in the message, and it is usually impossible to know what message is the “next” message written by the same thread – unless an explicit marker (like “Start of ..” and “End of ..”) is used. A notable exception are messages related to table unload – all operations related to unload of a particular table are done by a single thread, thus a table name can be used to construct a coherent stream of messages.
Example of entries in tracelog file:
2022-11-14 13:43:22.279150 TRACE: Lock schema tables for schema RMT...
2022-11-14 13:43:22.279252 INFO: Checking if row modification tracking is enabled in schema "RMT".
2022-11-14 13:43:22.279314 TRACE: Debug: SELECT 1 AS TRUE FROM SYSCAT.SCHEMATA WHERE SCHEMANAME='RMT' AND ROWMODIFICATIONTRACKING='Y'
2022-11-14 13:43:22.286735 INFO: Row modification tracking is enabled in schema "RMT".
2022-11-14 13:43:22.286833 TRACE: Debug: lock table "RMT"."T1" in intent none mode
2022-11-14 13:43:22.310807 TRACE: Debug: lock table "RMT"."T2" in intent none mode
Failures related to setup
Example 1: PYTHON_PATH dbm config is not set
$ db2 "call sysproc.logical_backup('-type full -path /home/user/tmp/backup -schema rmt -errorlogidr
/home/user/tmp/logs')"
SQL1797N The "SYSPROC.LOGICAL_BACKUP" utility has failed with error
"PYTHON_PATH can not be empty". SQLSTATE=5UA0Q
- No tracelog file created
- Solution: update PYTHON_PATH database manager configuration to point to python3 and recycle the instance.
Example 2: ibm_db is not installed
$ db2 "call sysproc.logical_backup('-type full -path /home/user/tmp/backup -schema rmt -errorlogdir /home/user/tmp/logs')"
SQL1797N The "SYSPROC.LOGICAL_BACKUP" utility has failed with error "Error
loading DB2 API lib or CLI driver file: /". SQLSTATE=5UA0Q
- Tracelog error:
2024-02-10 08:53:18.350117 INFO: Could not load the Db2 API library or the CLI driver file: /home/user/sqllib/lib64/libdb2.so. Please set up the environment by running db2profile for the user and ensure the Python library ibm_db_dbi is installed.
2024-02-10 08:53:18.350823 ERROR: Error loading DB2 API lib or CLI driver file: /home/user/sqllib/lib64/libdb2.so.
2024-02-10 08:53:18.351022 ERROR: No module named 'ibm_db_dbi'
- Solution: install ibm_db
Example 3: target backup image path is not authorized to be used by user:
$ db2 "call sysproc.logical_backup('-type full -schema rmt -path /home/user/tmp/another_user -errorlogdir /home/user/tmp/logs’)”
SQL1797N The "SYSPROC.LOGICAL_BACKUP" utility has failed with error "mkdir:
cannot create directory ‘/home/user/". SQLSTATE=5UA0Q
- Tracelog error:
2024-02-23 10:36:47.851554 TRACE: Current DB User: USER
2024-02-23 10:36:47.852012 TRACE: Running command [mkdir -p "/home/user/tmp/another_user/20240223103647/1/1"].
2024-02-23 10:36:47.852012 TRACE:
2024-02-23 10:36:47.863016 TRACE: RC: 1.
2024-02-23 10:36:47.863016 TRACE: STDOUT: []
2024-02-23 10:36:47.863016 TRACE: STDERR: [mkdir: cannot create directory ‘/home/user/tmp/another_user/20240223103647’: Permission denied
2024-02-23 10:36:47.863016 TRACE: ]
2024-02-23 10:36:47.863016 TRACE:
2024-02-23 10:36:47.863196 ERROR: mkdir: cannot create directory ‘/home/user/tmp/another_user/20240223103647’: Permission denied
- Solution: allow “user” to access path, or use another path.
Failures during initialize phase (pre-checks)
Logical Backup does place a few limitations on schema and tables to be backed up. They are enforced in the “initialize phase” prior to any data unload.
Example 1: Table type is not supported by LOGICAL_BACKUP
$ db2 "call sysproc.logical_backup('-type full -path /home/user/tmp/backup -schema rmt -errorlogdir /home/user/tmp/logs')"
SQL1797N The "SYSPROC.LOGICAL_BACKUP" utility has failed with error "non-RMT
table "T1".". SQLSTATE=5UA0Q
- Tracelog
2024-02-10 09:03:45.006388 INFO: Checking if row modification tracking is enabled in tables.
2024-02-10 09:03:45.006462 TRACE: Debug: SELECT DISTINCT(TABNAME) FROM SYSCAT.COLUMNS WHERE HIDDEN = 'S' AND COLNAME = 'CREATEXID' AND TABSCHEMA = 'RMT' GROUP BY TABNAME
2024-02-10 09:03:45.083474 ERROR: non-RMT table "T1".
- Solution: ensure table T1 is enabled for row modification tracking.
- Note: SELECT above returns a list of tables enabled for row modification tracking. Error is printed as a difference between a list of all tables in schema, and ones enabled for row modification.
Example 2: Index object for a table is defined in another schema
$ db2 "call sysproc.logical_backup('-type full -path /home/user/tmp/backup -schema rmt -errorlogdir /home/user/tmp/logs')"
SQL1797N The "SYSPROC.LOGICAL_BACKUP" utility has failed with error "List of
index names for which there is a mismat". SQLSTATE=5UA0Q
- Tracelog error:
2024-02-10 09:05:52.107628 ERROR: List of index names for which there is a mismatch between indschema and tabschema for an entry in syscat.indexes:
2024-02-10 09:05:52.107628 ERROR: "['USER .INX']"
2024-02-10 09:05:52.108092 ERROR: index not in backup schema "RMT".
- Solution: Ensure schema is self-contained and no related objects exist in other schemas.
Example 3: Logical Backup uses temporary tables to facilitate data unload, and no suitable temp tablespace exists
$ db2 "call sysproc.logical_backup('-type full -path /home/user/tmp/backup -schema rmt -errorlogdir /home/user/tmp/logs')"
SQL0286N A table space could not be found with a page size of at least "4096"
that authorization ID "USER" is authorized to use. SQLSTATE=42727
- Tracelog error:
2024-02-10 09:14:28.552686 ERROR: ExecuteSqlOrRaise exception with query [
2024-02-10 09:14:28.552686 ERROR: DECLARE GLOBAL TEMPORARY TABLE SESSION.EXCLUDE AS
2024-02-10 09:14:28.552686 ERROR: (
2024-02-10 09:14:28.552686 ERROR: SELECT MEMBER,
…
2024-02-10 09:14:28.552686 ERROR: )
2024-02-10 09:14:28.552686 ERROR: WITH DATA NOT LOGGED ON COMMIT PRESERVE ROWS WITH REPLACE
2024-02-10 09:14:28.552686 ERROR: DISTRIBUTE BY HASH(MEMBER) ORGANIZE BY COLUMN]. Details: ibm_db_dbi::ProgrammingError: Statement Execute Failed: [IBM][CLI Driver][DB2/LINUXX8664] SQL0286N A table space could not be found with a page size of at least "4096" that authorization ID "USER" is authorized to use. SQLSTATE=42727 SQLCODE=-286
- Solution: Ensure USER TEMPORARY tablespace exists and is authorized to be used.
Example 4: On restore target schema exists and no -drop-existing is specified
$ db2 "call sysproc.logical_restore('-type full -path /home/user/tmp/backup -schema rmt -errorlogdir /home/user/tmp/logs -timestamp 20240210091731')"
SQL1797N The "SYSPROC.LOGICAL_RESTORE" utility has failed with error "Schema
"RMT" already exists. Full schema-level". SQLSTATE=5UA0Q
- Tracelog error:
2024-02-10 09:17:56.025612 TRACE: Debug: select count(*) as exist from syscat.schemata where schemaname='RMT'
2024-02-10 09:17:56.027031 INFO: Schema "RMT" exists.
2024-02-10 09:17:56.027086 ERROR: Schema "RMT" already exists. Full schema-level restore requires a mandatory -drop-existing option to be specified.
- Solution: Drop schema explicitly, or specify -drop-existing option, or use another schema name as target using -target-schema option.
Failures during LOGICAL_BACKUP table unload
Logical Backup is not expected to fail during table unload, as all pre-conditions are validated in the “initialize phase”. However, it is still possible to see errors when, for example, run out of disk space, or errors are encountered during execution of data unload query. In such situations, the first step is to identify which table hit the problem, failing statement, and actual sqlcode returned by Db2.
Example: LOGICAL_BACKUP command is interrupted
$ db2 "call sysproc.logical_backup('-type full -schema rmt -path /home/user/tmp/backup -errorlogdir /home/user/tmp/logs')”
SQL0952N Processing was cancelled due to an interrupt. SQLSTATE=57014
- Tracelog error (some long lines cropped):
2024-02-23 09:52:13.514383 TRACE: Debug: CREATE EXTERNAL TABLE "RMT"."SYSINS_T1" ("SYSROWID" "SYSIBM ".BIGINT NOT NULL, "A" "SYSIBM ".INTEGER, "B" "SYSIBM ".
2024-02-23 09:52:13.682040 ERROR: ExecuteSqlOrRaise exception with query [CREATE EXTERNAL TABLE "RMT"."SYSINS_T1" ("SYSROWID" "SYSIBM ".BIGINT NOT NULL, "A" "S
2024-02-23 09:52:13.682634 ERROR: Failed to unload table "RMT"."T1":
2024-02-23 09:52:13.682634 ERROR: ibm_db_dbi::InternalError: Statement Execute Failed: [IBM][CLI Driver][DB2/LINUXX8664] SQL0952N Processing was cancelled due to an interrupt. SQLSTATE=57014 SQLCODE=-952
- Next step: Investigate the problem outlined in tracelog (in example above: the interrupt).
Performance problems
If Logical Backup or Logical Restore completed successfully but with unsatisfactory performance, the first step is to identify which command took a long time to complete. One simple way of doing this is to look at the timestamps of consecutive messages and noting the difference. In example below it is noted that unload of a table took excessive time.
Example: LOGICAL_BACKUP command completes successfully but takes too long to run.
$ db2 "call sysproc.logical_backup('-type full -schema rmt -path /home/user/tmp/backup -errorlogdir /home/user/tmp/logs')"
SQL1796I Logical backup utility has completed successfully, timestamp for the
backup image is "20240223090322". SQLSTATE=01541
- Timestamp traces from tracelog
2024-02-23 09:03:23.811072 TRACE: Debug: INSERT INTO "RMT"."SYSINS_T1" ("SYSROWID", "A", "B", "C") WITH
2024-02-23 09:03:23.811072 TRACE: EXCLUDE_MAX(MEMBER, TID) AS (SELECT MEMBER, MAX(TID)
2024-02-23 09:03:23.811072 TRACE: FROM SESSION.EXCLUDE GROUP BY MEMBER)
…
2024-02-23 09:03:23.811072 TRACE: </OPTGUIDELINES>*/
2024-02-23 09:03:23.811072 TRACE:
2024-02-23 09:13:01.273538 TRACE: Connection <ibm_db_dbi.Connection object at 0x7fd4537e19b0> : Successfully rollback external table created.
2024-02-23 09:13:01.273637 TRACE: set_autocommit: True
- Next step: notice that INSERT command start timestamp is 09:03:23.811072 and next command start timestamp is 09:13:01.273538 indicating that INSERT took 10 minutes to process. Further steps are to investigate the performance of INSERT – check disk IO speed (since it is inserting into external table), or analyze explain plan of query that feeds the insert.
Hangs
Investigation of hangs or suspected hangs can start by looking at the last few commands logged in tracelog file. Commands are logged before they are executed, but whether the execution is successful may not always be indicated. Also please keep in mind that the script is multi-threaded, and thus it may not be obvious which of the last few commands is hanging. In this case tracelog file information should be used to narrow the scope.
Example: LOGICAL_BACKUP command hangs.
$ db2 "call sysproc.logical_backup('-type full -schema rmt -path /home/user/tmp/backup -errorlogdir /home/user/tmp/logs')"
< command does not return>
- Last commands logged in tracelog (no ERROR messages):
2024-02-23 08:48:36.000245 TRACE: Debug:
2024-02-23 08:48:36.000245 TRACE: DELETE FROM SYSLOGICALBAR.SYSDDLHISTORY
2024-02-23 08:48:36.000245 TRACE: WHERE TABSCHEMA = 'RMT' AND (
…
2024-02-23 08:48:36.026175 TRACE: Debug:
2024-02-23 08:48:36.026175 TRACE: DELETE FROM SYSLOGICALBAR.SYSINVISIBILITYLIST
2024-02-23 08:48:36.026175 TRACE: WHERE BACKUPID IN
…
2024-02-23 08:48:36.058725 TRACE: Debug:
2024-02-23 08:48:36.058725 TRACE: DELETE FROM SYSLOGICALBAR.SYSBACKUPIDS
2024-02-23 08:48:36.058725 TRACE: WHERE SCHEMANAME = 'RMT' AND BACKUPIDLAST != '20240223084833'
2024-02-23 08:48:36.058725 TRACE:
- Next step: One of these DELETE commands is hanging. Further investigation is required. For example check lock waits, workload management queueing, dump stacks, perform application and system load analysis.
Conclusion
Logical Backup and Restore are very powerful utilities, and underneath execute a complex process that consists of many suboperations, usually in the form of SQL statement.
As we discuss in this blog, to diagnose failures it is important to start with the SQLCODE returned. And then move on to look into the corresponding tracelog for more details. In most user error cases, this is sufficient to identify the problem and point to steps to remediate it.
If a further investigation is required and IBM support needs to be involved, the information collected in the tracelog file is crucial for a complete analysis.
We hope with this blog you have acquired some more insights about the internals of Logical Backup and Restore and some useful guidelines in order to do your own troubleshooting.
Aleks Santars is alumnus of University of Waterloo and has joined IBM Db2 LUW almost 20 years ago as a coop student.
To this day he is fascinated with Db2 and Db2’s impact on the world, never ending stream of challenges, and amazing team of people that are fun to work with and to learn from.
Currently Aleks is a Senior Software Developer in Db2 engine and oversees a few projects – including Logical Backup and Restore.