Db2 (LUW) User Defined Functions
This article will explore User Defined Functions (UDFs) on the Db2 (LUW) platform. Most of the functionality is also available on Db2 for iSeries. The Db2 for z/OS platform only supports a subset of the UDF functionality available on Db2.
UDF Types – External and Native SQL
In the early days of Db2, the only way to produce a UDF for Db2 was to write it in a third-party language. These are known as “external UDFs”. It is still possible to do this. Most UDFs are now written in SQL. However, there are several reasons why developing an external UDF may still be the best option –
- Where performance of the UDF is extremely time critical and a highly efficient language such as C or C++ can provide the necessary boost
- Where a third-party language has prebuilt libraries which provide functionality that applications would like to exploit within SQL
- Where developer skills are available in one of the supported languages
Recently there has been a revival in interest in external UDFs, since Db2 recently has provided support for UDFs written with Python.
External UDFs can either be scalar (returning a single value) or table (returning a table).
On all Db2 platforms, external UDFs can be written in C, C++, Java and Python. Additionally on the Windows platform there is support for UDFs written as a .Net function or an OLE automation object. Finally on platforms that support OLE DB, it is possible to produce a table function which wraps an OLE DB provider.
As an alternative to writing stored procedures in an external language Db2 supports writing UDFs using SQL. Db2 implements the standard SQL/PSM procedural extensions as well as support for Oracle’s proprietary PL/SQL dialect.
This article focuses primary on SQL UDFs.
A generic syntax diagram for SQL UDFs is –
CREATE OR REPLACE FUNCTION <function-name> (IN|OUT|INOUT <param-name> <data-type>, …) RETURNS <data-type> <option-list> <SQL-function-body>; |
Scalar UDFs – Inline and Full Function
The simplest form of SQL UDF is the “inline UDF”, which consists of a function body which is a single SQL expression. For example, here is a simple UDF to return an input date field as a CHAR(8) expression in YYYYMMDD format.
CREATE FUNCTION UDF_DATE2CHAR_INLINE (paramDate DATE) RETURNS CHAR(8) LANGUAGE SQL CONTAINS SQL NO EXTERNAL ACTION DETERMINISTIC RETURN CAST(TO_CHAR(paramDate,'YYYYMMDD') as CHAR(8)); |
Note the single SQL expression, beginning with the “CAST” function. Since the default parameter type is “IN” this does not need to be specified.
By contrast the same function written with a body that can include any number of SQL/PSM statements could be coded as follows –
CREATE FUNCTION UDF_DATE2CHAR (paramDate DATE) RETURNS CHAR(8) LANGUAGE SQL CONTAINS SQL NO EXTERNAL ACTION DETERMINISTIC BEGIN RETURN CAST(TO_CHAR(paramDate,'YYYYMMDD') as CHAR(8)); END |
Note that the RETURN statement is now surrounded by a BEGIN / END clause.
While these two UDFs produce the same result, internally the implementation is different.
Here are selected columns from SYSCAT.ROUTINES for these two functions which highlight these differences –
ROUTINENAME |
IMPLEMENTATION |
LIB_ID |
UDF_DATE2CHAR |
db2pvm!pvm_entry |
847245216 |
UDF_DATE2CHAR_INLINE |
- |
- |
Note that the inline UDF has both the IMPLEMENTATION and LIB_ID columns set to NULL. The LIB_ID column is the cross-reference to SYSCAT.PACKAGES and shows that there is no statically defined package for an inline UDF.
On the Db2 for z/OS platform, the presence of a package and thus the need to load the package into the package cache can present a significant overhead for UDFs that are used frequently and where there is limited memory available. To investigate whether a similar situation occurred on Db2 (LUW) a series of tests were conducted where the inline and full-function UDF were invoked 200k times. Thankfully there was little difference in either the CPU or execution times, and the explain plans looked virtually identical.
For all but the simplest UDFs, full-function UDFs are easier to write and maintain and offer a range of benefits such as ability to code exception handling. So in general it is better to implement full-function UDFs for most UDFs.
As an example of why this is important, consider the reverse function to the one previously shown, which will take a CHAR(8) string in “CCYYMMDD” format and return a DATE. Unlike the previous function where the input date is strictly controlled by Db2 data typing the input character string could contain any 8 characters, not just those which represent a date. So more rigorous input checking and associated error handling must be introduced. The code for such a function could be implemented –
CREATE OR REPLACE FUNCTION UDF_CHAR2DATE (paramChar CHAR(8)) RETURNS DATE LANGUAGE SQL CONTAINS SQL NO EXTERNAL ACTION DETERMINISTIC BEGIN DECLARE varDate DATE DEFAULT NULL; DECLARE CONTINUE HANDLER FOR SQLSTATE ‘22007’ BEGIN SET varDate = NULL; END; SET varDate = DATE(TO_DATE (SUBSTR(CHAR(paramChar),1,4) CONCAT ‘-‘ CONCAT (SUBSTR(CHAR(paramChar),5,2) CONCAT ‘-‘ CONCAT (SUBSTR(CHAR(paramChar),7,2), ‘YYYY-MM-DD’)); RETURN varDate; END |
In this example we see the following features –
- Multiple statements within the UDF body with each statement terminating in a semicolon
- Definition of a variable (varDate)
- Definition of an error handler to deal with any issues with conversion of the input character string into date format
Since this is just a short example, it does not cover all scenarios – for example we might want to add code to ensure that the three components of a date (year, month and day) are within correct ranges so that any errors returned could be more specific.
The UDFs shown so far are known as “scalar UDFs”. This means that they return a single value from an execution of the UDF. Scalar UDFs are invoked in exactly the same way as the many scalar functions provided as part of the core Db2 functionality -
SELECT UDF_DATE2CHAR(MY_DATE), -- execution of UDF YEAR(MY_DATE) -- execution of Db2 inbuilt function FROM MYTABLE ; |
Table UDFs
Scalar functions are by far the most common type of UDFs that are developed. However, there is a growing interest in developing table UDFs – a type of UDF where the returned result is a table rather than a single value. Db2 provides many built-in table functions, particular for returning monitoring information, and as experience of using these has grown the possibilities for user-defined table functions have become obvious.
A simple table function can wrap a SELECT statement. For example, this is a table functions which takes a date as input and returns all tables which haven’t been used since this date -
CREATE FUNCTION UDF_OLD_TABLES (paramDate DATE) RETURNS TABLE (tabschema VARCHAR(128) ,tabname VARCHAR(128)) LANGUAGE SQL CONTAINS SQL NO EXTERNAL ACTION DETERMINISTIC BEGIN RETURN SELECT TABSCHEMA, TABNAME FROM SYSCAT.TABLES WHERE LASTUSED < paramDate; END |
Here is a SELECT which utilizes this UDF to return tables from schema “MYSCHEMA” which have not be used this year –
SELECT t.TABNAME FROM TABLE (UDF_OLD_TABLES(‘2023-01-01’)) AS t WHERE t.TABSCHEMA = ‘MYSCHEMA’; |
Note the following –
- The general pattern for calling a table UDF is “TABLE (udf(params)) AS correlation-token”
- A table UDF can then be accessed as any other table or inner SELECT -
- Only the required columns need selected
- Columns from the table UDF can also be used in predicates (or any other SQL function)
Comparing Table UDFs with Alternative Solutions
As we have seen, table UDFs allow storing code in the database to manipulate data and return a table of data. There are at least two alternatives which provide similar functionality –
- Views
- Stored procedures
In this section we will compare these three options.
Firstly we will compare table UDFs to views. Both produce a tabular result which can be accessed from a SELECT statement. However what is possible within a view definition is much more limited since it can only contain a single SQL expression. This makes complex logic challenging to write, in the same way that writing complex logic in an inline scalar UDF is difficult. In addition, it is not possible to have user-defined error handling in views – what errors are returned are determined by standard Db2 rules.
By contrast, stored procedures support the full SQL/PSM functionality just like UDFs. In fact, stored procedures are able to return multiple result sets to an application, whereas a table UDF is only able to return a single tabular result set. In practice however, this functionality is little used – partly because coding the processing of multiple result sets can be complex but also because many application frameworks do not support this feature.
The main difference between table UDFs and stored procedures is how they are executed – stored procedures are invoked using the CALL statement while table UDFs are accessible from standard SQL statements such as SELECT. While in general the difference in calling mechanism is not an issue when invoking directly from third-party programming languages, there are many tools and frameworks which only support the use of SELECT to obtain data and therefore cannot invoke stored procedures. In these situations the ability to call table UDFs from SELECT is critical – even if only used as a wrapper for an existing stored procedure.
Unique Table Functionality : The PIPE Statement
There is one situation where using a table UDF has significant benefits over using a stored procedure even where both invocation methods are available. This is when result rows must be returned based on complex logic. Often input parameters are used to retrieve some information, perform some calculations and then a decision is made whether the calculated values should be returned or not.
The traditional method of coding such functionality involves the use of a temporary table, with the pattern for the stored procedure or function being –
- Read and validate input parameters
- Define temporary table
- Define cursor to read temporary table
- Perform logic and insert data into temporary table as required
- Open cursor on temporary table
- Return to calling application
This involves a reasonably complicated amount of coding, typically involving checking that the temporary table does not already exist or contain data (not all temporary tables are as temporary as most people imagine). When trying to code this within a UDF, getting it to work can be a challenge, due to more rigid scoping rules for temporary tables and cursor definitions – perhaps designed to indicate that there is a better alternative.
That alternative is the PIPE statement which is very simple –
PIPE(variable1, variable2, ….); |
At each point where previously an INSERT into the temporary table would have been required, simple code a PIPE statement to indicate the data needs to be returned as part of the returned table. No definition or management of temporary tables and cursor is required, making the coding much easier.
However, this is not the only benefit that PIPE brings. As well as the complex coding, using temporary tables can result in issues with logging, particularly in environments such as HADR where BLOCKNONLOGGED should be set to disallow non-logged operations. It is often a surprise to many people that writing data to DGTTs (Declared Global Temporary Tables) is a logged operation. If the result set to be returned from the stored procedure or table UDF is large, either “log full” errors can be experienced or the amount of logging being carried out by this single process can be disruptive to other applications.
By contrast the PIPE command does not seem to exhibit any of these issues, and also has similar or better performance as compared to the temporary table solution. In our testing we were able to PIPE out hundreds of millions of rows without issue even on a very modest system, where we started to exhibit logging issues within a few hundred thousand rows.
One small difference is that PIPE does not allow the order of data to be returned to be specified as you would be able to do with a cursor using the ORDER BY clause. So when using PIPE in a table UDF ordering will have to be included in the calling SELECT statement.
Please be aware that the PIPE functionality is only available on Db2 (LUW) and Db2 for iSeries, not Db2 for z/OS.
To allow us to contrast the use of a temporary table as compared to the PIPE command there follows a simple (albeit contrived) example. It reads the EMPLOYEE table from the SAMPLE database and returns all the data in it multiple times.
First we have the code using a DGTT to store the data to be returned temporarily.
CREATE OR REPLACE FUNCTION NoPipeTest() RETURNS TABLE (EMPNO CHAR(6) ,LASTNAME VARCHAR(15) ,BIRTHDATE DATE) LANGUAGE SQL MODIFIES SQL DATA SPECIFIC NOPIPETEST NOT DETERMINISTIC NO EXTERNAL ACTION CALLED ON NULL INPUT BEGIN
DECLARE varCount INTEGER;
DECLARE TABLE_ALREADY_EXISTS CONDITION FOR '42710' ; DECLARE CONTINUE HANDLER FOR TABLE_ALREADY_EXISTS BEGIN END;
DECLARE GLOBAL TEMPORARY TABLE SESSION.RETURN_DATA (EMPNO CHAR(6) ,LASTNAME VARCHAR(15) ,BIRTHDATE DATE ) WITH REPLACE ;
SET varCount = 1;
my_loop: LOOP FOR C0 AS SELECT EMPNO, LASTNAME, BIRTHDATE FROM EMPLOYEE DO INSERT INTO SESSION.RETURN_DATA (EMPNO, LASTNAME, BIRTHDATE) VALUES (C0.EMPNO, C0.LASTNAME, C0.BIRTHDATE) ; END FOR; SET varCount = varCount + 1; IF varCount = 1000 THEN LEAVE my_loop; END IF; END LOOP my_loop; RETURN SELECT EMPNO, LASTNAME, BIRTHDATE FROM SESSION.RETURN_DATA; END # |
In this case we return the data inflated 1000 times, which is all that could be achieved on the test system before hitting a log full condition.
By contrast the same functionality implemented with a PIPE statement is –
CREATE OR REPLACE FUNCTION PipeTest() RETURNS TABLE (EMPNO CHAR(6) ,LASTNAME VARCHAR(15) ,BIRTHDATE DATE) LANGUAGE SQL --MODIFIES SQL DATA SPECIFIC PIPETEST NOT DETERMINISTIC NO EXTERNAL ACTION CALLED ON NULL INPUT BEGIN
DECLARE varCount INTEGER;
SET varCount = 1;
my_loop: LOOP FOR C0 AS SELECT EMPNO, LASTNAME, BIRTHDATE FROM EMPLOYEE DO PIPE (C0.EMPNO, C0.LASTNAME, C0.BIRTHDATE) ; END FOR; SET varCount = varCount + 1; IF varCount = 1000 THEN LEAVE my_loop; END IF; END LOOP my_loop;
RETURN; END # |
The code is much shorter (32 lines rather than 45 when all similar components are formatted identically). It is also much simpler, with all the code to set up and manage the DGTT removed and a single PIPE statement added. Also in this case we were able to run the data inflated by 1 million times without any issues, as compared to a maximum of about 1000 for the DGTT example.
Other (Less Common) UDF Types
Scalar and table UDFs are by far the most common UDFs that are coded. In the following section we will explore other, more esoteric types of UDFs –
- Row UDFs
- Sourced UDFs
- Function Templates and Mappings
- User-defined Aggregate Functions
Row UDFs
A row UDF is used in conjunction with structured data types, which have been available as part of Db2’s object-relational functionality for many years but are rarely used.
It takes as input a variable which is a structured data type and returns a row of data where the various elements of the structured data type have been extracted.
For example, consider the following structured type definition –
DECLARE TYPE AIRCRAFT AS ROW (REGISTRATION CHAR(10) ,AIRCRAFT_TYPE VARCHAR(50) ,CONSTRUCTORS_NUMBER CHAR(20)); |
A function can then be written to unpack the data from the structured type -
CREATE FUNCTION FROM_AIRCRAFT (A AIRCRAFT) RETURNS ROW (REGISTRATION CHAR(10) ,AIRCRAFT_TYPE VARCHAR(50) ,CONTRUCTORS_NUMBER CHAR(20)) LANGUAGE SQL CONTAINS SQL NO EXTERNAL ACTION DETERMINISTIC RETURN VALUES (A..REGISTRATION ,A..AIRCRAFT_TYPE ,A..CONSTRUCTORS_NUMBER); |
Note the use of the “double dot” syntax to unpack the various elements of the structured type variable (input parameter “A” of type “AIRCRAFT”).
Sourced UDFs
Sourced UDFs are also related to Db2’s object-relational functionality.
Distinct types can be defined, using built-in data types as a basis. For example, we may decide that we want to ensure that in all situations where we are dealing with monetary values we want to consistently use DECIMAL(9,2) as the underlying definition. To do this we can define a distinct type called MONEY, which uses DECIMAL(9,2) as the basis –
CREATE DISTINCT TYPE MONEY AS DECIMAL(9,2) WITH COMPARISONS; |
We can then create tables using MONEY as the data type -
CREATE TABLE ORDER (ORDER_ID BIGINT ,SALE_AMOUNT MONEY); |
We can then add data into the table using a regular INSERT statement –
INSERT INTO ORDER (ORDER_ID, SALE_AMOUNT) VALUES (1,100.25), (2,99.75); |
But when we then try to use Db2’s built-in functions to analyse the data we encounter an error -
SELECT SUM(SALE_AMOUNT) AS TOTAL_SALES FROM ORDER;
SQL0440N No authorized routine named "SUM" of type "FUNCTION" having compatible arguments was found. SQLSTATE=42884 |
This is because there is not built-in function called SUM() which takes data type MONEY as an input parameter. To allow this to happen we have to create a sourced UDF which uses the built-in function as its source.
CREATE FUNCTION SUM (MONEY) SYSIBM.SUM(DECIMAL(9,2));
SELECT SUM(SALE_AMOUNT) AS TOTAL_SALES FROM ORDER;
TOTAL_SALES =========== 200.00 |
The need to do this for every built-in function that you want to use with a user-defined data type might just be one of the reasons why this functionality is rarely used!
Function Templates and Mappings
Function templates and mappings are used to provide and control access to functions on federated systems. There are two situations where this is useful –
- Where both the local and remote systems have a function with the same name and you want to favour use of one over the other, either because they have different functionality or one of them provides better performance and the optimizer does not choose it
- Where a function exists on the remote server that you wish to exploit on the local system
First we define a local function template -
CREATE FUNCTION UDF_CHAR2DATE (CHAR(8)) RETURNS DATE AS TEMPLATE DETERMINISTIC NO EXTERNAL ACTION; |
Then we provide a function mapping to link the local function template to the function on the remote system. This can either be done for an individual remote system, or for all systems with the same federation type.
This example maps a single system -
CREATE FUNCTION MAPPING MY_ZOS_CHAR2DATE FOR UDF_CHAR2DATE(CHAR(8)) SERVER ZOS1 OPTIONS (REMOTE_NAME 'CHAR2DATE'); |
This example maps to all systems where the remote federated server is of type DB2/ZOS (which is the federated type for Db2 for z/OS systems) -
CREATE FUNCTION UDF_CHAR2DATE (CHAR(8)) RETURNS DATE AS TEMPLATE DETERMINISTIC NO EXTERNAL ACTION; CREATE FUNCTION MAPPING MY_ZOS_CHAR2DATE FOR UDF_CHAR2DATE(CHAR(8)) SERVER TYPE DB2/ZOS OPTIONS (REMOTE_NAME 'CHAR2DATE'); |
If there are functions of the same name on both the local and remote system and you want to disallow use of the remote function then use the "DISABLE" option on the function mapping.
Aggregate Functions
Db2 provides a large set of aggregate functions for use in both grouping sets and OLAP functions.
However, if a situation is encountered where a suitable aggregate function does not exist then it is possible to write a user-defined aggregate function. You might want to do this for a number of reasons including –
- You want to implement a type of aggregation using an algorithm which Db2 does not provide
- You want to perform an aggregation at a different level of precision to the one provided by the built-in aggregation function
An aggregate function is made up of four components (three stored procedures and a UDF) which are brought together by a wrapper function created with the AGGREGATE keyword. Communication between the various components is by way of state variables which are defined as part of the function definition in a WITH clause. The aggregate function definition specifies the four components using four keywords –
- INITIALIZE: a stored procedure which sets up the execution environment and returns OUT parameters matching the state variables
- ACCUMULATE: a stored procedure which adds the information from a single row of input data to the accumulated totals. It takes the value from the input row as an IN parameter and also has the state variables as INOUT parameters
- MERGE: a stored procedure, which is used to bring together data from different nodes of a multi-node cluster. It contains two sets of parameters matching the state variables, one set defined as IN and the other set defined as INOUT
- FINALIZE: a UDF which produces the final result. It has input parameters matching the state variables
The following code implements an aggregation function MYMEAN() which mimics the functionality of the built-in MEAN() function but with different precision.
First we define the INITIALIZE stored procedure -
CREATE OR REPLACE PROCEDURE mymean_initialize (OUT sum DECFLOAT ,OUT count INT) LANGUAGE SQL CONTAINS SQL BEGIN SET sum = 0; SET count = 0; END |
Here we initialize the two state variables required to calculate the mean. The first one keeps a running total of the values whereas the second one keeps track of the number of rows.
Then we define the ACCUMULATE stored procedure –
CREATE OR REPLACE PROCEDURE mymean_accumulate (IN input DECFLOAT ,INOUT sum DECFLOAT ,INOUT count INT) LANGUAGE SQL CONTAINS SQL BEGIN SET sum = sum + input; SET count = count + 1; END |
The value from the input row is passed into this stored procedure as the IN parameter. It is added to the “sum” state variable. The “count” state variable is incremented by one.
Next we define the MERGE stored procedure -
CREATE OR REPLACE PROCEDURE mymean_merge (IN sum DECFLOAT ,IN count INT ,INOUT mergesum DECFLOAT ,INOUT mergecount INT) LANGUAGE SQL CONTAINS SQL BEGIN SET mergesum = sum + mergesum; SET mergecount = count + mergecount; END |
The two IN parameters are values which would be passed into the stored procedure in turn from each of the partitions. On each execution of this stored procedure the cumulative totals for all partitions processed are returned on the INOUT parameters, so that after this procedure has been invoked for every partition the INOUT values contain the total of the state variables from each partition.
And lastly we define the UDF which provides the FINALIZE functionality -
CREATE OR REPLACE FUNCTION mymean_finalize (sum DECFLOAT ,count INT) LANGUAGE SQL CONTAINS SQL RETURNS DECFLOAT(34) BEGIN RETURN (sum / count); END |
This function performs the final calculation, which in the case of a mean function simply divides the total accumulated value from all rows, provided by the “sum” state variable, by the number of rows processed, provided by the “count” state variable.
To bring all these pieces together we finally define the aggregate function mymean() itself -
CREATE OR REPLACE FUNCTION mymean(DECFLOAT) RETURNS DECFLOAT(34) AGGREGATE WITH (sum DECFLOAT, count INT) USING INITIALIZE PROCEDURE mymean_initialize ACCUMULATE PROCEDURE mymean_accumulate MERGE PROCEDURE mymean_merge FINALIZE FUNCTION mymean_finalize |
Note how we define the two state variables, “sum” and “count”, in the “WITH” clause and how we define the four components with the “USING” clause.
To demonstrate the difference between the built in AVG() function and our MYMEAN() UDF -
db2 "select avg(salary) as SYS_AVG from employee"
SYS_AVG ------------------------------ 58155.357142857142857142857142
db2 "select db2inst1.mymean(salary) as MYMEAN from employee"
MYMEAN ----------------------------------- 58155.35714285714285714285714285714 |
That brings to an end our journey through the various types of user defined functions available within Db2 (LUW). For reference, much of this functionality is available on Db2 for iSeries but only a small subset is available on Db2 for z/OS.