MainframeSupport tip about UNICODE trouble in DB2 version 8 and later

MainframeSupports
tip week 33/2009:

DB2 version 8 and later supports data stored in UNICODE format. It means that some characters occupies one byte, others two bytes and in some rare cases a character may occupy up to four bytes. When DB2 is or was migrated to version 8 the characters in the catalog is or was converted to UNICODE. IBM has decided for all of us that storing good old EBCDIC characters in the DB2 catalog is over. Does it or did it have any impact at your installation?

The answer depends on whether the language you are using contains language specific characters or not. If you have used or uses any language specific characters in table names, column names or other DB2 names you have an impact. You may also have an impact on the home grown programs reading from the DB2 catalog. If you use language specific characters and wants to store them in ordinary DB2 tables as UNICODE you also have to plan very carefully. There may be a lot of challenges ahead. For ordinary DB2 tables fortunately you still have the choice of keeping them in EBCDIC or ASCII format.

The problems arise because language specific characters stored in UNICODE will occupy two bytes while the part of the alphabet common with the english alphabet will occupy one byte. First problem is the SUBSTR function which works on bytes and not on characters. Any SUBSTR on a CHAR/VARCHAR column in a UNICODE table is doomed to misery if the column contains characters occupying more than one byte. IBM has thought about this and has introduced the SUBSTRING function which works as SUBSTR, but on a character level instead of byte level. Dear reader, if you are working with a language containing language specific characters, please use SUBSTRING instead of SUBSTR.

In order to make us developers work even more, IBM has introduced a fourth parameter on SUBSTRING which must be present. Therefore a simple change all of SUBSTR to SUBSTRING is not enough. Some examples:

SELECT SUBSTRING('æøå', 1, 1, CODEUNITS32) FROM SYSIBM.SYSDUMMY1; SELECT SUBSTRING('æøå', 1, 1, CODEUNITS16) FROM SYSIBM.SYSDUMMY1; SELECT SUBSTRING('æøå', 1, 1, OCTETS) FROM SYSIBM.SYSDUMMY1;

The characters æ, ø and å are language specific characters for danish. If the table were in EBCDIC format you would use a SUBSTR('æøå', 1, 1). Now you have to choose from three different flavours. Fortunately the OCTETS parameter can be eliminated, because it turns SUBSTRING into an ordinary SUBSTR. Unfortunately I cannot tell you when to use CODEUNITS16 or CODEUNITS32. I am pretty sure that CODEUNITS16 is sufficient in most european countries using a latin alphabet, but I am unable to decipher IBM's explanations about when to use which parameter. By the way the SUBSTRING function is made available to you when DB2 version 8 runs in New Function Mode. Your installation may be running version 8 in Compability Mode and then SUBSTRING is not available. SUBSTRING is of course available in all stages of version 9 and later.

If you have reached this paragraph without having a shock or without quite understanding the consequences of what you have been reading so far, I am pretty sure the shock will come eventually. With DB2 version 8 IBM has managed to introduce a new version of DB2 which is not backward compatible. If you turn your DB2 table into a UNICODE table you may experience that your programs does not work any more unless you make changes to them. Have a nice day.

Previous tip in english Sidste danske tip Tip list

MainframeSupportstip week 33/2009:

MainframeSupports
tip week 33/2009: