Multibit Variable Width Encoded Datatypes - Questions To Myself

As some errors made by programmers are so blindingly obvious, you do not need to check whether the code would even, for example, compile!

Here is an example, in C.

  int main()
  {
    char test1a[3] = "abc";
    char test1b[3] = {'a','b','c'};
    char test2a[3] = "abcd";
    char test2b[3] = {'a','b','c','d'};
    return 0;
  }

I haven’t checked that this code would compile. Why would I do this? It is so obvious that the default behaviour of the compiler would be to raise a warning during compilation as a worst-case scenario, or as the code is provably wrong, just fail outright with a compilation error (and if the compiler reports nowt, change compiler, and/or programming language!).

I hope you spotted the bug; variables test2a and test2b don’t have ‘space’ for 4 characters – they are defined with space allocated for 3 characters. I am not even a C programmer, and if you are not either, it doesn’t matter, my point is that the issue is so obvious. It isn’t even difficult to see that execution of the binary as a result of compilation, if it passed the compilation phase, would even result in some sort of buffer overrun.

In summary the example above shows an array explicitly defined of size of 3 char; 3 characters, 3 bytes, and “abcd” is 4 characters, 4 bytes, ….

Another programming language example, this time Oracle PL/SQL.

SQL>
SQL>
SQL> CREATE OR REPLACE PACKAGE test_pkg AS
  2     PROCEDURE doStuff0;
  3     PROCEDURE doStuff10;
  4     PROCEDURE doStuff11;
  5  END test_pkg;
  6  /

Package created.

SQL>
SQL>
SQL> CREATE OR REPLACE PACKAGE BODY test_pkg AS
  2      x11 CONSTANT VARCHAR2(10) := 'abcdefghijk';  --11 characters
  3      x10 CONSTANT VARCHAR2(10) := 'ABCDEFGHIJ';   --10 characters
  4     PROCEDURE doStuff0 IS
  5      BEGIN
  6       NULL;
  7      END;
  8     PROCEDURE doStuff10 IS
  9      BEGIN
 10       DBMS_OUTPUT.PUT_LINE(x10);
 11      END;
 12     PROCEDURE doStuff11 IS
 13      BEGIN
 14       DBMS_OUTPUT.PUT_LINE(x11);
 15      END;
 16  END;
 17  /

Package body created.

SQL>
SQL>
SQL> EXEC dbms_output.enable(100000);

PL/SQL procedure successfully completed.

SQL>
SQL> EXEC dbms_output.put_line('test')
test

PL/SQL procedure successfully completed.

SQL>
SQL>
SQL> EXEC test_pkg.doStuff0;
BEGIN test_pkg.doStuff0; END;

*
ERROR at line 1:
ORA-06502: PL/SQL: numeric or value error: character string buffer too small
ORA-06512: at "XXX.TEST_PKG", line 2
ORA-06512: at line 1


SQL> EXEC test_pkg.doStuff0;

PL/SQL procedure successfully completed.

SQL> EXEC test_pkg.doStuff0;

So as above, no default behaviour error or compiler warning (10.2.0.1.0 on Solaris). This is not very good. Furthermore the error is observed at runtime, during package initialisation, but the error is only observed the first time a method within the PSM package is invoked, not the second or subsequent times. This is worse. I assume that the error is observed the first time the method is invoked, not subsequent invocations, as the package is already initialised, despite failing.

In summary, as I have written the prior paragraph, things have gone from ‘not very good’ to ‘worse’.

When I encountered this issue with Oracle, some months back now, I wanted a mechanism to find issues of this type. Naively I queried USER_OBJECTS or ALL_OBJECTS for the object status of INVALID. No such luck! I wanted too for the bug to alternatively be related to the database character sets (WE8ISO8859P1 and AL16UTF16). No such luck there either! This is truly awful. I am confident that compilation of my C program above would result in at least a warning message as default compiler behaviour, but hopefully attempts to compile the database package would have instead resulted in any sort of warning or error, and not the observed only at runtime scenario, and only then sometimes.

In summary, in light of the additional paragraph above, things have gone from ‘not very good’, to ‘worse’, to ‘truly awful’.

I think the PL/SQL issue above is just crap programming from Oracle. I do not believe I have misunderstood the issue here. I can see their (Oracle’s) perspective however. They built PL/SQL upon IDL (DIANA) and Ada, all ancient stuff, they did an initial poor job, and their technology is still tied to all this legacy crap, and since then there are billions of lines of PL/SQL code in the wild. It is just not possible for them to roll out the opposite of a breaking change to fix the issue.

Warnings for this type of error can be enabled with:

ALTER SESSION SET PLSQL_WARNINGS = 'ENABLE:ALL';

Not enabled as default behaviour, but it is what it is, and the ‘compiler’ does produce a warning in this case.

Lastly, and before I move on, it is noted that Oracle database PSM package initialisation has recently changed in 12c. Deity Steven Feuerstein recently documented the behavioural changes on his blog.

But (1) …. the VARCHAR2 size definition in Oracle is actually one of two options, VARCHAR2(10 CHAR) or VARCHAR2(10 BYTE) (the PL/SQL above uses the default, where either CHAR or BYTE has not been specified) and in single byte character sets for example, a VARCHAR2(10 BYTE) declaration will allocate space for 10 characters/bytes, but in multibyte character sets, say two byte character sets, only space for 5 characters would be allocated. This means that the same code may execute fine with one character set, but error with another, but, as observed above, the runtime error would only be observed the first time, then nothing after that. This is insane behaviour!

So what is the Oracle default, CHAR or BYTE? The answer to this question is that it is a database configuration setting – parameter NLS_LENGTH_SEMANTICS. Can you imagine the support and triaging issues required to nail down this type of PSM package initialisation buffer overflow, coupled with all JIRA/Remedy ticketing harassment, concluding at some point with someone saying “It works OK for me”. The warning signs are (formally) there too – Oracle documentation on this issue specifically writes:

Caution: Oracle strongly recommends that you do NOT set the NLS_LENGTH_SEMANTICS parameter to CHAR in the instance or server parameter file. This may cause many existing installation scripts to unexpectedly create columns with character length semantics, resulting in runtime errors, including buffer overflows.

But (2) …. given the option to specify the size of a variable in either BYTEs or CHARs, and putting aside the Oracle and PL/SQL oddies/issues/bugs/observations/features (whatever you want to call them), has a mainstream programming language ever existed where a string can be defined in bits and compressed? And if so, does this programming language exist now?

I am not positing some home-rolled compressed binary datatype representing a string, my mindset is more in line with manipulating a string as a first class citizen native datatype in every way, comparable to other native language strings, that could, in a contrived language, be both declared and initialised as below:

 string[10 bits] = "AAAAAAAAAbbbbbbbbbbbbc";

The string “AAAAAAAAAbbbbbbbbbbbbc” is clearly many characters, and would not fit into a string defined as 10 chars, but with a definition of 10 bits, it could – a variant of Huffman coding/compression immediately comes to mind.

Bar some very niche application in a very specialised domain, I cannot think that a programming language of this type has any application nowadays. Gone are the days where it was necessary to store dates with 2 digits rather than 4 digits to shave away few bytes here and there – a 2TB Hard Drive retails at a £74.99 or less if you shop around. I have no knowledge whether in days gone by, there ever existing a programming language where strings could be declared in bits and stored in a reduced form.

I’ve posted the same question twice above. So to answer my own questions – what is the answer? I don’t know.

— Published by Mike, 09:42:15 24 July 2017 (CET)

Connect with:

Multibit Variable Width Encoded Datatypes - Questions To Myself

Leave a Reply

Archive

Categories