Kevin Bacon is one cool dude. From his performances in The River Wild and Footloose, to the unforgettable Hollow Man, he constantly exudes a cool cat persona. It is undeniable.
Outside the cinema, the man’s man pops ups up in the most unexpected of other places too, from TV advertisements (in the UK) for EE with Britney Spears, to a graph database implementation determining the minimum number of hops between Kevin and another actor (linking them by movie performances – two actors are linked if they have been in a movie together). I have just checked – Kevin Bacon and President Donald J. Trump have a Bacon Number of 2. Now that is really quite impressive! 🙂
In my career as a software developer, on more than one occasion I have found myself in a situation where someone has helped me out, where this is really a euphemism for them making a pigs ear out of something, with the best of intentions of course, but their help causes grief. It’s often the sort of gift that doesn’t stop giving once you realise that this help, if a design implementation, cannot be easily refactored or rolled back.
Recently I encountered a problem that I have encountered thrice before, where someone helped me out by creating a large database table containing related pairs of values. The data was expected to be used in a way where a value of 1 should be returned by a PSM function if there is a connection/link between the two pairs (eg. Trump and Kevin Bacon), or 0 if there is no link. On the face of things, this is a similar issue to traversing a list to arrive at the Bacon Number within some sort of graph database, that has been implemented using elegant algorithms and structures for linking pairs of actors and movies. Unfortunately, someone helped me out by sabotaging the initial design. The knock-on implication is that other systems with a dependency on this data are implemented sub-optimally (or putting things somewhat more colloquially and bitterly, the bodged design will make the code I have to write execute excruciatingly slowly, and I have to take the wrap for it).
The offending table was defined with just two VARCHAR2 columns. It contained over 21 million rows of data pairs (growing by 1.5million/month – this is a time bomb!). I include some example data below, using Christian names rather than the domain specific data and jargon that would detract from my argument. The (adjusted) business requirement was to determine whether one Christian name was an alias for another, possibly by traversing a hierarchy or linked list of sorts.
SQL> SQL> SQL> SELECT * 2 FROM aliasPairs; P1 P2 ---------- ---------- Elisabeth Elizabeth Elisabeth Betty Beth Liz LIZ Elisabeth Lilibet Elizabeth Micheal Michael Mike Michael Micky Michael Richard Rich Richard Rick Richard Ricky Dick Richard 12 rows selected. SQL> SQL>
On the face of it, this doesn’t look too bad a design does it? I just have to write a trivial hierarchical query (I will use the superseded Oracle START WITH CONNECT BY syntax, although I recommend using the CTE/Recursive With as discussed in a prior blog article) as shown below:
SQL> SQL> SQL> SQL> SELECT DISTINCT p1 2 FROM aliasPairs 3 START WITH p1 = 'Liz' 4 CONNECT BY NOCYCLE PRIOR p1 = p2; no rows selected SQL> SQL>
Oh, hold on. This is wrong. There is a case sensitivity issue, viz. Liz and LIZ. Here’s a second attempt with Liz in capital letters, and converting the other row values to their upper case equivalents for comparison too.
SQL> SQL> SQL> SQL> SELECT DISTINCT p1 2 FROM aliasPairs 3 START WITH UPPER(p1) = 'LIZ' 4 CONNECT BY NOCYCLE PRIOR UPPER(p1) = UPPER(p2); P1 ---------- Beth LIZ SQL> SQL>
Oh, hold on, again. This isn’t correct either. This is rapidly turning into a software development exercise by trial and error – what about Elizabeth, and Lilibet, and the others in the ill designed database table? It appears that this is not a trivial hierarchical query at all. What’s the trick? How can I make this a hierarchical query – that is after all the only real approach I have open to me given I am working with this crippled design?
My solution was to create an intermediary table result set (using the WITH clause, or more formally a Subquery Refactoring Clause), as a union of (p1, p2) and (p2, p1); ie. ensuring that there will be a link connection in the hierarchical data so I can use a hierarchical query to retrieve the correct result set. Here is the code excerpt in its cut-down form (note the second Subquery Refactoring Clause, introduced as the function had to do something else as well, that isn’t included herein, keep reading).
SQL> SQL> SQL> SQL> WITH x AS 2 ( 3 SELECT UPPER(p1) p1, UPPER(p2) p2 4 FROM aliasPairs 5 UNION 6 SELECT UPPER(p2) p1, UPPER(p1) p2 7 FROM aliasPairs 8 ), 9 y AS ( 10 SELECT DISTINCT p1 11 FROM x 12 START WITH p1 = 'LIZ' 13 CONNECT BY NOCYCLE PRIOR p1 = p2 14 ) 15 SELECT p1 16 FROM y; P1 ---------- ELISABETH BETH ELIZABETH LILIBET LIZ BETTY 6 rows selected. SQL> SQL> SQL> SQL>
Ah, that’s better. A wrapped function using this core code, and three usage examples, is shown below – is there a connection between Elizabeth and Liz (ie. one is a shortened version of the other, or a synonym, although that really does have an alternative meaning), and Mike and Liz. Yes, yes, and no.
SQL> SQL> SQL> SQL> CREATE OR REPLACE FUNCTION testPair(vp1 IN aliasPairs.p1%TYPE, vp2 IN aliasPairs.p2%TYPE) RETURN NUMBER IS 2 retValue NUMBER; 3 BEGIN 4 WITH x AS 5 ( 6 SELECT UPPER(p1) p1, UPPER(p2) p2 7 FROM aliasPairs 8 UNION 9 SELECT UPPER(p2) p1, UPPER(p1) p2 10 FROM aliasPairs 11 ), 12 y AS ( 13 SELECT DISTINCT p1 14 FROM x 15 START WITH p1 = UPPER(vp1) 16 CONNECT BY NOCYCLE PRIOR p1 = p2 17 ) 18 SELECT CASE WHEN COUNT(*) = 0 THEN 0 ELSE 1 END 19 INTO retValue 20 FROM y 21 WHERE p1 = UPPER(vp2); 22 RETURN retValue; 23 END testPair; 24 / Function created. SQL>
Example usage as below:
SQL> SQL> SQL> SELECT testPair('Liz', 'Elizabeth') FROM DUAL; TESTPAIR('LIZ','ELIZABETH') --------------------------- 1 SQL> SELECT testPair('LIz', 'EliZABeth') FROM DUAL; TESTPAIR('LIZ','ELIZABETH') --------------------------- 1 SQL> SELECT testPair('Liz', 'Mike') FROM DUAL; TESTPAIR('LIZ','MIKE') ---------------------- 0 SQL> SQL>
This code implementation is just so awful. I am embarrassed even now writing up this blog article describing the approach. I am also reliving all the pain I felt the first time I wrote the code. A cost from the PLAN table for querying the indexed pairs table query is around 27000. Should this number not mean anything to you – just accept that it is shockingly high and a better design would reduce the cost my 2-3 orders of magnitude! Put another way, for the large table containing these pairs values, the SQL (or PL/SQL) above will not execute very fast at all. If the PSM function is invoked for a large number of rows in some result-set, performance issues are compounded.
Now what if I added to the problem by stating that rather than direct column comparisons, a regex was involved. Yes, the initial design, the help provided, was really this bad! The PSM function had to be bloated out with more code that caused performance woes, resulting in the second WITH clause, containing a REGEXP_LIKE. Your heart should sink!
In this organisation the perception is that the reason the query executes slowly is because Oracle is slow (after all SQL Server is so much faster isn’t it? Our other systems don’t suffer from this type of performance issue do they?). The fact is, if I am permitted to digress to facts for a moment, the performance issues are a result of a flawed design, it has been flawed from the outset, and it is flawed as someone helped me out by taking a piece of low hanging fruit (populating pairs into a table with two columns) and then bodging it. Before this design could be reigned in, or even some query profiling performed on the pairs table containing more than a couple of dozen rows, the implementation was rolled out to production systems and dependent technology built upon the design.
Although it was very tempting to borrow someone elses userID to check my code into Subversion (so my name didn’t appear in the source code revision history), I checked the code in under my own userID. Subversion has blame functionality. I was tempted to use this too. I didn’t. The serious take home message is that when someone helps you out and they don’t know what they are doing, it often has the opposite effect (of course this argument applies to bad design regardless, even from experienced developers, but the I’ve helped you out scenario seems to crop up repeatedly, and from people that know a bit but not much more IT).
Lastly, if Kevin Bacon were a software developer, an Oracle developer, he would have designed the system properly from the outset. He does nothing wrong. Well either that is true, or I have a man crush.
— Published by Mike, 06:57:12 19 June 2017 (CET)
By Month: November 2022, October 2022, August 2022, February 2021, January 2021, December 2020, November 2020, March 2019, September 2018, June 2018, May 2018, April 2018
Apple, C#, Databases, Faircom, General IT Rant, German, Informatics, LINQ, MongoDB, Oracle, Perl, PostgreSQL, SQL, SQL Server, Unit Testing, XML/XSLT
Leave a Reply