When you have to resort to statistics, you know it's time to redesign the software, or throw it away!

Yesterday the system was running fun. Today it is slow. What changed? How many times has this damn question been asked? The answer is of course nothing, a little, or a lot, depending on your perspective, and especially when considering systems where the rate determining step involves a database.

Imagine some system where this month, the response time for aggregation of data for a critical report is taking 2hrs. Imagine for next month, the projection is 2hrs 20mins, the month after that 2hrs 40mins, then 3hrs, and so on. A finger in the air estimate is an additional 20mins/month due (due to an ever increasing amount of data populating the database). At some point, the time taken to aggregate the data for the report would become unbearable, impact on BAU processes, the phones would start ringing, and Directors would start banging their fists (or perhaps their head) on the desk.

You could plot the performance metrics on a graph, and extrapolate things out for the purposes of planning in a task to address the performance issues. I have contrived some data, and done just that, to support my thesis.

So

  • Assuming today is month 11, then for the prior 11 months performance metrics (the blue line in the graph), the line slope can be determined with a little undergraduate maths, y=mx+c (the relationship might be quadratic or something else too, my point is that it isn’t difficult to forecast system performance for the months ahead by interpolating historical data – this is far from rocket science).
  • Extrapolating out past month 11 (again the blue line on the graph), it can be calculated and visually represented that around month 31, the system performance would have degraded past the point where the projected performance would be tolerated.

Any developer worth their salt, or manager worth the job title for that matter (this is merely SLA monitoring), would plan in some technical task to address performance degradation prior to month 31. Put another way, today is month 11, month 31 is therefore 20 months away, so there is time to address something else and more urgent now. This is the common sense approach, isn’t it? Or basic task prioritisation?

Enter the Oracle Cost Based Optimiser (CBO). The CBO is the crucial component of the Oracle (O)RDBMS for determining which of the many execution strategies should executed. The deciding factor in execution strategy, or plan, is cost, where cost means minimal system resource use. This is all an alternative way of saying a-function-of time (I dumb it down a bit, but this is the essence of the approach). Worded differently, there are likely many possible execution strategies and the CBO determines and selects the one that will execute the fastest, and this is the one that is used to retrieve the data.

The source of truth used by the CBO for determining the optimal execution strategy/path is some statistics; hardware performance metrics (single/multiblock read times), CPU speeds, and the distribution of the underlying data. From this information, for each SQL query, the CBO determines the strategy on whether, for example, if nested loops are used, which table is accessed first, or whether to use indexes (and if so, which ones) or perform full table scans, and so on.

As the underlying data changes, it is quite feasible that the execution strategy/plan changes. I refer to this as Execution Plan Lability rather than Execution Plan Stability – the theme however is that Oracle will choose a different execution strategy based on the statistics for the underlying data distribution at that time. Of course, if an alternate execution plan does produce a lower cost, and does execute faster, this is a good thing. On many many occasions it does not however (and yes, I want my cake and eat it too). On many occasions, for no apparent reason (likely a small edge-case change in the underlying data distribution, and/or a model or scenario not accommodated for within the CBO, or a bug), a totally inappropriate execution plan may be chosen once the database statistics are refreshed (this may occur nightly). When this happens, execution performance may dramatically and detrimentally change, and seemingly without reason.  This scenario is depicted in orange in the graph above. In this scenario, without warning, around month 20 the system is hit by significant performance issues due to the lability of the execution plan. This is a terrible surprise – after all, how could any organisation plan for this scenario? The answer is it cannot, and the incontrovertible fact is that this is an unpredictable system. Furthermore, to rub salt into the wound, the system has been designed to be unpredictable. Is it really defensible to retain a system that exhibits such behaviour?

No! That is my answer.

In summary

  • determining an execution plan for retrieving information from a database query is not a difficult task, but it has been made difficult, and deliberately unpredictable, or deliberately labile if you prefer.
  • the unpredictability removes all confidence in the system, developers, DBA’s, end users and the organisation as a whole.
  • the unpredictability is a result of using statistics to make decisions, in this case on execution path/strategy. Remember statistics are really just another name for coin flipping. They are not facts, especially in the manner they are being used. By means of an example, in the past year, statistics (although we refer to them as ‘polls’) told us that Brexit wouldn’t happen, and that if Trump and not Hillary Clinton won the US Presidential election, it would be a giant surprise (boffins at Princeton discuss this ‘giant surprise’ well).

Obviously I have a bee in my bonnet about Oracle CBO (and others including Microsoft SQL Server) DBMS implementations for determining execution path strategy, and more pointedly, execution plan lability.  Am I justified in whinging about this approach while not offering an alternative? Perhaps not, but there remains nothing wrong with voicing opinion on why the current model for determining execution path strategy is broken.

— Published by Mike, 13:36:12 09 May 2017 (CET)

Leave a Reply