Big O Analysis and Hybrid Data Structures

One of the key techniques to optimizing JAVA code is Big O analysis. Most of us were probably taught this back in University in an undergraduate course, perhaps in a Data Structures or Algorithms course.

https://www.geeksforgeeks.org/analysis-algorithms-big-o-analysis/

In actual real-world practice, most algorithms are self-evident. The big takeaway was “watch your loops” and the ability to shop the JAVA Collections to be able to use the right one. Need fast insert? ArrayList. Need fast lookup? HashMap or HashSet. For example, the PriorityQueue supports a logarithmic insertion and constant time for getting the best value.

https://docs.oracle.com/javase/7/docs/api/java/util/PriorityQueue.html

Implementation note: This implementation provides O(log(n)) time for the enqueuing and dequeuing methods (offer, poll, remove() and add); linear time for the remove(Object) and contains(Object) methods; and constant time for the retrieval methods (peek, element, and size).

So most of us over the years stopped looking into Big O seriously. You pick the best JAVA Collections class to suit your needs, consider that selection to be optimal, finish your development, and move on to testing.

But sometimes this fails us. Sometimes we end up with unacceptable run times. You go through a code profile or use the PSR logs, and you find that everything within the process operates as expected. You have no long-running queries, no FULL GCs, no database contention, no missing indexes. Recalculating stats on Oracle doesn’t help. Your algorithm just takes a LONG TIME.

In the case of MEIO, we literally had a single thread that took over 48 hours to complete!

So what do you do in these cases? Simple. Dig into your Big O analysis. 😊

Big O arithmetic…

Before I jump in, let’s go over the basics of big O arithmetic.

  • A simple function like writing a log statement is constant time O(c)...very fast.

  • A function that loops over all input data is linear O(n)...pretty good.

  • A function that searches a binary tree for an item is logarithmic O(long)...better than linear.

  • A function that loops over all input and must search binary tree each step is superlinear O(n long).

  • A function that loops over all input and then loops over all input each step is quadradic O(nc).

  • If you have two functions of the same run time, you simply assume one of them.

    • O(c + c) is same as O(c)

  • If you have two functions where one is of longer run time, you simply assume the longer run time.

    • O(c + n) is same as O(n)

  • If you have a loop calling another loop, then these are multiplied together.

    • O( n * n ) is same as O(nc)

Quick Note about Incremental Improvements …

Before I continue, I do need to mention incremental improvements. We’ve all read about the performance hit of string manipulation. RPL, for example, used to make extensive use of JSON Arrays. Inside a JSON Array, it would have a configuration parameter, and at every site, for every time, it would look up the JSON Array for the value. This JSON Array is essentially a string, and all these looks were string manipulation to find the parameter value. This turns out to be a very expensive way of looking for parameter values. We improved RPL a ton by first evaluating all the JSON array parameters and storing them in a POJO. But if this saves a ton of time, then Big O is a stupid waste of time. We just look for these incremental improvements!

It's true. An O(n) algorithm that makes calls to expensive constant time functions will be slower than an O(n) that does not. If I loop n times and take 5 minutes per loop, then it’s n * 5 minutes. If I cut that 5 minutes to 2.5 minutes, it only takes n * 2.5. It’s twice as fast. But depending on the size of n, the algorithmic complexity (the big O) will take over. nc * 5 minutes will be n * 2.5 minutes if n grows large enough. And trust me – we are only concerned with this argument if n is a very large number.

So it's important to trim the excessive calls to long functions, but it’s not a panacea. You MUST control the Big O.

Looking at Data Structures…

In order for you to get value out of Big O, you must look deeply into the process. Remember, those JAVA Collections have runtimes for their functions. While the JAVA Collections class are very well engineered, optimized, tested, and used in millions of applications, they have run time and that run time factors into your algorithm just as much as your own code.

  • HashMap

    • Insert

      • Logarithmic O(logn)

    • Get

      • Logarithmic O(logn)

  • ArrayList

    • Insert

      • Constant O(c)

    • Get

      • Linear O(n)

You may write a simple linear algorithm. Loop through every buffer, determine it is a good buffer, and then do some action on that buffer. The action may be simple arithmetic, so you may claim this algorithm is linear O(n). But how did we determine it was a good buffer? Assume you have an ArrayList of good buffers, and you use a get action on the ArrayList to see if it is a good buffer. For Big O analysis, your algorithm is not…

For each buffer

Is it good buffer

Act on buffer

Instead, it is really…

For each buffer

Get good buffer from array list

Act on buffer

Which is really…

For each buffer

For each buffer

Stop if this is the right buffer

Act on buffer

You see it now, right? The nested loop. This isn’t linear. This is Quadradic (a.k.a. Polynomial) O(nc).

But this doesn’t matter in the modern world, right? Having one thousand buffers only requires one million steps. These P7 processors can knock that out in a second. But what if we have 60,000 items with an average of 100 sites for each item, giving us 6,000,000 buffers? That’s 36,000,000,000,000. I don’t care how big your processor is; that’s A LOT of steps.

Okay, so we spend a lot of time looping – what can we do about it? Well, in this case, it’s obvious. The “good buffer” process is just an oversight. The developer moved through quickly for an obvious check and accidentally produced a performance hickey. The answer is obviously to choose a better data structure. What if we use a HashMap (or HashSet)?

So our algorithm looks like…

For each buffer

Get good buffer from hash map

Act on buffer

Which is really

For each buffer

Step through binary tree with average depth of O(logn)

Act on buffer

This means your algorithm is the number of buffers * logn number of buffers, or “superlinear” or what I call n logn. Those 6,000,000 buffers are now completed in 22 * 6,000,000 or 132,000,000. Still a nice big number, but a heck of a lot fewer zeros than 36,000,000,000,000. Am I right?

What happens if there isn’t a good data structure for you to choose?!?!?

How can this happen?

We all love JAVA Collection classes - well-constructed, well tested, used all over the world, but they are designed for one and only one purpose. A priority queue, for example, optimizes the collection so that finding the best value is done in constant time. But what if you need to find an element in the priority queue? That takes linear time! Remember what happened with the ArrayList in the previous example. JAVA Collection classes will fail you IF you need to use them for data that must be sorted along two or more dimensions.

What?

Imagine we have an object called a Foo, and a Foo can have a parent Foo and many children Foo. So you can see a need for a data structure that can allow you to find all the children and their children for a given Foo. But we need to find the Foo that needs the most help, so you can imagine a data structure that can keep all the Foos sorted so you can always find the one that needs the most help. But what if after we help a Foo, we need to find all the children of Foo and see if they need help as well?

Let’s say we create both the data models. We create a network of references for the parent/child relationship (a Foo has a reference to his parent and an array list of references to his children). And let’s use a Priority Queue for the amount of help a Foo requires. IIn this case, we are using the optimal data structure for both cases. To find the children of a Foo is a constant O(c) time. To find the best value is also constant time O(c) (look up the JAVA docs if you wish). So this MUST be a O(c) algorithm, yes? But look more closely…

Find Foo that needs most help

Help Foo

Find Foo’s children

Help Foo’s children

But this isn’t the whole story…

Find Foo that needs most help (along the prioityqueue, constant O(c) time)

Help Foo (simple arithmetic, constant O(c) time)

Find the Foo in the parent/child relationship structure (linear O(n) time)

Find the children of Foo (constant time O(c))

Help Foo’s children

Whoops. See how the linear O(n) sneaks in? If we have to help Foos a number of times directly related to the number of Foos (i.e. the number of helps roughly equals number of Foos), it get worse…

Continue to help Foos the number of Foos (linear O(n))

Find Foo that needs most help (along the prioityqueue, constant O(c) time)

Help Foo (simple arithmetic, constant O(c) time)

Find the Foo in the parent/child relationship structure (linear O(n) time)

Find the children of Foo (constant time O(c))

Help Foo’s children

See the quadradic O(nc) ? The two linear O(n), one inside the other? We're back to 36,000,000,000,000 to help all the Foos.

How do you fix it?

Notice the cost comes in when switching the dimension from the amount of help required to the parent/child relationship! It’s this switch that forces us into the quadradic O(nc). THIS IS WHAT YOU WANT TO LOOK FOR IN YOUR DESIGNS.

Imagine if you did not have to switch data structures when you switch dimensions. Imagine if you had a Foo Node, and Foo Node could be a member of the parent/child reference map as well as the value sort? But how? What JAVA Collection supports these two dimensions?

The answer is NONE. All the JAVA Collections support a single dimension.

So do we give up? No. What you need to do is build your own data structure with elements of BOTH the parent/child reference map as well as the value sort. What if a Foo Node contains the parent reference, array list of child references, as well as the references required to support a Red Black (google binary tree algorithms) tree, and with the Red Black tree, you sort by the amount of help needed? This turns our algorithm into…

Continue to help Foos the number of Foos (linear O(n))

Find Foo that needs most help (Foo Node in the Red Black Tree, constant O(c) time)

Help Foo (simple arithmetic, constant O(c) time)

Find the Foo in the parent/child relationship structure (you have Foo Node – you already have it. Constant O(c) time)

Find the children of Foo (constant time O(c))

Help Foo’s children

Boom! The inner O(n) goes away. O(nc) becomes O(n). A run time of 48 hours becomes four hours. You become a rock star. 😊

Summary

If I define an engine as (task generate, read, process, write) and you notice a very long process step, run PSR Logger or code profile. If you don’t see calls to functions you shouldn’t be calling (database queries in process step for example), you need to do Big O. With Big O you are looking for Loops. Loops can be hidden in the data structure you choose. Look to make sure you are using the correct data structure. Look for cases in which you are trying to use a data structure along two different dimensions. The cost of this mistake can be as much as an order of magnitude (10x run time) or worse!