I'm just a biologist that switched to Python because Excel and Origin weren't de...

emodendroket · on Aug 24, 2016

Asymptotic complexity comes up pretty frequently in bioinformatics contexts because the volumes of data can be huge.

vonmoltke · on Aug 24, 2016

How does it change what you do, though? I did signal processing with massive data streams rather than bioinformatics, but I assume the situation is similar. The algorithms are what they are. They are complex mathematical equations or transformations that need to be run on data and are often optimized without being able to change their asymptotic complexity.

emodendroket · on Aug 24, 2016

Skiena's Algorithm Design Manual mentions him being brought in as an algorithmic consultant to modify some genetics analysis software so that it'd actually finish but I don't really remember the details or know enough about the field to give you plausible examples.

vonmoltke · on Aug 24, 2016

I can see that; I did a lot of similar work with signal processing algorithms. None of what I did affected asymptotic complexity at all, though. The asymptotic complexity was tied to the algorithms chosen, and changing those was an issue of trading computational performance for system performance.

emodendroket · on Aug 24, 2016

I pulled it up and found it; the problem involved modeling genome sequences as strings and finding possible substrings.

Jtsummers · on Aug 24, 2016

EDIT: Misstated the big-O, in this particular case (should've found my coworkers actual code). Both are O(m x n), one just has a large constant.

Here's a pattern I've noticed with code written for processing a data file by a lot of people (python-esque, using a function (match) that's "left as an exercise for the reader" to implement):

  def search(filename, value):
    with open(filename, "r") as f:
      for line in f:
        if match(value,line):
          print(line)
        # we don't care about not matching

  def main():
    for v in [search1, search2, search3, ...]:
      search("data.dat", v)

What happened is that one time they needed that search function, and so they made search and it worked well. They realized they could run that same search function repeatedly, and for small data files and few searches it was quick enough. But the performance is O(m x n) [EDIT: originally wrote O(m x n)], where m is the number of lines, n is the number of search values. [EDIT: wrong: a second m because it takes a time proportional to the size of the file to read the file.]

The data file is read every time something is searched. If you've got an SSD, it's not really noticeable. If you've got a spinning disk, it becomes a problem. If you're hitting network storage, you're downloading that file n times. The main issue being that each read (each iteration of the inner for) hits the hard drive, network, or similar. A simple performance hack is to move the read into main, put the whole thing into one list of lines and pass that list to search instead of the filename (modifying search appropriately):

  def search(data, value):
    for line in data:
      if match(value,line):
        print(line)
      # we don't care about not matching

  def main():
    with open("data.dat", "r") as f:
      data = f.read().splitlines()
      for v in [search1, search2, search3, ...]:
        search(data, v)

It's still O(m x n) [EDIT: It's now O(m x n). We've removed one of the m factors because we do the read once, and never again.]

For very large files and very large search parameter lists, this will still take a long time, but it's much faster than the previous version when you're dealing with large files.

EDIT:

Shortest code I can think of to get the actual worst case that I've had a few coworkers pull off:

  def search(filename, value):
    with open(filename, "r") as f:
      data = f.read().splitlines()
      for line in data:
        if match(value,line):
          print(line)
        # we don't care about not matching

  def main():
    for v in [search1, search2, search3, ...]:
      search("data.dat", v)

With, of course, other code in between because as vonmoltke points out, the above has clear problems. My point was about the structure of the bad pattern, not the specific implementation of it.

vonmoltke · on Aug 24, 2016

I would have never written the first example in the first place, and I don't need Big O calculus to tell me it's a bad idea. Even with a single file it is obvious that the initial implementation is doing unnecessary work and that re-reading a file from disk every time is ridiculous (unless the file is too large for memory, in which case I would pass the list of search terms to the search function and check each line for all terms as the lines are read).

Jtsummers · on Aug 24, 2016

It only looks like a bad idea because of the close proximity in my example. What I've seen normally is that it's grown into something mimicking this structure, but actually far more complex. The point where the file read happens isn't so near the top so that refactor is less obvious, and it's so deep that the person who puts it into that outer loop in main may not realize what's happening internally (fully, at least).

I'm trying to recall the structure of another case where this happened that with a more complex internal algorithm. The solution was far less obvious, but required similar refactorings. In that case it was both reading the file multiple times, and a several deep loop where one (which was by far the longest running) could be refactored to only happen once. Instead of 100 or so times, we flipped some of the loops around (moved it to be the outer loop, similar to the idea of moving the loop over all lines to be the outer loop in my other example). Big-O wasn't essential (for me), because I'd internalized that sort of thinking. But that explanation was essential for my colleagues (EEs, couple years out of school) who hadn't been exposed to that construct before (at least not enough to stick).

vonmoltke · on Aug 24, 2016

> It only looks like a bad idea because of the close proximity in my example. What I've seen normally is that it's grown into something mimicking this structure, but actually far more complex. The point where the file read happens isn't so near the top so that refactor is less obvious, and it's so deep that the person who puts it into that outer loop in main may not realize what's happening internally (fully, at least).

I don't see how Big O calculus helps here. If you have enough understanding to run that analysis you have enough understanding to see it is trivially a dumb idea.

> I'm trying to recall the structure of another case where this happened that with a more complex internal algorithm. The solution was far less obvious, but required similar refactorings. In that case it was both reading the file multiple times, and a several deep loop where one (which was by far the longest running) could be refactored to only happen once. Instead of 100 or so times, we flipped some of the loops around (moved it to be the outer loop, similar to the idea of moving the loop over all lines to be the outer loop in my other example). Big-O wasn't essential (for me), because I'd internalized that sort of thinking. But that explanation was essential for my colleagues (EEs, couple years out of school) who hadn't been exposed to that construct before (at least not enough to stick).

What's funny is that I am an EE, and the way I internalized complexity analysis and optimization was to look at the number of operations being performed, along with the cost of those operations, and design the code such that it used the fewest resources. Only later did I learn this "Big O" thing and it seemed stupid because it seemed overly complex and was telling me to throw out significant factors that I spent my career worrying about. I still don't really see the value of it over more detailed methods that seem trivially easy to me, like simply deriving an approximation of the complete polynomial describing the runtime, memory usage, or what have you. I am a systems engineer and have a bias towards modeling things, though.