I'm just a biologist that switched to Python because Excel and Origin weren't dealing very well with my ever increasing pile of data (Typical data: Every row is cell in a Tissue sample, every column is a quantified parameter (size, marker intensity, ...) of that cell, typically I deal with 10s to 100s of tissues samples) Pandas is great, I spend my time turning DataFrames into histograms, scatter plots and ROC curves in Jupyter Notebooks. I have the feeling knowing Big-O is not very relevant. Still, learning new languages, new words and new abstractions is almost guaranteed to influence ones way of working and thinking at some level.
So, indeed indeed the book probably doesn't hurt ;)
Edit: Just glanced over the link in Practicality's comment about big-O and sure enough I think it may actually be useful as my ever increasing pile of data increases even further! I have to admit; as the parameters increase I find myself doing over night calculations more and more.
How does it change what you do, though? I did signal processing with massive data streams rather than bioinformatics, but I assume the situation is similar. The algorithms are what they are. They are complex mathematical equations or transformations that need to be run on data and are often optimized without being able to change their asymptotic complexity.
Skiena's Algorithm Design Manual mentions him being brought in as an algorithmic consultant to modify some genetics analysis software so that it'd actually finish but I don't really remember the details or know enough about the field to give you plausible examples.
I can see that; I did a lot of similar work with signal processing algorithms. None of what I did affected asymptotic complexity at all, though. The asymptotic complexity was tied to the algorithms chosen, and changing those was an issue of trading computational performance for system performance.
EDIT: Misstated the big-O, in this particular case (should've found my coworkers actual code). Both are O(m x n), one just has a large constant.
Here's a pattern I've noticed with code written for processing a data file by a lot of people (python-esque, using a function (match) that's "left as an exercise for the reader" to implement):
def search(filename, value):
with open(filename, "r") as f:
for line in f:
if match(value,line):
print(line)
# we don't care about not matching
def main():
for v in [search1, search2, search3, ...]:
search("data.dat", v)
What happened is that one time they needed that search function, and so they made search and it worked well. They realized they could run that same search function repeatedly, and for small data files and few searches it was quick enough. But the performance is O(m x n) [EDIT: originally wrote O(m x n)], where m is the number of lines, n is the number of search values. [EDIT: wrong: a second m because it takes a time proportional to the size of the file to read the file.]
The data file is read every time something is searched. If you've got an SSD, it's not really noticeable. If you've got a spinning disk, it becomes a problem. If you're hitting network storage, you're downloading that file n times. The main issue being that each read (each iteration of the inner for) hits the hard drive, network, or similar. A simple performance hack is to move the read into main, put the whole thing into one list of lines and pass that list to search instead of the filename (modifying search appropriately):
def search(data, value):
for line in data:
if match(value,line):
print(line)
# we don't care about not matching
def main():
with open("data.dat", "r") as f:
data = f.read().splitlines()
for v in [search1, search2, search3, ...]:
search(data, v)
It's still O(m x n) [EDIT: It's now O(m x n). We've removed one of the m factors because we do the read once, and never again.]
For very large files and very large search parameter lists, this will still take a long time, but it's much faster than the previous version when you're dealing with large files.
EDIT:
Shortest code I can think of to get the actual worst case that I've had a few coworkers pull off:
def search(filename, value):
with open(filename, "r") as f:
data = f.read().splitlines()
for line in data:
if match(value,line):
print(line)
# we don't care about not matching
def main():
for v in [search1, search2, search3, ...]:
search("data.dat", v)
With, of course, other code in between because as vonmoltke points out, the above has clear problems. My point was about the structure of the bad pattern, not the specific implementation of it.
I would have never written the first example in the first place, and I don't need Big O calculus to tell me it's a bad idea. Even with a single file it is obvious that the initial implementation is doing unnecessary work and that re-reading a file from disk every time is ridiculous (unless the file is too large for memory, in which case I would pass the list of search terms to the search function and check each line for all terms as the lines are read).
It only looks like a bad idea because of the close proximity in my example. What I've seen normally is that it's grown into something mimicking this structure, but actually far more complex. The point where the file read happens isn't so near the top so that refactor is less obvious, and it's so deep that the person who puts it into that outer loop in main may not realize what's happening internally (fully, at least).
I'm trying to recall the structure of another case where this happened that with a more complex internal algorithm. The solution was far less obvious, but required similar refactorings. In that case it was both reading the file multiple times, and a several deep loop where one (which was by far the longest running) could be refactored to only happen once. Instead of 100 or so times, we flipped some of the loops around (moved it to be the outer loop, similar to the idea of moving the loop over all lines to be the outer loop in my other example). Big-O wasn't essential (for me), because I'd internalized that sort of thinking. But that explanation was essential for my colleagues (EEs, couple years out of school) who hadn't been exposed to that construct before (at least not enough to stick).
> It only looks like a bad idea because of the close proximity in my example. What I've seen normally is that it's grown into something mimicking this structure, but actually far more complex. The point where the file read happens isn't so near the top so that refactor is less obvious, and it's so deep that the person who puts it into that outer loop in main may not realize what's happening internally (fully, at least).
I don't see how Big O calculus helps here. If you have enough understanding to run that analysis you have enough understanding to see it is trivially a dumb idea.
> I'm trying to recall the structure of another case where this happened that with a more complex internal algorithm. The solution was far less obvious, but required similar refactorings. In that case it was both reading the file multiple times, and a several deep loop where one (which was by far the longest running) could be refactored to only happen once. Instead of 100 or so times, we flipped some of the loops around (moved it to be the outer loop, similar to the idea of moving the loop over all lines to be the outer loop in my other example). Big-O wasn't essential (for me), because I'd internalized that sort of thinking. But that explanation was essential for my colleagues (EEs, couple years out of school) who hadn't been exposed to that construct before (at least not enough to stick).
What's funny is that I am an EE, and the way I internalized complexity analysis and optimization was to look at the number of operations being performed, along with the cost of those operations, and design the code such that it used the fewest resources. Only later did I learn this "Big O" thing and it seemed stupid because it seemed overly complex and was telling me to throw out significant factors that I spent my career worrying about. I still don't really see the value of it over more detailed methods that seem trivially easy to me, like simply deriving an approximation of the complete polynomial describing the runtime, memory usage, or what have you. I am a systems engineer and have a bias towards modeling things, though.
So, indeed indeed the book probably doesn't hurt ;)
Edit: Just glanced over the link in Practicality's comment about big-O and sure enough I think it may actually be useful as my ever increasing pile of data increases even further! I have to admit; as the parameters increase I find myself doing over night calculations more and more.