Here's a quick example of a case where knowledge of data structures and algorithmic complexity is useful, regardless of the language.
Let's say you've got two lists, named "A" and "B", which each have 1000 integers. You want to return a list of unique integers which are present in both lists (ie the intersection of two lists). These 'lists' could be arrays or linked lists.
The brute force method would be something like this:
Out = []
for a in A:
for b in B:
if a == b and !Out.contains(b)://O(n) list search
Out.append(b)
return Out
But this can be very slow, because you can end up in the territory of O(n^3) iterations across A, B, and Out (internally the language would generally be iterating across Out in order to complete that 'contains' call). In this specific example, that works out to be something along the lines of 1000x1000x1000 = 1000000000 iterations, vastly larger than the size of the original lists.
---
A better way is to sort one or both of the lists then do the comparisons along each of them (this syntax assumes the lists are specifically arrays, but it'd effectively be the same for linked lists):
Out = []
sort(A)
sort(B)
b = 0
for a in xrange(len(A)):
vala = A[a]
valb = B[b]
if vala == valb:
Out.append(vala)
++a
++b
elif vala < valb:
++a
else://vala > valb
++b
return Out
This is definitely better than the above example. Now we're performing two sorts, each O(n log n), then we're doing a linear iteration across both of those lists in parallel, each O(n). So now we end up with an overall complexity of O(n log n) as a result of those initial sorts. Let's estimate the total number of iterations to be around, I dunno, 10000(sort)+2000(iterate/compare) = 12000? The exact number of comparisons in a sort can vary by algorithm used and how things were ordered in the lists to begin with.
Not bad, definitely better than what we were doing before. But we might be able to go a little better...
---
Yet another way is to use some hash sets, which (generally) have O(1) insertions and retrievals, and only store unique values, such that if you insert the same value twice, the second insertion is effectively ignored. We can do something like this:
Out_set = set([])
A_set = set([a in A])
for b in B:
if A_set.contains(b)://O(1) set search
Out_set.append(b)
return list(Out_set)//optional, could return Out_set
Now we end up with an algorithm which is O(n), where we iterated over the items in A once to fill in A_set, then we iterated over the items in B once, and each "contains" call was an O(1) hash lookup inside of A_set. Then finally we iterated over Out_set once to create the output list. This final step is optional, we could also have just returned Out_set directly, but it doesn't effect algorithmic complexity in either case. Now we've got 2000-3000 iterations, depending on how whether we return a set or a list. And this additionally looks a bit simpler than the sort+iterate version I gave above.
---
So just by using our knowledge of data structures, we've turned ~1000000000 iterations into ~2000-3000 iterations. This is on the order of reducing the distance to the moon to around a kilometer. And that's just with two lists that are limited to 1000 items each.
And this sort of thing is a pretty common scenario in any language, no matter how 'abstracted' it is.
Thanks a bunch for the detailed examples. I've noticed this is a huge problem in my code: too many nested for loops. I've generally gotten better about using built in functions like set (in Python at least):
Out = list(set(A+B)) # right?
Or even DISTINCT in SQL (when dealing with webapps).
And now I never doubt the usefulness of fuzzing the DB with hundreds of thousands of random entries. That quicky code that took 40ms to run over several dozen entries turns into a minute or more for several thousand.... Ouch! ;-)
Let's say you've got two lists, named "A" and "B", which each have 1000 integers. You want to return a list of unique integers which are present in both lists (ie the intersection of two lists). These 'lists' could be arrays or linked lists.
The brute force method would be something like this:
But this can be very slow, because you can end up in the territory of O(n^3) iterations across A, B, and Out (internally the language would generally be iterating across Out in order to complete that 'contains' call). In this specific example, that works out to be something along the lines of 1000x1000x1000 = 1000000000 iterations, vastly larger than the size of the original lists.---
A better way is to sort one or both of the lists then do the comparisons along each of them (this syntax assumes the lists are specifically arrays, but it'd effectively be the same for linked lists):
This is definitely better than the above example. Now we're performing two sorts, each O(n log n), then we're doing a linear iteration across both of those lists in parallel, each O(n). So now we end up with an overall complexity of O(n log n) as a result of those initial sorts. Let's estimate the total number of iterations to be around, I dunno, 10000(sort)+2000(iterate/compare) = 12000? The exact number of comparisons in a sort can vary by algorithm used and how things were ordered in the lists to begin with.Not bad, definitely better than what we were doing before. But we might be able to go a little better...
---
Yet another way is to use some hash sets, which (generally) have O(1) insertions and retrievals, and only store unique values, such that if you insert the same value twice, the second insertion is effectively ignored. We can do something like this:
Now we end up with an algorithm which is O(n), where we iterated over the items in A once to fill in A_set, then we iterated over the items in B once, and each "contains" call was an O(1) hash lookup inside of A_set. Then finally we iterated over Out_set once to create the output list. This final step is optional, we could also have just returned Out_set directly, but it doesn't effect algorithmic complexity in either case. Now we've got 2000-3000 iterations, depending on how whether we return a set or a list. And this additionally looks a bit simpler than the sort+iterate version I gave above.---
So just by using our knowledge of data structures, we've turned ~1000000000 iterations into ~2000-3000 iterations. This is on the order of reducing the distance to the moon to around a kilometer. And that's just with two lists that are limited to 1000 items each.
And this sort of thing is a pretty common scenario in any language, no matter how 'abstracted' it is.