Certainly an interesting discussion of some algorithms here, however...
"In fact the number of checks grows quadratically with the length of your list"
No, no this is not necessarily true at all? If you are talking about an interview-style question where we are discussing a linked list in memory not some giant implicit list of integers, then it is very easy to beat O(n^2). If the list size is known, it's very easy to construct a hash table or hash set where collisions are extremely low and O(1) lookups are expected, giving an O(n) algorithm. A binary search tree would give O(n log n)
Part of the problem is that the list size is not known. If the list size where known to be N, then you simply traverse N elements and check if the tail is null or not.
To some extent, yes. Suppose you want to use the technique of running around, remembering everything you've seen, then spotting the first duplication. If you're not allowed to mark the nodes then you need to create and maintain some sort of external data structure to remember everything you've seen. Space complexity O(n), time complexity depends on the techniques, but since you have no information about the size then you can assume O(n log n).
But the space complexity is the killer, so let's assume you can mark each node. If the mark starts off in a known state then you're OK and the algorithm is trivial.
So what if you don't know the starting state of the mark on each node? That means you have to run around the list once, setting each mark to a known state. But how do you know when you're done? There might be a loop, so you need a loop detection algorithm.
So even if you're allowed to mark the nodes this is non-trivial.
Space complexity is the issue for only a subset of problems, those with extremely large input sets. I think those cases are the exception rather than the rule. I certainly agree that special techniques are required with gigantic or streaming data sets. Finding that size limit is probably interesting... I guess my inclination is to go for either the hash table or search tree solution unless I really, really have to implement some special algorithm.
How would you ever not know the start state of the mark on each node? It's a field you invented, null field = not seen. If you are talking about a very large list, it's not going to fit into memory - things tend to get extremely tricky with really big data sets like that. It could be perhaps a node per file ? File where you can add an additional field?
Your comments are interesting, it shows just how much my background has colored my thinking and caused me to take for granted situations that other people just don't think of.
> Space complexity is the issue for only a subset of problems ...
Hmm. I deal on a daily basis with datasets that are several GB in size, and don't regard them as especially large. If I wanted to find a loop in a linked list, I'd worry about allocating memory to a subsidiary data structure that was subsequently going to be thrown away. Given that I often don't know the size of the data set, I can't easily create a hash structure, so having a wide selection of algorithms like this immediately to hand gives me a good chance of finding something that will work.
My first application of this technique was in an implementation of a variation on the Pollard Rho method of factoring. In that case it's impossible to create a parallel data structure - the teleporting turtle-type of technique is essential.
> my inclination is to go for either the hash table
> or search tree solution unless I really, really
> have to implement some special algorithm.
It seems to me that in most of the cases you're considering you wouldn't have a linked list to start with. I get the feeling that the situations where one would seriously consider using a linked list as opposed to a hash table or tree search are those situations where you do have to worry about these issues.
As usual the answer is really to question why there's a linked list to start with.
> How would you ever not know the start state
> of the mark on each node? It's a field you
> invented, ...
Again, not always the case. I regularly deal with libraries and I'm given an API, but I don't get to define the fields. This happened to me recently where I had a complex data structure and I wanted a flag on each node, but couldn't have it. The solution we thought of first built a subsidiary data structure to keep the additional information, but that destroyed the performance. In the end we used a trick like the teleporting turtle, that reminded me of Floyd's algorithm, and so I wrote it up.
Interesting point about APIs. Did you try doing one O(n) pass to dump the list into a secondary format? Just walk the list, creating a new list of your own with the new field and then walk it again to do cycle detection? That's two O(n) walks, still O(n) and should take like an hour to implement. I'm assuming you can get some unique ID for each node from the API, which seems reasonable.
Could you give some more insights into the scenarios where you don't know the size of the data? I agree it is an interesting algorithm and I would like to have a better grasp over use cases.
>you wouldn't have a linked list to start with
In my mind I don't typically turn up linked lists in the wild, just relationships which means graphs and some of them have a structure which maps onto a linked list. I find myself thinking more about the general problem of cycle detection in graphs - a linked list is after all only a special case of a graph. Any relationship-centric data can be scraped and end up with this big pile of data where you want to find cycles.
I guess my thinking was colored by the start of the article - "It's a classic interview question: Given a (singly-)linked list, determine whether or not it has a cycle."
If the question had said, given an immense linked list or in constant space or without knowing the size, I would have thought differently. If someone gives me a linked list, I assume I have the whole thing.
I work with large databases fairly regularly myself. These usually come from some data collection effort (e.g. scraping) and I end up with a pile of data I need to make sense of. In my experience, identifiers are typically a relatively small amount of data compared to the rest. Just as a concrete example, I'm currently forklifting about 4.5M entities in 22GB, with the largest set of identifiers coming out to about 250M. So, with long identifiers that means about 2GB of identifiers for the largest entity class - a relatively easy to deal with number.
> Did you try doing one O(n) pass to dump the
> list into a secondary format?
In this case it wasn't a list, and there was the problem of knowing if I'd already walked the entire structure, because there may have been loops, and it's the entire loop detection problem all over again. That was one problem with trying to build the subsidiary structure. We had to do it lazily, and, oh, it's too complex to try to obfuscate what we were doing, and I can't tell you the details because it's commercial in confidence. Sorry.
I should add that a lot of my contexts are semi-embedded systems, so limitations in memory, disk, CPU and bandwidths are all unsual.
> Could you give some more insights into the scenarios
> where you don't know the size of the data?
Sadly, no, sorry. In some sense the original problem really encapsulates the essence. Your ideas are valid in many, perhaps virtually all cases, but sometimes you need to solve the problem as given without finding tricks to get around them. I come across problems like that, which is why I like to have a wide range of techniques.
> I guess my thinking was colored by the start of the
> article - "It's a classic interview question: Given
> a (singly-)linked list, determine whether or not it
> has a cycle."
Noted - I'll look at putting a side note to put it into the more general context to help avoid putting people's thoughts into a smaller box.
Thanks for the comments and the interesting article. I can see your point about how traversing the structure can result in the same cycle detection issue. I'm actually very interested in algorithms, but lately I try to do everything I can to avoid implementing them basically for maintenance reasons, just because I am totally in a get the thing out the door mindset. I guess my typical problem solving philosophy is to try the laziest practical solution just off having too much to do. gluck in your work!
not necessarily : here's a file or chunk of memory that's 10MB in size representing a list, is there a cycle? that's a different problem than an infinite sequence of integers.
(I know this was 17 days ago, but I just noticed this comment).
Being contained in a finite amount of memory does not give a list a size. Lists with cycles may take up a number of nodes, but saying they have a "size," despite the fact that you can iterate over them infinitely and never reach the end, is oxymoronic.
An infinite sequence (which a list with a cycle is) has no size.
> doubling the size of the hash table whenever
> there are too many entries in it.
And then re-hashing and dealing with all that implies. Surely all of that is much harder than chasing a pair of pointers down the list.
Also, it doesn't work in the case of, say, Pollard's Rho method of factoring. I really need to change the original article to make it clear that that's a case I'm interested in.
First of all there is a bug in his first pseudocode set. As listed, the hare can easily skip past the turtle unnoticed. If the loop is a particular size, the hare can keep skipping past the turtle forever, and the algorithm will operate forever without indicating there is a loop. He needs to perform a comparison after each step of the hare.
Regarding the teleporting turtle optimisation, I am not convinced it is an actual optimisation. It will work faster in certain cases, but not in others. For example, imagine there is a big loop in the list. Lets say the hare is halfway through the loop when the turtle just enters the beginning of the loop. At this point, the hare needs to catch up to the turtle and is only half a loop behind. However, imagine that the algorithm decides to teleport the turtle to the hare precisely in this point in time. Then the hare is an entire loop behind the turtle, which means it will be twice as slow to catch up with it just because of the teleporting.
So the "optimisation" is faster than the classic algorithm for some types of lists and slower for others. So you cannot say whether it is an optimisation unless you know something about the lists you will have to handle. So it is not really an optimisation per se.
I suspect the teleporting turtle algorithm may be considered an optimisation because people assume that most lists will not have a loop and a list with a loop is a rare anomaly. In that case it is in fact an optimisation because it will verify a large non-looped list faster than the classic algorithm.
I don't think you're correct that the hare can skip past the tortoise indefinitely. Certainly, I cannot find such a loop.
On the question of whether it is a constant factor optimization, though, it rather depends on the relative costs of following a link versus the bookkeeping accounting for steps taken. Empirically tested in a simple program, teleporting turtle (TT) is slower than tortoise and hare (TH) when simply following linked lists in adjacently laid out in memory. When retrieving a link is made more costly, TT pulls ahead. I cannot find any scenario where TT follows more links than TH, only less or equal.
So it seems that following links is the thing that is being optimized for, which makes sense for finding loops in PRNGs and similar problems: you want to optimize to reduce the number of links taken. As I explain in a different message, because each iteration of TH is so much more costly than TT, TT can't really lose.
Actually it seems you are right on both accounts. In the first case the hair would not actually jump over the turtle except for the very beginning. It is funny how that works, it is rather counter intuitive.
And in the second case, I think in the worst case scenario the optimization performs the same number of link followings as the original algorithm. That is because the original algorithm performs duplicative work as the turtle has to follow the same links the hare did.
But of course if following a link was that expensive, one would modify the original algorithm to save all the links that the hare traversed so that one does not have to pay for the turtle to follow them. This will result in the original algorithm being faster in certain circumstances than the optimisation (although it may use a bit more memory).
Well anyways I was wrong at the top, I am sorry if I offended anyone by stating that they had bugs in their pseudocode. I should not argue with stuff that has been obviously academically developed and argued and double checked to death before i have seen it.
In terms of saving the links, I think it's not necessarily the wisest course for scenarios like PRNGs etc., where that may mean saving millions or billions of intermediate number results.
The TH algorithm is Floyd's algorithm, while the TT algorithm is Brent's algorithm in the Wikipedia cycle detection link.
Reading the article, I too felt that another check was needed, to prevent the hare from sometimes leapfrogging eternally.
While drafting a snappy comeback to your comment that there was no such loop, I tried to find the smallest loop that demonstrated the bug, and came across a very nice pattern:
1: (1,1),(1,1)
2: (1,1),(1,2),(1,1)
3: (1,1),(3,2),(2,3),(1,1)
4: (1,1),(3,2),(1,3),(3,4),(1,1)
5: (1,1),(3,2),(5,3),(2,4),(4,5),(1,1)
6: (1,1),(3,2),(5,3),(1,4),(3,5),(5,6),(1,1)
...
Put another way, once the tortoise enters the loop, it will take <=n-1 steps (where n is the size of the loop) for them to come together (without teleportation).
This problem seems like way too much fun to use in an interview!
The point is that (hare position minus tortoise position, modulo loop size if there is a loop) increases by exactly 1 every time. So if there is a loop it will eventually reach 0, fewer than (loop size) steps after entering the loop.
Odd. The "teleporting turtle" was the first solution to this problem that I thought I had recalled after being reminded of it in this link, but when I read the tortoise / hare solution, I knew this was the original solution I had read when I first heard about this problem.
In other words, while trying to remember the details of the tortoise and hare solution, I turned it into the described teleporting turtle solution, the intuition being that doubling the length of the cycle check would amortize to constant time per link.
The cycle might not include the head of the list. You need to keep moving the turtle until it is in the cycle. Otherwise the rabbit will enter the circle and keep running around and never see the turtle again.
If you have the tortoise and hare start at 1, but there is a loop going 2 -> 3 -> 2 -> ... then, without moving the tortoise, the hare will never find the tortoise or the end of the list.
To say, "now on every clock tick we only take one step, not three," seems to imply that the optimized version takes 1/3 as many steps. The mammal still doesn't catch the reptile without doing two full laps while the reptile does one.
For an n-entry loop: The original version has the tortoise take n steps and the hare take 2n steps. The optimized version has the rabbit take 2n steps and the turtle take lg(n) steps.
For an n-entry line: The original version has the tortoise take n/2 steps and the hare take n steps. The optimized version has the rabbit take n steps and the turtle take lg(n) steps.
Your misreading: the turtle doesn't move at all. It teleports to the rabbit's location. (Counting following a link as a move, which seems to be the costly operation being optimized for.)
For every full iteration of the loop, TH (tortoise and hare) follows 3 links, while TT (teleporting turtle) follows 1 link. In the loop case, where the loop of length n is linked in at k steps, the there must be at least k iterations of both loops (the turtle or tortoise must reach the loop). In the no-loop case, there must be at least n iterations of the TT case but TH can get away with n/2 iterations. Unfortunately, each iteration of TH has more than twice the cost of an iteration of TT, when optimizing for following loops.
If both Tortoise and Hare have just entered a cycle of length n, the original algorithm takes n iterations before the Tortoise is caught (2n steps for the Hare, n steps for the Tortoise). Likewise, TT takes n steps (although it may take some additional steps before the turtle is teleported into the loop; once this has happened, it's n steps.)
* once this has happened, it's n steps*
Unless the turtle teleports some time during those n steps. The only way the turtle won't teleport during that time is if the rabbit has already taken at least n steps to get to the loop, in which case, it only eliminates up to 1/4 of the mammal moves.
"In fact the number of checks grows quadratically with the length of your list"
No, no this is not necessarily true at all? If you are talking about an interview-style question where we are discussing a linked list in memory not some giant implicit list of integers, then it is very easy to beat O(n^2). If the list size is known, it's very easy to construct a hash table or hash set where collisions are extremely low and O(1) lookups are expected, giving an O(n) algorithm. A binary search tree would give O(n log n)