I'm not a chip designer. I don't know the best decoding algorithms in the field.
What I can say is, I've come up with a decoding algorithm that's clearly O(n) total work and O(log(n)) depth. From there, additional work would be done to discover faster methodologies.
The importance of proving O(n) total work is not in "X should be done in this manner". Its in that "X has asymptotic complexity of at worst case, this value".
I presume that the actual chip designers making decoders are working on something better than what I came up with.
> Proof?
Its not an easy proof. But I think I've given enough information in sibling posts that you can figure it out yourself over the next hour if you really cared.
The Hillis / Steele paper is considered to be a good survey of parallel computing methodologies in the 1980s, and is a good read anyway.
One key idea that helped me figure it out, is that you can run FSM's "backwards". Consider a FSM where "abcd" is being matched. (No match == state 1. State 2 == a was matched. State 3 == a and b was matched. State 4 == abc was matched, State5 == abcd was matched).
You can see rather easily that FSM(a)(b)(c)(d), applied from left-to-right is the normal way FSMs work.
Reverse-FSM(d) == 4 if-and-only if the 4th character is d. Reverse-FSM(other-characters) == initial state in all other cases. Reverse-FSM(d)(c) == 3.
In contrast, we can also run the FSM in the standard forward approach. FSM(a)(b) == 3
Because FSM(a)(b) == 3, and Reverse-FSM(d)(c) == 3, the two states match up and we know that the final state was 5.
As such, we can run the FSM-backwards. By carefully crafting "reverse-FSM" and "forward-FSM" operations, we can run the finite-state-machine in any order.
Does that help understand the concept? The "later" FSM operations being applied on the later-characters are "backwards" operations (figuring out the transition from transitioning from state4 to state 3). While "earlier" FSM operations on the first characters are the traditional forward-stepping of the FSM.
When the FSM operations meet in the middle (aka: FSM(a)(b) meets with reverse-FSM(c)), all you do is compare states: reverse-FSM(c) declares a match if-and-only-if the state was 3. FSM(a)(b) clearly is in state 3, so you can combine the two values together and say that the final state was in fact 4.
-----
Last step: now that you think of forward and reverse FSMs, the "middle" FSMs are also similar. Consider FSM2(b), which processes the 2nd-character: the output is basically 1->3 if-and-only-if the 2nd character is b.
FSM(a) == 1 and ReverseFSM(d)(c) == 3, and we have the middle FSM2 returning (1->3). So we know that the two sides match up.
So we can see that all bytes: the first bytes, the last bytes, and even the middle bytes, can be processed in parallel.
For example, lets take FSM2(b)(c), and process the two middle bytes before we process the beginning or end. We see that the 2nd byte is (b), which means that FSM2(b)(c) creates a 2->4 link (if the state entering FSM2(b)(c) is 2, then the last state is 4).
Now we do FSM(a), and see that we are in state 2. FSM(a) puts us in state 2, and since FSM2(b)(c) has been preprocessed to be 2->4, that puts us at state 4.
So we can really process the FSM in any order. Kogge-Stone gives us a O(n) work + O(log(n)) parallel methodology. Done.
Its pretty obvious to me that the x86 prefix-extensions to the ISA are a Chomsky Type3 regular grammar. Which means that a (simple) RegEx can describe the x86 prefixes, which can be converted into a nondeterministic finite automata, which can be converted into a deterministic finite state machine.
Or under more colloquial terms: you "parse" a potential x86 instruction by "Starting with the left-most byte, read one-byte at a time until you get a complete instruction".
Any grammar that you parse one-byte-at-a-time from left-to-right is a Chomsky Type3 grammar. (In contrast: 5 + 3 * 2 cannot be parsed from left to right: 3*2 needs to be evaluated first before the 5+ part).
What I can say is, I've come up with a decoding algorithm that's clearly O(n) total work and O(log(n)) depth. From there, additional work would be done to discover faster methodologies.
The importance of proving O(n) total work is not in "X should be done in this manner". Its in that "X has asymptotic complexity of at worst case, this value".
I presume that the actual chip designers making decoders are working on something better than what I came up with.
> Proof?
Its not an easy proof. But I think I've given enough information in sibling posts that you can figure it out yourself over the next hour if you really cared.
The Hillis / Steele paper is considered to be a good survey of parallel computing methodologies in the 1980s, and is a good read anyway.
One key idea that helped me figure it out, is that you can run FSM's "backwards". Consider a FSM where "abcd" is being matched. (No match == state 1. State 2 == a was matched. State 3 == a and b was matched. State 4 == abc was matched, State5 == abcd was matched).
You can see rather easily that FSM(a)(b)(c)(d), applied from left-to-right is the normal way FSMs work.
Reverse-FSM(d) == 4 if-and-only if the 4th character is d. Reverse-FSM(other-characters) == initial state in all other cases. Reverse-FSM(d)(c) == 3.
In contrast, we can also run the FSM in the standard forward approach. FSM(a)(b) == 3
Because FSM(a)(b) == 3, and Reverse-FSM(d)(c) == 3, the two states match up and we know that the final state was 5.
As such, we can run the FSM-backwards. By carefully crafting "reverse-FSM" and "forward-FSM" operations, we can run the finite-state-machine in any order.
Does that help understand the concept? The "later" FSM operations being applied on the later-characters are "backwards" operations (figuring out the transition from transitioning from state4 to state 3). While "earlier" FSM operations on the first characters are the traditional forward-stepping of the FSM.
When the FSM operations meet in the middle (aka: FSM(a)(b) meets with reverse-FSM(c)), all you do is compare states: reverse-FSM(c) declares a match if-and-only-if the state was 3. FSM(a)(b) clearly is in state 3, so you can combine the two values together and say that the final state was in fact 4.
-----
Last step: now that you think of forward and reverse FSMs, the "middle" FSMs are also similar. Consider FSM2(b), which processes the 2nd-character: the output is basically 1->3 if-and-only-if the 2nd character is b.
FSM(a) == 1 and ReverseFSM(d)(c) == 3, and we have the middle FSM2 returning (1->3). So we know that the two sides match up.
So we can see that all bytes: the first bytes, the last bytes, and even the middle bytes, can be processed in parallel.
For example, lets take FSM2(b)(c), and process the two middle bytes before we process the beginning or end. We see that the 2nd byte is (b), which means that FSM2(b)(c) creates a 2->4 link (if the state entering FSM2(b)(c) is 2, then the last state is 4).
Now we do FSM(a), and see that we are in state 2. FSM(a) puts us in state 2, and since FSM2(b)(c) has been preprocessed to be 2->4, that puts us at state 4.
So we can really process the FSM in any order. Kogge-Stone gives us a O(n) work + O(log(n)) parallel methodology. Done.