Important to note that the datasets used are sparse, and that the key to this algorithm is better exploitation of sparsity. The GPU over CPU advantage is a lot lower if you need sparse operations, even with conventional algorithms.
"It should
be noted that these datasets are very sparse, e.g., Delicious
dataset has only 75 non-zeros on an average for input fea-
tures, and hence the advantage of GPU over CPU is not
always noticeable."
In other words, they got a good speedup on their problem, but it might not apply to your problem.
WaveNet, if I remember correctly, has 1-from-256 encoding of input features. And 1-from-256 encoding of output features.
It is extremely sparse.
If you look at language modeling, then things there are even sparsier - typical neural language model has 1-from-several-hundredths-of-thousands for full language (for Russian, for example, it is in range of 700K..1.2M words and it is much worse for Finnish and German) and 1-from-couple-of-tens-of-thousands for byte pair encoded language (most languages have encoding that reduced token count to about 16K distinct tokens, see [1] for such an example).
The image classification task also has sparcity at the output and, if you implement it as RNN, a sparsity at input (1-from-256 encoding of intensities).
Heck, you can engineer you features to be sparse if you want to.
I also think that this paper is an example of "if you do not compute you do not have to pay for it", just like in GNU grep case [2].
Embeddings tables aren't hard on the GPU (being only a lookup table), and the output softmax still requires you do the full matrix-multiply. The label may be sparse, but the computation is far from sparse.
> The reverse is true, embeddings are both the performance and memory-footprint bottleneck of modern NN models.
They may be a bottleneck, but the alternative is worse -- you can't fit complex models with large vocabularies into GPU memory using sparse one-hot encodings.
Technically, the sparse one-hot encoding is the most efficient in terms of memory footprint. You simply store the non-zero coordinates.
The problem in practice for GPUs is that sparse vector/matrix operations are too inefficient.
The whole point of something like this paper is to skip the entire 'densification' step and to directly deal with the sparse matrix input as a sparse matrix. The LSH is used in this paper improves on directly using SpMSpV, as that is also inefficient on CPUs, although to a lesser extent than GPUs.
> If you look at language modeling, then things there are even sparsier - typical neural language model has 1-from-several-hundredths-of-thousands for full language
Most real-world models don't use one-hot encodings of words -- they use embeddings instead, which are very dense vector representations of words. Outside of the fact that embeddings don't blow out GPU memory, they're also semantically encoded, so similar words cluster together.
First, you need to compute these embeddings at least once - sparsity, here you are! Second, these embeddings may be different between tasks and accuracy from their use may differ too.
For example, the embeddings produced from CBOW and skipgram word2vec models are strikingly different in cosine similarity sense - different classes of words are similar in CBOW and skipgram.
So you agree that the problem is fundamentally sparse? Embeddings are used to make sparse (e.g. categorical) data possible on GPUs, and real-world models are limited by how large they can make the embeddings to fit in GPU memory. Embedding lookups is also a compute bottleneck:
Why aren't GPUs better at sparse matrix math? Generally, sparse operations are memory bandwidth limited, but GPUs/TPUs still have much faster memory than CPUs and more memory bandwidth in general (roughly a factor of 4 or so between the latest cpus and gpus).
Sparse matrix math basically boils down to indirect array references: A[B[i]]. GPUs generally trade off memory bandwidth for latency, relying on being able to do a lot of work to hide that memory latency. But because there's no work between the first and second load, you are no longer able to hide the memory latency of the second load with extra work.
CPUs, by contrast, have a thorough caching hierarchy that tends to focus on minimizing memory latency, so it doesn't take as long to do the second load compared to a GPU.
Yeah on the GPU you need to get your threads to ideally load consecutive memory locations for each thread to utilize the memory bandwidth properly. Random-indexing blows this out of the water. I guess that you could pre-process on the CPU though to pack the sparse stuff for better GPU efficiency..
Another thing to note is that sparsity is being leveraged even to build a more efficient version of hardware. A good example of this is the Cerebras Waferscale chip that was announced recently. I'm assuming the author was unaware of developments on the hardware side of things.
"SLIDE doesn’t need GPUs because it takes a fundamentally different approach to deep learning. The standard “back-propagation” training technique for deep neural networks requires matrix multiplication, an ideal workload for GPUs. With SLIDE, Shrivastava, Chen and Medini turned neural network training into a search problem that could instead be solved with hash tables"
Seems this is related to an adaptive approach which GPUs don't have support for, but could be made to. I think this means the next version of TPUs will support it, and then GPUs follow closely after.
No, their approach changes the fundamental access pattern into something anathema to GPU and TPU architectures.
In ELI5 or layman's terms: current GPU/TPU accelerators are specialized in doing very regular and predictable calculations very fast. In deep learning a lot of those predictable calculations are not needed, like multiplying with zero. This approach leverages that and only does the minimal necessary calculations, but that makes it very irregular and unpredictable. Regular CPUs are better suited for those kind of irregular calculations, because most other general software is that as well.
They don't implement SLIDE in GPU so we don't actually know if CPU is faster than GPU. It's SLIDE on CPU vs softmax & sampled softmax in GPU comparison.
They should at least give rationale why GPU is not speeding up locality sensitive hash based algorithm. GPU's are used to compute fast hashes (they were used in Bitcoin mining once).
That's a valid point, but it does not solve the question.
Obviously the memory throughput will not be as high as with matrix calculations, but the algorithm could still be optimized for the GPU. GPU's can do random access and have large and fast memories. What is the difference in memory lines? 64-bytes vs 128-bytes?
GPUs would still be faster than CPUs. You describe them as high-latency but their memory latency is comparable to CPUs. That's why ethash mining or equihash mining (workloads bottlenecked by short ≤32-byte random memory reads) is still faster on GPUs than on CPUs. Also see https://news.ycombinator.com/item?id=22505029
32-bytes accesses are not short. 8 bytes (double precision floating point) are shorter and that's makes sparse matrix multiplication hard on GPU.
Also, SHA256(d?) employed by ethash is, actually, quite long - 80 cycles, at the very least (cycle per round). In mining you can interleave mining computation for one header with loading required by computation of mining of another header and, from what I know, this is what CUDA on GPU will do.
The sheer amount of compute power makes ethash mining faster on GPU.
Reads shorter than 64 bytes on a CPU all cost you the same: a packet of 64 bytes on the memory bus, because that's the atom size of modern CPU's DDR4 memory controllers...
On GPUs the atom size is 32/64 bytes. So GPUs are always better than or equal to CPUs when it comes to small reads/writes.
It's true that the compute power of ethash is not negligible, but to give you one more data point: on equihash there is even less compute spent on hashing, and GPUs still dominate CPUs
A slightly more elaborate answer than the sibling post to drive home how much happens on a simple read that is not cached :
- request to L1D cache, misses
- request to L2D cache, misses
- packet is dropped on the mesh network to access L3D, likely misses
- L3D requests load from memory from the memory controller, load is put in queue
- dram access latency ~100-150
- above chain in reverse
This is the best case scenario on miss, because there could be a DTLB miss on the address (which is why huge tables are crucial in the paper) or there could be dirty cache lines somewhere in other cores that trigger the coherency mechanism.
Yeah, this is a line of research that's been ongoing for a while and "CPUs not GPUs" seems like of it's rallying cry but this seems involve multiple incomparables - like a combination of apple-to-orange and banana-to-peaches comparisons (different algorithms on different platforms doing different tasks judged according to a different success criteria, jeesh).
The thing is there's no benchmark for just neural-netting. Neural nets do well on a variety of image recognition tasks but the nets that do well as particular trained instances on particular data. And, moreover, it's not just doing well on a benchmark but generalizing well. Essentially, neural nets simply algorithms but paradigms (for good and ill).
Instead of the babble about CPU vs GPU, they should be talking the strengths and weaknesses of their approach to image recognition and related tasks. Their locality sensitive hash approach is certainly interesting and some parts of their approach might hypothetically be useful for dealing with the weaknesses of the neural net approach (fragility, black-box-unexplainability, etc).
As someone who ported memory-bound workloads to GPU, I say SLIDE appears it would run even faster on GPU than on CPU. I skimmed the paper and SLIDE is described as a memory-bound workload, specifically random memory accesses to 2-10GB of data. GPUs excel at this type of workload. For example the Ethereum PoW (ethash) is memory-bound, and GPU ethash implementations are faster than CPU ones.
So I don't understand why the authors don't mention the possibility of implementing SLIDE on GPU. Of course, I could have missed something (I spent less than 10min reading the paper)...
No, that is definitely not the case, not all memory-bounds are of the same type.
Hash tables produce lots of data-dependent random accesses into DRAM which are definitely not better on GPUs. Warps divergence, bank conflicts, partial cache-line access inefficiencies, etc.
Even CPUs struggle on this type of pointer chasing workloads due to cache inefficiencies. Open addressing schemes such as robin hood hashmaps are popular because they reduce the amount of pointer chasing.
Your example is a false comparison, generating a hash is very different from pointer chasing the address generated by the hash.
GPUs still have higher memory bandwidth than CPUs though. So if you miss in caches all the time, GPUs can still potentially come out on top (as long as you can keep the higher memory latency under control, which the massive effective hyper-threading of GPUs should be able to help with). That's the point of using GPUs for solving memory-hard problems for mining those "ASIC-resistant" coins.
"Your example is a false comparison, generating a hash is very different from pointer chasing the address generated by the hash."
That's not true at all in the case of ethash, where the running time is dominated by waiting for memory read ops to complete, not waiting for ALU ops (hashing) to finish executing.
I have also written an Equihash miner where the workload is similar: running time dominated by hashtable reads or writes, and can confirm GPUs beat CPUs.
I reiterate: GPUs excel at data-dependent random reads, compared to CPUs. Sure, these are very difficult workloads for both CPUs and GPUs, but GPUs still trump CPUs. That's because the atom size (minimum number of bytes that a GPU/CPU can read/write on the memory bus) is the same or better on GPU: 64 bytes on CPU (DDR4), and 32/64 bytes on GPU (HBM2), and GPUs have a significantly higher memory throughput up to 1000 GB/s while CPUs are still stuck around 200 GB/s per socket (AMD EPYC Rome 8-channel DDR4-3200).
So in ethash or Equihash mining workloads, many data-dependent read ops across a multi-GB data structure (much larger than local caches) will be mostly bottlenecked by (1) the maximum number of outstanding read ops the CPU/GPU memory controller can handle and (2) the overall memory throughput. In the case of GPUs, (1) is not really a problem, so you end up being bottlenecked by overall memory througphut. That's why GPUs win.
As of 3-4 years ago I remember Intel CPUs having a maximum number of 10 outstanding memory operations so (1) was the bottleneck. But things could have changed with more recent CPUs. In any case, even if (1) is not a bottleneck on CPUs, their lower memory throughput guarantee they will perform worse than GPUs on such workloads.
Correct, in GPUs can indeed do a better job at hiding latency through massive parallelism.
My expertise might be outdated here, but the problem used to be that actually getting to that max bandwidth through divergent warps and uncoalesced reads was just impossible.
Is this still the case with Volta? Did you avoid these issues in your Equihash implementation?
Divergent warps are still a huge problem (but SLIDE doesn't have this problem AFAIK).
Uncoalesced reads are not a problem severe enough to make GPUs underperform CPUs. Or, said another way, uncoalesced reads come with a roughly equally bad performance impact on both GPUs and CPUs.
The only reason GPUs can hide the latency is the massive parallelism in the problem space (computing the hash for nonce n doesn't block nonce n + 1). This algorithm involves a lot of data-dependency, so a computer for training these networks actually may be memory-latency bound (unless you are training a ton of neural networks and can hide the latency), which is extremely bad for GPUs.
Well, depends on the size of the hash table and the particular memory access patterns. Lookups into many small hash tables, or workloads in which most threads in a threadgroup all fetch the same entry, can be very efficient on GPUs. Sparse virtual texturing is often implemented with a hash table and works well on GPUs because the hash table involved has both of these properties.
Yes, a very good point. I am assuming the tables are quite large due to the workload. If it's large enough to give a benefit to large pages in reduced DTLB misses, it's likely too large for warp-local memory :)
Ethash mining is not memory bound. It has, if I remember correctly, a precomputed buffer used by many operations.
To quote Wikipedia: "Ethash uses an initial 1 GB dataset known as the Ethash DAG and a 16 MB cache for light clients to hold. These are regenerated every 30,000 blocks, known as an epoch."
You generate it once, access many times.
If each memory access is 32-bytes, then one need not more than 32*2^20 computations to get completely sequential access. To very probably not miss a prefetched DRAM block (main throughput bottleneck), which is 4K..8K in typical size, one need about 128 times as less computation threads. And this number (~250 thousands threads) is well within reach for current GPU models.
State of the NN weights will change after each batch, on the other hand.
For reference since it's kind of buried/obfuscated: Their point of comparison is between an NVIDIA Tesla V100 and a 44-core CPU. The latter is probably something like a Xeon E5-2699, which has a list price of $4115 USD. Hard to find accurate pricing data for the V100, but it looks like it was in the $7-10k USD range back in 2018. Still a cool cost/performance improvement but not as massive as I was expecting before I looked up the test hardware.
It's not just that the CPU is cheaper than the GPU. They also claim it's faster than the more expensive GPU. Meaning if you rent a box in the cloud for training it'll cost you less since you'll rent it for less time, and if you buy the hardware you can train more models and thus need less hardware to train as much as with GPUs before.
So if it's 3.5 times faster then you require only roughly 30% of the time (or hardware) it took before, both savings combined seem pretty significant to me.
Edit: quick napkin math, let's take your $7k figure for the GPU and 4115 for the CPU. We need one third of that CPU so 4115 * 0.3 = 1234.5 now 1234.5 / 7000 = approx. 0.176.
> Their point of comparison is between an NVIDIA Tesla V100 and a 44-core CPU.
Not quite. Further in the paper: "Similarly, for larger Amazon dataset, the red and black line never intersect, and the red and blue line intersects on 8 cores. This means that SLIDE beats Tensorflow-GPU with as few as 8 CPU cores and Tensorflow-CPU with as few as 2 CPU cores."
Yes, they have a beast machine, but it can outperform the GPU with only 8 cores in some cases.
Also worth noting that that Xeon processor takes somewhere around 350-400W and the V100 is also in the range of 300W. So not huge on energy savings. Although a really cool, potentially industry shaking breakthrough, regardless.
The V100 would need to run for 3.5 times as long, so would use way more energy. E.g. training for 1 hour on CPU: 400W×1h = 400Wh of energy, vs training for 3.5h on GPU: 300W×3.5h = 1050Wh of energy.
Since the article doesn't specify how power draw was measured, I assume that they measured _total system_ power consumption. This is not to be confused with the CPU power requirements as it includes the entire platform and all losses (e.g. power supply, main board, peripherals, cooling, ...).
The new AMD EPYC 7742 should have double the memory bandwidth (they claim they are memory bound) and 1.5x the cores for a similar list price.
But in all cases, I would not expect that a serious research can outright beat the current NNs by 10x on all dimensions. This will take some time (and may not fully happen) and this paper is certainly a great advance.
Maybe the real benefit here is that due to the asynchronous nature of the algorithm, you could easily scale to more cores and the new approach will scale with it, vs. at some point, you reach a point of diminishing returns with the GPU?
A more correct headline would be: "An algorithm beats a poorly optimized GPU implementation on a narrow problem that uses an extremely sparse dataset using $8.5K worth of CPU".
This is for recommenders only, and it does not translate to anything else. I don't know why everyone seems to misrepresent this as NVIDIA's undoing. Read the freaking paper, people.
I think I'm right in saying that the type of locality-sensitive hash systems they're talking about are not entirely dissimilar to Igor Aleksander's WISARD, RAM-based recognisers from the 80's. I suspect the latter is a special case of the former. How far off-base am I?
I feel like this is more a stunt for the new intel xenon. Creating a new paradigm on a heterogeneous hardware dependency is misleading.
Linear analysis could be more explored. They could achieve the same with less. Holomorphic functional analysis.
I don't think so. Keep in mind that DL training is only a minuscule part of GPGPU applications. HPC is still a huge market with a strong demand for compute accelerator cards.
In practice, TPUs and SoCs with inference extensions are a much bigger threat for NVIDIA in the cloud and automotive business.
NVidia could add special accelerator instructions to their GPUs to efficiently implement the same algorithms and then they would be significantly ahead of Intel and AMD.
This is obvious to me. Since Hadoop came out, (a lot of) people have been giving up on even forming algorithms and just dumping data into machine learning and hoping for the best. I recall someone high up at Google complaining about it.
We need to get back to forming algorithms as well as concepts and first principles. We cannot and should not expect ML to brute force finding patterns and just sit back and relax.
Here is another prediction for you: we will not solve ray-tracing in games and movie CGI with more hardware. We will need some algorithm that gets us 80-90% of the way there in a smart way.
This was my first thought. Well, to be more complete - smart algorithms beat dumb algorithms even if the dumb algorithms use hardware acceleration (unless the problem is trivial anyway.) Smart algorithms plus hardware acceleration beats smart algorithms on general purpose hardware. Smart algorithms are just better.
Sorry for the off topic comment but this then/than mistake I read every day is just getting on my nerves.
"
What to Know:
Than and then are different words. Than is used in comparisons as a conjunction, as in "she is younger than I am," and as a preposition, "he is taller than me." Then indicates time. It is used as an adverb, "I lived in Idaho then," noun, "we'll have to wait until then," and adjective, "the then governor."" [1]
You think that's a bad language mistake? The title (and the article itself) says that their CPU run is 3.5x faster than GPU. But actually it's 3.5x as fast, which is a radically different thing: "3.5x faster" would mean "4.5x as fast", in the same way that "50% bigger" means "150% times the size" not "50% of the size".
Edit: Clearly this is a contentious comment, and even those of us that see things this way mostly seem to agree that we understand the intended meaning (but things get fuzzer with smaller numbers expressed as percentages e.g. "120% faster"). Surely, though, it makes more sense to use the completely precise phrasing "3.5x as fast", especially for the main statement of the main result in an academic paper.
> I think "two times faster" would mean "twice as fast", not "three times as fast".
I don't see how that's the case. What is "50% bigger"? I understand it as 150% of the size. Similarly, I would understand "90% bigger" to mean "190% of the size" and "100% bigger" to mean "200% of the size", and I'm sure that is how they are used in practice.
So then surely "200% bigger" means "300% of the size"? And "two times bigger" - which is mathematically identical to "200% bigger" - would be three times the size. I acknowledge that the phrase is often used like you say but I don't think that is its literal meaning, and if you used that in a contract I think it would legally be interpretted in the way I've said (I'm thinking of consumer law in a situation where something is "n times bigger for the same price").
All this applies analagously to speed and x times faster. I gave the examples above with size because I think it's a bit more of a common thing to talk about this way and speed is a bit more subtle because what we're measuring here is the time which is something / "speed" and here the numerator isn't clear (number of training runs perhaps).
Interestingly, in Russian "100% bigger" and "two times bigger" will have different prepositions, so it's more clear that in the first case you do sum (x + 100%x = 2x), and in the second case you do multiplication (2 x = 2x).
I'm quite sure it's the same in English (sum with % and multiplication with times), but I'm not a native English speaker.
I'm a native English speaker, I don't think I would ever say "two times bigger", sounds like bad grammar. I would read "2x the size" as "twice the size" or "two times the size", but not "add on twice the size".
So I agree with the article, "3.5 times faster (1 hour vs. 3.5 hours)" is perfectly correct, and it is OK to abbreviate "3.5 times" as "3.5x".
That's just sloppy language and you got used to it. Twice as fast means twice as fast. Two times faster means two times faster. One time faster would mean twice as fast. 0.5x faster would mean 50% faster or 150% as fast. 0.5x slower would mean 50% slower or 50% as fast.
I agree and, if you are mistaken, I'd have to observe that this is a spectacularly unclear way of communicating an increase.
The meaning of "two times faster" as "twice as fast" is certainly the way such a statement would generally be interpreted everywhere I've worked.
It is of course possible that the meaning suggested by quietbritishjim is archaic British, but I certainly don't believe it's current: I've worked in various places Cambridge and London for the past 18 years and, as I say, have never encountered it.
> if you are mistaken, I'd have to observe that this is a spectacularly unclear way of communicating an increase.
I absolutely agree with that, and in practice if I ever see that turn of phrase with anything more than 100% then I assume that they are using it the way that you're thinking of. But I maintain this is not the literal meaning. At the same time, I'm not saying people should be subtracting 100% to make the number mathematically correct, that would definitely be bizarre. Instead, I'm saying they shouldn't be using such a weird turn of phrase in the first place, so the headline should simply be "Deep learning on CPU 3.5x as fast as on GPU"
> I've worked in various places Cambridge and London for the past 18 years and, as I say, have never encountered it.
You have really never encountered an item in a supermarket saying "now 20% bigger!"? Thinking about it now, they're usually charging the same amount as the old size (otherwise it's not much to brag about really) in which case they use the vastly clearer phrase "20% extra free", but I'm sure I've seen the former phrase.
> A personal note: I know the disparagement of Times-er from long ago, from my grade school years, I think; I was taught that two times more than X really means 'three times as many as X'. Since authority figures insisted on this interpretation, I avoided the construction entirely (as, as far as I know, I still do). Yet I've never stopped asking, "Why don't you understand the clear meaning of what people are saying?"
>> That last reaction incorporates one criticism of Times-er, namely that it is "illogical" or "irrational": X times more than Y MUST MEAN 'Y plus X-times-Y (that is, 'X+1 times Y'), not 'X times as many/great as Y' (that is, 'X times Y'). (In the most common variant of this reaction, X times more than Y is disparaged because it is said to be ambiguous, with both the 'X times Y' and 'X+1 times Y' interpretations.)
>> The appeal here is to the idea that ordinary-language expressions are simply realizations of logical (or arithmetical) formulas. This is just backwards. The formulas are there to represent the meanings of expressions; they are not the prior reality, merely cloaked in (those devilishly vague) words of actual languages.
What annoys me even more is that I keep using this comparison in a wrong way despite knowing better. BTW, what do you say if you made something twice as fast? Once faster?
People using 'further' instead of 'farther' (used for distance) used to get me upset, but people mistake them so often that I realized I was getting upset for no good reason. I know what they meant even if they were ignorant of the grammar. I agree that in an academic paper you don't want to see grammar mistakes, but in many other contexts if you understand what is meant, it's no use getting bent out of shape.
I think it could be easily attributed to the fact that most non-native speakers learn English at school by studying its grammar in written form, where the two words are distinct. Native speakers, instead, learn English as their spoken language, where the words sound basically identical to each other.
As a non-native, I learned "then" and "than" in different contexts, months if not years apart. I also learned them in speech and in writing at the same time.
Please don't quiz me on how to read bear, pear, tear, fear, spear, clear, and dear.
People using 'further' instead of 'farther' (used for distance) used to get me upset, but people mistake them so often that I realized I was getting upset for no good reason. I know what they meant even if they were ignorant of the grammar. I agree that in an academic paper you don't want to see grammar mistakes, but in many other contexts if you understand what is meant, it's no use getting bent out of shape.
I do understand it but, for my colleagues and friends who don't have English as their first language, it adds another caveat to learn and remember without a logical basis. That's another place to introduce ambiguity and errors.
I don't get angry at non-standard usage but I think it's important not to ignore the purpose of consistent style.
yea very true! and as someone who grew up speaking french, i want people to stop making the viola/voila mistake too. viola is an instrument. voila is what you mean most of the time.
also off-topic, but i heard the word senglish, for singaporean english, on a podcast yesterday. that made me realize how english has the potential to become a universal language with each country having their own version.
french people already use the term frenglish when they mix english words with french. we could have spanglish, japenglish, germanish, etc. they don't have to be called those though.
it would be totally awesome to be able to communicate with almost everyone in the world. just like the internet!
It's not not knowing the meaning, it's simply a proofing mistake in phonemic writing. For a native speaker, spelling as it sounds is second nature, making morphological distinctions is harder.
For a non-native speaker it's hard to tell the two apart. Then and than sound pretty much the same. And even if one knows the difference, it's easy to make a typo.
In the sort of "international pidgin English" that's spoken anywhere outside the UK such subtle differences should just be ignored.
I don't know about that, this kind of mistakes (then/than, effect/affect) are getting on my nerves as well, and I'm definitely not a native english speaker.
Not to mention that this specific kind of mistakes (similar sounds) are at least as often from native speakers as from non-native in my experience/native language.
In French, a lot of people mistake Ça for Sa for example, native and non-native alike
I don't agree,
I'm not a native speaker and probably like many, I leaned english by reading so those two words sound very diferently in my head.
I'm always lost when I see this mistake.
I don’t think you’ll get much play for suggesting (to an audience of at least some programmers) that we should allow for more ambiguity in language, heh.
The programmers I’ve met without an eye for detail are usually ones I do not like working with.
Hehe, true, but unlike programming languages, the languages humans use for communication are "sloppy" and ambiguous by definition. Grammar rules have been invented after the fact to create the illusion that there's order where none exists.
English allows much more "freedom" than many other languages (e.g. German), maybe that's one reason why it has been so successful in the end.
If they are just ignored, how will anyone learn? If they really were small (to the meaning of the sentence) differences, then whatever, but switching than for then changes the sentence.
"It should be noted that these datasets are very sparse, e.g., Delicious dataset has only 75 non-zeros on an average for input fea- tures, and hence the advantage of GPU over CPU is not always noticeable."
In other words, they got a good speedup on their problem, but it might not apply to your problem.