Smart Algorithms beat Hardware Acceleration for Large-Scale Deep Learning

comicjk · on March 6, 2020

Important to note that the datasets used are sparse, and that the key to this algorithm is better exploitation of sparsity. The GPU over CPU advantage is a lot lower if you need sparse operations, even with conventional algorithms.

"It should be noted that these datasets are very sparse, e.g., Delicious dataset has only 75 non-zeros on an average for input fea- tures, and hence the advantage of GPU over CPU is not always noticeable."

In other words, they got a good speedup on their problem, but it might not apply to your problem.

thesz · on March 6, 2020

WaveNet, if I remember correctly, has 1-from-256 encoding of input features. And 1-from-256 encoding of output features.

It is extremely sparse.

If you look at language modeling, then things there are even sparsier - typical neural language model has 1-from-several-hundredths-of-thousands for full language (for Russian, for example, it is in range of 700K..1.2M words and it is much worse for Finnish and German) and 1-from-couple-of-tens-of-thousands for byte pair encoded language (most languages have encoding that reduced token count to about 16K distinct tokens, see [1] for such an example).

[1] https://bellard.org/nncp/

The image classification task also has sparcity at the output and, if you implement it as RNN, a sparsity at input (1-from-256 encoding of intensities).

Heck, you can engineer you features to be sparse if you want to.

I also think that this paper is an example of "if you do not compute you do not have to pay for it", just like in GNU grep case [2].

[2] https://lists.freebsd.org/pipermail/freebsd-current/2010-Aug...

Given all that I think it is a paper about combination of very clever things which give excellent results in a synergy.

stephenroller · on March 6, 2020

Embeddings tables aren't hard on the GPU (being only a lookup table), and the output softmax still requires you do the full matrix-multiply. The label may be sparse, but the computation is far from sparse.

yvdriess · on March 6, 2020

The reverse is true, embeddings are both the performance and memory-footprint bottleneck of modern NN models.

Check figure 6. of : https://arxiv.org/pdf/1906.00091.pdf

Embeddings are used to lookup sparse features, so you have those pesky data-dependent lookups.

zeroxfe · on March 6, 2020

> The reverse is true, embeddings are both the performance and memory-footprint bottleneck of modern NN models.

They may be a bottleneck, but the alternative is worse -- you can't fit complex models with large vocabularies into GPU memory using sparse one-hot encodings.

yvdriess · on March 6, 2020

Surely you mean dense one-hot?

Technically, the sparse one-hot encoding is the most efficient in terms of memory footprint. You simply store the non-zero coordinates.

The problem in practice for GPUs is that sparse vector/matrix operations are too inefficient.

The whole point of something like this paper is to skip the entire 'densification' step and to directly deal with the sparse matrix input as a sparse matrix. The LSH is used in this paper improves on directly using SpMSpV, as that is also inefficient on CPUs, although to a lesser extent than GPUs.

thesz · on March 6, 2020

No, you can successfully fit complex models if you use byte-pair or similar encodings (morphessor [1] comes to mind).

[1] https://morfessor.readthedocs.io/en/latest/

You also will get much more meaningful embeddings from summing embeddings of part of the word.

Der_Einzige · on March 6, 2020

Only a bad bottleneck because proper database techniques aren't being used widely for embeddings yet within ML pipelines

See libraries like magnitude for proper embedding lookup implementations

zeroxfe · on March 6, 2020

> If you look at language modeling, then things there are even sparsier - typical neural language model has 1-from-several-hundredths-of-thousands for full language

Most real-world models don't use one-hot encodings of words -- they use embeddings instead, which are very dense vector representations of words. Outside of the fact that embeddings don't blow out GPU memory, they're also semantically encoded, so similar words cluster together.

thesz · on March 6, 2020

First, you need to compute these embeddings at least once - sparsity, here you are! Second, these embeddings may be different between tasks and accuracy from their use may differ too.

For example, the embeddings produced from CBOW and skipgram word2vec models are strikingly different in cosine similarity sense - different classes of words are similar in CBOW and skipgram.

yvdriess · on March 6, 2020

So you agree that the problem is fundamentally sparse? Embeddings are used to make sparse (e.g. categorical) data possible on GPUs, and real-world models are limited by how large they can make the embeddings to fit in GPU memory. Embedding lookups is also a compute bottleneck:

An example is facebook DLRM: https://arxiv.org/pdf/1906.00091.pdf

xiphias2 · on March 6, 2020

I believe that it's so critical here that the dataset is sparse, that it should be in the title of the paper.

Like this I view it as clickbait.

wdobbels · on March 6, 2020

It's not even mentioned in the abstract.

spott · on March 6, 2020

Why aren't GPUs better at sparse matrix math? Generally, sparse operations are memory bandwidth limited, but GPUs/TPUs still have much faster memory than CPUs and more memory bandwidth in general (roughly a factor of 4 or so between the latest cpus and gpus).

jcranmer · on March 6, 2020

Sparse matrix math basically boils down to indirect array references: A[B[i]]. GPUs generally trade off memory bandwidth for latency, relying on being able to do a lot of work to hide that memory latency. But because there's no work between the first and second load, you are no longer able to hide the memory latency of the second load with extra work.

CPUs, by contrast, have a thorough caching hierarchy that tends to focus on minimizing memory latency, so it doesn't take as long to do the second load compared to a GPU.

l33tman · on March 6, 2020

Yeah on the GPU you need to get your threads to ideally load consecutive memory locations for each thread to utilize the memory bandwidth properly. Random-indexing blows this out of the water. I guess that you could pre-process on the CPU though to pack the sparse stuff for better GPU efficiency..

vchak1 · on March 6, 2020

You can solve around this by using cuckoo or robin hood hashing. See for example: https://www.researchgate.net/scientific-contributions/148064...

wbl · on March 6, 2020

Sparsity breaks the spatial coherence GPUs like. Scatter gather pays a penalty vs direct.

rajesh-s · on March 6, 2020

Another thing to note is that sparsity is being leveraged even to build a more efficient version of hardware. A good example of this is the Cerebras Waferscale chip that was announced recently. I'm assuming the author was unaware of developments on the hardware side of things.

tyingq · on March 6, 2020

The best laypersons's summary I could find:

"SLIDE doesn’t need GPUs because it takes a fundamentally different approach to deep learning. The standard “back-propagation” training technique for deep neural networks requires matrix multiplication, an ideal workload for GPUs. With SLIDE, Shrivastava, Chen and Medini turned neural network training into a search problem that could instead be solved with hash tables"

https://insidehpc.com/2020/03/slide-algorithm-for-training-d...

ganzuul · on March 6, 2020

Seems this is related to an adaptive approach which GPUs don't have support for, but could be made to. I think this means the next version of TPUs will support it, and then GPUs follow closely after.

yvdriess · on March 6, 2020

No, their approach changes the fundamental access pattern into something anathema to GPU and TPU architectures.

In ELI5 or layman's terms: current GPU/TPU accelerators are specialized in doing very regular and predictable calculations very fast. In deep learning a lot of those predictable calculations are not needed, like multiplying with zero. This approach leverages that and only does the minimal necessary calculations, but that makes it very irregular and unpredictable. Regular CPUs are better suited for those kind of irregular calculations, because most other general software is that as well.

numpad0 · on March 6, 2020

In layman's response that sounds like that network could use normalization

Rannath · on March 6, 2020

Simplify normalization if you want a layman's terms.

tyingq · on March 6, 2020

Maybe..."Our analysis suggests that SLIDE is a memory-bound application"

nabla9 · on March 6, 2020

They don't implement SLIDE in GPU so we don't actually know if CPU is faster than GPU. It's SLIDE on CPU vs softmax & sampled softmax in GPU comparison.

They should at least give rationale why GPU is not speeding up locality sensitive hash based algorithm. GPU's are used to compute fast hashes (they were used in Bitcoin mining once).

It's Intel sponsored research, but come on.

yvdriess · on March 6, 2020

Since when is calculating the hash the bottleneck in hash table access?

The bottleneck is the:

  ptr = h(x);    // trivial, ~0.5-2 cycles
  bucket = *ptr; //  --> ~200-300 cycles

edit: reading the paper, it's pointers to the data being stored, so I have to add the following as well:

  el_ptr = bucket[i]
  el = *el_ptr

So that's two dependent random-access loads.

nabla9 · on March 6, 2020

That's a valid point, but it does not solve the question.

Obviously the memory throughput will not be as high as with matrix calculations, but the algorithm could still be optimized for the GPU. GPU's can do random access and have large and fast memories. What is the difference in memory lines? 64-bytes vs 128-bytes?

snovv_crash · on March 6, 2020

GPUs have high latency, high throughput memory. Random access is a killer if your calculations are at all serialized

nabla9 · on March 6, 2020

The SLIDE algorithm is not serialized. The only issue is the sparsity.

mrb · on March 6, 2020

GPUs would still be faster than CPUs. You describe them as high-latency but their memory latency is comparable to CPUs. That's why ethash mining or equihash mining (workloads bottlenecked by short ≤32-byte random memory reads) is still faster on GPUs than on CPUs. Also see https://news.ycombinator.com/item?id=22505029

thesz · on March 6, 2020

32-bytes accesses are not short. 8 bytes (double precision floating point) are shorter and that's makes sparse matrix multiplication hard on GPU.

Also, SHA256(d?) employed by ethash is, actually, quite long - 80 cycles, at the very least (cycle per round). In mining you can interleave mining computation for one header with loading required by computation of mining of another header and, from what I know, this is what CUDA on GPU will do.

The sheer amount of compute power makes ethash mining faster on GPU.

mrb · on March 6, 2020

Reads shorter than 64 bytes on a CPU all cost you the same: a packet of 64 bytes on the memory bus, because that's the atom size of modern CPU's DDR4 memory controllers...

On GPUs the atom size is 32/64 bytes. So GPUs are always better than or equal to CPUs when it comes to small reads/writes.

It's true that the compute power of ethash is not negligible, but to give you one more data point: on equihash there is even less compute spent on hashing, and GPUs still dominate CPUs

johnlorentzson · on March 6, 2020

Forgive me if this should be obvious, but why would a simple read from a pointer take so many cycles?

yvdriess · on March 6, 2020

A slightly more elaborate answer than the sibling post to drive home how much happens on a simple read that is not cached :

- request to L1D cache, misses

- request to L2D cache, misses

- packet is dropped on the mesh network to access L3D, likely misses

- L3D requests load from memory from the memory controller, load is put in queue

- dram access latency ~100-150

- above chain in reverse

This is the best case scenario on miss, because there could be a DTLB miss on the address (which is why huge tables are crucial in the paper) or there could be dirty cache lines somewhere in other cores that trigger the coherency mechanism.

erikmolin · on March 6, 2020

because you have to fetch it from RAM, unless the problem is small enough to fit in cache

johnlorentzson · on March 6, 2020

Ah, right. I don't know how I forgot about cache and RAM.

signa11 · on March 6, 2020

> because you have to fetch it from RAM, unless the problem is small enough to fit in cache

... might have to...

joe_the_user · on March 6, 2020

Yeah, this is a line of research that's been ongoing for a while and "CPUs not GPUs" seems like of it's rallying cry but this seems involve multiple incomparables - like a combination of apple-to-orange and banana-to-peaches comparisons (different algorithms on different platforms doing different tasks judged according to a different success criteria, jeesh).

The thing is there's no benchmark for just neural-netting. Neural nets do well on a variety of image recognition tasks but the nets that do well as particular trained instances on particular data. And, moreover, it's not just doing well on a benchmark but generalizing well. Essentially, neural nets simply algorithms but paradigms (for good and ill).

Instead of the babble about CPU vs GPU, they should be talking the strengths and weaknesses of their approach to image recognition and related tasks. Their locality sensitive hash approach is certainly interesting and some parts of their approach might hypothetically be useful for dealing with the weaknesses of the neural net approach (fragility, black-box-unexplainability, etc).

mrb · on March 6, 2020

As someone who ported memory-bound workloads to GPU, I say SLIDE appears it would run even faster on GPU than on CPU. I skimmed the paper and SLIDE is described as a memory-bound workload, specifically random memory accesses to 2-10GB of data. GPUs excel at this type of workload. For example the Ethereum PoW (ethash) is memory-bound, and GPU ethash implementations are faster than CPU ones.

So I don't understand why the authors don't mention the possibility of implementing SLIDE on GPU. Of course, I could have missed something (I spent less than 10min reading the paper)...

yvdriess · on March 6, 2020

No, that is definitely not the case, not all memory-bounds are of the same type.

Hash tables produce lots of data-dependent random accesses into DRAM which are definitely not better on GPUs. Warps divergence, bank conflicts, partial cache-line access inefficiencies, etc. Even CPUs struggle on this type of pointer chasing workloads due to cache inefficiencies. Open addressing schemes such as robin hood hashmaps are popular because they reduce the amount of pointer chasing.

Your example is a false comparison, generating a hash is very different from pointer chasing the address generated by the hash.

atq2119 · on March 6, 2020

GPUs still have higher memory bandwidth than CPUs though. So if you miss in caches all the time, GPUs can still potentially come out on top (as long as you can keep the higher memory latency under control, which the massive effective hyper-threading of GPUs should be able to help with). That's the point of using GPUs for solving memory-hard problems for mining those "ASIC-resistant" coins.

mrb · on March 6, 2020

"Your example is a false comparison, generating a hash is very different from pointer chasing the address generated by the hash."

That's not true at all in the case of ethash, where the running time is dominated by waiting for memory read ops to complete, not waiting for ALU ops (hashing) to finish executing.

I have also written an Equihash miner where the workload is similar: running time dominated by hashtable reads or writes, and can confirm GPUs beat CPUs.

I reiterate: GPUs excel at data-dependent random reads, compared to CPUs. Sure, these are very difficult workloads for both CPUs and GPUs, but GPUs still trump CPUs. That's because the atom size (minimum number of bytes that a GPU/CPU can read/write on the memory bus) is the same or better on GPU: 64 bytes on CPU (DDR4), and 32/64 bytes on GPU (HBM2), and GPUs have a significantly higher memory throughput up to 1000 GB/s while CPUs are still stuck around 200 GB/s per socket (AMD EPYC Rome 8-channel DDR4-3200).

So in ethash or Equihash mining workloads, many data-dependent read ops across a multi-GB data structure (much larger than local caches) will be mostly bottlenecked by (1) the maximum number of outstanding read ops the CPU/GPU memory controller can handle and (2) the overall memory throughput. In the case of GPUs, (1) is not really a problem, so you end up being bottlenecked by overall memory througphut. That's why GPUs win.

As of 3-4 years ago I remember Intel CPUs having a maximum number of 10 outstanding memory operations so (1) was the bottleneck. But things could have changed with more recent CPUs. In any case, even if (1) is not a bottleneck on CPUs, their lower memory throughput guarantee they will perform worse than GPUs on such workloads.

yvdriess · on March 6, 2020

Correct, in GPUs can indeed do a better job at hiding latency through massive parallelism.

My expertise might be outdated here, but the problem used to be that actually getting to that max bandwidth through divergent warps and uncoalesced reads was just impossible.

Is this still the case with Volta? Did you avoid these issues in your Equihash implementation?

mrb · on March 6, 2020

Divergent warps are still a huge problem (but SLIDE doesn't have this problem AFAIK).

Uncoalesced reads are not a problem severe enough to make GPUs underperform CPUs. Or, said another way, uncoalesced reads come with a roughly equally bad performance impact on both GPUs and CPUs.

plusplusc · on March 6, 2020

The only reason GPUs can hide the latency is the massive parallelism in the problem space (computing the hash for nonce n doesn't block nonce n + 1). This algorithm involves a lot of data-dependency, so a computer for training these networks actually may be memory-latency bound (unless you are training a ton of neural networks and can hide the latency), which is extremely bad for GPUs.

pcwalton · on March 6, 2020

Well, depends on the size of the hash table and the particular memory access patterns. Lookups into many small hash tables, or workloads in which most threads in a threadgroup all fetch the same entry, can be very efficient on GPUs. Sparse virtual texturing is often implemented with a hash table and works well on GPUs because the hash table involved has both of these properties.

(I'm sure you know this, just wanted to clarify.)

yvdriess · on March 6, 2020

Yes, a very good point. I am assuming the tables are quite large due to the workload. If it's large enough to give a benefit to large pages in reduced DTLB misses, it's likely too large for warp-local memory :)

tyingq · on March 6, 2020

"We pre-allocate 1000 2MB Hugepages and 10 1GB Hugepages which are found to be enough for both Delicious-200K and Amazon-670K datasets."

So ~12GB total for those datasets.

thesz · on March 6, 2020

Ethash mining is not memory bound. It has, if I remember correctly, a precomputed buffer used by many operations.

To quote Wikipedia: "Ethash uses an initial 1 GB dataset known as the Ethash DAG and a 16 MB cache for light clients to hold. These are regenerated every 30,000 blocks, known as an epoch."

You generate it once, access many times.

If each memory access is 32-bytes, then one need not more than 32*2^20 computations to get completely sequential access. To very probably not miss a prefetched DRAM block (main throughput bottleneck), which is 4K..8K in typical size, one need about 128 times as less computation threads. And this number (~250 thousands threads) is well within reach for current GPU models.

State of the NN weights will change after each batch, on the other hand.

LargoLasskhyfv · on March 6, 2020

Why do you all guess? The paper has a link to their github with the source and instructions: https://github.com/keroro824/HashingDeepLearning

I see no reason not to try it with some AMD Threadripper or EPYC instead.

threeseed · on March 6, 2020

It looks like it benefits from AVX512 which AMD does not support.

Might be worth trying on something like an 10940x/10980xe if you can get your hands on them.

thesz · on March 6, 2020

AVX512 benefits are coming from gather-scatter instructions, I think.

What is interesting here is that in their current implementation they aren't very beneficial [1] and [2].

[1] https://arxiv.org/pdf/1806.05713.pdf [2] https://www.sciencedirect.com/topics/computer-science/scatte... (recommends these instructions to be used outside of main loop)

I remember vaguely that first implementations of scatter/gather instructions were not faster than sequential access from different memory registers.

And, thusly, it may come handly that AMD has much bigger core count because each thread will have less memory to access.

kevingadd · on March 6, 2020

For reference since it's kind of buried/obfuscated: Their point of comparison is between an NVIDIA Tesla V100 and a 44-core CPU. The latter is probably something like a Xeon E5-2699, which has a list price of $4115 USD. Hard to find accurate pricing data for the V100, but it looks like it was in the $7-10k USD range back in 2018. Still a cool cost/performance improvement but not as massive as I was expecting before I looked up the test hardware.

lawl · on March 6, 2020

It's not just that the CPU is cheaper than the GPU. They also claim it's faster than the more expensive GPU. Meaning if you rent a box in the cloud for training it'll cost you less since you'll rent it for less time, and if you buy the hardware you can train more models and thus need less hardware to train as much as with GPUs before.

So if it's 3.5 times faster then you require only roughly 30% of the time (or hardware) it took before, both savings combined seem pretty significant to me.

Edit: quick napkin math, let's take your $7k figure for the GPU and 4115 for the CPU. We need one third of that CPU so 4115 * 0.3 = 1234.5 now 1234.5 / 7000 = approx. 0.176.

18% of what it cost before...

tyingq · on March 6, 2020

That's fair, though it's much easier and cheaper to rent CPU time because of the overwhelmingly larger amount of supply.

zvrba · on March 6, 2020

> Their point of comparison is between an NVIDIA Tesla V100 and a 44-core CPU.

Not quite. Further in the paper: "Similarly, for larger Amazon dataset, the red and black line never intersect, and the red and blue line intersects on 8 cores. This means that SLIDE beats Tensorﬂow-GPU with as few as 8 CPU cores and Tensorﬂow-CPU with as few as 2 CPU cores."

Yes, they have a beast machine, but it can outperform the GPU with only 8 cores in some cases.

raghavtoshniwal · on March 6, 2020

Also worth noting that that Xeon processor takes somewhere around 350-400W and the V100 is also in the range of 300W. So not huge on energy savings. Although a really cool, potentially industry shaking breakthrough, regardless.

mkl · on March 6, 2020

The V100 would need to run for 3.5 times as long, so would use way more energy. E.g. training for 1 hour on CPU: 400W×1h = 400Wh of energy, vs training for 3.5h on GPU: 300W×3.5h = 1050Wh of energy.

steerablesafe · on March 6, 2020

Why is this downvoted? This is absolutely right.

tyingq · on March 6, 2020

I realize TDP is a bit gamed, but Intel shows the E5-2699A as 145W. Curious where you got the 350-400W for the CPU.

Nvidea lists 300W as the TDP for the V100.

ahoka · on March 6, 2020

Intel TDP is not "gamed", but total bullshit: https://www.tomshardware.com/reviews/intel-xeon-e5-2600-v4-b...

qayxc · on March 6, 2020

Since the article doesn't specify how power draw was measured, I assume that they measured _total system_ power consumption. This is not to be confused with the CPU power requirements as it includes the entire platform and all losses (e.g. power supply, main board, peripherals, cooling, ...).

veselin · on March 6, 2020

The new AMD EPYC 7742 should have double the memory bandwidth (they claim they are memory bound) and 1.5x the cores for a similar list price.

But in all cases, I would not expect that a serious research can outright beat the current NNs by 10x on all dimensions. This will take some time (and may not fully happen) and this paper is certainly a great advance.

poorman · on March 6, 2020

Maybe the real benefit here is that due to the asynchronous nature of the algorithm, you could easily scale to more cores and the new approach will scale with it, vs. at some point, you reach a point of diminishing returns with the GPU?

threeseed · on March 6, 2020

You can get an Intel 10980xe with 18 cores/36 threads for $1030USD which also supports AVX512 and DL-Boost.

https://ark.intel.com/content/www/us/en/ark/products/198017/...

Availability is pretty much non-existent though. Also you can OC it pretty close to 5GHz with decent water cooling.

jandrewrogers · on March 6, 2020

> "hashing is a data-indexing method invented for internet search in the 1990s"

Eh? It was invented multiple decades prior to the 1990s. Some days you could get the impression that computer science did not exist before the Web.

m0zg · on March 6, 2020

A more correct headline would be: "An algorithm beats a poorly optimized GPU implementation on a narrow problem that uses an extremely sparse dataset using $8.5K worth of CPU".

This is for recommenders only, and it does not translate to anything else. I don't know why everyone seems to misrepresent this as NVIDIA's undoing. Read the freaking paper, people.

bmh · on March 6, 2020

I'm curious to see if they can apply this to industry-standard tests like ImageNet classification.

The workloads that they test on make it hard to quantify the broad applicability of their work.

fareesh · on March 6, 2020

> Slide: 3.5x Faster Deep Learning on CPU Then on GPU

Typo in the submission title - it should say "Than" on GPU

regularfry · on March 6, 2020

I think I'm right in saying that the type of locality-sensitive hash systems they're talking about are not entirely dissimilar to Igor Aleksander's WISARD, RAM-based recognisers from the 80's. I suspect the latter is a special case of the former. How far off-base am I?

andrewmatte · on March 6, 2020

A reminder for us all: No matter how much faster your hardware, you can still write slower code

ironfootnz · on March 8, 2020

I feel like this is more a stunt for the new intel xenon. Creating a new paradigm on a heterogeneous hardware dependency is misleading. Linear analysis could be more explored. They could achieve the same with less. Holomorphic functional analysis.

But it’s a good try. I’d say that this catch’s my interesting as is a valid point to optimizations https://arxiv.org/pdf/1908.05858.pdf

signa11 · on March 6, 2020

actual paper: https://arxiv.org/abs/1903.03129

bitL · on March 6, 2020

Did they just accidentally kill NVidia's business model?

qayxc · on March 6, 2020

I don't think so. Keep in mind that DL training is only a minuscule part of GPGPU applications. HPC is still a huge market with a strong demand for compute accelerator cards.

In practice, TPUs and SoCs with inference extensions are a much bigger threat for NVIDIA in the cloud and automotive business.

imtringued · on March 6, 2020

NVidia could add special accelerator instructions to their GPUs to efficiently implement the same algorithms and then they would be significantly ahead of Intel and AMD.

numpad0 · on March 6, 2020

Only if it works and scales? NVIDIA can still build like 128 core ARM if they absolutely has to.

zozbot234 · on March 6, 2020

I don't think so. Matrix multiplication has plenty of uses besides neural network training, and GPUs will still excel at those workloads.

bitL · on March 6, 2020

Matrix multiplication with INT8, INT16 or BFLOAT16 doesn't have that many uses outside Deep Learning.

jmmcd · on March 6, 2020

Even if they killed it, it wasn't accidental!

poorman · on March 6, 2020

I've been trying to get through this paper for the last two days. It's somewhat sparse itself. Maybe I need to go read the code they wrote first...

tkyjonathan · on March 6, 2020

This is obvious to me. Since Hadoop came out, (a lot of) people have been giving up on even forming algorithms and just dumping data into machine learning and hoping for the best. I recall someone high up at Google complaining about it.

We need to get back to forming algorithms as well as concepts and first principles. We cannot and should not expect ML to brute force finding patterns and just sit back and relax.

Here is another prediction for you: we will not solve ray-tracing in games and movie CGI with more hardware. We will need some algorithm that gets us 80-90% of the way there in a smart way.

taneq · on March 6, 2020

This was my first thought. Well, to be more complete - smart algorithms beat dumb algorithms even if the dumb algorithms use hardware acceleration (unless the problem is trivial anyway.) Smart algorithms plus hardware acceleration beats smart algorithms on general purpose hardware. Smart algorithms are just better.

ourlordcaffeine · on March 6, 2020

>We provide codes and scripts for reproducibility

Where? I want to get my hands on this code

garybake · on March 6, 2020

paperswithcode has a link to the repo

https://paperswithcode.com/paper/slide-in-defense-of-smart-a...

mwexler · on March 6, 2020

"But... Moore's Law, more hardware!" he plaintively cries out...

RawChicken · on March 6, 2020

Sorry for the off topic comment but this then/than mistake I read every day is just getting on my nerves.

" What to Know: Than and then are different words. Than is used in comparisons as a conjunction, as in "she is younger than I am," and as a preposition, "he is taller than me." Then indicates time. It is used as an adverb, "I lived in Idaho then," noun, "we'll have to wait until then," and adjective, "the then governor."" [1]

[1] https://www.merriam-webster.com/words-at-play/when-to-use-th...

quietbritishjim · on March 6, 2020

You think that's a bad language mistake? The title (and the article itself) says that their CPU run is 3.5x faster than GPU. But actually it's 3.5x as fast, which is a radically different thing: "3.5x faster" would mean "4.5x as fast", in the same way that "50% bigger" means "150% times the size" not "50% of the size".

Edit: Clearly this is a contentious comment, and even those of us that see things this way mostly seem to agree that we understand the intended meaning (but things get fuzzer with smaller numbers expressed as percentages e.g. "120% faster"). Surely, though, it makes more sense to use the completely precise phrasing "3.5x as fast", especially for the main statement of the main result in an academic paper.

mkl · on March 6, 2020

I don't think that's right. I think "two times faster" would mean "twice as fast", not "three times as fast".

quietbritishjim · on March 6, 2020

> I think "two times faster" would mean "twice as fast", not "three times as fast".

I don't see how that's the case. What is "50% bigger"? I understand it as 150% of the size. Similarly, I would understand "90% bigger" to mean "190% of the size" and "100% bigger" to mean "200% of the size", and I'm sure that is how they are used in practice.

So then surely "200% bigger" means "300% of the size"? And "two times bigger" - which is mathematically identical to "200% bigger" - would be three times the size. I acknowledge that the phrase is often used like you say but I don't think that is its literal meaning, and if you used that in a contract I think it would legally be interpretted in the way I've said (I'm thinking of consumer law in a situation where something is "n times bigger for the same price").

All this applies analagously to speed and x times faster. I gave the examples above with size because I think it's a bit more of a common thing to talk about this way and speed is a bit more subtle because what we're measuring here is the time which is something / "speed" and here the numerator isn't clear (number of training runs perhaps).

callmekit · on March 6, 2020

I wonder if you're a native English speaker?

Interestingly, in Russian "100% bigger" and "two times bigger" will have different prepositions, so it's more clear that in the first case you do sum (x + 100%x = 2x), and in the second case you do multiplication (2 x = 2x).

I'm quite sure it's the same in English (sum with % and multiplication with times), but I'm not a native English speaker.

DougBTX · on March 6, 2020

I'm a native English speaker, I don't think I would ever say "two times bigger", sounds like bad grammar. I would read "2x the size" as "twice the size" or "two times the size", but not "add on twice the size".

So I agree with the article, "3.5 times faster (1 hour vs. 3.5 hours)" is perfectly correct, and it is OK to abbreviate "3.5 times" as "3.5x".

IanCal · on March 6, 2020

"times" and "of" are doing the same work in the phrase. You just wouldn't say 300% times bigger. The times (or X) implies multiplying, not adding.

hnuser123456 · on March 6, 2020

That's just sloppy language and you got used to it. Twice as fast means twice as fast. Two times faster means two times faster. One time faster would mean twice as fast. 0.5x faster would mean 50% faster or 150% as fast. 0.5x slower would mean 50% slower or 50% as fast.

bomod · on March 6, 2020

And what would 2x slower mean?

hnuser123456 · on March 6, 2020

reverse time

but really, relative to something else, i guess it would be 0.33x as fast, because it would take two times more time, or 3 times as long

bartread · on March 6, 2020

I agree and, if you are mistaken, I'd have to observe that this is a spectacularly unclear way of communicating an increase.

The meaning of "two times faster" as "twice as fast" is certainly the way such a statement would generally be interpreted everywhere I've worked.

It is of course possible that the meaning suggested by quietbritishjim is archaic British, but I certainly don't believe it's current: I've worked in various places Cambridge and London for the past 18 years and, as I say, have never encountered it.

quietbritishjim · on March 6, 2020

> if you are mistaken, I'd have to observe that this is a spectacularly unclear way of communicating an increase.

I absolutely agree with that, and in practice if I ever see that turn of phrase with anything more than 100% then I assume that they are using it the way that you're thinking of. But I maintain this is not the literal meaning. At the same time, I'm not saying people should be subtracting 100% to make the number mathematically correct, that would definitely be bizarre. Instead, I'm saying they shouldn't be using such a weird turn of phrase in the first place, so the headline should simply be "Deep learning on CPU 3.5x as fast as on GPU"

> I've worked in various places Cambridge and London for the past 18 years and, as I say, have never encountered it.

You have really never encountered an item in a supermarket saying "now 20% bigger!"? Thinking about it now, they're usually charging the same amount as the old size (otherwise it's not much to brag about really) in which case they use the vastly clearer phrase "20% extra free", but I'm sure I've seen the former phrase.

zimpenfish · on March 6, 2020

>archaic British

If even that.

I think "3.5x faster" to mean "3.5x as fast" is fairly common, pretty clear, and very understandable to anyone but daft grammar prescriptivists.

zimpenfish · on March 6, 2020

And here's an actual expert to provide a view: https://languagelog.ldc.upenn.edu/nll/?p=463

> A personal note: I know the disparagement of Times-er from long ago, from my grade school years, I think; I was taught that two times more than X really means 'three times as many as X'. Since authority figures insisted on this interpretation, I avoided the construction entirely (as, as far as I know, I still do). Yet I've never stopped asking, "Why don't you understand the clear meaning of what people are saying?"

exdsq · on March 6, 2020

+1 to Lojban for not having these issues.

https://en.wikipedia.org/wiki/Lojban

polar · on March 6, 2020

Consider what 1 times faster or 100% faster would mean.

amelius · on March 6, 2020

Language doesn't follow logic sometimes.

zimpenfish · on March 6, 2020

https://languagelog.ldc.upenn.edu/nll/?p=463

>> That last reaction incorporates one criticism of Times-er, namely that it is "illogical" or "irrational": X times more than Y MUST MEAN 'Y plus X-times-Y (that is, 'X+1 times Y'), not 'X times as many/great as Y' (that is, 'X times Y'). (In the most common variant of this reaction, X times more than Y is disparaged because it is said to be ambiguous, with both the 'X times Y' and 'X+1 times Y' interpretations.)

>> The appeal here is to the idea that ordinary-language expressions are simply realizations of logical (or arithmetical) formulas. This is just backwards. The formulas are there to represent the meanings of expressions; they are not the prior reality, merely cloaked in (those devilishly vague) words of actual languages.

hervature · on March 6, 2020

Not disagreeing with this but I would argue that logical inconsistency here is that nothing is faster and so using the word is what breaks the math.

For example you also wouldn’t say -0.5 times faster, it just doesn’t make sense.

bengillies · on March 6, 2020

1 times faster means "the same speed as". 100% faster means "twice as fast". Though of course nobody would say 1 times faster in reality.

wongarsu · on March 6, 2020

So what do "0.5 times faster" and "50% faster mean"? Are both identical to "half as fast"?

jansan · on March 6, 2020

What annoys me even more is that I keep using this comparison in a wrong way despite knowing better. BTW, what do you say if you made something twice as fast? Once faster?

_gtly · on March 6, 2020

People using 'further' instead of 'farther' (used for distance) used to get me upset, but people mistake them so often that I realized I was getting upset for no good reason. I know what they meant even if they were ignorant of the grammar. I agree that in an academic paper you don't want to see grammar mistakes, but in many other contexts if you understand what is meant, it's no use getting bent out of shape.

martin-adams · on March 6, 2020

This reminds me when people say things like, "we had a 1% increase in conversion rate", but what they mean is a 1 percentage point increase.

marcusjt · on March 6, 2020

YES! This annoys me so much whenever I see it

nxpnsv · on March 6, 2020

I first thought the title implied "Make DL faster by first running on CPU, then on GPU"...

ralfd · on March 6, 2020

Curiously, this mistake of "than/then", or similar the grating "would of" instead of would've, I only ever see native speakers do.

frabert · on March 6, 2020

I think it could be easily attributed to the fact that most non-native speakers learn English at school by studying its grammar in written form, where the two words are distinct. Native speakers, instead, learn English as their spoken language, where the words sound basically identical to each other.

notRobot · on March 6, 2020

Yes, true. "have" and "of" don't sound same at all in almost all non-native accents which helps.

notRobot · on March 6, 2020

True. Non-native speakers learn the language a lot more "explicitly" so they make mistakes like these a lot less often.

Another pet peeves:

"irregardless" isn't a word. It's regardless.

If you don't care about something, you "couldn't care less about it". It's simple logic and yet people mess it up all the time.

kylebyproxy · on March 6, 2020

I've only noticed it enter common usage recently, but I cringe whenever someone fails at using the word "devoid".

galoisgirl · on March 6, 2020

As a non-native, I learned "then" and "than" in different contexts, months if not years apart. I also learned them in speech and in writing at the same time.

Please don't quiz me on how to read bear, pear, tear, fear, spear, clear, and dear.

_gtly · on March 6, 2020

People using 'further' instead of 'farther' (used for distance) used to get me upset, but people mistake them so often that I realized I was getting upset for no good reason. I know what they meant even if they were ignorant of the grammar. I agree that in an academic paper you don't want to see grammar mistakes, but in many other contexts if you understand what is meant, it's no use getting bent out of shape.

2zcon · on March 6, 2020

I do understand it but, for my colleagues and friends who don't have English as their first language, it adds another caveat to learn and remember without a logical basis. That's another place to introduce ambiguity and errors.

I don't get angry at non-standard usage but I think it's important not to ignore the purpose of consistent style.

zimpenfish · on March 6, 2020

> people mistake them so often that I realized I was getting upset for no good reason

Indeed, because they come from the same root - the difference only came from English peasants and their wacky spellings.

https://languagelog.ldc.upenn.edu/nll/?p=45754

blackandblue · on March 6, 2020

yea very true! and as someone who grew up speaking french, i want people to stop making the viola/voila mistake too. viola is an instrument. voila is what you mean most of the time.

also off-topic, but i heard the word senglish, for singaporean english, on a podcast yesterday. that made me realize how english has the potential to become a universal language with each country having their own version.

french people already use the term frenglish when they mix english words with french. we could have spanglish, japenglish, germanish, etc. they don't have to be called those though.

it would be totally awesome to be able to communicate with almost everyone in the world. just like the internet!

pishpash · on March 6, 2020

It's not not knowing the meaning, it's simply a proofing mistake in phonemic writing. For a native speaker, spelling as it sounds is second nature, making morphological distinctions is harder.

gwd · on March 6, 2020

My favorite example of the importance of the difference is https://weheartit.com/entry/22781532

johnlorentzson · on March 6, 2020

It can be simplified as "than is a comparison between two things, then is for referring to a time".

taneq · on March 6, 2020

Unfortunately apparently most people could care less.

flohofwoe · on March 6, 2020

For a non-native speaker it's hard to tell the two apart. Then and than sound pretty much the same. And even if one knows the difference, it's easy to make a typo.

In the sort of "international pidgin English" that's spoken anywhere outside the UK such subtle differences should just be ignored.

Majestic121 · on March 6, 2020

I don't know about that, this kind of mistakes (then/than, effect/affect) are getting on my nerves as well, and I'm definitely not a native english speaker.

Not to mention that this specific kind of mistakes (similar sounds) are at least as often from native speakers as from non-native in my experience/native language.

In French, a lot of people mistake Ça for Sa for example, native and non-native alike

ur-whale · on March 6, 2020

>should just be ignored.

As a matter of fact, it shouldn't: the sentence with "then" in it has a different meaning altogether than the one with "than" in it.

Using "then" suggests that something is done on the CPU and then ("then") on the GPU.

heero · on March 6, 2020

I don't agree, I'm not a native speaker and probably like many, I leaned english by reading so those two words sound very diferently in my head. I'm always lost when I see this mistake.

boardwaalk · on March 6, 2020

I don’t think you’ll get much play for suggesting (to an audience of at least some programmers) that we should allow for more ambiguity in language, heh.

The programmers I’ve met without an eye for detail are usually ones I do not like working with.

flohofwoe · on March 6, 2020

Hehe, true, but unlike programming languages, the languages humans use for communication are "sloppy" and ambiguous by definition. Grammar rules have been invented after the fact to create the illusion that there's order where none exists.

English allows much more "freedom" than many other languages (e.g. German), maybe that's one reason why it has been so successful in the end.

dkersten · on March 6, 2020

If they are just ignored, how will anyone learn? If they really were small (to the meaning of the sentence) differences, then whatever, but switching than for then changes the sentence.

galoisgirl · on March 6, 2020

Speak for you're self. /s