Hacker News new | past | comments | ask | show | jobs | submit login
MXNet – Deep Learning Framework of Choice at AWS (allthingsdistributed.com)
99 points by werner on Nov 22, 2016 | hide | past | favorite | 56 comments



Translation from corporatespeak: "We don't have an internally developed framework that can compete with TensorFlow, which is controlled by Google, so we are throwing our weight behind MXNet."

As others have commented here, there is no evidence that MXNet is that much better (or worse) than the other frameworks.


Exactly. Among those DL frameworks, I think what TensorFlow gets right the most is the tooling support. The metric collection/visualization/checkpointing is plug-and-play in TensorFlow, others not too much. For example, summary a.k.a metric collection is just a subgraph of the whole computational graph, which can be evaluated at any time. A simple and neat abstraction indeed.

Those properties combined make TensorFlow the most engineer/practitioner friendly choice in the market. If AWS hopes to compete with TensorFlow in all seriousness, they need to catch up with support on those seeming trivial but important details.


Amazon has been building technology based on ML&DL for over 20 years and has developed several frameworks. You must have missed the announcement of this open source framework earlier in the year: https://github.com/amznlabs/amazon-dsstne.


I saw that when it was announced. DSSTNE has failed to capture the hearts and minds of developers. In my experience, it doesn't come up in any conversations about which frameworks to bet on for new product development.

And I'm rooting for Amazon (and FaceBook, and Microsoft...). TensorFlow needs competition for the hearts and minds of developers.


This doesn't address the root comment at all. Does Amazon actually think MXNet is the best? Or did they simply choose the next best thing that isn't already backed by another "big four" company (Google -> TensorFlow, Facebook -> Torch). It's hard to believe this is actually about scalability without any data.


Here is a very nice blog article explaining how Amazon is generating recommendations at scale with Apache Spark and Amazon DSSTNE :) https://aws.amazon.com/blogs/big-data/generating-recommendat...


At least MXNet is a good one that deserves more publicity and backing (in terms of maintenance effort). I find it better for the community to have AWS back a good existing open source project than to re-invent a very similar wheel one more time.


I like MXNet, and I think it's great that Amazon is backing it publicly. TensorFlow needs competition for the minds and hearts of developers.


There is a huge distributed performance advantages vs TensorFlow. You can get a hint from Prof. Carlos Guestrin's keynote talk at Data Science Summit 2016. Also, CMU CS Dean Andrew Moore cited MXNet as "is the most scalable framework for deep learning I have seen"


This recent OSDI paper [1] has a direct comparison in Fig 8. It appears there is no particularly pronounced distribution or general performance advantage, and TensorFlow actually outperforms MXNet in this comparison.

1: https://www.usenix.org/system/files/conference/osdi16/osdi16...

[full disclosure, I work on the TensorFlow team]


The version tested in paper should not have P2P,so it was much slower than current version.


It seems more prevalent now than it used to be, that frameworks/libraries are being used as weapons in a sort of mindshare war between the world's megacorps. Or perhaps I'm misremembering history. And I don't mean just AI; just look at Angular (Google) vs. React (Facebook).

It's a bit of a double edged sword. As developers this war gives us free access to well funded and heavily developed tools. The world has been fundamentally changed by their availability. But at the same time we need to understand that the primary reason they exist is to lock developers into a particular vendor. It's most transparent with Google's TensorFlow, where they were obvious about their intentions to offer TensorFlow services on their cloud platform.

This article more than most exemplifies their desperate attempts. For now it seems to remain mostly that, desperate attempts, with the tools remaining more-or-less platform agnostic. But I foresee a grim future where our best libraries and tools are tied inextricably to a commercial ecosystem.


Then utilize torch.


Isn't Torch actively supported by Facebook? https://research.facebook.com/research/torch/


Using 3 year-old GPUs on a much deeper network than the other guys(tm) to demonstrate awesome scaling efficiency == Intel-level FUD. Note also the absence of overall batch size.

Wonder what would happen to that scaling efficiency if those GPUs were P40s?

See also the absence of equivalent AlexNet numbers to further obscure attempts at comparing this to the other guys(tm).

Can't wait for Intel's response to this.


What is really fishy is evaluating training time speed ups in terms of throughput. The latency induced by the parallelism mechanism (when using asynchronous data parallelism) might seriously hamper the convergence speed. The presence of this potential problem cannot be detected in the throughput metric. They should have used a convergence metric instead (e.g. training time to reach 99% of the best validation loss).

If they can achieve 109x speed up with 128 GPUs using synchronous data parallelism with a batch size tuned for optimal single GPU convergence time, then this is very impressive (but quite unlikely).

However I don't think that publishing training benchmarks on Inception v3 (vs say AlexNet) is a fraud. Inception v3 is close to the state of the art and very good at using few parameters & inference FLOPS for a good test accuracy.

Inception v3 has been publicly available for quite a long time in a variety of DL toolkits along with pre-trained weights.


The results are reported using synchronized SGD with each GPU using batch-size 32. More details such as scripts to reproduce the results, scalability results on various networks (including Alexnet) and various batch sizes will be available soon. I'll put more technical details such as implementation details and performance analysis in my phd thesis.


Then the total batch size is growing with the number of GPUs and the convergence might be impacted both in terms of speed and solution quality (e.g. https://arxiv.org/abs/1609.04836 ).

I could believe you if tell you me that the validation loss and test accuracy of the large distributed model remains as good as the sequential, single GPU model after the same total number of epochs but this is not a given and if it's not the case I would find those benchmarks deceptive.


There are a lot of papers talking about the trade-off between algorithm convergence (validation accuracy) and system efficiency. At least it is my major phd research topic at CMU. In the context of synchronized SGD on deep CNN models, my observation is that up to batch-size X, the convergence speed is not so sensitive to the batch size; between X and Y, we still get good convergence rate by tuning the hyper-paramters carefully; but beyond Y, it then becomes an interesting research question.

Both X and Y are related to the dataset and network complexity. A rough guess I often use is num_classes < X < 10num_classes and Y ~= 10X. To accelerate the convergence for batch size between X and Y, we can either increase the data augmentation or learning rate, or both. The basic idea is to add more noise to the SGD training to avoid falling into suboptimal points too easily.

The paper you mentioned studies the extremely case that batch size >> Y. They used CIFAR 10 (num_classes = 10) and batch size (20% num_examples = 12K). I also surprised that they also extended our earlier work to CNN and showed promising results (Sec 4.2)

But also as mentioned by the paper authors, there is little theory we can say about that. I expected that the research community will have fun about it for a while.

But back to the MXNet benchmark, we did successfully tuned the hyper-parameters with 128 GPUs and batch size = 32 * 128 to match the convergence compared to a single machine on the Imagenet 1K dataset. So we think our setting is reasonable. But the main point here is that we are more willing to show how fast the system can achieve, so that researchers can easier try more efficient distributed algorithms here.


Are claiming MXNet gets a ~109x speedup in training time to X% accuracy with synchronized SGD on 128 gpus?


Amazon probably used P2 because they want to advertise it. We can get almost linear speedup on 10 8xM40 machines using MXNet. Batch size is linearly increased with # of machines but empirically it doesn't hurt convergence, at least on imagenet.

I mean who cares about AlexNet any more? It's 2016 already. It trains in under 2h on a single machine. Distributing it doesn't make much sense


Publish those numbers with the sample code to reproduce them. Your first paragraph is enough for an awesome white paper/use case to drive adoption. Don't let silly AWS internal politics get in the way if you work there. Find a workaround.

Amazon is at its best when it's customer obsessed and at its worst when it puts politics first.

All IMO of course.


2 hours to train Alexnet on a single machine? Link please.


https://developer.nvidia.com/cudnn Alex did it on 2x580 in 2012. Took him 1 week. It's 60x faster now even compared to K40


Comparisons on AlexNet are not very useful now. I can get AlexNet-like quality a lot cheaper (at test time) now, and for the same computational cost I can get a lot better in quality of results ... or even better if I accept more cost. I can't think of a good reason to evaluate AlexNet nowadays, I'm more annoyed at the other guys(tm) that (exclusively) do, since that means to get meaningful datapoints I need to rerun the experiments myself.


AlexNet #s IMO provide an excellent ballpark estimate of how well balanced compute and communication are in terms of both the framework and the underlying platform.

A platform that runs AlexNet well has excellent computation performance for the convolution layers but it also has excellent algorithms/communication for parallelizing the model/data by whatever means.

Networks that attempt to minimize computation and/or communication are cool, but they should be considered in that light IMO.

It's also a great estimate of the low-end for strong scaling. There's a lot of bread and butter machine learning at this level in my experience.


Okay, with all due respect, this is BS. I love MXNet and think it's under appreciated as well. But, pretty much its best feature is the memory mirror. (see oneshot908's comment)


This reads weirdly. He talks about how MXNet is the best choice without comparing it to other frameworks. That's the whole point of choosing between things. I'm sure they did the legwork to make this decision, and some insight into that choice might help others follow. Without that, my distrust radar is blinking.


From the OP:

  > a Deep Learning AMI, which comes pre-installed with the popular open source
  > deep learning frameworks mentioned earlier; GPU-acceleration through CUDA
  > drivers which are already installed, pre-configured, and ready to rock
You might want to clarify that the negative reviews [0] are from earlier versions which did not include the CUDA drivers. I recently considered this AMI and rejected it for a class [1] because of these reviews.

[0] https://aws.amazon.com/marketplace/reviews/product-reviews?a...

[1] https://www.meetup.com/Cambridge-Artificial-Intelligence-Mee...


The deep learning AMI now has both CUDA and CUDNN installed.


> we have concluded that MXNet is the most scalable framework

Without back by any benchmarks? This claim is lazy.


>MXNet can consume as little as 4 GB of memory when serving deep networks with as many as 1000 layers.

So perhaps I'm not well versed enough in deep learning, but does this mean that they solved the vanishing gradient problem? How are they managing to do this?


For deep convnets the vanishing gradient problems can mostly be solved by using residual architectures. See: https://arxiv.org/abs/1603.05027

This is kind of related to solving the vanishing gradient issue in RNNs by using additive recurrent architectures like LSTMs and GRUs.

Alternatively it's possible to use concatenative skip connections as in DenseNets: https://arxiv.org/abs/1608.06993

Still using 1000 layers is useless in practice. State of the art image classification models are in the range 30-100 layers with residual connections and varying numbers of channels per layer depending on the depth so as to keep a tractable total number of trainable parameters. The 1000 layers nets are just interesting as a memory scalability benchmark for DL frameworks and to validate empirically the feasibility of the optimization problem but are of no practical use otherwise (as far as I know).


Thank you!


Vanishing gradient isn't the same as memory efficiency. The memory mirror option is what allows this extremely efficient memory usage by only being 30% more compute intensive.


Yes, but that's not what I asked about.


Vanishing gradient is solved using model architectural choices: ReLu activation instead of sigmoid or tanh, using batch-normalization, using LSTMs

These are orthogonal to memory management and neural net framework choices.


Did not realize you could use MXNet declaratively (like Tensorflow/Theano) and imperatively (like Torch/Chainer). Can anyone speak more of their imperative usage of MXNet?


it means you can declare gpu array like those in numpy/torch, write them imperatively from python side, and mix them with the graph computation, instead of forcing everything to be part of a graph


does declaratively mean the use of expression template in C++?

I learned about it last week, I don't seem to see too much benefit if the goal is good performance.


No it means writing a program that defines the structure of a computation graph lazily (without executing the nodes when defining the model) so as to reuse that compute graph in a later step of the programs.

The computation graph is an in-memory datastructure that can be introspected by the program itself at runtime so as to do symbolic operations (e.g. compute the gradient of one node in the graph with respect to any ancestor input node).

theano implements this in pure Python and can generate C or CUDA code from string templates (in Python). tensorflow has to a Python API to assemble pre-built operators which are mainly written in C++ and use the Eigen linear algebra library.


"defines structure of a computation graph lazily (without executing the nodes when defining the model)"

But this sounds exactly like expression template.


But neither declarative DL toolkits (theano & tensorflow) use that C++ language feature: the computation graph is typically defined by writing a python script to assembles building blocks dynamically at runtime.

Once the graph is defined, it can be passed along with concrete values for the input nodes to the runtime framework to execute the section of the graph of interest (possibly with code generation + compilation).


Li Mu, the core developer behind MXNet, works for Amazon recently.


[offtopic] I think presentations with ascending bar charts are sort of cliche.


> Machine learning (...) is being employed in a range of computing tasks where programming explicit algorithms is infeasible.

I found this comment interesting. Is this really the summary of what machine learning is about?


Image classification is a classic example of such a task. How exactly would you go about writing an algorithm to tell the difference between a picture of a cat and a picture of a dog?


Well, this might be cheating, but I would apply a bunch of different filters for things like edge-detection, etc. Then I would come up with a statistical model that, for each feature, gave the likelihood that there image under consideration was a dog. Then I would aggregate all those results into a final likelihood.

Not trying to be sarcastic, I just can't think of any way other than the ML way.


To further the point: what filters would you choose? What features could you choose heuristically to distinguish between the two? They both have fur, they both have four legs, they both have two eyes, they both come in a wide variety of colors and patterns... Most dogs have an elongated snout but not all of them (pugs, bulldog, etc.).

I would be extremely impressed if someone developed an algorithm that could accomplish this task without using any type of statistical/machine learning.


How to draw an owl: http://imgur.com/gallery/RadSf


Yes! Sometimes you know that a solution will take a particular mathematical form, without knowing what the parameters will be. So you can write down a program (function) that can express any solution of that form, and use an optimization algorithm e.g. gradient descent on labeled examples, to figure out which specific instance of your possible solutions works best.


MXNet is the only deep learning framework that has proper support for R. That's why I use it and it is pretty nice IMO.


Isn't TF available in R as of late, too, from the RStudio guys? Still incomplete?


Atrocious syntax.


Can someone please spell-out for us muggles what sets these frameworks (Theano, Tensorflow, Torch, CNTK, Mxnet) apart ? They all seem to be essentially doing the same thing underneath.


Cloud vendor feature signaling, mostly.

Microsoft wants you to use CNTK on Azure. Amazon wants you to use Mxnet on AWS. Google wants you to use Tensorflow on GCP.

It's irrelevant whether these frameworks can be used outside their home platform by broke college students. That's a red herring. The cloud vendors are looking to sell enterprise contracts, and they need to check all of the boxes.

This strategy makes complete sense from a business perspective, and you really cannot fault them for doing it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: