Algorithmia founder here. nvidia-docker is helpful but does not address all the issues with running GPU computing inside of docker. There are driver issues on the host OS, and the real challenge is running multiple GPU jobs inside of separate docker containers and sharing the GPU.
I agree that building models is still definitely a big challenge, but the tooling and knowledge is getting better every day. Either way, our goal with Algorithmia is to create a channel for people to make their models available, and create incentive for people to put in the effort to train really solid, useful models.
Agreed. I brought up a system with nvidia-docker last week for a computer vision application and while it works, it seems fragile. There are more pieces than there should be and it seems easy to break. I also don't know if we can use multiple containers on one host, but it doesn't sound like it.
It is not the final solution for containerized GPU applications.
Author of nvidia-docker here. You can definitely have multiple containers on each GPU if you want. If you find a bug or if you think the documentation was not great, please file a bug!
Awesome. Thanks for the reply and I apologize for suggesting something incorrect.
It does strike me as tricky needing to match driver versions between the host and the container. Do you know if there is any effort to eliminate that requirement?
Also while we're chatting, is there any hope of NVIDIA open sourcing their linux drivers? How would such a move affect nvidia-docker?
You don't need to match the driver version between the host and the container. Actually, you shouldn't include any driver file inside the container.
All the user-level driver-files required for execution are mounted when the container is started using a volume. This way you can deploy the same container on any machine with NVIDIA drivers installed.
Thanks for your superb work. Is it possible to use nvidia-docker on several AWS instances, to use multiple GPUs? (To spread training on multiple GPUs for more speed and ram. Tensorflow and Caffe support distributed training but not sure if it's viable on dockerized envs on AWS?)
One container can use multiple GPUs on the same machine without problems.
For distributed training (which Caffe doesn't actually support, not the official version), you would have to run one container per instance, but this is more a configuration problem at the framework level, than a Docker or nvidia-docker problem.
Graphistry co-founder here. We do that every day with nvidia-docker & AWS.
The real challenge is doing this on 100+ GPUs and leveraging multitenancy for an additional 100X+ economy of scale. We're actively working on it, and in my experience, this seems like a classic scheduling area where different domains will want to do it differently. However, even there, it'll end up something like "plugin a new user-level mesos scheduler x", and Nvidia is working on exactly that.
I'll wait for someone at Baidu or the Titan lab to blow up those numbers by another 100-1000X ;-)
Edit: If this sounds like a cool problem, we're leveraging GPU cloud computing and visual graph analytics for event analysis (e.g., core tool for teams in enterprise security). We would love help, esp. on cloud infrastructure or on connecting the eco-system together! Contact build@graphistry.com and we'll figure something out :)
Well yes, you do need to have the driver installed on the host OS :)
You can run multiple containers on the same GPU with nvidia-docker, it's exactly the same as running multiple processes (without Docker) on the same GPU.
I agree that building models is still definitely a big challenge, but the tooling and knowledge is getting better every day. Either way, our goal with Algorithmia is to create a channel for people to make their models available, and create incentive for people to put in the effort to train really solid, useful models.