What a great coincidence. I found gpt-2-simple on Friday and just got it running in a Flask on Fargate a few mins ago. GPT-2-simple made the process so simple that my biggest problems were infra and not inference.
Have you heard of any success on running in a lambda?
GPT-2 small might be too big/slow for a lambda (admittingly I am less familiar with the AWS stack, more familiar with GCP). In the meantime, I do have it running on Cloud Run (https://github.com/minimaxir/gpt-2-cloud-run) with decent success.
Love the writeup! Not sure if it's just me, but showing how your entire infra works is just amazing (I wrote something similar on how I did my infra: https://sdan.xyz/sd2).
How much do you spend per month on this?
Also why do you need a GPU? Last time I played around with GPT, you didn't need a GPU to run with inference (you already have the weight files).
Regarding the diagram - yeah, it's SO much easier to visualize when you see it. It takes oddly a bit of time just to make boxes fit though. Huge fan of Traefik too :)
Most of my fine-tuning was on 355 (otherwise, 774 is too hard to train).
The GPU with inference is helpful for much faster responses -- for instance, it's easy to run inference (CPU) if you're getting a one way shot of 200 words (no revisions), but if you're constantly changing and revising, then speed ends up mattering a lot for UX.
EDIT: Looked over your articles, I'm literally floored/amazed you're in high school and know this much. The whole world is your oyster.
Really cool writeup! I'm surprised you trained with TF and then deployed with PT. I feel like PT is considered easier to train with while TF is easier and faster to set up production-level serving. I think you might have had to do less work if you used TF Serving.
https://github.com/huggingface/transformers (was) based on PyTorch, so they originally had lot of the models like gpt2-small. Because of that influence, PyTorch was probably going to win there.
The one con about TF Serve (TFX) is packaging the entire model into the container (so that ends up being 3gb?+). This was a couple of months ago, so I might be wrong by now ... It was an area I wasn't very confident I could do, TF Serve is really new, so there aren't many guides (and many were already out of date).
TF Serving has been around several years and is pretty mature. (It predates TFX). You can stick the model in the container or have the container pull the model from blob storage (or have no container at all if you really want).
It's too bad you didn't find a good guide - if you have the training dump a SavedModelBundle at the end, you can have a production-quality serving microservice up and running in about two lines of code - https://www.tensorflow.org/tfx/serving/docker.
But it doesn't really matter since you got it working.
To give some perspective, the setup of http://textsynth.org consists in a single C Linux executable of 250 KB on the server and in 150 lines of Javascript code on the client without any dependency on other libraries...
All cloud providers have prepackaged VM/Containers with all the versions aligned and GPU libraries + DL frameworks preinstalled with full GPU support enabled. They are generally called things like Deep Learning container/VM/AMI.
>All cloud providers have prepackaged VM/Containers with all the versions aligned and GPU libraries + DL frameworks preinstalled with full GPU support enabled.
ah didn't know that. I've been spinning up blank nix boxes which does end up a little fiddly until you find a combo that works