Lessons Learned from Building an AI Writing App

minimaxir · on Oct 13, 2019

About GPT-2 finetuning: I have a Python package which makes the process straightforward (https://github.com/minimaxir/gpt-2-simple), along with a Jupyter notebook to streamline it further (https://colab.research.google.com/drive/1VLG8e7YSEwypxU-noRN...).

jeffshek · on Oct 13, 2019

Hi Max :) Huge fan!

I'm adding this link on the article since I'm sure that's a bit more straight forward than my Docker mess lol.

aaron-santos · on Oct 13, 2019

What a great coincidence. I found gpt-2-simple on Friday and just got it running in a Flask on Fargate a few mins ago. GPT-2-simple made the process so simple that my biggest problems were infra and not inference.

Have you heard of any success on running in a lambda?

minimaxir · on Oct 13, 2019

GPT-2 small might be too big/slow for a lambda (admittingly I am less familiar with the AWS stack, more familiar with GCP). In the meantime, I do have it running on Cloud Run (https://github.com/minimaxir/gpt-2-cloud-run) with decent success.

sdan · on Oct 13, 2019

Love the writeup! Not sure if it's just me, but showing how your entire infra works is just amazing (I wrote something similar on how I did my infra: https://sdan.xyz/sd2).

How much do you spend per month on this?

Also why do you need a GPU? Last time I played around with GPT, you didn't need a GPU to run with inference (you already have the weight files).

Also are you running 774 or just 335?

jeffshek · on Oct 13, 2019

Regarding the diagram - yeah, it's SO much easier to visualize when you see it. It takes oddly a bit of time just to make boxes fit though. Huge fan of Traefik too :)

Most of my fine-tuning was on 355 (otherwise, 774 is too hard to train).

However, the default prompt on https://writeup.ai goes to 774.

The GPU with inference is helpful for much faster responses -- for instance, it's easy to run inference (CPU) if you're getting a one way shot of 200 words (no revisions), but if you're constantly changing and revising, then speed ends up mattering a lot for UX.

EDIT: Looked over your articles, I'm literally floored/amazed you're in high school and know this much. The whole world is your oyster.

solidasparagus · on Oct 13, 2019

Really cool writeup! I'm surprised you trained with TF and then deployed with PT. I feel like PT is considered easier to train with while TF is easier and faster to set up production-level serving. I think you might have had to do less work if you used TF Serving.

jeffshek · on Oct 13, 2019

https://github.com/huggingface/transformers (was) based on PyTorch, so they originally had lot of the models like gpt2-small. Because of that influence, PyTorch was probably going to win there.

The one con about TF Serve (TFX) is packaging the entire model into the container (so that ends up being 3gb?+). This was a couple of months ago, so I might be wrong by now ... It was an area I wasn't very confident I could do, TF Serve is really new, so there aren't many guides (and many were already out of date).

solidasparagus · on Oct 13, 2019

TF Serving has been around several years and is pretty mature. (It predates TFX). You can stick the model in the container or have the container pull the model from blob storage (or have no container at all if you really want).

It's too bad you didn't find a good guide - if you have the training dump a SavedModelBundle at the end, you can have a production-quality serving microservice up and running in about two lines of code - https://www.tensorflow.org/tfx/serving/docker.

But it doesn't really matter since you got it working.

throwaway_bad · on Oct 13, 2019

This was very informative and thoroughly convinced me that I am not good enough to do ML dev ops myself if I ever want to deploy a model.

Is there an easier way? Shouldn't there be some company who will take my money to instantly turn my models into a microservice?

rectangletangle · on Oct 14, 2019

I'm pretty sure there's more than a few options out there.

Just a quick google search found these:

https://www.floydhub.com/

https://algorithmia.com/product

(I'm not affiliated with them in any way, and haven't ever used their services.)

hint23 · on Oct 14, 2019

To give some perspective, the setup of http://textsynth.org consists in a single C Linux executable of 250 KB on the server and in 150 lines of Javascript code on the client without any dependency on other libraries...

Havoc · on Oct 13, 2019

Nice write-up. :)

Must have cost a fortune if each instance gets it's own GPU.

I suspect the CUDA via docker added a fair bit of complexity.

I toyed around with TF2 on a VM and while difficult to get the versions aligned it wasn't as troublesome as the docker sounds.

solidasparagus · on Oct 13, 2019

All cloud providers have prepackaged VM/Containers with all the versions aligned and GPU libraries + DL frameworks preinstalled with full GPU support enabled. They are generally called things like Deep Learning container/VM/AMI.

I always try to use one of those.

Havoc · on Oct 14, 2019

>All cloud providers have prepackaged VM/Containers with all the versions aligned and GPU libraries + DL frameworks preinstalled with full GPU support enabled.

ah didn't know that. I've been spinning up blank nix boxes which does end up a little fiddly until you find a combo that works

alphagrep12345 · on Oct 14, 2019

Amazing write-up. Do you mind sharing how much this costed you?

samirsd · on Oct 14, 2019

trying to do this with video style transfer (haven't successfully attached gpus yet so it's sloooowww): https://www.vcr.plus -- example video: https://storage.googleapis.com/vcr_plus/out_6ae00222-9111-46...

Jack000 · on Oct 13, 2019

Have you tried batch inference on the GPU? In my experience it increases throughput by an order of magnitude.