> Most of the cutting edge papers are trained on several $100k worth of GPU time...

> Most of the cutting edge papers are trained on several $100k worth of GPU time

You can scale some things down. VGG 16 is basically a stack of CNNs, there's no reason you need 16 of them with an input size of 224x224x3; you can just as easily watch a 4 layer CNN learn filters on inputs of size 64x64x1. Obviously if the paper's result is achieved from sheer compute this won't work, but plenty of results come purely from the architecture.

You could also implement and run networks that are designed to be really cheap to compute. ResNet/InceptionNet, for example. I think this is a pretty important part of the space right now, considering how performant, general, and therefore inefficient Transformer architectures are.