Nice. I will definitely be taking a look at this. Have you looked at the xformers library ? They are looking at the same problem as you but their focus is more on providing performant transformer modules using triton. Using specific components from the library though is not as simple. I kept running into runtime errors so I've kept it aside for now. I am building something based on the Bert architecture so I will give this a look. Thanks for all the work!
I would've loved to look at xFormers, but I avoided looking at other implementations to make sure that ours is a clean room implementation.
Curated Transformers started as a very small library just for spaCy (spaCy 3.7 transformer pipelines use Curated Transformers) with just the older encoder models (BERT, RoBERTa, etc.). spaCy used Hugging Face Transformers prior for the provided transformer models, but we wanted something where we could easily hook into different parts of the model (e.g. for distillation).
After the functionality needed for spaCy was done, Matt @ Explosion encouraged us to extend it into a more general PyTorch library that would also support decoder architectures, generation, etc.