Cluegen – Python Data Classes From Type Clues

miohtama · on May 13, 2020

Python imports and slow startup speed hurts Python developers on moderate and large projects. It's the downside of nothing having more static module exports. I believe JS/TypeScript folks try to avoid this trap. In Python, to know what things module exports you need to run the module, there is no static analysis way to find out it otherwise.

Making imports fast for data classes may solve some of the problems. Old big projects like Plone/Zope have solved this problem by making more generic lazy import system that it extensively being used.

Use zope.deferredimport package for this:

https://zopedeferredimport.readthedocs.io/en/latest/narrativ...

Though I am not sure if zope.deferredimport has been updated to play nicely with the modern typing tools like editors and MyPy.

fnord123 · on May 13, 2020

>Python imports and slow startup speed hurts Python developers on moderate and large projects

What kloc is moderate and large? I have not run into a project where import took that long.

>I believe JS/TypeScript folks try to avoid this trap.

Have you ever had to run a node project? I haven't benched it but I think it would be faster to compile a rust project and start that up than use node. But I can't web so maybe there's special ways to make node start quickly that I just don't know about. (I think there's a law where you can get answers more quickly by declaring something impossible to trigger people into flooding you with great suggestions :-) ).

twa927 · on May 13, 2020

The problem I experienced multiple times in 50-200 KLOC projects is not the time needed to import the modules, but the memory consumption caused by the imports. Moving some imports from top-level module statements to inner functions' code could improve the memory consumption several times, e.g. from 250MB per process to 80MB per process.

One tool I used was https://github.com/mnmelo/lazy_import but I'm not sure it's updated for Python 3.7/3.8.

fnord123 · on May 13, 2020

Ah, maybe it's use of pkg_resources that you find to be a problem with imports. That's indeed a horror story especially with NFS mounted python files.

anentropic · on May 13, 2020

Another approach to speeding up python imports I just found out about yesterday: https://pyoxidizer.readthedocs.io/en/oxidized_importer-0.1/o...

sq_ · on May 13, 2020

Love this bit from the Q&A section:

> You should pronounce it as "kludg-in" as in "runnin" or "trippin". So, if someone asks "what are you doing?", you don't say "I'm using cluegen." No, you'd say "I'm kludgin up some classes." The latter is more accurate as it describes both the tool and the thing that you're actually doing. Accuracy matters.

tda · on May 13, 2020

I read the readme only after this comment and was surprised by how detailed this was. Why would someone put effort complicated tool, benchmark it etc. at then not really take it seriously? Until I noticed who the author was, then it suddenly made complete sense

sq_ · on May 13, 2020

I don't know the author at all really, but I do appreciate a professional in any field who can find humor in their work and avoid taking themselves too too seriously.

The writing in the README reminds me of people like Derek Lowe (author of the In The Pipeline pharma/bio/whatever blog) and John D. Clark (author of Ignition!) who can create exceptional things and deliver knowledge in a humorous and engaging way.

nurettin · on May 13, 2020

There is no setup.py because we can just copy cluegen.py to our project and modify it. And there will be no new features.

This is probably the first library of it's kind that is unpackaged and claims that it doesn't need packaging.

BiteCode_dev · on May 13, 2020

David Beazley usually does that, and also refuses PR.

He is interested in providing PoC (like with curio), but not doing what comes after.

lstamour · on May 13, 2020

I don’t know, the code itself is shorter and easier to read than the readme initially was. And maybe this will dissuade folks from installing and using this if they’re not committed to maintaining it as part of their dependencies. Dependencies do require your maintenance and oversight of new code revisions, but few people schedule such time. It’s essential though, any dependency could arbitrarily change its behaviour at any time, technically breaking your code until you can produce a fix. So dependencies are technically yours to maintain — you just choose to limit your maintenance from a “fork” to a version string and API usage, but the maintenance burden still exists...

BiteCode_dev · on May 13, 2020

Yes, and he doesn't care.

raziel2p · on May 13, 2020

That is the dirtiest use of metaprogramming in Python I've ever seen. Well done.

gigatexal · on May 13, 2020

That’s just David being David. It’s awesome. And his talks always illuminate all the cool parts of python as he does crazy things with it.

sitkack · on May 13, 2020

And also necessary because Python is dirtiest VM in popular existence.

BerislavLopac · on May 13, 2020

The main problem with this approach is that it conflates type hierarchies (through class inheritance) with developer convenience; there are reasons why both attrs and dataclasses chose the decorator approach.

smitty1e · on May 13, 2020

I might quibble at calling this a "problem".

Every ingredient in the rack is spice. There is a "proper" amount to use for the recipe. Let good taste be the guide. Never go Full [language I like to drag].

hprotagonist · on May 13, 2020

dabeaz is the King Bumi of the python world.

And,true to his physicist roots, he’s very good at gedankenexperimenten.

I’m unlikely to actually use this and can’t really think why I would, but that’s hardly the point.

cooperadymas · on May 13, 2020

I'm trying to figure out if you intend some comparison beyond "somewhat mad but effective."

Is he waiting patiently for the right moment to strike? Does he pretend to be dumb but is actually incredibly smart?

:-)

hprotagonist · on May 13, 2020

he’s a mad genius who would rather go cabbage-traincar sledding with his friends than rule.

there’s a lot to admire in that, frankly.

sitkack · on May 13, 2020

Do they have that in Evanston?

hprotagonist · on May 14, 2020

no, just hackney's.

Congeec · on May 13, 2020

I use https://github.com/samuelcolvin/pydantic instead

softinio · on May 13, 2020

I use pydantic also and its great. Funny the README of this new lib claims it hasn't been done before. Good to have choice never the less.

BiteCode_dev · on May 13, 2020

It's not the same goal: pydantic is a validation and serialization library. The scope is way bigger.

uryga · on May 13, 2020

i'm curious about that __eq__. it generates code like

  (self.x, self.y) == (other.x, other.y)

but i think this would be more efficient:

  self.x == other.x and self.y == other.y

because you avoid creating a tuple (possibly heap-allocated) and if `.x` differs, you avoid a dictionary lookup¹ for `.y` thanks to short-circuiting. i think i benchmarked it at some point, but that was a while ago (on CPython 3.5)

---

btw, i went with a similar approach (i.e everything is codegened) in my sum type library: http://github.com/lubieowoce/sumtype

i never got around to generating the methods lazily, maybe i should!

for correctness reasons i switched from interpolating raw strings to <array of (line, indent-level)> and a bunch of wrappers – generating if-elif-else chains via raw strings gets scary. but they work well enough here

---

1. iirc even though it's using slots, it still has to look up the descriptors for `.x` and `.y` in the class dictionary. another possible optimization would be to trade off memory (and extensibility) for time and "cache" those. something in the vein of

  def generate(...):
    ...
    get_x = MyClass.x.__get__
    get_y = MyClass.y.__get__
    ...
    def eq(self, other):
      return (get_x(self), get_y(self)) == (get_x(other), get_y(other))
  
    MyClass.__eq__ = eq

here, `get_x` and `get_y` would be closed-over in `eq`, so they shouldn't incur a dictionary lookup. which of course adds significant complexity, but that may be a sensible trade-off

tutuca · on May 13, 2020

> Yes. Yes, you could do that if you wanted your class to be slow to import, wrapped up by more than 1000 lines of tangled decorator magic, and inflexible.

At first I thought: there's no way someone is this categorical about some obscure aspect of python's internals, then I noticed it's from dabeaz. What a man :)

_ZeD_ · on May 13, 2020

>>> Q: Who maintains cluegen?

>>> A: If you're using it, you do. You maintain cluegen.

while I know and respect dabeaz for all his work, I still prefer to rely on the "official" dataclasses module.

sixhobbits · on May 13, 2020

This wasn't actually in the original

https://twitter.com/honnibal/status/1260544951638732801?s=20

pietroppeter · on May 13, 2020

luckily there is a PR that will fix the par.t of th.e read.me that is un.read.able!

https://github.com/dabeaz/cluegen/pull/4/commits/b506ed86697...

BerislavLopac · on May 13, 2020

But why? This was obviously done on purpose, as a tongue-in-cheek comment on the attrs naming convention...

pietroppeter · on May 13, 2020

I was not taking seriously the PR, but found it funny enough to mention it.

BiteCode_dev · on May 13, 2020

That's a joke, and Beazley don't accept PR anyway.