Hacker News new | past | comments | ask | show | jobs | submit login

They can be, pip developers just have to care about this. Nothing you described precludes file-based deduplication, which is what pnpm does for JS projects: it stores all library files in a global content-addressable directory, and creates {sym,hard,ref}links depending on what your filesystem supports.

Being able to mutate files requires reflinks, but they're supported by e.g. XFS, you don't have to go to COW filesystems that have their own disadvantages.

You can do something like that manually for virtualenvs too, by running rmlint on all of them in one go. It will pick an "original" for each duplicate file, and will replace duplicates with reflinks to it (or other types of links if needed). The downside is obvious: it has to be repeated as files change, but I've saved a lot of space this way.

Or just use a filesystem that supports deduplication natively like btrfs/zfs.

https://github.com/sahib/rmlint




This is unreasonably dismissive: the `pip` authors care immensely about maintaining a Python package installer that successfully installs billions of distributions across disparate OSes and architectures each day. Adopting deduplication techniques that only work on some platforms, some of the time means a more complicated codebase and harder-to-reproduce user-surfaced bugs.

It can be worth it, but it's not a matter of "care": it's a matter of bandwidth and relative priorities.


"Having other priorities" uses different words to say exactly the same thing. I'm guessing you did not look at pnpm. It works on all major operating systems; deduplication works everywhere too, which shows that it can be solved if needed. As far as I know, it has been developed by one guy in Ukraine.


Send a PR!

Are there package name and version disclosure considerations when sharing packages between envs with hardlinks and does that matter for this application?

Practically, caching ~/.pip/cache should save resources; From "What to do about GPU packages on PyPI?" https://news.ycombinator.com/item?id=27228963 :

> "[Discussions on Python.org] [Packaging] Draft PEP: PyPI cost solutions: CI, mirrors, containers, and caching to scale" [...]

> How to persist ~/.cache/pip between builds with e.g. Docker in order to minimize unnecessary GPU package re-downloads:

  RUN --mount=type=cache,target=/root/.cache/pip

  RUN --mount=type=cache,target=/home/appuser/.cache/pip




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: