Hacker News new | past | comments | ask | show | jobs | submit login
I Broke Rust's Package Manager for Windows Users (sasheldon.com)
351 points by sasheldon on May 7, 2017 | hide | past | favorite | 162 comments



This is a great example of something else about software. As software grows in usage and use cases, it starts bumping up against edge conditions which need to be handled for various reasons.

Cargo now is becoming stronger and more stable because of bugs like this being discovered. All software goes through this growth cycle. It's great to see these things worked out in the various projects that support Rust.

There is another point here though; anytime the question comes up to just rewrite a piece of software, throw out all the technical debt, it's not as straightforward as it seems. Remember, together with that technical debt lies a lot of valuable learnings written into the code. I haven't worked on Windows directly in years, but I never knew that NUL was a reserved word as a file. I would, and probably still will make this mistake in the future.

Which makes me wonder, has anyone written a file name validation crate that guarantees that you're not writing to any reserved words on a filesystem of the host OS? A quick search of crate.io doesn't turn anything up.


It also shows how necessary it is to have some sort of deprecation process. Maintaining nonsensical landmine features for compatibility with an operating system released 36 years ago is putting the interests of MS's lazy long-term users ahead of the interests of its current users. Even if MS maintained a policy of only removing functionality after a 10-year deprecation period, this "feature" would have been gone long ago. Transitions must be orderly, but they should still happen.

It's nice that Rust's toolchain is better able to live Windows crazy ecosystem, but that doesn't make Windows any less crazy.


If you think Microsoft supports features long-term out of lazyness then you haven't been paying attention. It's a very deliberate choice that helped them grow their business and keep customers.

Transitions are nice from a development perspective but I can guarantee you'll never hear someone who uses your library happy that they need to rewrite parts of it.

Also Windows doesn't have a monopoly bizarre filenames/features/etc you can find plenty of things in the nix family as well.

Lastly, Rust is one of the few projects I've seen that has phenomenal Windows support. It's something that's really appreciated and is going to help them capture markets that other software won't.


> If you think Microsoft supports features long-term out of lazyness then you haven't been paying attention.

Misreading. GP talked about MS's lazy long term users, "lazy" applies to the users not to Microsoft.


> Also Windows doesn't have a monopoly bizarre filenames/features/etc you can find plenty of things in the nix family as well.

Like what? I'm not aware of special file names in arbitrary directories. Only in known/documented ones like /proc or /dev.

I'd say *nix OSes are too lax in what they allow as anything without a zero byte is valid.


In the order I happen to think of them: Filenames may be straightforward on the filesystem level, but a lot of UNIX programs do weird things with them. Many programs use "-" to mean STDIN or STDOUT as appropriate where it is used. Bash has a somewhat ill-conceived feature where it synthesizes a /dev/tcp/$host/$port filesystem that will write to TCP or UDP sockets. Most people don't know about this, a few people think it's a UNIX feature rather than a bash-ism.

The fact that multiple /s will be normalized to be the same as one sometimes trips up security code or code trying to validate that some particular file isn't used (i.e., checking that the filename doesn't start with /dev or a list of other blacklisted directories will fail if the user passes //dev).

Symlinks! Oh, gosh, symlinks. Were this not a stream-of-consciousness dump they probably should come first. You can do terrible things with symlinks, like upload a tarball or zip file that creates a symlink to an arbitrary location in the system, then use that symlink reference as a directory reference to plop a file down. (Some archivers prevent this, others don't.)

Also, /dev is just a convention, it's possible to place device nodes anywhere you want.

You can also pretty much mount arbitrary things in arbitrary places via bind mounts. Hard links can also cause some fun with code that assumes file systems aren't cyclic. Windows technically has a lot of these features but they're harder to get to and less well known whereas UNIX uses the various links in base Linux installs and they're readily available.


Is there any particular reason not to have something like /dev/tcp as a real filesystem, rather than a pretend game that bash likes to play?


There were several implementations of that idea in the early 1980s. The following paper describes one of them.

More Taste: Less Greed? or Sending UNIX to the Fat Farm[0] describes a V7 derivative that had /dev/deuna, /dev/arp, /dev/ip, and /dev/udp.

[0] http://www.collyer.net/who/geoff/taste.pdf


Not much details though. Oh well.


The code in Research Unix V8, V9, and V10 is available. Alcaltel-Lucent made them public a couple months ago[0]. Here are the relevant URLs and file paths within the archives. I already had them on my hard drive and it was easy to grep them. I removed a few columns from the output of tar.

V8

http://www.tuhs.org/Archive/Distributions/Research/Dan_Cross...

  12738 Jul 25  1985 usr/sys/inet/tcp_device.c
V9

http://www.tuhs.org/Archive/Distributions/Research/Norman_v9...

  13461 Aug  6  1986 ./sys/inet.old/old/tcp_device.c
  13457 Feb  3  1987 ./sys/inet.old/tcp_device.c
  13457 Feb 24  1987 ./sys/inet/tcp_device.c
V10

http://www.tuhs.org/Archive/Distributions/Research/Dan_Cross...

  13542 Feb 20  1990 lsys/inet/tcp_device.c
  13622 Mar  9  1992 sys/inet/tcp_device.c
[0]https://news.ycombinator.com/item?id=13971909

Edited to fix formatting.


Oh, nice! But I can't find the right man page.


V8 doesn't have a /dev/tcp man page but the interface is documented at /usr/include/sys/inet/tcp_user.h[0].

Here are the commands I used to identify the right file.

find . -type f -print0 | xargs -0 grep -I "/dev/tcp" | less

[0] https://pastebin.com/8RT5vpH6

Edited to add the command sequence for the historical record.

Edited again to fix wording of the first sentence.


That's not very documented. How would someone use this?


V10 has a man page. Extract v10src.tar and look at man/adm/man4/tcp.4.


Okay, this I took a quick look at it, and this seems way to awkward to use from a shell script. It's pretty much C only. I guess Plan 9 does better.


thanks!


Actually, the stronger case is that the feature should be removed from bash. While it's hard to point at a specific security guarantee that UNIX makes that bash violates by making TCP available via the psuedo-file system, it is a non-trivial ambient contribution to general insecurity for UNIX systems. (People itching to reply to that sentence, please parse it carefully first; I chose the adjectives quite carefully. In particular, I did not just call UNIX "generally insecure".)


I find this surprising. If someone can run bash, they can do anything anyway. What am I missing?


Sometimes you don't get to "run bash", but just pass certain parameters, or add things on the end, or whatever other monstrosity an application programmer comes up with to use bash to do something. This allows you to do things like potentially redirect files to sockets of your choice, where you might exfiltrate data, or provide unexpected data to internal processes.

You would be correct in then pointing out that if you pass user parameters to bash without treating them as carefully as you'd treat radioactive waste, you're asking for trouble, and that /dev/tcp doesn't offer much than the various "nc"s don't. That's why I was sort of non-committal about condemning them; it's not like they are a massive breach of security. It's just one more thing that can surprise people if they're trying to lock a system down, and that's already a pretty long list. And since it's not clear to me that it could ever be a short list, that's why I wanted to emphasize I wasn't trying to condemn UNIX. It's just that it's a feature that doesn't add much but complexity to bash, while not really offering any functionality that isn't better done with nc or something, and on the balance, probably ought to just be removed from an already complicated and security-sensitive program.


I don't know about radioactive waste, but surely allowing untrusted user input into /dev is unrealistically sloppy. (Famous last words?)

I agree that having this as a bash feature versus just using nc doesn't seem to buy much. But I think having these in the actual file system is useful. So why not do both: expunge them from bash, and get them into /dev (or maybe /net, or wherever they belong).


Symlinks are a poor example, IMO. Yes, they need to be carefully handled for security reasons. But they also offer great flexibility that is actually widely used, and that wouldn't be available through other mechanisms.


To paraphrase: Windows NUL is a poor example, IMO. Yes, it needs to be carefully handled for reasons. But it also offers great flexibility that is actually widely used, and that wouldn't be available through other mechanisms.

I rest my case. ;-)


While your reply is genuinely amusing (thank you), how is it actually true?

What do we gain from having NUL everywhere, as opposed to having it in only one specific location, e.g. root?

Also, as an aside, I thought it wasn't a magic file (nul), but rather a magic device (NUL:), which IMO makes a lot of sense.


But that's just not true. They offer less flexibility that would be available through a special namespace prefix like /dev.


It doesn't offer great flexibility though. It has characteristics that made it useful on ancient versions of DOS and now it only offers annoyances that we have to deal with.


Just look at Mac OS X, which is also from the Unix family. It has the feature of decomposing precomposed characters in file names, so if your software writes a file named "café" (caf\xc3\xa9), and later lists the directory, it will find a file named "café" (cafe\xcc\x81). That tends to confuse software which expects to find a file with the same name after creating it, like for instance git.

For a while, if you were in a team in which some developers were on Linux and others were on Mac OS X, and someone on the Linux side checked in a file named with a diacritic, on the Mac OS X side the file appeared to have been deleted (and a new untracked file with the "same name" appeared). Later git grew special code to work around this misfeature.

And yes, Linux has the "bizarre feature" of being way too permissive. A filename is a sequence of bytes of which only the null byte and the slash are forbidden, and only a single or double dot have special meaning; one can have files named with control characters, and/or with something which is not valid for the current character encoding (LC_CTYPE), leading to pain for languages which insist that a string must be always valid Unicode (this includes Rust).

But yeah, nothing compares to the madness that is forbidding simple names like "nul" or "con" or "aux" (alone or followed by any extension) in every single directory, made worse by the fact that you can create files with these names if you use a baroque escaping syntax (which is not available for every API), confusing every other program which does not carefully do the same.

And let's not forget about the fact that the file you just created might not be readable or writable the next instant, because some other process (usually some sort of "antivirus") decided to open it in a exclusive mode. I've seen several projects add retry loops when opening (or moving, or deleting) a file on Windows, to work around that issue.


> It has the feature of decomposing precomposed characters in file names

I was under the impression that the new APFS stopped trying to understand bytes in filenames at all, thereby switching from 'confusion' to [tableflip] as a policy (which is likely an improvement, but also amuses me on the basis it's nice to know [tableflip] is about the only response anybody has to certain unicode-isms)


(note that rust just requires the built-in string type to be valid Unicode, you are free to manipulate other kinds of strings, which is exactly how the os string problem is solved. Also gives you a chance to explicitly handle the errors.)


    And let's not forget about the fact that the file you just
    created might not be readable or writable the next instant
    because some other process (usually some sort of
    "antivirus") decided to open it in a exclusive mode.
THIS. Spent quite a long time trying to reproduce a Windows-only bug with the old Rails 2 gem unpacker caused by exactly this; the code would create a directory "foo-1.2.3" and then immediately try to write files to it and fail because of an exclusive lock - on an empty directory.


Exclusive mode is useful when used for good reasons - i.e. to get snapshot semantics (no-one else can change this) while reading, or implement atomic changes (no-one else can see the change halfway) when writing.

The problem on Windows is that too many APIs decided that exclusive should be the default mode if none is specified - which is the safer choice in a sense that it gives the most guarantees (and the least surprise) to the caller, but arguably the adverse effects it causes on other apps are more surprising and harmful in the end.


I agree with pain-points that you described.

Each OS has it's set of weird, broken and surprising behavior. Most of it in the name of backwards compatibility. There is a group of people that see one mess bearable while the all others totally brain-dead. There are other groups that have somewhat different opinion.

Everything sucks. Which one sucks less? I pick the one that I know more about.


Note that many OS operations in general require retry loops on POSIX systems.


Well, Windows technically supports files with the reserved names - if you use the right APIs - but they break many programs including Explorer. You could make an analogy to Unix filenames with spaces or newlines, which can be created but don't work properly with some tools. (For spaces, try 'make CFLAGS="-I/path/with spaces/"' - there is no way to escape it or otherwise make it work. Newlines break a lot of stuff.)


IIRC you can `make "CFLAGS=-I/path/with spaces/"`


That doesn't make a difference - regardless of where you put the opening quote, make gets the string "CFLAGS=-I/path/with spaces" as argv[1]. The quotes do help, as otherwise it gets split up into multiple arguments to make.

But actually, I was wrong - GNU make passes strings to execute to the shell, so you can use nested quotes: CFLAGS='"-I/path/with/spaces"'. Not sure why I thought differently. The shell itself doesn't work this way, though: when it splits a variable into multiple arguments, it just splits by spaces rather than doing any fancier processing. So there are issues with shell scripts.


The windows command line client for PostgreSQL used to produce confusing errors on my machine because my development source code directory happened to be called "C:\dev"

What constitutes "bizarre" depends a lot on what your prior assumptions are.


But it isn't just a 10 year old feature no one uses. It seems if you write to the NUL file in any directory it still works the same as writing to /dev/null today. There might be scrips written yesterday that rely on that behavior.


Joel Spolsky famously praised this policy of backward-compatibility at all costs which he called "The Raymond Chen Camp"[1]. Many agreed with him, but I always thought that Microsoft compatibility ideals were too radical to be real wisdom. At some point the list of features you try to keep compatibility with grows large enough that the Raymond Chen Way becomes unmaintainable.

The received wisdom of the 90s is wrong. Most users don't care about compatibility, as Apple's success has clearly shown, and most companies are now out following the Apple road. Large enterprise care about compatibility, and they pay a lot, but this is not a forward-looking market. They'll keep using buying new versions of your software because of the compatibility, but if compatibility is the only story you have to offer, you'll slowly lose that market.

I completely agree with you that Microsoft should have had a strategy for deprecating these features back in the 90s, when they were already old.

In this specific case of outdated filename restrictions, you could start with what they already did: Windows NT 3.5 - Allow accessing all filenames with a special prefix (which they already did). Windows NT 4.0 - Make it easy to migrate to sane filenames by providing an opt-in per-process flag that makes all APIs use them by default. At this point they can easily dogfood and migrate all Microsoft software to the new APIs, so you would be able to delete these pesky files in explorer. Windows 2000 - Make the new API flag default for all versions compiled with the latest version of the Windows SDK. Windows XP - Make the new API default for any app without a special entry in the compatibility database.

Somewhere along the road, batch files (which is the only place where compatibility with the old filenames was necessary) could be easily made compatible by modifying the batch parser to replace redirections to NUL with redirections to \\?\devices\null or something akin. You may see some breakage in scripts which use NUL and CON in non-standard way (e.g. as an argument), but the migration pain won't be huge, and you could still save an old script with a compatibility flag.

Microsoft obviously didn't take that way, and yeah, all the batch files written back in 1981 may still work without hitch, but newer things keep breaking in strange ways.

[1] https://www.joelonsoftware.com/2004/06/13/how-microsoft-lost...


Newer things only break in strange ways because they're broken. So rather than break the old stuff, why not fix the new stuff?? - because after all, approximately the only criticism you can't level at the Windows NUL/PRN/COMx/etc. special names is that they're some kind of surprise that appeared suddenly out of nowhere! It's been this way for a very long time.

(I wonder if part of this is the rage of Unix fans discovering that portable means actually, you know, making an effort... and that there's more too it than just checking it builds on x86 Debian as well as x64 Ubuntu...)


You can't just say it's been that way for a long time so it's acceptable, because the industry (and for that matter the Internet) is getting fresh new people every day. You can't expect them not to be surprised, and you can't just arbitrarily require them to know something they haven't stumbled upon until after it caused problems.


"portable means actually, you know, making an effort"

When I hear portable, I immediately think of the Portable Operating System Interface.


> Most users don't care about compatibility, as Apple's success has clearly shown

Apple is not exactly big in the same markets where MS is big, e.g. enterprise. So while I agree that "most" users don't care, the very few who do care might be important customers for MS.

EDIT: grammar


I can't think of a single enterprise where devs don't use MacBook pros. Sure they exist but I haven't run into it.


I can't think of a single enterprise where devs don't use a Dell supplied by the IT department.

The only MacBooks I've seen at the various meetups I've been to were at 'hip' dev shops.


They exist in large quantities - try every .NET shop for starters.


The government, federal, state, and local.


> Most users don't care about compatibility, as Apple's success has clearly shown,

by having 3% of the desktop market and 10% of the smartphone market?


Apple has 18% of the smartphone market: http://www.idc.com/promo/smartphone-market-share/vendor

And 7% of the PC market:

https://www.google.com/amp/s/amp.businessinsider.com/apple-m...

And greater than 10% in the US.


Even those numbers don't exactly scream users don't care about compatibility.


Apple market cap: ~700B, Microsoft market cap ~500B.


It's almost like they are both successful but for different reasons.


Market cap is a meaningless metric. It tells more about state of mind of general public (greed vs. fear) than about company's well-being


How does having a higher market equal cap users don't care about compatibility?


Market cap is a lottery ticket.

I would assign more meaning to cash hoard:

Microsoft: ~$100 billion. Apple: ~$250 billion.


Microsort has taken a similar approach for adding long filename support to Windows 10[0].

[0] https://blogs.msdn.microsoft.com/jeremykuhne/2016/07/30/net-...


take this post to /r/sysadmin and watch them bring out the pitchforks.


The best demonstration for backwards compatibility: https://www.youtube.com/watch?v=PH1BKPSGcxQ


> I haven't worked on Windows directly in years, but I never knew that NUL was a reserved word as a file.

It's not. It's a reserved word through the MS-DOS file redirection facilities. If you use the newer file API or you use the \\?\[path] convention; the reserved words are not an issue and you can create files named for them.


You have to use both, actually. The Unicode API and the \\?\ path prefix. It also astonishes me sometimes how many applications nowadays still choke on Unicode paths.


> I never knew that NUL was a reserved word as a file. I would, and probably still will make this mistake in the future

While we're here: NUL, COM<n>, LPT<n> and AUX are reserved.


> While we're here: NUL, COM<n>, LPT<n> and AUX are reserved.

Worse: they're reserved with any extension. Have a file in your repository called "aux.rs"? It will cause problems on Windows.


Which happened in Servo already: https://github.com/servo/servo/issues/1226


It's swings and roundabouts. Say I create an app or tool that happily resides in c:\proc\whatever then I turn my attention to creating a Linux version and specify /proc/whatever then ... boom? Sure, it's maybe a convoluted example, but for the creator of this "nul" package they got burned by something that's actually common knowledge in the MS world.

I think you need to be a wee bit pro-active and take a look at your potential deployment targets and try and guard against these types of naming issues. Unix and Linux aren't the only (one true) operating systems in the world.


The right solution, actually, is to use a library that gives you the right path for the thing that you need to do depending on the conventions of the platform. For example QStandardPaths in Qt: http://doc.qt.io/qt-5/qstandardpaths.html

  QString appDataDir = QStandardPaths::writableLocation(QStandardPaths::AppDataLocation);
  // ~/Library/Application Support/<APPNAME> on macOS
  // C:/Users/<USER>/AppData/Roaming/<APPNAME> on Windows
  // ~/.local/share/<APPNAME> on Linux


Still doesn't solve the case where the developer just wants to slap something in the root of the C: drive on windows from the outset (Cygwin I seem to remember defaults to c:\Cygwin for example...again slightly convoluted).

Also those locations are user specific, there's nothing there to support the use-case of an app that's available to all users, or might just be a system service (/daemon).


And CON, as Macha mentions in a sister comment. An idiom I remember from old times in DOS, for quickly writing some contents into a file - equivalent to `cat > myfile.txt` on Linux:

  COPY CON MYFILE.TXT


You can also do

    type con > myfile.txt


IIRC, CON was reserved, too, at one point (which would mean it most likely would be reserved today)?

Also, have fun trying to delete C:\Program Files\Xerox ;-)


I've done con.py on a Linux system a few times for net code in different projects and then realised I couldn't clone it on windows. It comes up infrequently enough that you can forget


I came across the concept of "Chesterton's fence" a while ago here on Hacker News, which I really like: https://en.wikipedia.org/wiki/Wikipedia:Chesterton%27s_fence


Other magic aliases include CON, PRN, AUX, COM1-9 and LPT1-9. They are aliased to respective devices in Win32 namespace "\\.\". COMs and LPTs above 9 don't have aliases in global namespace and must be accessed explictly in Win32 namespace, eg. "\\.\COM10" (which itself is symlink to NT native "\Device\Serial9")

In fact, it is possible to create files named NUL, COM1, etc. using \\?\ (eg. "\\?\C:\NUL" is valid path) prefix which disables parsing arcane Win32 magic files. Unfortunately these files are causing strange behaviour in applications that don't use that prefix, Explorer included.

source: https://msdn.microsoft.com/en-us/library/windows/desktop/aa3...


I still remember using "copy con foo.txt" and ending with ctrl-z to quickly create a file. It was years before I understood how that actually worked.


As the blog post mentioned, we solve the issue by deleting the crate from the package repository and reserving these problematic names. The incident lasted about 2 and a half hours.

Crate names have to be one or more valid idents connected by hyphens, so no other clever names like `/home` would be possible to upload. We already had some crate names reserved and we just needed to add these to the list.


> The incident lasted about 2 and a half hours.

And because it was a weekend, much of that time involved me trying to figure out who had the proper credentials for crates.io, and then texting those people until one of them responded. :)


Reserving just the crate names won't cover your bases, though, no? I'm not clear on what exists as part of a crate—but if there's any user control over the filenames of the contents of the crate (e.g. if the crate's source code is in there) then any crate might contain a file named e.g. "nul.rs", triggering the same problem.


I think you're misunderstanding the problem described in the OP. When you build a project via cargo using the default settings, it fetches the git repository at https://github.com/rust-lang/crates.io-index to enable it to resolve dependencies locally. This git repository contains metadata for each library on crates.io, where the metadata for a given library is located in a file with the same name as the name of that library. When the OP uploaded a library whose name was an illegal filename on Windows, git unexpectedly choked when updating the local crate index repo, impacting all Windows users.

It sounds like the concern you're describing is a different matter. It's likely true that if the source of a crate contains a file named "nul.rs", cargo on Windows will fail if it attempts to git-fetch the source (unless you're using Linux Subsystem for Windows, anyway). While this would indeed be a problem, it would only affect users who elect to use specific libraries, rather than serving as a denial-of-service for every Rust user on Windows.


Ah, I was just misunderstanding the format of the repo. I was assuming it was more similar to a ports tree, where each library is specified in the index using a directory which can have random files sitting in it, like Makefiles, .patch files, etc. along with a metadata spec file.

Looking at the repo you linked, there's no allowance for that, so at least in this case you should be safe.


Ah, this is the key info that I was missing. It's not at all clear from the article.


I was going to ask how a remote pckg could do that. Not knowing how rust works (or package managers apparently) I didn't understand how it could be widespread. Makes sense, damn; that's substantial.


I'm not sure how other package managers do it (it should be noted that this approach was designed to avoid some problems that other package managers have encountered), but there is still room for improvement here: ideally, I think we'd be hashing crate names rather than storing them verbatim on the filesystem, to enforce more uniform distribution in the trie.


Interesting, hashing them makes sense but it was a corner case; huge outage but it was definitely something that was easy to overlook for sure


There was a bug in Windows 95 (98 too?) where if you tried to open 'nul\nul' or 'con\con' etc, it would BSOD instantly. Provided lots of drive-by fun in computer labs... (got really good at typing win+r con\con)


For more fun, you could also target other machines with SMB shares. \\thevictim\foo\nul\nul would BSOD that machine. Good times.


IIRC, you could also reference it in a HTML page, so the whole computer would crash when that page was viewed.


This is like the 90s era "undefined is not a function."

null is not a problem, but null.null, on the other hand...


For those who don't use Windows and might need this info: https://msdn.microsoft.com/en-us/library/windows/desktop/aa3...


That is a great page. Is there also such a page for Linux and Mac OS?


There's path_resolution(7) in the Linux man-pages set plus of course the relevant parts of the POSIX standard.


What I don't understand is why cargo fetches the entire crate list and create a directory for every crate (even if you never install it). Why not just have a single file with the entire list? The issue mentions they use a trie, but why use the filesystem as the trie store? Why not have a single file?


The original authors of cargo, wycats and carllerche, aren't around today to ask (it's a weekend!) though IRC attempted to answer regardless:

  <foo> to keep the number of files in a single directory down
  <foo> tools become unhappy with hundreds of thousands+ of things in a single dir
  <foo> as do filesytems
  <bar> why not just a flat file
  <bar> or sqlite or whatever
  <qux> right now it uses git's deduplication feature
  <qux> aka, when downloading updates you only download the objects that changed
  <qux> but it mostly works on a per file basis
  <qux> so git hashes each file and if the hash didnt change, it doesnt download an update
  <qux> but if it did, it treats it as completely new file, even if its just a little change


Update:

  <wycats> Because of this: https://github.com/CocoaPods/CocoaPods/issues/4989#issuecomment-193772935
  <wycats> I ran some scenarios against huge repos when I first worked on cargo
  <wycats> Trying to minimize the cost of operations
  <wycats> I landed on the current strategy, and GitHub in the above thread more or less endorsed what we were already doing at that time
  <wycats> Also see https://github.com/rust-lang/cargo/issues/2452


It's still fundamentally a waste of disk space. On my system, as of a minute ago, ~/.cargo/registry/index took up about 200MB for three different checkouts (for some reason). After deleting that and running `cargo update`, only one of them is recreated, 104MB. Out of that, 57MB is the JSON files and 47MB is git history. But if I just concatenate all the JSON files, the result is only 33MB, and after gzipping, 3MB. Hypothetically, a non-GitHub-based Cargo could store only those 3MB (using binary deltas to avoid resending it on every update), or even 0MB if it just relied on the server to resolve dependencies.


Once you've gzipped to achieve that 3MB storage, binary deltas are useless. Perhaps it the data could be (almost certainly is) transferred gzipped, then expanded to the full 33MB size so binary diffs could be applied to it later, but setting up a system to do binary diffs is a lot of incidental complexity: xdelta is a surprisingly complex format, and bsdiff is really tuned for executables, not arbitrary content (and is pretty complex too).

It sounds like the biggest win would be for cargo to keep using git, but clone the crates.io index as a bare repository rather than checking out the plaintext content. Then it would only take 47MB by your count, which is pretty close to 33MB, and you could still get out the plain content with `git cat-file` and friends.


Technically, the Cargo /already/ bundles a full copy of libxdelta as part of libgit2 (in addition to the separate Git binary delta algorithm); I just checked using nm that it's actually included in the binary. It could probably be removed, but, well, it probably adds a lot less than 44MB to the binary size :)

Alternately, since JSON is text, I suppose you could just ensure that whatever emits this hypothetical merged JSON file puts newlines between different packages' entries, and then use a regular text diff (on the uncompressed version, of course). But reading 44MB of JSON isn't instant; it would probably be better to switch to either a binary format, or even something silly like a sorted list of JSON strings separated by newlines.

There would be some incidental complexity around generating and applying the diffs… you'd probably want to precalculate them on the server side, but it could be rather expensive to, on every change, calculate a diff between the current version and every previous change. Instead, you could have daily checkpoints: each day the server would make a checkpoint and calculate a diff to the last N checkpoints; on every update the server would recalculate the diff between the latest checkpoint and HEAD. The client would store both HEAD and a reverse diff to the latest checkpoint (or just store the checkpoint separately and waste a few MB), so when it updates, it could revert to that checkpoint and request the diff from there to the new latest checkpoint; it would also request the diff from the checkpoint to the new HEAD. If its checkpoint is too old then it would just redownload from scratch.

Overall, not a trivial change, but probably not too hard either.

apt-get does something vaguely similar with its pdiff files.


There is a very long comment in the cargo source which explains this decision and some of the trade offs involved. https://github.com/rust-lang/cargo/blob/master/src/cargo/sou...

I know that some of the people who worked on cargo originally had experience with other package managers - mainly bundler - and I believe bundler used to use a single file, but ran into performance issues.


AFAIK this is how the BSD Ports system works too.


Sounds more like a problem with stupid Windows design choices than with anything you did.


Windows is, for better or worse, fiercely proud of its backwards compatibility. So it's not so much a stupid Windows design choice as a 'stupid' DOS 1.0 design choice (and not even so much a choice as simply a quirk of how the DOS 1.0 file system worked) that Windows doesn't want to break backwards comparability with.


I agree with parent that it's a bit crazy; but I wouldn't be as critical. to your point; presumably even if they dropped DOS support something between DOS and now likely relies on that. It's a fine line.


Stupid design choices on Windows are always justified by "backwards compatibility". What I don't understand is why an app for DOS doesn't work on Windows 10. Or a lot of Windows XP software that doesn't work anymore. Heck, anyone remember of Windows Vista and all the mess with "Compatibility Mode" that never worked?


Because there aren't any of those in posix...


Are you kidding? POSIX is perfect /s


Hehe, Linux (and Unix in general, I guess) is just a bundle of text files and C programs, held together by shell scripts and eternal vigilance. It's very impressive and very disappointing at the same time...


If you want to see perfection, check out plan9.


People could also stop using CreateFile without a \\? prefix and all the problems would go away. There's not even a MAX_PATH limit on any NT based Windows version if you do that.


Except that only works for absolute paths. And then also changes other semantics that you might expect to be there - e.g. it disables handling of .. and . to mean parent and current directory (which is valid even in absolute paths, and often useful).


It comes down to tutorials a lot of the time - does any tutorial in C, C#, Python, whatever mention that you should probably refer to paths in Windows like that?


C# can't because .NET has outright banned the usage. But now the length restriction has been lifted, at least in .NET Core. I think you still need Windows 10 with path length limitation disabled for the full .NET Framework. It's actually a bit confusing because the limitation was lifted in .NET Core first by using \\? prefix internally, but then not long after Windows 10 introduced the ability to remove the limitation for the api without the prefix, but it must be enabled by group policy or registry patch...

Anyways, it shouldn't really have mattered that Microsoft didn't care much for this for a long time. If everybody else had just been using the prefix since Windows XP came out then Microsoft would soon have been forced to change their own software as well.

For example with Python and non-MS C (e.g. clang and gcc on mingw), they should simply have made the standard libraries implement the file api using the prefix. Of course if you have a need to call CreateFile directly you would still be on your own, but if everything else created files you couldn't interact with then you would probably fix the problem.


To me, it sounds more like a problem with Rust that a single misnamed package could bring down the whole system. It's essentially a SQL injection attack (without the SQL).


Yep, just not allowing (directly) user controlled file names seems ideal. Maybe just hash the crate names and use the hash as a file name? No more silly restrictions due to platforms. Eliminates issues with some file systems having a length limit too.


In the Mac System 7-ish days, people used to earnestly warn each other not to name a file '.Sony' (a special name reserved for the floppy driver) as it supposedly trashed your HD. Although I've never heard of anyone reproducing it.


Trying to name a folder "trash" led to the error message "The name 'trash' is reserved for the operating system."


Now's your chance: https://archive.org/details/mac_MacOS_7.0.1_compilation

I'd try it myself, but I've only got my phone with me.


7.0.1 might be a little late, at least, for the supposed catastrophic results. It doesn't like the file at all but nothing dreadful seems to happen.


I tried it in that and System 6 (https://archive.org/details/mac_MacOS_6.0.8); System 6 actually didn't care at all. An interesting bug.

I haven't actually done so, but earlier versions are available if you know where to look (https://archive.org/details/mac_Paint_2).


Wouldn't it make sense for Cargo not to use crate names in file names, and use hexadecimally encoded hashes instead?


Yes, or some other identifier that's unique to that crate. Assuming that the crate name is also a valid file name seems risky.


Or have a prefix. Once cost me exam points trying to optimize a prefix away. The examinators were right.


You could just hex encode the crate name. No need to hash it too.


What did you end up calling the new crate?

Edit: I suggest "terminated"


I haven't published it again because I hadn't thought of a name (and nothing published is using it, so no urgent need).

I like terminated! Good suggestion :D


The .toml file in the master branch on Github seem to still call it "nul": https://github.com/SSheldon/nul/blob/master/Cargo.toml

I can't find it on crates.io though.


I guess we can say, its To be Determined or entirely scrapped.


Urgh, this "nul" filename / reserved filename bug is probably in a lot of software.


Every MS-DOS programmer of old knows about nul, con, and the other reserved names. Those might come from CP/M actually (so are even older), and Atari TOS had them as well I believe.


I was working on a video project for a local comics convention, and named the project file "con.proj". That file hung around until I upgraded my hard drive because no file manager could delete it.


You can remove such files using Bash on Windows. `rm con.proj` works just fine.


This was years ago, I'm sure there were ways to do it but I didn't know about them.


It's very tricky to do cross platform file handling stuff, and only the most mature projects have ironed out this. Just look at your pet project and see if it handles

- Windows and unix line breaks in text files

- Windows and unix path separators

- BOM and non-BOM marked files if parsing UTF

- Forbidden filenames such as in this article

By "handling" I mean it should accept or fail nicely on unexpected input - e.g. say that line breaks should be unix style, or paths should be backslashes etc. Very few projects actually do this well. Even fewer will do even more complex things like handling too long paths with nice error messages etc.


Here is a character encoding issue that I ran into about a year ago.

git does not support UTF-16LE[0]. The result is that UTF-16LE encoded files will be mangled[1] by the line ending conversion. There is at least one generated Visual Studio file (GlobalSuppressions.cs) that is saved in UTF-16 by default.

[0] https://github.com/libgit2/libgit2/issues/1009

[1] https://github.com/Microsoft/Windows-classic-samples/issues/...


Since Windows 10 now comes with an official Linux subsystem, why not just use POSIX APIs and conventions everywhere, and not bother with Windows-specific code if possible?


For one, because the Linux subsystem is an optional install. If you're making anything user-facing you can't rely on it being there - it's really a tool for developers, not end-users.


I'm pretty sure there are already programs out there whose install instructions include activating developer mode and installing WSL Ubuntu.


And it's not even available for server versions...


Depends on what type of application you are making. For a library that can be used in a "real" graphical Windows application, you can't make a posix type shortcut.

I think "if possible" is (at least still) very rarely the case that it is.


Because this leaks to the user. For example, you'll be dealing with paths like /mnt/c/..., which, if you surface it in your UI, will rather confuse someone who expects C:\...

And speaking of UI, that's one major hurdle right there.


>- Windows and unix line breaks in text files

And Mac classic linebreaks (\r only), for that matter.


- illegal characters in file names such as !, |, ^


I recommend ':?', as it will work in POSIX, but not Windows.


This sounds like end of Rust for next month for me...


While I know nothing of Rust, Diesel, or CrateDB, I do know that Windows uses a case-insensitive file system and this fix doesn't seem to take that into consideration. However, the author of the fix does note:

> I believe crates.io's namespace is case insensitive let me know if that's wrong

Someone should probably validate that.


Not quite. Windows uses a case insensitive API on top of a case sensitive file system.

FUN FACT: As of 2017, Windows 10 is (partially) binary-compatible to Ubuntu Linux. Any application that was originally compiled for Linux will still be case sensitive when running on Windows 10.


Right. I forgot NTFS is in fact case sensitive. Thanks.


I tried `npm install nul` on my win7 VM and it created a folder called nul which I can't get rid of ¬_¬


Well, notably `npm uninstall nul` gets rid of it. But in the Explorer you can't do anything with it.


Came here to mention that guy who registered the "nul" package on npm an hour ago. Found your post. The world is a small place.


Makes me wonder if I can make a crate called "../.." and have it overwrite some user files.


Crate identifiers are required to be [a-zA-Z] for the first character and [a-zA-Z0-9_-] for the rest. So, no. :P


More succinctly: [A-z]+[A-z0-9_-]


If you look closely at your comment, you'll realize the hilarious HN bug that prevents me from writing it as you suggest. :P


  [a-zA-Z][a-zA-Z0-9_-]*
FTFY


[A-z] is not the same as [A-Za-z] .

The former includes a few punctuation characters that are between "Z" and "a".


[\]^_` included, or is that a special case?


It looks like the files were managed by Git (see the output where the checkout errors out), so no that won't work.


I thought git handled the windows special file names. Am I misremembering or is it a difference (and thus a bug) in libgit2? (used by cargo, afaik)


I believe that git chokes by default on special file names on Windows, but I think there's a config variable that you can set to fix it. I don't know if libgit2 differs here.


Regular old git will fail; I tried to check out MINIX's source code a month ago and the clone failed since a file was named COM.


Hmm... I feel like someone should stick the reserved names into a json somewhere for easy reference for the next package manager.


Remember not to name anything pr#6 for either...


Don't they have a continuous integration system where they run the unit tests on all platforms for all checkins to master?


Yes. Why do you ask?


Obviously, he wanted to know.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: