Hacker News new | past | comments | ask | show | jobs | submit login

For an alternative view, don't forget to read the section on Pipes of The Unix-Haters Handbook: http://web.mit.edu/~simsong/www/ugh.pdf (page 198)



> When was the last time your Unix workstation was as useful as a Macintosh?

Some of that discussion has not aged well :)


The core critique - that everything is stringly typed - still holds pretty well though.

>The receiving and sending processes must use a stream of bytes. Any object more complex than a byte cannot be sent until the object is first transmuted into a string of bytes that the receiving end knows how to reassemble. This means that you can’t send an object and the code for the class definition necessary to implement the object. You can’t send pointers into another process’s address space. You can’t send file handles or tcp connections or permissions to access particular files or resources.


> You can’t send pointers into another process’s address space.

Thank goodness.


To be fair, the same critisim could be used for a socket? I think the issue is that some people want pipes to be something magical that connects their software, not a dumb connection between them.


I don't want all my pipes to be magical all the time, but occasionally I do want to write a utility that is "pipeline aware" in some sense. For example, I'd like to pipe mysql to jq and have one utility or the other realize that a conversion to json is needed in the middle for it work.

Im working on a library for this kind of intra-pipeline negitiation. It's all drawing-board stuff right now but I coobbled together a proof of concept:

https://unix.stackexchange.com/a/495338/146169

Do you think this is a reasonable way to achieve the magic that some users want in their pipelines? Or are ancient Unix gods going to smite me for tampering with the functional consistency of tools by making their behavior different in different contexts?


This is interesting, yes. If the shell could infer the content type of data demanded or output by each command in a pipeline, then it could automatically insert type coercion commands or alter the options of commands to produce the desired content types.

You're right that it is in fact possible for a command to find the preceding and following commands using /proc, and figure out what content types they produce / want, and do something sensible. But there won't always be just one way to convert between content types...

Me? I don't care for this kind of magic, except as a challenge! But others might like it. You might need to make a library out of this because when you have something like curl(1) as a data source, you need to know what Content-Type it is producing, and when you can know explicitly rather than having to taste the data, that's a plus. Dealing with curl(1) as a sink and somehow telling it what the content type is would be nice as well.


My ultimate use case is a contrived environment where I have the luxury of ignoring otherwise blatant feature-gaps--such as compatibility with other tools (like curl). I've come to the same conclusions about why that might be tricky, so I'm calling it a version-two problem.

I notice that function composition notation; that is, the latter half of:

> f(g(x)) = (f o g)(x)

resembles bash pipeline syntax to a certain degree. The 'o' symbol can be taken to mean "following". If we introduce new notation where '|' means "followed by" then we can flip the whole thing around and get:

> f(g(x)) = (f o g)(x) = echo 'x' | g | f

I want to write some set of mathematically interesting functions so that they're incredibly friendly (like, they'll find and fix type mismatch errors where possible, and fail in very friendly ways when not). And then use the resulting environment to teach a course that would be a simultaneous intro into both category theory and UNIX.

All that to say--I agree about finding the magic a little distasteful, but if I play my cards right my students will only realize there was magic in play after they've taken the bait. At first it will all seem so easy...


The magic /proc thing is a very interesting challenge. Trust me, since I read your comments I've thought about how to implement, though again, it's not the sort of thing I'd build for a production system, just a toy -- a damned interesting one. And as a tool for teaching how to find your way around an OS and get the information you need, it's very nice. There's three parts to this: a) finding who's before and after the adapter in the pipe, b) figuring out how to use that information to derive content types, c) match impedances. (b) feels mundane: you'll have a table-driven approach to that. Maybe you'll "taste" the data when you don't find a match in the table? (c) is not always obvious -- often the data is not structured. You might resort to using extended file attributes to store file content-type metadata (I've done this), and maybe you can find the stdin or other open files of the left-most command in a pipeline, then you might be able to guesstimate the content type in more cases. But obviously, a sed, awk, or cut, is going to ruin everything. Even something like jq will: you can't assume the output and input will be JSON.

At some point you just want a Haskell shell (there is one). Or a jq shell (there is something like it too).

As to the pipe symbol as function composition: yes, that's quite right.


I wonder if something like HTTP’s content negotiation is a good model for this.


That sounds reasonable, I'll look into it--thanks.

I was imagining an algorithm where each pipeline-aware utility can derive port numbers to use to talk/listen to its neighbors. I may be able to use http content negotiation wholesale in that context.


I've been trying to solve the exact same problem with my shell too. It's pipes are typed and all the builtin commands can than automatically decode those data types via shared libraries. So commands don't need to worry about how to decode and re-encode the data. This means that JSON, YAML, TOML, CSV, Apache log files, S-Expressions and even tabulated data from `ps` (for example) can all be transparently handled the same way and converted from one to another without the tools ever needing to know how to marshal nor unmarshal that data. For example: you could take a JSON array that's not been formatted with cartridge returns and still grep through it item by item as if it was a multi-line string.

However the problem I face is how do you pass that data type information over a pipeline from tools that exist outside of my shell? It's all well and good having builtins that all follow that convention but what if someone else wants to write a tool?

My first thought was to use network sockets, but then you break piping over SSH, eg:

    local-command | ssh user@host "| remote-command"
My next thought was maybe this data should be in-lined - a bit like how ANSI escape sequences are in-lined and the terminals don't render them as printable characters. Maybe something like the following as a prefix to STDIN?

    <null>$SHELL<null>
But then you have the problem of tainting your data if any tools are sent that prefix in error.

I also wondered if setting environmental variables might work but that also wouldn't be reliable for SSH connections.

So as you can see, I'm yet to think up a robust way of achieving this goal. However in the case of builtin tools and shell scripts, I've got it working for the most part. A few bugs here and there but it's not a small project I've taken on.

If you fancy comparing notes on this further, I'm happy to oblige. I'm still hopeful we can find a suitable workaround to the problems described above.


> ...with my shell too...

I was hoping to stick with bash or zsh, and just write processes that somehow communicate out of band, but I think we're still up against the same problem.

One idea I had was that there's a service running elsewhere which maintains this directed graph (nodes = types, edges = programs which take the type of their "from" node and return the type of their "two" node). When a pipeline is executed, each stage pauses until type matches are confirmed--and if there is a mismatch then some path finding algorithm is used to find the missing hops.

So the user can leave out otherwise necessary steps, and as long as there is only one path through the type graph which connects them, then the missing step can be "inserted". In the case of multiple paths, the error message can be quite friendly.

This means keeping your context small enough, and your types diverse enough, that the type graph isn't too heavily connected. (Maybe you'd have to swap out contexts to keep the noise down.) But if you have a layer that's modifying things before execution anyway, then perhaps you can have it notice the ssh call and modify it to set up a listener. Something like:

User Types:

    local-command | ssh user@host "remote-command"
Shell runs:

    local-command | ssh user@host "pull_metadata_from -r <caller's ip> | remote-command"
Where pull_metadata_from phones home to get the metadata, then passes along the data stream untouched.

Also, If you're writing the shell anyway then you can have the pipeline run each process in a subshell where vars like TYPE_REGISTRY_IP and METADATA_INBOUND_PORT are defined. If they're using the network to type-negotiate locally, then why not also use the network to type-negotiate through an ssh tunnel?

This idea is, of course, over-engineered as hell. But then again this whole pursuit is.


> I was hoping to stick with bash or zsh, and just write processes that somehow communicate out of band, but I think we're still up against the same problem.

Yeah we have different starting points but very much similar problems.

tbh idea behind my shell wasn't originally to address typed pipelines, that was just something that evolved from it quite by accident.

Anyhow, your suggestion of overwriting / aliasing `ssh` is genius. Though I'm thinking rather than tunnelling a TCP connection, I could just spawn an instance of my shell on the remote server and then do everything through normal pipelines as I now control both ends of the pipe. It's arguably got less proverbial moving parts compared to a TCP listener (which might then require a central data type daemon et al) and I'd need my software running on the remote server for the data types to work anyway.

There is obviously a fair security concern some people might have about that but if we're open and honest about that and offer an "opt in/out" where opting out would disable support for piped types over SSH then I can't see people having an issue with it.

Coincidentally I used to do something similar in a previous job where I had a pretty feature rich .bashrc and no Puppet. So `ssh` was overwritten with a bash function to copy my .bashrc onto the remote box before starting the remote shell.

> This idea is, of course, over-engineered as hell. But then again this whole pursuit is.

Haha so true!

Thanks for your help. You may have just solved a problem I've been grappling with for over a year.


I was thinking something similar, buried in a library that everyone could link. It seems... awfully awkward to build, much less portably.

This reminds me of how busted Linux is for not having a SO_PEERCRED. You can actually get that information by walking /proc/net/tcp or using AF_NETLINK sockets and inet_diag, but there is a race condition such that this isn't 100% reliable. SO_PEERCRED would [have to] be.


The problem with that is that each command in the pipeline would have to somehow be modified to convey content type metadata. Perhaps we could have a way to send ancillary metadata (a la Unix domain sockets SCM_*).


Yes. The compromise of just using an untyped byte stream in a single linear pipeline was a fair tradeoff in the 70s, but it is nearly 2020 and we can do better.


We have done better. The shell I'm writing is typed and I know I'm not the only person to do this (eg Powershell). The issue here is really more with POSIX compatibility but if you're willing to step away from that then you might find an alternative that better suits your needs.

Thankfully switching shells is as painless as switching text editors.


> Thankfully switching shells is as painless as switching text editors.

So, somewhere between, "That wasn't as bad as I feared," and, "Sweet Jesus, what fresh new hell have I found myself in"?


haha yes. I was thinking more about launching the shell but you're absolutely right that learning the syntax of a new shell is often non-trivial.


I'm not going to argue that UNIX got everything right because I don't believe that to be the case either but I don't agree with those specific points:

> This means that you can’t send an object and the code for the class definition necessary to implement the object.

To some degree you can and I do just this with my own shell I've written. You just have to ensure that both ends of the pipe understands what is being sent (eg is it JSON, text, binary data, etc)? Even with typed terminals (such as Powershell), you still need both ends of the pipe to understand what to expect to some extent.

Having this whole thing happen automatically with a class definition is a little optimistic though. Not least of all because not every tool would be suited for every data format (eg a text processor wouldn't be able to do much with a GIF even if it has a class definition).

> You can’t send pointers into another process’s address space.

Good job too. That seems a very easy path for exploit. Thankfully these days it's less of an issue though because copying memory is comparatively quick and cheap compared to when that handbook was written.

> You can’t send file handles

Actually that's exactly how piping works as technically the standard streams are just files. So you could launch a program with STDIN being a different file from the previous processes STDOUT.

> or tcp connections

You can if you pass it as a UNIX socket (where you define a network connection as a file).

> or permissions to access particular files or resources.

This is a little ambiguous. For example you can pass strings that are credentials. However you cannot alter the running state of another program via it's pipeline (aside what files it has access to). To be honest I prefer the `sudo` type approach but I don't know how much of that is because it's better and how much of that is because it's what I am used to.


>> You can’t send file handles

> Actually that's exactly how piping works

Also SCM_RIGHTS, which exists exactly for this purpose (see cmsg(3), unix(7) or https://blog.cloudflare.com/know-your-scm_rights/ for a gentler introduction and application).

That's been around since BSD 4.3, which predates the Hater's Handbook 1ed by 4 years or so.


And that's how Unix is secretly a capability system


Yeah I had mentioned UNIX domain sockets. However your post does add a lot of good detail on them which I had left off.


Look at the alternatives though. Would you really want to use something like Spring in shell scripting?


No. I typically use python as a drop in replacement for shell scripts > ~10 lines of code.


MacOs is layered on a UNIX-like OS. You can use pipes in your command windows.


This comment makes me feel really old.

MacOs wasn't always layered on unix, and the unix-haters' handbook predates the switch to the unix-based MacOs X.


Of course not, but the switch to BSD fixed a bunch of the underpinnings in the OS and was a sane base to work off of.

Not to put too fine a point on it, but they found religion. Unlike Classic (and early versions of Windows for that matter), there was more to be gained by ceding some control to the broader community. Microsoft has gotten better (PowerShell - adapting UNIX tools to Windows, and later WSL, where they went all in)

Still, for Apple it meant they had to serve two masters for a while - old school Classic enthusiasts and UNIX nerds. Reading the back catalog of John Siracusa's (one of my personal nerd heroes) old macOS reviews gives you some sense of just how weird this transition was.


The Unix Haters Handbook was published in 1994, when System 7 was decidedly not unix-like.


You can also drop the "-like". (-:

* https://unix.stackexchange.com/questions/1489/


... has it ? most people using macs never ever open a terminal.


The section on find after pipes has also not aged well. I can see why GNU and later GNU/Linux replaced most of the old Unices (I mean imagine having a find that doesn't follow symlinks!). If I may, a bit of code golf on the problem of "print all .el files without a matching .elc"

  find . -name '*.el' | while read el; do [ -f "${el}c" ] || echo $el; done
Of course this uses the dreaded pipes and doesn't support the extremely common filenames with a newline in them, so let's do it without them

  find . -name '*.el' -exec bash -c 'el=$0; elc="${el}c"; [ -f "$elc" ] || echo "$el"' '{}' ';'


So the dreaded space-in-filenames is a problem when you pass the '{}' to a script.

The following works very nicely for me:

  * find . -name '*.el' -exec file {}c ';' 2>&1 | grep cannot


I should have said "works very nicely for me, including on file names with spaces"


Or p160 by internal numbering.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: