That sounds reasonable, I'll look into it--thanks.
I was imagining an algorithm where each pipeline-aware utility can derive port numbers to use to talk/listen to its neighbors. I may be able to use http content negotiation wholesale in that context.
I've been trying to solve the exact same problem with my shell too. It's pipes are typed and all the builtin commands can than automatically decode those data types via shared libraries. So commands don't need to worry about how to decode and re-encode the data. This means that JSON, YAML, TOML, CSV, Apache log files, S-Expressions and even tabulated data from `ps` (for example) can all be transparently handled the same way and converted from one to another without the tools ever needing to know how to marshal nor unmarshal that data. For example: you could take a JSON array that's not been formatted with cartridge returns and still grep through it item by item as if it was a multi-line string.
However the problem I face is how do you pass that data type information over a pipeline from tools that exist outside of my shell? It's all well and good having builtins that all follow that convention but what if someone else wants to write a tool?
My first thought was to use network sockets, but then you break piping over SSH, eg:
local-command | ssh user@host "| remote-command"
My next thought was maybe this data should be in-lined - a bit like how ANSI escape sequences are in-lined and the terminals don't render them as printable characters. Maybe something like the following as a prefix to STDIN?
<null>$SHELL<null>
But then you have the problem of tainting your data if any tools are sent that prefix in error.
I also wondered if setting environmental variables might work but that also wouldn't be reliable for SSH connections.
So as you can see, I'm yet to think up a robust way of achieving this goal. However in the case of builtin tools and shell scripts, I've got it working for the most part. A few bugs here and there but it's not a small project I've taken on.
If you fancy comparing notes on this further, I'm happy to oblige. I'm still hopeful we can find a suitable workaround to the problems described above.
I was hoping to stick with bash or zsh, and just write processes that somehow communicate out of band, but I think we're still up against the same problem.
One idea I had was that there's a service running elsewhere which maintains this directed graph (nodes = types, edges = programs which take the type of their "from" node and return the type of their "two" node). When a pipeline is executed, each stage pauses until type matches are confirmed--and if there is a mismatch then some path finding algorithm is used to find the missing hops.
So the user can leave out otherwise necessary steps, and as long as there is only one path through the type graph which connects them, then the missing step can be "inserted". In the case of multiple paths, the error message can be quite friendly.
This means keeping your context small enough, and your types diverse enough, that the type graph isn't too heavily connected. (Maybe you'd have to swap out contexts to keep the noise down.) But if you have a layer that's modifying things before execution anyway, then perhaps you can have it notice the ssh call and modify it to set up a listener. Something like:
Where pull_metadata_from phones home to get the metadata, then passes along the data stream untouched.
Also, If you're writing the shell anyway then you can have the pipeline run each process in a subshell where vars like TYPE_REGISTRY_IP and METADATA_INBOUND_PORT are defined. If they're using the network to type-negotiate locally, then why not also use the network to type-negotiate through an ssh tunnel?
This idea is, of course, over-engineered as hell. But then again this whole pursuit is.
> I was hoping to stick with bash or zsh, and just write processes that somehow communicate out of band, but I think we're still up against the same problem.
Yeah we have different starting points but very much similar problems.
tbh idea behind my shell wasn't originally to address typed pipelines, that was just something that evolved from it quite by accident.
Anyhow, your suggestion of overwriting / aliasing `ssh` is genius. Though I'm thinking rather than tunnelling a TCP connection, I could just spawn an instance of my shell on the remote server and then do everything through normal pipelines as I now control both ends of the pipe. It's arguably got less proverbial moving parts compared to a TCP listener (which might then require a central data type daemon et al) and I'd need my software running on the remote server for the data types to work anyway.
There is obviously a fair security concern some people might have about that but if we're open and honest about that and offer an "opt in/out" where opting out would disable support for piped types over SSH then I can't see people having an issue with it.
Coincidentally I used to do something similar in a previous job where I had a pretty feature rich .bashrc and no Puppet. So `ssh` was overwritten with a bash function to copy my .bashrc onto the remote box before starting the remote shell.
> This idea is, of course, over-engineered as hell. But then again this whole pursuit is.
Haha so true!
Thanks for your help. You may have just solved a problem I've been grappling with for over a year.
I was thinking something similar, buried in a library that everyone could link. It seems... awfully awkward to build, much less portably.
This reminds me of how busted Linux is for not having a SO_PEERCRED. You can actually get that information by walking /proc/net/tcp or using AF_NETLINK sockets and inet_diag, but there is a race condition such that this isn't 100% reliable. SO_PEERCRED would [have to] be.
I was imagining an algorithm where each pipeline-aware utility can derive port numbers to use to talk/listen to its neighbors. I may be able to use http content negotiation wholesale in that context.