Hacker News new | past | comments | ask | show | jobs | submit login
The Beauty of Unix Pipelines (2020) (prithu.dev)
97 points by ddtaylor on June 4, 2022 | hide | past | favorite | 54 comments



This should remind us what programming is all about: Take a data set, modify it, present it to the user or write it back to a file or a database.

And we can see why imperative programming is much more simpler than OOP or functional programming.

- That that data!

- Transform the data!

- Take result of transformation and transform further!

- Display the result!

Not:

- Think about an abstraction of the problem

- Define a class which is vaguely near to the problem domain.

- Take the data and press into the object using a constructor.

- Write some methods which fits the object abstraction but not related to the actual problem

- Write a couple of methods to administrate the data inside the object

- Do the same for a second class because after doing the first transformation step you have a different kind of object

- Use a mixin to combine the two classes

- Refactor the code to adhere to Clean Code

- Write a helper class which is actually not an abstraction of anything

- Write test classes to test quality into your code because your codebase became to messy.

- Explain everybody why this approach is superior than just pipelining the data between functions.

- Complain about why the state of the industry is so bad.


I would argue that functional code is much closer to Unix pipelines than imperative code. Each function takes inputs and transforms them. Pipelines are just function composition.

100% agree with everything else you just said though.


I'm glad I'm not the only one who thought this. When I first learned about function composition, a supposedly groundbreaking feature of functional programming, Unix pipes were the first thing that popped into my head, but making that comparison out loud was only met with facepalms and accusations of trolling. But it's literally 'take this input and apply these functions to it, in this order'.


Yah, I was expecting this comment. I'd agree but I see nowadays to obsession on abstract issues around functional programming like clojures, monads, recursion and code is data and function as first class citizens which might be of theoretical interest, but quite often are only code obfuscating technics.

When you use (progn ....) in lisp or (do ...) or (-> ...) in clojure you are pretty near to what I understand as imperative but you are leaving the orthodoxy of functional programming.

What is relevant is idempotence which is easily achievable in functional programming and in procedural/imperative as long as you don't use global variables.

Idempotence generally not a feature of OOP. (OOP quite often is not even deterministic).


While a lot of the type theory stuff is controversial, closures and recursion are core concepts of functional programming. I mean, they date back to the first ever functional language, lisp. Removing them would be like removing classes in an OOP language.


> The Unix philosophy lays emphasis on building software that is simple and extensible. Each piece of software must do one thing and do it well.

But, you see… his first example shows how the ‘git log’ doesn’t only show the log, but receives a flag to output arbitrary information according to the format string, and ‘uniq’ doesn’t only show unique entries, but receives a flag to count entries.

I would say these are not great examples of “do one thing”, and taken literally, the Unix philosophy would require much more shell script gymnastic to do simple tasks.


Exactly. The definition of “one thing” is unclear. One house is one thing, which also has a thousand things in it. Car is does the “one thing” — driving, but steering is arguably a thing as well.


Don't play dumb. When you can describe the function with one simple sentence, it is likely that it does one thing. Cars transport their passengers. The fact that performing this action involves complex processes doesn't matter.

Just like a math theorem can be short and say just one thing and be extremely useful, but its demonstration could require many steps.

The idea behind the Unix philosophy is not simplicity of implementation - on the contrary many little Unix tools are complex under the hood because they try to do their thing optimally - but modularity. "Do one thing and do it well" to avoid functional overlap. That's simply an obvious and right thing to do in general.

However this is not a principle easy to follow, because of unrelated concerns like stuffing your resume with impressive stuff - the problem with "simple" is that it is often confused with "easy" - or making things than can be sold - simplicity doesn't sell, "powerful" (as in "I have no idea what's the purpose of this does and I don't need that thing, but it's good to have 'just in case'") does.


This is a well-known problem of Unix (and software in general) - the original tools have suffered 'feature creep' and expanded far beyond their original intended capabilities.

Discussed here on the venerable cat-v website, which is itself named after this very phenomenon: http://harmful.cat-v.org/cat-v/

It's difficult to know where/when to draw the line though. I went through a phase of being a massive Unix fanboy but found that if you follow the philosophy too strictly, you end up with a tonne of disparate 'tools' that you have to string together several of just to do anything useful. If you have a script that e.g. loops over a list of files, you might have to spawn a large number of processes for every single item in the list, for functionality that feels like it should be built into one program (but if you add one extra feature behind a flag, then there's no reason we shouldn't add another little extra feature behind a flag, and that's how feature creep happens!).

Not only that, but once you start stringing together various Unix utilities, the expected output of each becomes set in stone (even if the output is just a single line of tab-separated numbers with no obvious meaning, e.g. `wc`). Plain text can be a lot more fragile than Unix fans would like to admit. You could add an option to a program to present the output in a different/better format if needed, but then that's another flaggie that could be better handled by a separate utility... (e.g. instead of an 'output as json' flag, why not have a 'convert input to json' utility that we can pipe the output through?)

In time, I've come to use a mix of single-purpose Unix tools with all their assorted flags, along with monolithic one-man-band tools (e.g. `find` - I have very little trust in piping things through xargs sh -c "run two commands on \"\$filename\" and please don"'"'"t fuck up quotes in \"\$filename\"" - or is it {}? or "{}"? or \"{}\"? or \{\}? You get the picture). Piping stuff together is good for when you're writing shell one-liners and making it up as you go along / as needed, but for anything more than that where a script file is needed, these days I'm more inclined to rewrite it in a 'kitchen sink' language such as PHP or Perl.


That philosophy tends to be cargo cult by those that never used commercial UNIXes.

I use UNIX systems since 1993, starting with Xenix, no idea when that started to be something that some people actually believe it happens.


It's very nice indeed, but PowerShell pipelines are at another level of beauty: think what you could do if instead of just STDOUT (and STDERR to be fair) you had data structures with names

On Windows try for the fun it:

Get-ChildItem -Path HKLM:\SYSTEM\ControlSet001\Services\DeviceAssociationService\State\Store\ | Select-String -Pattern Bluetooth

Now look at how it parallelize the request by default, and how it returned you several columns of data per line.

Want a regex?

Get-ChildItem -Path HKLM:\SYSTEM\ControlSet001\Services\DeviceAssociationService\State\Store\ | Select-String -Pattern Bluetooth | ForEach-Object { $_ -replace 'HK.*-','' -replace ':', ''}

Want a CSV? Add "| ConvertTo-Csv"

Let's do another, to filter only devices (a named property)

Get-PnpDevice -Class 'Bluetooth' -PresentOnly | Where-Object HardwareID -match 'DEV_' | Get-PnpDeviceProperty | ConvertTo-Csv

Now I'm learning how to do better things like lambdas on the pipeline

PowerShell is weird the first time, but it's so well made it'll give you hours and hours of fun :)


I've said this on here many times, but PS only works well with small files. Your examples with even 50 MB files absolutely crawl when Unix is blazing. We're talking 10 minutes vs like 3 seconds. PS code is nice once you get the hang of it, but I've finally abandoned it for most tasks after a few years due to the performance issues. It turns out converting everything to objects is slow slow.


This kind of thing could be done in Unix as well.

In FreeBSD, libxo[0] is integrated into a lot of it's utilities, allowing this same type of power.

For example:

  ps aux --libxo json
Or:

  ps aux --libxo xml
At this point you still have plain text, but it now has a well-defined structure that can be interpreted by other utilities. With JSON output you can use a tool like `jq` to slice and dice the data. With XML output you could run it through an XSLT to transform it into whatever you want - this definitely moves outside the realm of simple though.

This is definitely not on the same level as what PowerShell does, but it shows how something similar could exist in the Unix world.

[0] https://www.freebsd.org/cgi/man.cgi?query=libxo&sektion=3


The problem with PowerShell (aside from its verbosity) is that you essentially have to relearn the name of every command and every option, throwing all your existing shell knowledge out.

I earnestly used PS as my main shell for a number of years, but this learning curve was just too steep. Once WSL turned up I jumped ship.


I get it, and I respect the effort to clean up shell and create something coherent, but the commands and casing are really overdone and feel like old IBM mainframe commands that require looking up a physical binder to use effectively. Start by replacing the commands with simple verbs like "walk", "filter" and "convert" and give them some default positional args.

I think a proper shell should still have a few shortcuts that break coherence for convenience, as long as there also is a long form for script readability.


> Start by replacing the commands with simple verbs like "walk", "filter" and "convert" and give them some default positional args

PowerShell commands (more accurately, cmdlets[0]) have a fixed Verb-Noun syntax. Get-ChildItem, Invoke-WebRequest, etc etc.

[0]: https://docs.microsoft.com/en-us/powershell/scripting/powers...


Right, and that verb-noun convention reminds me of https://www.techtarget.com/whatis/reference/Common-AS-400-co...

...and it's not good. At least AS/400 commands had terseness to them. Those powershell commands are ok-ish for scripting but much too long for interactive shell sessions where you need to "explore the space".


> much too long for interactive shell sessions

There exist aliases[0] for the most commonly-used cmdlets.

[0]: https://docs.microsoft.com/en-us/powershell/scripting/learn/...


I tend to agree. After working with data outputs from a variety of Linux sources for years I find the PS output is much more standardized -- or at least PS handles string outputs in a much more standardized way. I love the Select-String ... approach to field display. Making and outputting tables in PS is much slicker than the same magic on Linux. Ugly is a matter of opinion, but the real magic of PS is never having to drop characters or spaces to munge text into a column.


"It's very nice indeed, but PowerShell pipelines are at another level of beauty"

What?!??

"PowerShell is weird"

Yes, totally agree with that.


Sorrt, I can't tell if this is sarcasm.

Is there something I'm missing? Even the above examples look awful to me


It's not sarcasm: I actually like PS because it builds on the concept of data pipelines, to turn them into something more object orientend.

The only thing similar in the Linux world is using sqlite as the target of pipe to do some data processing, like on a ps output, find the name of the process using the most RAM.

Structurally, the language seems both Perl inspired and something a bit like of an hybrid of LISP (for the object) and SQL (for how most structures are manipulated)


> I actually like PS because it builds on the concept of data pipelines, to turn them into something more object orientend.

It certainly is an interesting idea. I think the .Net dependency makes it more heavyweight than it might have otherwise been. I wonder if anyone at Microsoft has ever thought about porting it from C# to C++ (I suppose using COM/WinRT).

I think a big deficiency with pipes is they don’t come with any protocol for out-of-band content negotiation. Someone could add an IOCTL you could call on a write pipe (or Unix domain socket) to advertise supported MIME types, and blocks waiting for the other end to select one. The read end can call another IOCTL to get supported MIME types, and then a third to indicate its choice. If the read end doesn’t support this protocol, you could make it the first IOCTL is unblocked by the first read()/recvmsg() call on the read end, so this protocol still works even if the other end doesn’t support content negotiation. Utilities could default to producing text/plain, but generate application/json instead if the other end supports it.


In the FreeBSD world, we have libxo[0] which allows utilities to produce structured output.

To use your example, taking the output of ps and finding the process using the most RAM, you could do something like this:

  $ ps aux --libxo json|jq '."process-information"."process" | sort_by((.rss)|tonumber) | .[-1]'
  {
    "user": "root",
    "pid": "2372",
    "percent-cpu": "0.0",
    "percent-memory": "11.4",
    "virtual-size": "2149236",
    "rss": "1906584",
    "terminal-name": "- ",
    "state": "SC",
    "start-time": "21May22",
    "cpu-time": "1307:09.11",
    "command": "bhyve: postgresql (bhyve)"
  }
It's not pretty, but it does let you get structured data out of most of the base system utilities to make the slicing and dicing easier.

[0] https://www.freebsd.org/cgi/man.cgi?query=libxo&sektion=3


If "programs ought to do one thing and do it well", then why aren't they just library functions in a scripting language? Scripting languages can probably do the same things (if given minor syntax improvements, see Xonsh) in one line as well, and scale to have more complicated logic if needed. I still don't see the beauty.


Why would having them just be library functions make it better? (I don’t disagree, just curious what you’re thinking.) The actual differences between pipeable programs and library functions might be hard to pin down concretely. Btw do you use unix & pipes a lot, or not much?

Here are a few thoughts, maybe helpful or maybe not. Part of the beauty of pipes and simple lego-brick programs is the complete integration with the operating system, and more specifically the file system. Being able not just to write your state to a file, but furthermore having that be the default mode of operation is pretty powerful, especially combined with special files. Writing to a file in a scripting language that’s not a shell is usually easy, but not the default, and would be cumbersome to do at every step. Part of do one thing well is to keep the shell DRY - don’t add sorting or searching or filtering functionality to another program if you can already use grep or sort or sed. Done right, this helps people stay fluent with a small number of tools. Another implicit benefit of the unix philosophy that wasn’t mentioned here, but does go hand-in-hand with pipes and small-good tools, is the idea to process and work with human readable text files.

One way to look at unix beauty is to do some command line data analysis. I like the general philosophy of data exploring with piped commands for one-off and small projects. Sometimes more focused tools are needed, but most of the time perhaps you can get away with unix tools. https://en.wikibooks.org/wiki/Ad_Hoc_Data_Analysis_From_The_...


I mean... isn't that what is already happening, the shell is the scripting language, and compiled binaries that mung text are the library functions.

Unix/Unixlike operating systems aren't perfect, but they get the job done.


Moreover, something rarely mentioned, multiple processes and pipes can give you some free parallelism courtesy of the scheduler.


The downer is that every function call in this system has only string parameters.


That's because this is the way they thought back then. You can trivially pass json data between processes and make a coherent structured system. Nobody does it though.


jq has >20k stars on GitHub.

I use it mostly for little cli utilities, so maybe it isn’t an exact refutation of your claim.

[0] https://github.com/stedolan/jq


Sure, but I think the bigger problem is complete systems. Look at a usual Linux box. It has a bunch of files, all with their custom syntax, and you parse them with error-prone shell commands. If everything used json, everyone's lives would've been so much better.

Independent programs using json is a step in the right direction though.


> …then why aren't they just library functions in a scripting language?

I’d say they basically are. What’s the difference between Python’s sort and (let’s say) Bash’s sort? In Python I call sort to order a data structure like a list. In Bash I do the same, but on a different data structure. The crux of the matter is probably buried in semantics.

> I still don't see the beauty.

What I like about shell and pipelines is that they allow me to easily edit and play with text. I don’t need to think about opening nor closing files, I don’t need to think how to best represent the data I’m working with. In minutes I can get the information I need and I can move on.


> If "programs ought to do one thing and do it well", then why aren't they just library functions in a scripting language?

In many cases, the command line tool is a wrapper for an awesome library, and the awesome library has bindings for multiple scripting languages. So in many cases, there's a command, and a library and yeah, you can import the library into your favorite scripting language (or often compiled language, too). In the end, you could argue that all languages are just an abstraction for system calls anyway.

So why the love for the Unix shell? One line is faster for the user than 23 lines. It's the same reason we use compiled languages instead of assembly most of the time. The same applies to using python of javascript instead of C or rust. Less lines of code. Bigger abstractions. Problem is solved and the work is done faster.

> I still don't see the beauty.

The whole "collection of small programs that do one thing well" thing is just one way to do it, and there are plenty of examples of programs that are super successful that don't heed that advice in any way at all. In the end, Unix has held up for a remarkably long time... seems like there should be a better way to do it, but it's hard to argue with Unix success in the wild.


You can think of them as library functions, but they all have the same signature: a pair of (byte[]) streams for input and output (plus another for errors), and then `args: string[]` and `options: map<string, string>`. If you're a human that's only really interested in consuming text, this isn't a completely inappropriate interface, and there is some beauty in just how expressive it can be.


Or perhaps, why not both? Imagine a monorepo where the same source code that produces the "ls" output also has a library function that returns the same as structs. Perhaps ls is a bad example, but it would be quite neat if the command line and programming context could act on semantically similar procedures.

The Rust effort to implement coreutils anew could possibly pick this up and make something really useful here.


Actually, shell/unix pipes bring to the table a bunch of features: a convenient syntax to connect processes, I/O buffering (usually done by the programs themselves but still), and concurrency.

Languages are an (often) unique combination of (often) non-unique features. That's why shell scripting is a thing.


I was wondering if there's a CLI that can replace sort | uniq -c | sort -nr. I feel like this is a very commonly used pattern.


Obviously you could create a shell alias or write a new utility for that -- but if you're wondering why there isn't a standard UNIX utility for doing that: All the standard text processing utilities were designed in the context of processing files which didn't fit into memory -- with the exception of sort (which uses temporary files) they're all streaming utilities which keep at most a handful of lines in memory at once.

The "normal" way of doing "count how many times lines are present and list them in decreasing order" is to use a hash table which keeps all the (unique) lines in memory -- so that utility simply wasn't possible when all the standard tools were invented.


Roll your own:

  ~/bin/freq
  ==========
  #! /bin/bash

  sort "$@" | uniq -c | sort -nr
Once you get into the habit of capturing your habits as scripts that snap together like German nouns, you'll find the command line a wondrous place.


To improve performance, consider using `export LC_ALL=C` (credit: https://benhoyt.com/writings/count-words/). But note that it sorts uppercase and lowercase separately, and doesn't handle Unicode well.


Someone else mentioned using a hash table to accumulate counts; that approach would give something like

  ~/bin/freq
  ==========
  #! /bin/bash

  awk '{acc[$0]++} END {for (item in acc) {print acc[item], item}}' | sort -nr


i've found data_hacks[^0] to be a useful substitute for interactive use. e.g.

  $ for x in {0..999}
  > do echo $((RANDOM%5))
  > done | bar_chart.py -np
  # each * represents a count of 8. total 1000
  0 [   193] ************************ (19.30%)
  1 [   200] ************************* (20.00%)
  2 [   209] ************************** (20.90%)
  3 [   193] ************************ (19.30%)
  4 [   205] ************************* (20.50%)
  $
[^0]: https://github.com/bitly/data_hacks


Maybe I'm missing something, but what's stopping you from creating a shell alias for exactly that pipeline?


The point is that it's slower than an equivalent do-it-all - sort | uniq is already slow for interactive use if you have a couple thousand input lines (with our computers it should absolutely never take more than a few milliseconds)


Who said anything about perf? @yegle said it was a common pattern and asked for alternatives.

> it should absolutely never take more than a few milliseconds

Why? It surely depends on your use case (highest possible perf is different for many dupes vs few dupes, for example). And a few milliseconds is lower than a human's ability to respond, so doesn't matter for interactive use, it can only matter in batch scenarios.

sort | uniq | sort -nr takes ~300ms on a file of 10k lines of random data that is 0.5MB on my 10 year old mac. It's ~60ms on 2k lines. It's faster on a newer computer, and unless you're doing this a thousand times in a row, there's next to no difference between 3 ms or 300ms to me in an interactive shell...

BTW, what is the alternative? Are you assuming I write something high performance, spend my time figuring it out? Or is there a higher performance sort|uniq already available?


> sort | uniq | sort -nr takes ~300ms on a file of 10k lines of random data that is 0.5MB on my 10 year old mac. It's ~60ms on 2k lines.

it's absolutely intolerable

> It's faster on a newer computer, and unless you're doing this a thousand times in a row, there's next to no difference between 3 ms or 300ms to me in an interactive shell...

having to wait 300ms after an interaction makes my hands shake. If the result is not on screen by the time the enter key has gone back up it's plain bad.


> it's absolutely intolerable

Huh. Okay. Maybe you could answer the actual question: why? That seems like a rather extreme position to take.

300ms was for 10k lines, not your example of 2k lines. 2k lines does come back within 2 repaint frames on my 10 year old mac, which is about as fast as most video game responses to controller input.

PS and I'm still wondering what your proposed alternative is...


> Huh. Okay. Maybe you could answer the actual question: why? That seems like a rather extreme position to take.

it makes me feel physically sick and my hands start shaking when a computer does not answer immediately

> 2k lines does come back within 2 repaint frames on my 10 year old mac, which is about as fast as most video game responses to controller input.

which repaint frames ? e.g. here on my 240hz monitor that would be 8.3 ms which i guess is tolerable for visual input. on a 60hz, that's 33ms which is noticeable when everything else around it is fast

> PS and I'm still wondering what your proposed alternative is...

? it's in the original post - having an utility which does such common piping use cases without having to go through three processes - here doing a time sort | uniq | sort -nr of a 100mb file gives me 2/3 seconds for the first sort, 1s for the uniq, and again a couple seconds for the second sort ; the uniq can stream but the sorts can't, and a single command would likely be able to optimize this with a bit of algorithmic thinking


> my hands start shaking when a computer does not answer immediately

I’m sorry to hear that. It sounds like it could be an anxiety issue somewhat unrelated to what we are talking about? Does it help to stick to very small files so you never have to wait? Does it help when you get progress updates in the interface?

I still don’t understand what you’re suggesting because the idea that you want to give arbitrarily sized inputs to a sort and have it come back instantly is unrealistic. You can only improve the overheads and bandwidth, you can never avoid waiting if you give larger inputs. No matter what solution you suggest, I can give you a file you’ll have to wait for. Speeding up the unix pipeline doesn’t appear to solve the problem you’re describing, that’s why I’m asking for more detail.

> it’s in the original post

What is? I’m not sure what you’re referring to. You only mean the imaginary idea to combine 3 processes into one? @yegle only appeared to ask for a shortcut for a common workflow, and not a more performant alternative. I’m asking you what your preferred practical implementation is. What can you do today to speed up this workflow? And how does that solve the problem of having to wait?

> the uniq can stream but the sorts can’t

Right, this is true regardless of what tool you use. You can eliminate the overheads of starting up 3 processes, but you can’t avoid the second sort, if you ask for a second sort.

Might be worth observing that the overheads of 3 processes get relatively smaller the larger the inputs are. If you’re sorting a gigabyte or terabyte of text, using a single process will be barely better, if at all.

Might also be worth looking at cold cache vs warm cache startup times for 3 processes. Does your 2/3 second change the second time?

PS GNU sort has options that can improve its bandwidth. Maybe try these? https://www.reddit.com/r/commandline/comments/a7hq5n/psa_imp...

This is another reason to avoid writing your own tool, since sort may in reality have a lot of functionality that is not easy to duplicate, and may have ways to gain performance that is hard to beat.


I wish the next data science language had parallel pipelines built in as a default.

I love the way I can pipe stuff in R’s tidyverse package, and I’m a big fan of chained pandas, but I wish they streamed data one row at a time and all the pipes ran in parallel.

Having that as a primitive in your data science language would be really cool.


You should check out https://www.pola.rs/


I made https://io10.dev/ to apply this concept to code notebooks.


I guess one could also refer to the beauty of Lisp REPL, 1961, with the added benefit of having a proper debugging facility and structured data.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: