Hacker News new | past | comments | ask | show | jobs | submit login
Perl One-Liners Cookbook (learnbyexample.github.io)
126 points by asicsp on Nov 7, 2020 | hide | past | favorite | 47 comments



Hello!

I had started tutorials on command line text processing (https://github.com/learnbyexample/Command-line-text-processi...) more than three years back. I learnt a lot writing them and continue to learn more with experience.

I finished first version of cookbook on Perl one-liners recently. With that, five of the major chapters from that repo are now accessible as better formatted ebooks, updated for newer software versions, exercises, solutions, etc.

You can download pdf/epub versions of the ebooks using the links below (free until this Sunday)

* Perl one-liners cookbook: https://gumroad.com/l/perl-oneliners or https://leanpub.com/perl-oneliners

* Five book bundle of grep/sed/awk/perl/ruby one-liners: https://gumroad.com/l/oneliners or https://leanpub.com/b/oneliners

---

I'd highly appreciate your feedback and hope that you find these resources useful. Happy learning and stay safe :)


How do you picture someone using this?

The reason I ask is that I've never figured out a good way to make use of this type of material. Reading idly means I retain almost nothing. I try to avoid spaced repetition on things I don't actually use (that has turned out to be a good filter to avoid over-adding flashcards).

Whenever I have a problem that can be solved by a Perl one-liner, there are two obstacles to using something like your book:

1. Figuring out that my problem is solvable with a Perl one-liner in the first place, and

2. Finding the right patterns in the book to piece together to achieve my goal.

Have you considered any sort of tiered index to allow one to classify a problem at hand and look up the relevant structures int he book?


I agree – having read the intro and skimmed some other sections, they seem like great foundational introductions and I'm absolutely excited to use perl instead of GNU-like utilities... but I'm very nervous that I wouldn't be able to quickly find how to do it with perl later.

An additional "problems to solve with a perl one-liner" index of some kind might be quite helpful for that...


>Have you considered any sort of tiered index to allow one to classify a problem at hand and look up the relevant structures int he book?

I have wondered a few times if I can build some sort of guide/flowchart that allows one to pick tool(s) for a particular text processing job (for ex: when to choose head/tail/grep/sed/awk/pr/paste/etc). And then within the tool(s) chosen, how to break down a problem and select particular features of that tool. It has remained a fantasy as I haven't tried to actually start doing something about this. I dunno if this is feasible and/or whether I'll be able to do it.

>The reason I ask is that I've never figured out a good way to make use of this type of material.

There are several prerequisites mentioned in the Preface before you can use this book. You'll have to be familiar with cli environment as well as Perl basics.

After that, I feel the best way would be to start using these one-liners in your everyday tasks. If your workflow doesn't suit this type of usage, you'll have to be either motivated enough to explore or take a risk to change/modify your workflow.

I provide several examples for each topic and have exercises at the end of chapters. Broadly speaking, there are three main tasks (tools mentioned in parentheses is the fundamental task of that tool)

* Filter an input record based on regexp, string or numeric comparisons (grep)

* Search and replace (sed)

* Field based filtering/substitution (awk)

awk and perl can do all of these tasks. However, a tool specialized for a task is usually faster and has custom options for ease of use for those tasks.

So, my best advice would be to practice and find reasons to incorporate cli based solutions in your everyday workflow.


I consider myself an intermediate-or-advanced user of both sed and awk. Those are small enough languages that I was able to study them for a few hours and then know roughly how to solve any problem with them.

Perl one-liners are different in that they rely on a large set of flags and language shortcuts which are harder to remember when your main language isn't Perl.

So it's not about getting used to the POSIX utilities and text processing, it's more about it being infeasible to "just memorise" all these Perl shortcuts.

The traditional solution when "just memory" won't cut it is some sort of index to speed up the search.

The guide/flowchart thing is just what I'm talking about. Botanists use a tiered index to identify plants: they subdivide the flora first by large differences in characteristics and then down to differences you need a magnifier to see. Each observation cuts the potential space by a significant fraction.

Something similar would be useful for this. E.g. "is your input structured or free text?" "Are records separated by linebreak or something else?" "Are there fields in the records?" "Are you looking to perform an aggregation?" And so on. I don't know. It would be neat for many reasons!


>"is your input structured or free text?" "Are records separated by linebreak or something else?" "Are there fields in the records?" "Are you looking to perform an aggregation?"

Yeah, that'd be awesome. I guess I'll just have to give a shot at it sometime, get feedback, improve and repeat the process until there's a good enough resource. That'll likely give me good returns in terms of sales too.

For now, writing one yourself could help. Since you know sed/awk, try writing equivalent ones with Perl with help of the book. For options, create a cheatsheet that you can refer quickly from the command line. Whenever you see something that isn't easily possible with sed/awk, put it under a special section so that you can build a list where perl would be better (I give some examples of this in the first section of introduction chapter). I also mention resources for getting started with Perl in the Preface chapter.

I often use my books as a reference guide, because I can't remember all the syntax and gotchas. The situation is worse now compared to few years back because I have had to dive deeper to write these books. I'm old enough that learning new things has become difficult. I wish to create a cli lookup (something similar to the guide/flowchart), but for now the web version created with mdbook serves as the best way for me to look up something.


>Whenever I have a problem that can be solved by a Perl one-liner, there are two obstacles to using something like your book:

1. Figuring out that my problem is solvable with a Perl one-liner in the first place, and

2. Finding the right patterns in the book to piece together to achieve my goal.

The same argument can be made about any programming language. What makes Perl one-liners different?


My comment was not about a programming language, but about a book of example one-liners. I have the same problem with any book of one-liners, no matter the subject language.


nice links! thank you!


My favorite

    perl -ne '$l=$_ if rand()<(1/$.); END{print $l}'
To pick a random line from a stream of unknown length in a single pass while only storing one line. With some more options it will handle a fortune file (or any stream of things that you can go through by record).


Can anyone explain how that works?


I can try. -ne modifiers are just to say to execute the script on every line of input. The body that will be executed in pseudo code is

    while get_input:
      if rand() < 1 / current_line_number:  # $. contains the line number of input
        l := current_line_content  # $_ always contains "what you want", in this case the current line

    # when using -n flat you can use BEGIN {} and END {} blocks
    # to execute code before and after the loop
    print(l)
Now, as to why the random selection works, this is a simplified version of Reservoir Sampling[1]

[1]: https://en.wikipedia.org/wiki/Reservoir_sampling


It seems like nobody else mentioned this, so for completeness: the algorithm is called reservoir sampling.

It allows you to sample n elements from a potentially very big stream with exactly n elements of storage. Incredibly neat technique.

Lots written about it a web search away once you know its name!

(If the sequence is in primary memory with fast random access, it's more efficient to run n iterations of a Fisher-Yates shuffle and pick the first n elements.)

Edit: I forgot to refresh the comments page. Someone had mentioned it now. Good!


Without knowing much Perl, I guess that $_ is the current line, and $. the number of lines read.

This will pick the first input line with a 100% chance, overwrite it with the second line with a 50% chance, then overwrite it with the third line with a 33% chance, etc. Proving that this really gives all lines the same chance of ending up in $l sounds like a fun math exercise.


Well, the second line replaces the first line 50% of the time.

3 lines: The 3rd line has the correct 1/3 chance of being written, otherwise (with probability 2/3) one of the first 2 lines remain, and they had an equal 50% chance of being stored just prior to the 3rd line appearing, which now is 50% x 2/3 = 1/3.

4 lines: The 4th line has the correct 1/4 chance of being written, otherwise (with probability 3/4) one of the first 3 lines remain, and they had an equal 1/3 chance of being stored just prior to the 4th line appearing, now 1/3 x 3/4 = 1/4. Same argument applies all the way down. That's very neat! ...

n lines: The nth line has the correct 1/n chance of being written, otherwise (with probability (n-1)/n) one of the first n-1 lines remain, and they had an equal 1/(n-1) chance of being stored just prior to the nth line appearing, now 1/(n-1) x (n-1)/n = 1/n.


> Proving that this really gives all lines the same chance of ending up in $l sounds like a fun math exercise.

Quick try with recursive proof:

Assume all preceding n-1 lines have equal probability of having been picked (p = 1/(n-1)). This is clearly true for the baseline n=2.

Then for line n, it will replace the picked line with probability p=1/n, since that is in the algo. Any previous line will now have p=[1/(n-1)] * (1 - 1/n) = 1/n and the assumption holds.


> Filtering ... perl one-liners can be used for filtering lines matched by a regexp, similar to grep, sed and awk.

In this section, the examples do stuff that are easy to do with grep, sed, or awk, but since perl is more powerful, wouldn't it be better to use examples that aren't as easy to do with those other tools?

For example, I recently found I could use short perl one-liners to filter files in a pipe like using `[[ -f $file ]]` per line. I used to use stest from dmenu like this:

  pacman -Qlq $pkg | stest -f
but now I see I could instead do the following anywhere without installing a special package:

  pacman -Qlq $pkg | perl -lne 'print if -f'


>wouldn't it be better to use examples that aren't as easy to do with those other tools?

hmm, food for thought...I wasn't thinking along those lines after I had shown/linked powerful one-liners in the very first section: https://learnbyexample.github.io/learn_perl_oneliners/one-li...

I'll see if I can also add such examples in the actual content too. I was concentrating more on explaining the syntax, showing features, etc


> perl -MList::Util=uniq -e 'print uniq <>'

This one is interesting, because I've come across this issue many times where "uniq" will not work the way you perhaps expect it to (if you, like me, don't read "man" first), and will give back unique values from lines later in the file multiple times. Normally I've found that the way to work around this is to sort the output first before passing it to uniq, so I was curious if the perl way was faster or not.

I wrote 10M random numbers between 1-3 to a text file and tried this benchmark:

> time cat hn.txt | sort -n | uniq

real 0m3,929s

> time cat hn.txt | perl -MList::Util=uniq -e 'print uniq <>'

real 0m2,205s

Almost twice as fast. Lesson learned I guess, don't knock on Perl for CLI wrangling :)

I can share one of my frequently used ones as well: if you have a log file with an initial timestamp that is UNIX epoch, you can pipe the text to this perl command to convert it to localtime:

> cat logfile | perl -pe 's/(\d+)/localtime($1)/e'

(I'm sure you could do this with something other than Perl as well, but it does the job well)


If you remove `cat`, it gives me nearly identical run time on my machine. I used `shuf -r -n10000000 -i 1-3 > t1` as input. Also, output won't be same, as the perl solution retains input order. `sort -nu` is faster than `sort+uniq` combo.

    perl -ne 'print if !$h{$_}++'
runs faster (but this might depend on input, not sure if it is always faster than uniq) than the `sort -u` or `sort+uniq` combo

what's surprising to me is that this was faster than

    awk '!a[$0]++'


(You sound like someone who would probably know this, but a lot of people don't).

Note that in general "sort -nu" is not equivalent to "sort -n | uniq", although in this particular case they are. If you have more than one field on the line, they can differ.

For example, if the input is:

  3 first
  2 second
  3 second
  1 second
  2 first
  1 first
then this is the "sort -nu" output:

  1 second
  2 second
  3 first
whereas "sort -n | uniq" gives:

  1 first
  1 second
  2 first
  2 second
  3 first
  3 second
I'm a bit confused as to how "sort -nu" chooses which record to print when there are more than one with the same key. I would have expected the chosen record to be the one that would have come first in "sort -n" output, so that "sort -n | sort -nu" would be equivalent to "sort -nu", but that is not the case. The pipe gives:

  1 first
  2 first
  3 first
It is as if -nu automatically adds -s.


manual for -u option says "output only the first of an equal run" which is effectively same as what -s option does


Try using mawk instead of GNU awk. It's much faster than either Perl or gawk in this scenario for me. Roughly 2x.

Also, "export LC_ALL=C" typically speeds up gawk quite a lot.


I did try LC_ALL=C with gawk, but that didn't give any improvement for this case.

mawk indeed was much faster


Might be interesting to see how it fares vs "sort -u".


I’m curious, is anyone doing active development with Perl? All the Perl I’ve seen has been in legacy systems.


I think it's very widely used for its original purpose, which is a shell replacement. When I need a short script, particularly for text processing, I'll usually reach for Perl now that the Perl 5/Perl 6 thing has been resolved. Just look at the activity on CPAN: https://metacpan.org/recent

There are two heavily developed/used web frameworks for Perl: https://mojolicious.org/ and http://perldancer.org/

Users of Mojolicious: https://github.com/mojolicious/mojo/wiki/Projects-and-Compan...


Yes, absolutely, perl is part of my standard toolbox for some specific tasks, and I'd say I use it at least once a week.

For example, for short scripts to process/massage data - either one shot or in a production pipeline -, it's very expressive and "to the point", especially if you're going to use regexps.

perl is in particular way less verbose than e.g. python when it comes to regexes coz regexes are built into the basic syntax of the language.

OTOH, there are many, many things that are broken in perl, but if I had to cite two things that made me kind of give up on it for new large and mid-level complexity projects these would be: - near impossible to predict perl's handling of "numerical" values - having to type dollar signs to dereference variables


There are Perl interface libraries for everything, so it is often a really great choice, even if it is no longer sexy now that the peak CGI-days have gone.

One thing that I like about Go is that it promotes testing as something that should be required, but in my experience the vast majority of Perl modules also come with very good test-case coverage, it being the source of the TAP format after all!

I've written a few new sites and APIs using the CGI::Application framework over the past few years, and I'll often knock up simple transformation scripts in Perl to glue APIs together and perform ad-hoc transformation of various tools.

Although I've mostly started writing new services in golang I don't feel the need to go back and rewrite stable systems in perl.


Yes. Quite a few tools and products at $dayjob are Perl with active development, multiple large scale revenue numbers attached to them.

I use perl daily in my work on supercomputers and hpc systems. My code helps configure and control them. It helps simplify users work.

No. Perl is not dead by any measure.


These responses are great, glad to hear perl is alive and well. I miss working with it and need to use it more often.


I use (Raku) for shell scripting and system administration. Bash scripts more than 10 lines long scare me.


Yes, working on a greenfield project for one of the largest companies you’ve heard of.


I do. Ideal for deploying on light (edge) servers.


I have lots of programs for loading data from files or sites into a database.

I also have a few Telegram bots written in Perl.


I use it every day for sysadmin stuff.


Many industries use Perl 5 to keep things running. The semiconductor industry, in particular uses a lot of existing, and develops a lot of new, Perl code to do things commercial software cannot.



http://ziprecruiter.com is largely written in perl


Booking.com, a $73 billion company, uses Perl.

It is both "legacy" and "active development".

s/legacy/valuable/g;


Don’t forget Craigslist


There are "real" companies using Perl presently...Booking and DuckDuckGo and others. Looking at Redmonk, Tiobe, other surveys...Perl isn't at the top of the heap by any measure but it also stubbornly refuses to die. I write Perl every day and frankly don't care if people think it is "dead", and even I am surprised at how prominent it figures in language surveys, still.

Most of the key libraries I use are updated regularly. Perl itself is under active development with regular releases.



Perl has exotic data structures

    1. ARRAYS OF ARRAYS 
    2. HASHES OF ARRAYS
    3. ARRAYS OF HASHES
    4. HASHES OF HASHES
https://perldoc.perl.org/perldsc


You probably jest, but perl introduced simple arrays and hashes long before introducing references, and this created a lot of confusion later. Very simple json-like structures are pretty confusing in perl now. In most other dynamic languages distinction between array and a reference to an array is nonexistent, thankfully.

There're some actually exotic data structures, like pseudo-hashes. And perl can do this! $x = \$x

But having switched to ruby 15 years ago I never missed this particular quirk.


Distinguishing between the two can have massive performance impacts. There is also the reason that perl is closer to C than a lot of other languages in this regard.

Copying (often happens when passing an arg to a function / class) a reference is much quicker (only a 4 or 8 byte copy depending on arch) than doing either a single level shallow copy (4 x length(array) or 8 x length(array) bytes). Obviously the larger the size of the array the larger the impact.


well, then why bother copyiing at all? ruby passes references everywhere. Those are not pointers, of course, but for small arrays AFAIR those arrays (pointers to actual data) are contained in the reference itself. And logically it simplifies things soo much.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: