Hacker News new | past | comments | ask | show | jobs | submit login
TSV Utilities: Command line tools for large, tabular data files (github.com/ebay)
467 points by bryanrasmussen on Aug 31, 2019 | hide | past | favorite | 117 comments



I've been noticing more swiss army-knife-like cli tools in the last few years. It would be cool if there were some that could support avro/parquet/orc formats. This one is notable because it's written in D lang by a mega corp.

Some useful cli data wrangling tools --

https://github.com/BurntSushi/xsv

https://github.com/dinedal/textql

https://github.com/n3mo/data-science

https://stedolan.github.io/jq/

https://gitlab.redox-os.org/redox-os/parallel

https://github.com/willghatch/racket-rash

Would you have any others you recommend?


Big fan of visidata https://visidata.org/ - one of those tools that does a lot but easily provides useful things with little complexity. For example open, select column, shift-f gives a histogram. Enter on one of those rows filters the data to just them.

Can load a variety of data formats + anything pandas can (so parquet works).

Edit - you can do a lot with it, quickly creating columns with python expressions, plots, joining tables, etc. but it's great for things like "pipe in kubectl output - get numbers of each job status, wait why are 10 failed, filter to those..." which easily works with any fixed width output.


Thanks for this. Looks perfect for my needs. I've got to analyse a hideous generated CSV file with thousands of columns. I was planning to load it into SQLite3 but it has too many columns for import. I was also unpleasantly surprised to find that the column names were too long for PostegresQL too. Python + pandas can handle it, but a tool for quick interactive exploration is just what I need. Hopefully visidata will provide!


If you still hit size issues, you might be able to throw things through xsv first (list columns, select just a few columns and pipe out, get stats on contents of columns, samples, etc). It's one of my other goto things https://github.com/BurntSushi/xsv


I’ve been looking for exactly this!


I recommend "gron"

https://github.com/tomnomnom/gron

> gron transforms JSON into discrete assignments to make it easier to grep for what you want and see the absolute 'path' to it. It eases the exploration of APIs that return large blobs of JSON but have terrible documentation.

    ▶ gron "https://api.github.com/repos/tomnomnom/gron/commits?per_page=1" | fgrep "commit.author"
    json[0].commit.author = {};
    json[0].commit.author.date = "2016-07-02T10:51:21Z";
    json[0].commit.author.email = "mail@tomnomnom.com";
    json[0].commit.author.name = "Tom Hudson";


https://github.com/dflemstr/rq

jq but supports avro, messagepack, protocol buffers, etc.


My go-to tools:

    * vd (visidata)
    * xsv (often piping into vd)
    * jq (though I often forget the syntax =P)
    * gron (nice for finding the json-path to use in the jq script)
    * xmlstarlet (think jq for xml)
    * awk (or mawk when I need even more speed)
    * GNU parallel (or just xargs for simple stuff)



I wrote one for working with newline-delimited JSON and JavaScript expressions: https://github.com/mbostock/ndjson-cli


RBQL: https://github.com/mechatroner/RBQL - SQL dialect with Python or JS expressions. I am the author, BTW


https://github.com/johnkerl/miller

Miller (mlr) is now my goto tool for csv/tsv manipulation. I've used it from time to time during the past year and it's been a joy to use. I must admit that I did not try any other tool for the purpose so I don't know how it compares.

It has a nice verb system, verbs can be combined with "then" statements : https://johnkerl.org/miller/doc/reference-verbs.html . There is also a small DSL to use with the "put" and "filter" verbs: https://johnkerl.org/miller/doc/reference-dsl.html . Another verb I find very useful is "join", which does pretty much what you would expect it to do.


jwalk parses JSON into a stream of TSV records: https://github.com/shellbound/jwalk/


>parses large documents slowly, but steadily, in memory space proportional to the key depth of the document

If parsing JSON with shell scripts and awk is your idea of the most ideal way to "slowly, but steadily" get the job done.

https://github.com/shellbound/jwalk/blob/master/lib/jwalk/co...

I know that everything looks like a nail if your only tool is a hammer, and it's fun to nail together square wheels out of plywood, but there are actually other tools out there with built-in, compiled, optimized, documented, tested, well maintained, fully compliant JSON parsers.


CRUSH looks interesting, but you’ll have to poke around to find out what the tool(s) can do:

https://github.com/google/crush-tools/blob/wiki/CrushTutoria...


Gron -- Make JSON Greppable https://github.com/TomNomNom/gron


If you know a bit of Python, you can use it directly on streams in your shell.

https://github.com/python-mario/mario


if one gets to port it to java, wario will make a great cousin


Crush Tools

No idea how it compares to others but has been very handy.

https://github.com/google/crush-tools



These tools probably do a good job at processing CSV/TSV/DSV (haven't tried them). However, I would love if we could just stop using delimiter-separated value files alltogether.

Why? Because the file format is an underspecified, unsafe mess. When you get this kind of a file you have to manually specify its schema when reading it because the file doesn't contain it. Also, due to its underspecification, there are many unsafe implementations that produce broken files that cannot be read without manual fixing in a text editor. Let's just start using safe, well-specified file formats like AVRO, Parquet or ORC.

As a data scientist, I have had lots of issues because the data I got for a project was a CSV/TSV/DSV file. I recently spat out a rant on this topic, so if you want more details, check out https://haveagooddata.net/posts/why-you-dont-want-to-use-csv...


This is addressed in the xsv readme: https://github.com/BurntSushi/xsv#motivation

We can't stop, because it's a de facto standard format for exchange with spreadsheet programs. So long as that's ubiquitous, we might as well write tools to make processing them easier.

Also, I'm not sure why you called CSV unsafe. It's certainly the case that it's severely under-specified, but I don't think there's anything unsafe about it.


> Also, I'm not sure why you called CSV unsafe.

One example of it being unsafe that happened to me: I got a CSV file written by a program with a broken implementation of a CSV writer that didn't quote string fields when there was a newline in them (in my case only the first half of a newline: carriage return). Then I read the file with a broken implementation of a CSV reader that assumed that the carriage return meant a new record and filled both parts of the broken line with N/As instead of throwing an error. This way the data in the sink didn't match the data in the source. This is the loss of data integrity, which I would call unsafe. It doesn't happen if you have a file format that serializes your data safely.

Due to the format being underspecified, many people roll their own unsafe CSV writer or CSV reader, thus every CSV file (where you don't completely control the source) is potentially broken.

Edit: Browsing your Github account I found that you implemented a CSV parser in Rust. I didn't know that when I wrote the above comment, so I was definitely not trying to imply that your particular CSV parser is unsafe.


What makes you think that if people manage to misimplement CSV parsers and generators they are not going to misimplement other formats? At least with CSV it’s always easy to implement some sort of heuristic that splits the rows correctly.

The only times when I had to deal with the issues you describe I was supplied with the data from a literally dying company. They just didn’t give a damn. Changing the file formats wouldn’t change anything - they would still find a way to mess it up.


Ah I see, yeah, from where I come from "unsafe" has a bit more weight to it. I'd call what you describe "silently incorrect." Which is also quite bad, to be fair!


> These tools probably do a good job at processing CSV/TSV/DSV (haven't tried them). However, I would love if we could just stop using delimiter-separated value files alltogether.

I hear ya. I have no doubt that "we" could, as in IT professionals. Maybe even the "surrounding" science fields that provide data could make an effort. But you're out of luck when it comes to almost any other field that serves you data, in my experience.

Any org that you can't even tell details about the CSV you want (what's the "C"? UTF8? Quoting?) will have no chance of providing you with something more complex. It's partly our fault: The tools we provide them with suck. Excel's CSV handling is atrocious. Salesforce and similar tools seem to spit out barely consistent data dumps.

Sometimes I feel like 80% of the industry is dealing with sanitizing input.


Another one is the recent https://github.com/cube2222/octosql/ with a sql-based approach (I'm one of the authors)


Miller

https://github.com/johnkerl/miller

I'd normally pass but this is written in D so I'm looking forward to try it out.


PowerShell.


Seems like a waste to do CSV->TSV without going all the way:

http://www.tsvx.org/

The problem with TSVs and CSVs is that you might get an odd datatype 1TB into a file. For example, what you expect to be an integer value is somehow a string.

TSVx extends TSV to add standard formats for things like headers which allows for strict typing. You can do things like export a database table and import losslessly. You can even export from MySQL and import into PostgreSQL most of the times without pain.

The strict typing also avoids a lot of potential security issues. And in an environment where you control both ends (so you don't need to worry about security of where the file came from), it leads to much nicer APIs: you can refer to things by names rather than column numbers. It's more readable, and if the order or number of columns changes, nothing breaks.


It wouldn't seem that tsvx is "going all the way", it's a completely different thing that dirties your numbers with silly metadata.


TSVs aren't just numbers. A TSV might be numbers for the first terabyte, and then include a string, just for the heck of it.

I'd call that "dirtying numbers," rather than an removable header which allows you to programmatically validate that your numbers have not, in fact, been dirtied.


tsvx apparently isn't tsv. so it's not a good example of going all the way with tsv.

it has a header that isn't tab separated. technically all text files are variable length tsvs. But a proper tsv file has the same number of columns per row.


Hm interesting, I hadn't seen tsvx. Where is it being used? Are you part of its development?

FWIW I have my own TSV enhancement that is more modest:

https://github.com/oilshell/oil/wiki/TSV2-Proposal

I hope to find time to prototype it in https://www.oilshell.org/ first. Although it's very simple -- about as simple as JSON -- I think it needs some real world use first to prove it.

I think adding metadata is orthogonal to cleaning up the problems with TSV, e.g. that you can't represent strings with tabs (or newlines) in fields. Also I wouldn't want a TSV enhancement to depend on YAML, as it's quite a big and confusing (e.g. https://news.ycombinator.com/item?id=17359376)


I think it depends on how YAML is used. Requiring full YAML on an input is cumbersome on users (e.g. if any downstream user MUST parse ANY YAML). On the other hand, a simple format which happens to be YAML-compatible means anyone with a YAML parser can parse it without reinventing wheels. That seems a good thing.

An example of that is JSON. Any YAML parser will parse JSON.

TSVX merely seems to guarantee that the output is a simplified (but compatible) form of YAML. All this means is that any YAML parser will read it, and you don't need to write your own. But it's just a colon-delimited dictionary with a few constraints.


Interesting. Anyone have any experience of using tsvx in the wild?


Hi all, Primary tsv-utils author here. Thanks for checking out the tools! I hope some of you find them beneficial.


Thanks for these. I've rolled my own implementation of many of these over the last 20 odd year, most of them living in ugly shell scripts embedded as aliases in my env. This will make my life easier!


Appreciate it, sincerely. Any chance this moves Paypal to offer more real-time reports vs submit-and-wait?


Thanks! These are awesome. The rationale provided is excellent and really explains the design decisions made. Kudos!


It makes me cry how much time has been invested in formats like CSV's, TSV's, etc. ASCII (and UTF-8) has characters reserved for column, row, and even group separation. Just use them and save a lot of pain.


We’ve discussed this at length and I’m squarely in the DO NOT camp. The drawbacks of using non-readable meta characters exceeds the benefits:

1. TSV can be read and imported by almost everything.

2. People can add and adjust TSV files from any editor.

3. What’s the way to insert meta characters again In VIM? And now nano? Argh I’ll just try and copy and paste it. Ugh that doesn’t work.

Just use CSV/TSV folks. Anything more complicated and reach for a better serialization format (json, yaml) and not a better delimiter.


2. More accurately, people can screw up TSV files from any editor. Have you not seen embedded spaces used instead of tabs and totally mess everyone up?


Like a lot of things, #1 is only partly true, particularly if you have embedded tabs.

The number of times I've had to deal with parsing errors because of embedded carriage returns, commas, tabs, etc., sometimes costing millions of dollars is just... upsetting.

It's like JSON... we say everything can parse it, but really what we've got is some approximation that will come back to haunt us... and once you've done all the work to make sure everything is precise and correct, you'd really have saved time if you'd just used the tools we already had in the first place.


Dealing with tabs and carriage returns has sometimes cost you millions of dollars? If so, that sounds like a great story and I would love to hear it!


Especially if (as it sounds) it's happened more than once!


Both JSON and YAML are very difficult to construct parsers for.

To make matters worse, both formats are full of pitfalls:

http://seriot.ch/parsing_json.php https://arp242.net/yaml-config.html

using either of these, in my opinion, is extremely misguided.


Can't find a source for it now, but one of the design goals of JSON iirc was that it should be easy to parse.

You should have a go at writing a JSON parser to convince yourself it isn't hard. Can manually lex it in less than 150 lines of Python.


The flip side to that is that json and yaml parsers exist in every language, and would be more than capable of replacing any logic you’d find in a CSV.


CSV's are a data structure. They do not contain any logic.


Just use these formats if you want to be stuck in the last century when internationalization was that odd thing you could easily afford to ignore. Otherwise, use a real format.


3. If you don't know how to use your editor, maybe you should learn. If your editor can't insert even all the characters in basic ASCII, it's not an editor.


Go ahead and give every engineer who comes across your meta-delimited file a little rundown on how it works. "Can you show me how to import it into google sheets?" "I need to email it to someone, can you show me how to change it?" "My IDE says I need a plugin to read it? Do you know anything about that? Can you help me set it up?"

They'll all agree how clever and useful the meta characters are of course, but only after you've given them your time in learning about it. No thanks no thanks no thanks. For me I'd rather deal with a little bit of serialization headache then a support headache.


...and this is why we can't have nice things.


While that might have been a good idea if we'd started a while ago, at this point CSV/TSV are so much more established that using commas or tabs doesn't actually save pain.


But wouldn’t you have to account for the possibility that those separators exist in the content your working with, putting you back at square one in terms of escaping?


That's entirely the point: 0x1C and 0x1E should never actually appear in "normal" text unless someone has explicitly put them there, which is not necessarily true of , \t or \n.


We took the wrong path quite some time ago. Most of the POSIX utils like sort, cut, grep only support \n (or sometimes \0) as row separator.


Are you aware of $IFS?


I think $IFS is only used by the shell (with read and expansions), not by e.g. sort. It is so at least on my system.


But, there is nothing to stop your values from containing these characters. So, you still have to escape your input. And once you've done that, you might as well just use csv / tsv.


Actually, the whole point of those characters is that they cannot be used in values.


For some reason, in some circles, it seems to be semi standard to use þ (0xFE aka thorn) as the delimiter and a paragraph symbol (0x14 aka DC4 aka ^T) as the separator. The latter is not to be confused with 0xB6.

Anyway, these character are presumably not going to occur in ordinary text.


All of "upper ascii half" can occur in ordinary text in "pre-Unicode" encodings.

0xFE is a good example - you may get a customer or employee from Iceland with that character in their name (e.g. https://en.wikipedia.org/wiki/Haf%C3%BE%C3%B3r_J%C3%BAl%C3%A...), or data in cyrillic cp1251 or koi8-r enconding where 0xFE also represents characters that you'll encounter in surnames, etc.


It would be escaped in that case.


using csv and tools like grep and join is very fast for large datasets


Using the reserved separator characters wouldn't change that.


I always used cut, sort, and uniq a lot too.


How do you do (for example) column names or types with ASCII only?

How do you extend ASCII?


ESC is the ASCII character for code extension.

ISO 2022 aka ECMA-35¹ standardizes how to use it in general, but the only thing that really caught on is a subset of the terminal control extensions (ISO 6429 aka ECMA-48² aka ‘ANSI’). In the hypothetical alternate universe where people use ASCII-based structured data, ISO 2022 would have standard sequences for such metadata.

If you were rolling your own today, you'd probably wrap the metadata in Application Program Command (ESC _ … ESC \) or Start Of String (ESC X … ESC \) or one of the private-use sequences. ¹ https://www.ecma-international.org/publications/standards/Ec...

² https://www.ecma-international.org/publications/standards/Ec...


There're lots of tools in this space that are similar in a very general way, but have widely different design choices. My tookit is

https://github.com/dkogan/vnlog

It's tsv-utils-like, but is strictly a wrapper around existing tools. So filtering and transformations are interpreted literally as awk (or perl) expressions. And the various cmdline options match the standard tool options because they ARE the standard tools. So you get a very friendly learning curve, but something like tsv-utils is probably faster and probably more powerful. And it looks like tsv-utils references fields by number instead of by name. Many of the others (mine included) use the field names, which makes a MAJOR usability improvement.

Other tools in no particular order:

https://csvkit.readthedocs.io/

https://github.com/johnkerl/miller

https://github.com/eBay/tsv-utils-dlang

http://harelba.github.io/q/

https://github.com/BatchLabs/charlatan

https://github.com/dinedal/textql

https://github.com/BurntSushi/xsv

https://github.com/dbohdan/sqawk

https://stedolan.github.io/jq/

https://github.com/benbernard/RecordStream


Adding another one which I'm one of the authors of: https://github.com/cube2222/octosql/

It tries to provide ergonomic data munching of various formats, but using a sql interface, which most will probably feel immediately at home with.


With respect, who actually feels at home with SQL?

It's about the most alien and obtuse language I've ever had the misfortune of encountering, and in that category I rate it worse than COBOL, FORTRAN, Assembly, C, Prolog, Sendmail re-write rules, BASH, and every other language I've ever encountered and had to use, but which I cannot recall at the moment.


> With respect, who actually feels at home with SQL?

So many data analysts that each time a new analytical database engine comes out it needs to support SQL or something like it to get any mindshare.


SQL gets rather mistifying when you move towards some complex joins and data mapping. PSQL can also feel Sadly out of place.

But their is real nice clarit in what you want at its foundation.


I am! SQL is the language of data and analytics and it can be quite a joy to use. But maybe I've just Stockholm'd myself into liking it.



Cool that there are some specialty tools in this space and will use if I hit that performance challenge.

Minor nit, I found the performance table (https://github.com/eBay/tsv-utils/blob/master/docs/Performan...) confusing at first and second glance. Alternating colors indicate.. different OSes? That doesn't seem to be the important message to convey as you are trying to show the speed of your tool and not the OS. Recommend to use the coloration to provide differentiation between tools instead of OS.


Thanks for the feedback. I agree, the performance table would benefit from better formatting.


Worth to note clickhouse-local: full featured ClickHouse SQL engine for files in CSV/TSV/JSONLines, whatever...

https://www.altinity.com/blog/2019/6/11/clickhouse-local-the...


Nice! Or even just Sqlite


With AWKA, in my tests I was able to get an instant 100% speed boost without modifying any of my AWK code. So using AWKA would likely beat the tools presented here, without needing to invest the time to learn a new tool, which for me at least is of paramount importance.

https://github.com/noyesno/awka


I remember there was a TUI app that could handle CSV, HDF, etc. I can't find if for the life of me though. I remember the author even made their own TUI library for it. I've been looking for it for a while now, if anyone knows please let me know.


must be visidata


I think that's the one, thanks so much.


Dear author: thank you for making multiple small programs instead of a single command with 400 flags.


Agreed!

My pet peeve: open source packages that have no separation of the actual IP from the embodiment in an app or environment. Is it so hard to create a library? And put the real feature in there, instead of wrapped up in some run-on main module?

Kudos to this writer for doing it well.


Thank you both! I too like the Unix philosophy of small tools that do a specific job. In a package like this, people may find they only care about one or two tools. They can ignore the rest and mix and match with other tools as they see fit.


I've had something like this for 10 years, self made. I use cut, grep, sort, uniq from the system, but added a tool to compute counts per field, do value processing like awk (using a one line Perl script because I didn't want to learn awk), randomise order, split files by percentage, join multiple files by column and pretty format the data.

Sometimes I make a chain of them 3-4 lines long, at which point I switch to a script.


I always get excited by these types of tools and then end up going back to awk. How do you beat something that's already pre-installed on every Unix system for the last 40 years?

Having said that ... these do look like they have some useful extras, in particular, around the annoying part of retaining / manipulating header rows ...


I totally agree. awk, sed, grep are amazing and it astonishes me how many people aren't fluently able to write scripts in it. I used to be in that camp few months earlier and I used to write everything in Python, but one day, I determined to learn shell scripting and I loved it.

awk::Python what Java::Python. What we can do in 10 lines of Python can be done in one line of awk and it doesn't have to be unreadable!

Main pain in the neck is finding a good tutorial. As usual, I started documenting what I learnt in an end to end guide here, https://github.com/thewhitetulip/awk-anti-textbook


The original awk book from the creators is in my opinion really good. https://archive.org/download/pdfy-MgN0H1joIoDVoIC7/The_AWK_P...


I found it to be a good introduction, but the issue is, it does not go in dept and teach by example.


I totally agree about the unnecessary verbosity.

For a more concise Python on the command line, I like Mario: https://github.com/python-mario/mario


Is there anything you can do with awk or sed that isn't built in to perl? I know I never used awk after I learned the latter, and sed very, very rarely.


I tried to learn perl but I gave up halfway.


How do you beat something that's already pre-installed on every Unix system for the last 40 years?

One simple and useful way is by being much faster, which is the case for these tools. There are benchmarks linked in the README.


I agree. Mastering any tool of that sort is a lot of effort and I'd rather invest it in a tool that's reliable and always available.

If you have to explore data anyway, using basic unix tools, piping one into the other, always works. You end up with verbose, but readable code.

Put it into a script, make sure it works in a more general case and you might get close to what some of the more specialized tools can do.


For TSV, coreutils, awk, etc are nice to work with for sure but for CSV not so much.


awk -F, ?

Until you start dealing with quotes.


perl -MText::CSV will take care of that.


At the risk of being flippant: sed 's/,/\t/g'


And hope you don't have any commas into your data?


In my line of work, that's an excellent assumption.


Must be nice to start work with clean data. Allow me to assure you that there are many, many others who don't have the luxury.


I feel like this might explain atrocious unicode handling and special character escaping in most eBay properties... Looking at you gumtree!


Check out Nushell if you missed it a few days ago on HN: https://github.com/nushell/nushell

Related HN thread: https://news.ycombinator.com/item?id=20783006


Looks cool! The presentation of the benchmark data could be improved by charting the values, however.


Mario is a tool for manipulating data in your shell with Python functions. It supports csv, json, toml, yaml, xml.

https://github.com/python-mario/mario



Wouldn't a relational database be better and faster-querying and more space efficient in most cases, e.g. sqlite? Wonder what use cases would favor large collections of text / TSV files?


If you’re getting TSV files from an external source and are interested in just some aggregate results, you’ll get better performance by computing the smaller results and storing these than if you first import, then process, then delete the data in a SQL db. Also you don’t have to store the data twice.


People use delimited text files to transfer data from one database to another. I'm thinking of lawyers in particular.

https://en.wikipedia.org/wiki/Load_file


`keep-header` is a superb gem of do-one-thing composability.


You might want to check out Apache Drill. It allows you to access structured files as database tables through JDBC/ODBC.


Dremio replaces Drill for a lot of that too, some of the original team has worked on it. Here's a tutorial from some folks who have used it: http://www.helicalinsight.com/technical-guide/connecting-csv...


eBay has an api project that appears to have been half-completed and then abandoned. Trying to understand exactly how to use their APIs is extremely confusing because they don't have anything definitive and basically a frankenstein-like API so I've given up. I frankly wouldn't trust anything coming from eBay at this point, they appear to have extremely poor developer support and no investment in making their APIs better to use.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: