I've been noticing more swiss army-knife-like cli tools in the last few years. It would be cool if there were some that could support avro/parquet/orc formats. This one is notable because it's written in D lang by a mega corp.
Big fan of visidata https://visidata.org/ - one of those tools that does a lot but easily provides useful things with little complexity. For example open, select column, shift-f gives a histogram. Enter on one of those rows filters the data to just them.
Can load a variety of data formats + anything pandas can (so parquet works).
Edit - you can do a lot with it, quickly creating columns with python expressions, plots, joining tables, etc. but it's great for things like "pipe in kubectl output - get numbers of each job status, wait why are 10 failed, filter to those..." which easily works with any fixed width output.
Thanks for this. Looks perfect for my needs. I've got to analyse a hideous generated CSV file with thousands of columns. I was planning to load it into SQLite3 but it has too many columns for import. I was also unpleasantly surprised to find that the column names were too long for PostegresQL too. Python + pandas can handle it, but a tool for quick interactive exploration is just what I need. Hopefully visidata will provide!
If you still hit size issues, you might be able to throw things through xsv first (list columns, select just a few columns and pipe out, get stats on contents of columns, samples, etc). It's one of my other goto things https://github.com/BurntSushi/xsv
> gron transforms JSON into discrete assignments to make it easier to grep for what you want and see the absolute 'path' to it. It eases the exploration of APIs that return large blobs of JSON but have terrible documentation.
* vd (visidata)
* xsv (often piping into vd)
* jq (though I often forget the syntax =P)
* gron (nice for finding the json-path to use in the jq script)
* xmlstarlet (think jq for xml)
* awk (or mawk when I need even more speed)
* GNU parallel (or just xargs for simple stuff)
Miller (mlr) is now my goto tool for csv/tsv manipulation.
I've used it from time to time during the past year and it's been a joy to use.
I must admit that I did not try any other tool for the purpose so I don't know how it compares.
I know that everything looks like a nail if your only tool is a hammer, and it's fun to nail together square wheels out of plywood, but there are actually other tools out there with built-in, compiled, optimized, documented, tested, well maintained, fully compliant JSON parsers.
These tools probably do a good job at processing CSV/TSV/DSV (haven't tried them). However, I would love if we could just stop using delimiter-separated value files alltogether.
Why? Because the file format is an underspecified, unsafe mess. When you get this kind of a file you have to manually specify its schema when reading it because the file doesn't contain it. Also, due to its underspecification, there are many unsafe implementations that produce broken files that cannot be read without manual fixing in a text editor. Let's just start using safe, well-specified file formats like AVRO, Parquet or ORC.
As a data scientist, I have had lots of issues because the data I got for a project was a CSV/TSV/DSV file. I recently spat out a rant on this topic, so if you want more details, check out https://haveagooddata.net/posts/why-you-dont-want-to-use-csv...
We can't stop, because it's a de facto standard format for exchange with spreadsheet programs. So long as that's ubiquitous, we might as well write tools to make processing them easier.
Also, I'm not sure why you called CSV unsafe. It's certainly the case that it's severely under-specified, but I don't think there's anything unsafe about it.
One example of it being unsafe that happened to me: I got a CSV file written by a program with a broken implementation of a CSV writer that didn't quote string fields when there was a newline in them (in my case only the first half of a newline: carriage return). Then I read the file with a broken implementation of a CSV reader that assumed that the carriage return meant a new record and filled both parts of the broken line with N/As instead of throwing an error. This way the data in the sink didn't match the data in the source. This is the loss of data integrity, which I would call unsafe. It doesn't happen if you have a file format that serializes your data safely.
Due to the format being underspecified, many people roll their own unsafe CSV writer or CSV reader, thus every CSV file (where you don't completely control the source) is potentially broken.
Edit: Browsing your Github account I found that you implemented a CSV parser in Rust. I didn't know that when I wrote the above comment, so I was definitely not trying to imply that your particular CSV parser is unsafe.
What makes you think that if people manage to misimplement CSV parsers and generators they are not going to misimplement other formats? At least with CSV it’s always easy to implement some sort of heuristic that splits the rows correctly.
The only times when I had to deal with the issues you describe I was supplied with the data from a literally dying company. They just didn’t give a damn. Changing the file formats wouldn’t change anything - they would still find a way to mess it up.
Ah I see, yeah, from where I come from "unsafe" has a bit more weight to it. I'd call what you describe "silently incorrect." Which is also quite bad, to be fair!
> These tools probably do a good job at processing CSV/TSV/DSV (haven't tried them). However, I would love if we could just stop using delimiter-separated value files alltogether.
I hear ya. I have no doubt that "we" could, as in IT professionals. Maybe even the "surrounding" science fields that provide data could make an effort. But you're out of luck when it comes to almost any other field that serves you data, in my experience.
Any org that you can't even tell details about the CSV you want (what's the "C"? UTF8? Quoting?) will have no chance of providing you with something more complex.
It's partly our fault: The tools we provide them with suck. Excel's CSV handling is atrocious. Salesforce and similar tools seem to spit out barely consistent data dumps.
Sometimes I feel like 80% of the industry is dealing with sanitizing input.
The problem with TSVs and CSVs is that you might get an odd datatype 1TB into a file. For example, what you expect to be an integer value is somehow a string.
TSVx extends TSV to add standard formats for things like headers which allows for strict typing. You can do things like export a database table and import losslessly. You can even export from MySQL and import into PostgreSQL most of the times without pain.
The strict typing also avoids a lot of potential security issues. And in an environment where you control both ends (so you don't need to worry about security of where the file came from), it leads to much nicer APIs: you can refer to things by names rather than column numbers. It's more readable, and if the order or number of columns changes, nothing breaks.
TSVs aren't just numbers. A TSV might be numbers for the first terabyte, and then include a string, just for the heck of it.
I'd call that "dirtying numbers," rather than an removable header which allows you to programmatically validate that your numbers have not, in fact, been dirtied.
tsvx apparently isn't tsv. so it's not a good example of going all the way with tsv.
it has a header that isn't tab separated. technically all text files are variable length tsvs. But a proper tsv file has the same number of columns per row.
I hope to find time to prototype it in https://www.oilshell.org/ first. Although it's very simple -- about as simple as JSON -- I think it needs some real world use first to prove it.
I think adding metadata is orthogonal to cleaning up the problems with TSV, e.g. that you can't represent strings with tabs (or newlines) in fields. Also I wouldn't want a TSV enhancement to depend on YAML, as it's quite a big and confusing (e.g. https://news.ycombinator.com/item?id=17359376)
I think it depends on how YAML is used. Requiring full YAML on an input is cumbersome on users (e.g. if any downstream user MUST parse ANY YAML). On the other hand, a simple format which happens to be YAML-compatible means anyone with a YAML parser can parse it without reinventing wheels. That seems a good thing.
An example of that is JSON. Any YAML parser will parse JSON.
TSVX merely seems to guarantee that the output is a simplified (but compatible) form of YAML. All this means is that any YAML parser will read it, and you don't need to write your own. But it's just a colon-delimited dictionary with a few constraints.
Thanks for these. I've rolled my own implementation of many of these over the last 20 odd year, most of them living in ugly shell scripts embedded as aliases in my env. This will make my life easier!
It makes me cry how much time has been invested in formats like CSV's, TSV's, etc. ASCII (and UTF-8) has characters reserved for column, row, and even group separation. Just use them and save a lot of pain.
2. More accurately, people can screw up TSV files from any editor. Have you not seen embedded spaces used instead of tabs and totally mess everyone up?
Like a lot of things, #1 is only partly true, particularly if you have embedded tabs.
The number of times I've had to deal with parsing errors because of embedded carriage returns, commas, tabs, etc., sometimes costing millions of dollars is just... upsetting.
It's like JSON... we say everything can parse it, but really what we've got is some approximation that will come back to haunt us... and once you've done all the work to make sure everything is precise and correct, you'd really have saved time if you'd just used the tools we already had in the first place.
The flip side to that is that json and yaml parsers exist in every language, and would be more than capable of replacing any logic you’d find in a CSV.
Just use these formats if you want to be stuck in the last century when internationalization was that odd thing you could easily afford to ignore. Otherwise, use a real format.
3. If you don't know how to use your editor, maybe you should learn. If your editor can't insert even all the characters in basic ASCII, it's not an editor.
Go ahead and give every engineer who comes across your meta-delimited file a little rundown on how it works. "Can you show me how to import it into google sheets?" "I need to email it to someone, can you show me how to change it?" "My IDE says I need a plugin to read it? Do you know anything about that? Can you help me set it up?"
They'll all agree how clever and useful the meta characters are of course, but only after you've given them your time in learning about it. No thanks no thanks no thanks. For me I'd rather deal with a little bit of serialization headache then a support headache.
While that might have been a good idea if we'd started a while ago, at this point CSV/TSV are so much more established that using commas or tabs doesn't actually save pain.
But wouldn’t you have to account for the possibility that those separators exist in the content your working with, putting you back at square one in terms of escaping?
That's entirely the point: 0x1C and 0x1E should never actually appear in "normal" text unless someone has explicitly put them there, which is not necessarily true of , \t or \n.
But, there is nothing to stop your values from containing these characters. So, you still have to escape your input. And once you've done that, you might as well just use csv / tsv.
For some reason, in some circles, it seems to be semi standard to use þ (0xFE aka thorn) as the delimiter and a paragraph symbol (0x14 aka DC4 aka ^T) as the separator. The latter is not to be confused with 0xB6.
Anyway, these character are presumably not going to occur in ordinary text.
All of "upper ascii half" can occur in ordinary text in "pre-Unicode" encodings.
0xFE is a good example - you may get a customer or employee from Iceland with that character in their name (e.g.
https://en.wikipedia.org/wiki/Haf%C3%BE%C3%B3r_J%C3%BAl%C3%A...), or data in cyrillic cp1251 or koi8-r enconding where 0xFE also represents characters that you'll encounter in surnames, etc.
ISO 2022 aka ECMA-35¹ standardizes how to use it in general, but the only thing that really caught on is a subset of the terminal control extensions (ISO 6429 aka ECMA-48² aka ‘ANSI’). In the hypothetical alternate universe where people use ASCII-based structured data, ISO 2022 would have standard sequences for such metadata.
If you were rolling your own today, you'd probably wrap the metadata in Application Program Command (ESC _ … ESC \) or Start Of String (ESC X … ESC \) or one of the private-use sequences.
¹ https://www.ecma-international.org/publications/standards/Ec...
It's tsv-utils-like, but is strictly a wrapper around existing tools. So filtering and transformations are interpreted literally as awk (or perl) expressions. And the various cmdline options match the standard tool options because they ARE the standard tools. So you get a very friendly learning curve, but something like tsv-utils is probably faster and probably more powerful. And it looks like tsv-utils references fields by number instead of by name. Many of the others (mine included) use the field names, which makes a MAJOR usability improvement.
With respect, who actually feels at home with SQL?
It's about the most alien and obtuse language I've ever had the misfortune of encountering, and in that category I rate it worse than COBOL, FORTRAN, Assembly, C, Prolog, Sendmail re-write rules, BASH, and every other language I've ever encountered and had to use, but which I cannot recall at the moment.
Cool that there are some specialty tools in this space and will use if I hit that performance challenge.
Minor nit, I found the performance table (https://github.com/eBay/tsv-utils/blob/master/docs/Performan...) confusing at first and second glance. Alternating colors indicate.. different OSes? That doesn't seem to be the important message to convey as you are trying to show the speed of your tool and not the OS. Recommend to use the coloration to provide differentiation between tools instead of OS.
With AWKA, in my tests I was able to get an instant 100% speed boost without modifying any of my AWK code. So using AWKA would likely beat the tools presented here, without needing to invest the time to learn a new tool, which for me at least is of paramount importance.
I remember there was a TUI app that could handle CSV, HDF, etc. I can't find if for the life of me though. I remember the author even made their own TUI library for it. I've been looking for it for a while now, if anyone knows please let me know.
My pet peeve: open source packages that have no separation of the actual IP from the embodiment in an app or environment. Is it so hard to create a library? And put the real feature in there, instead of wrapped up in some run-on main module?
Thank you both! I too like the Unix philosophy of small tools that do a specific job. In a package like this, people may find they only care about one or two tools. They can ignore the rest and mix and match with other tools as they see fit.
I've had something like this for 10 years, self made. I use cut, grep, sort, uniq from the system, but added a tool to compute counts per field, do value processing like awk (using a one line Perl script because I didn't want to learn awk), randomise order, split files by percentage, join multiple files by column and pretty format the data.
Sometimes I make a chain of them 3-4 lines long, at which point I switch to a script.
I always get excited by these types of tools and then end up going back to awk. How do you beat something that's already pre-installed on every Unix system for the last 40 years?
Having said that ... these do look like they have some useful extras, in particular, around the annoying part of retaining / manipulating header rows ...
I totally agree. awk, sed, grep are amazing and it astonishes me how many people aren't fluently able to write scripts in it. I used to be in that camp few months earlier and I used to write everything in Python, but one day, I determined to learn shell scripting and I loved it.
awk::Python what Java::Python. What we can do in 10 lines of Python can be done in one line of awk and it doesn't have to be unreadable!
Is there anything you can do with awk or sed that isn't built in to perl? I know I never used awk after I learned the latter, and sed very, very rarely.
Wouldn't a relational database be better and faster-querying and more space efficient in most cases, e.g. sqlite? Wonder what use cases would favor large collections of text / TSV files?
If you’re getting TSV files from an external source and are interested in just some aggregate results, you’ll get better performance by computing the smaller results and storing these than if you first import, then process, then delete the data in a SQL db. Also you don’t have to store the data twice.
eBay has an api project that appears to have been half-completed and then abandoned. Trying to understand exactly how to use their APIs is extremely confusing because they don't have anything definitive and basically a frankenstein-like API so I've given up. I frankly wouldn't trust anything coming from eBay at this point, they appear to have extremely poor developer support and no investment in making their APIs better to use.
Some useful cli data wrangling tools --
https://github.com/BurntSushi/xsv
https://github.com/dinedal/textql
https://github.com/n3mo/data-science
https://stedolan.github.io/jq/
https://gitlab.redox-os.org/redox-os/parallel
https://github.com/willghatch/racket-rash
Would you have any others you recommend?