I spend a lot of time in the terminal and want to quickly glance at a csv files without making a new script, opening excel, or using a tui. I made tidy-viewer (tv) because current tools like cat and column were not pretty enough.
tv modifies raw files in the following ways:
1. NA detection and highlighting
2. Printing only significant digits
3. Header and footer meta data
I have been using this a lot at work. There is a lot more work to do, but it is in a usable state.
Give it a try! If you like it then star on Github!
The NA detection and higlighting is nice but I'm not sure how I feel about showing anything other than the exact textual value. I don't mind abridging quotes when they're not necessary, but showing "N/A", NA,, etc. as the same value is a bit iffy.
I was going compain about null bytes in text (never, period), but then realized you actually did mean the U+2400 SYMBOL FOR NULL[0] character itself. That's surprisingly viable (though you do now have to worry about the string "\xE2\x90\x80" ending up in your data).
0: Which is actually incorrectly named - it should be "SYMBOL FOR NUL".
It is rough. There are many ways that different tools put NAs, na, N/A, "", etc. in a file. To chose only "NA" would mean I would be excluding the output of other tools. I chose accessibility over specificity. #trade-offs.
Fair enough but that doesn't explain why you chose to display all of them as "NA". As you say there are lots of different ones, hence it would be a bad idea to pick one as the 'default' to display. To me it's important whether something is missing, filled with "N/A", or "null", or "Not Applicable" etc.
Simple: provide CLI switches to let the user decide what they want for NA detection (current behavior as default, user can provide alternate NA values, per the source file or the natural language it is expressed in), and how they want them displayed, whether as-is, blank or a consistent custom value
(as-is should be the default).
Well, I think being promiscuous with "NA", "N/A", nan, etc. is a separate issue from a blank cell. A blank cell is literally missing. That should be filled with NA.
haha. You are right "NA" stands for "Not Applicable". That is not always how people/programs using it though. What are some alternatives that you would suggest? I am happy to learn.
I would suggest similar to what other people have suggested where you color the background of the cell red and then just display the literal content of the cell. I think it would be reasonable to have this configurable via command line arguments though, so if you like the "NA" that could also be a mode.
Perhaps it would make sense to have a "pretty" mode and a "literal" mode (which would also turn off the clever processing of numbers)?
First of all - kudos on tackling this task - it is indeed very annoying to get CSVs to render nicely on a terminal.
1. How does tidy-viewer compare with csvlook?
2. Looking at the demo video, there seems to be an odd fixation with "N/A". The CSV spec, AFAIK, doesn't recognize this phrase. I don't understand why someone would expect a quoted string field whose raw characters are "n/a" should be rendered as anything other than n/a (i.e. lowercase and without the quotes). I'm guessing maybe in your workflow you want to use that phrase a lot, but for a tool for the general public I'd not do this kind of interpretation; and I would leave an empty field as empty.
3. tidy-viewer seems to require "unstable library features", or at least ones which were unstable as of Rust 1.48.0 . It would be nice if you could be compatible with older rust distributions/versions.
4. Many systems, especially older ones, especially ones which you access remotely and don't have root privileges on, won't have a rust installation. It would be even more convenient if you could provide binaries with little or no extra dynamic library dependencies, which could be used on older / rustless systems. I realize this is a tall order, however.
5. What about scrolling? The worst part of viewing CSVs is having to handle wide ones which exceed the terminal width, and having decent horizontal as well as vertical scrolling ability. less doesn't cut it, because it doesn't keep the header row, plus it doesn't recognize field widths.
6. tidy-viewer does not seem to support wrapping longer fields onto multiple terminal lines.
7. When the user doesn't specify the color scheme, are you choosing one based on the terminal colors, or are you using absolute color values? I suggest the former.
8. tidy-viewer loads and parses the entire CSV immediately; and, in fact, seems to keep two copies of it in memory at once. This means it cannot be used with large files without thrashing; and even if your CSV does fit in global memory, it will still be kind of unusable, trying to dump gigabytes onto the terminal.
Bottom line: A nice initial effort, but the more serious challenges are yet to be tackled, plus needs to be more robustly cross-platform.
The norm of treating missing data as NA exists in R (which the developer of this is clearly inspired by based on the GitHub readme.). Pandas in Python is stuck with NaN for numeric types (not quite correct) and "" or None for string types. Personally I like the choice to both explicitly render missing data in colour and to apply NA as a placeholder text to display that colour.
The less(1) command has horizontal scrolling, just invoke it with the -S or --chop-long-lines options or toggle that feature while paging a file.
I agree it would be nicer if less(1) had a user configurable header with a format option to set it to the contents of the first line in a file or stdin or perhaps the most recent line matching a regex to allow for multiple tables and an option to make it scroll horizontally in -S or --chop-long-lines mode.
First of all - kudos on tackling this task - it is indeed very annoying to get CSVs to render nicely on a terminal.
> How does tidy-viewer compare with csvlook?
The most important issue to me is that csvlook is a much less pleasant viewing experience, but there is also this ...csvlook reads and parses all of the data. Try pushing diamonds.csv to csvlook. When I do it on my machine it takes 15.228 seconds while tv takes 0.0042 seconds. For this reason tv is much faster, but speed is not the goal of the package. tv's purpose is to maximize viewer enjoyment.
2. Looking at the demo video, there seems to be an odd fixation with "N/A". The CSV spec, AFAIK, doesn't recognize this phrase. I don't understand why someone would expect a quoted string field whose raw characters are "n/a" should be rendered as anything other than n/a (i.e. lowercase and without the quotes). I'm guessing maybe in your workflow you want to use that phrase a lot, but for a tool for the general public I'd not do this kind of interpretation; and I would leave an empty field as empty.
I could not say it better than this:
> The norm of treating missing data as NA exists in R (which the developer of this is clearly inspired by based on the GitHub readme.). Pandas in Python is stuck with NaN for numeric types (not quite correct) and "" or None for string types. Personally I like the choice to both explicitly render missing data in colour and to apply NA as a placeholder text to display that colour.
3. tidy-viewer seems to require "unstable library features", or at least ones which were unstable as of Rust 1.48.0 . It would be nice if you could be compatible with older rust distributions/versions.
That is a good point. I also release binaries which I think makes this requirement less needed. What are your thoughts.
4. Many systems, especially older ones, especially ones which you access remotely and don't have root privileges on, won't have a rust installation. It would be even more convenient if you could provide binaries with little or no extra dynamic library dependencies, which could be used on older / rustless systems. I realize this is a tall order, however.
5. What about scrolling? The worst part of viewing CSVs is having to handle wide ones which exceed the terminal width, and having decent horizontal as well as vertical scrolling ability. less doesn't cut it, because it doesn't keep the header row, plus it doesn't recognize field widths.
Scrolling is nice. To offer scrolling the only option I am aware of is turning this cli into a tui. I made the choice early on to stay chose the more minimal path and stick to a cli. The goal is to be a `column` replacement not a spreadsheet replacement.
6. tidy-viewer does not seem to support wrapping longer fields onto multiple terminal lines.
The goal is to glance at the data as a whole not a cell or fields. If there are cells with long text they get cut at 20 characters. I like this a lot. I would prefer to know that there is a lot of text that I can dig into latter, but when I am glancing at the csv I just want an overall picture. In my view tables of data are data visualizations meaning that I don't have to show everything to understand enough of it.
7. When the user doesn't specify the color scheme, are you choosing one based on the terminal colors, or are you using absolute color values? I suggest the former.
Great question. I want to eventually add the ability for users to make a config file will their own colors. At this time I just have absolute presets. If you are interested I would happily take a contribution that allows users the option to configure tv with some dotfile.
8. tidy-viewer loads and parses the entire CSV immediately; and, in fact, seems to keep two copies of it in memory at once. This means it cannot be used with large files without thrashing; and even if your CSV does fit in global memory, it will still be kind of unusable, trying to dump gigabytes onto the terminal.
That is almost true. tidy-viewer reads the entire csv, but only parses the head. If I knew of a way to get the number of rows and columns of a csv without reading the whole file then I would. I know there is a good deal more room for memory optimization. This is not my strength and I am still learning.
9. Bottom line: A nice initial effort, but the more serious challenges are yet to be tackled, plus needs to be more robustly cross-platform.
Thanks for the compliment. It is still a work in progress.
Doesn't need to be pre-parsed. Perhaps give the filename to the utility instead of content via stdin. Then filename gives a hint. If there is none, run "file filename" (via library) beforehand.
This looks great! I wonder how long it’ll be until someone posts a long ask snippet that will do something similar and claim this isn’t progress, but rest assured that they are wrong. I’m adding tv to my toolbox.
Cool project! I'm familiar with column, and this looks like a good replacement.
Curious, how do you handle formatting on cells with long strings that need to overflow to multiple lines? As soon as you try to optimize the column widths for table length, you start hitting an NP-hard problem.
I actually read that article when I started making the package. You can see some of the input data here https://github.com/alexhallam/tv/blob/main/data/a.csv. I let the user chose how long the max column width should me then append "...". The default value is 20 characters.
I think it's a great first effort, but there are a number of possible improvements to do. The most obvious one would be to support passing the file as an argument instead of using cat or the redirection operator every time. It's great that it works with stdin to allow piping into it, but it's cumbersome if you just want to take a file and print it, which will no doubt be a common use case.
It works, but almost all UNIX commands that work on pipelines can take a list of files as arguments. Out of the commands I use regularly, "patch" is the only one that works exclusively from stdin, probably because file arguments have a different, somewhat obscure, and probably historical meaning.
If appropriate, using files as arguments instead of using shell pipelines is a best practice. Commands can optimize for that use case, print better error messages, etc...
And it is not a good thing to encourage useless use of cat. If you goal is to show how your tool is to be used with pipelines, show an actually useful pipeline for example "sed '1b;/abc/!d' file.csv | tv". The "sed" command prints the first line (header), and all lines containing "abc".
Some (most?) tools that output data in columns and fit each one to the largest value in that column need to scan the whole file as a first pass just to start displaying data.
Not only is it the case with this tool, but from what I'm reading in main.rs it looks like it's also loading the whole file in memory. I was going to say that scanning the file was a deal-breaker, but if true this is much more resource-intensive.
This looks like a nice tool, but these design choices seem to limit its use to relatively small files. It could be updated to have a read-ahead buffer instead and adjust its output as new lines are discovered with values of different width, although doing this without a jarring resize could be challenging.
Could someone with better knowledge of Rust than mine confirm this?
I see the full dataset being loaded here[1] and the column widths being computed here.[2]
> these design choices seem to limit its use to relatively small files
1. As a rule-of-thumb, I have been working on functionality before optimization. That said, `tv` is really fast. It is completely false that `tv` only works for relatively small files. I just pushed a 624MB file to `tv`. It ran in 2.8 seconds. With `column` it takes 5.0 seconds. Now, I would love help from programmers smarter than me. I am sure there are a lot of optimization gains to be had in `tv`. I just wanted to make sure potential users are not misled. `tv` is performant.
> Some (most?) tools that output data in columns and fit each one to the largest value in that column need to scan the whole file as a first pass just to start displaying data.
> Not only is it the case with this tool, but from what I'm reading in main.rs it looks like it's also loading the whole file in memory.
2. `tv` reads once, but parse partly. This means that it reads the full file only to grab the number of rows. It only parses(take) the first n rows.
If the goal is to calculate the correct column width, you have to do one pass through the data before writing the first row.
If the file can be read multiple times (not a UNIX stream), you can just read the file twice.
If the file is a stream, instead of retaining the entire dataset in memory, you can write to a temporary file and re-parse it after calculating the widths.
The correct column width is calculated from the first n rows not the full file.
A stream does not work for tv because a stream does not know how many rows are in the file a priori. Displaying the dimensions of the file is a priority for `tv`. I am very happy with that trade-off. I would rather know the dimensions of a file than have a file stream of unknown dimensions.
If you did it the way he's talking about you would stream through the file to find how many rows and write the file as a temp file that you could re-parse for the actual data.
I'm not saying you should or shouldn't, but your use case doesn't bar you from using streams.
I like this idea. I don't think it would be jarring if the read-ahead buffer was a minimal number of lines, i.e. looking like distinct pages. The default could be at least the line height of the terminal, or some multiple.
There could be an option to redisplay the header row for resized "pages".
There could be a CLI switch giving the user control, i.e. make everyone happy.
It is more resource intensive, but it pushes the problem you mentioned onto tv. If tv doesn't work with embedded EOLs, then you need to fix your data or fix your tool.
> Just show me the top 5 rows. That's all most people are looking for.
Is it? I'd wager that can't be more than half its use at most. Accessing a specific section that could be at any section of the file is very common in my experience, as is truly random access. Both of these, as well as the first few rows use case, are far better served by a page system.
$ open test/example.csv | format generic
Login email Identifier One-time password Recovery code First name Last name Department Location
rachel@example.com 9012 12se74 rb9012 Rachel Booker Sales Manchester
laura@example.com 2070 04ap67 lg2070 Laura Grey Depot London
craig@example.com 4081 30no86 cj4081 Craig Johnson Depot London
mary@example.com 9346 14ju73 mj9346 Mary Jenkins Engineering Manchester
jamie@example.com 5079 09ja61 js5079 Jamie Smith Engineering Manchester
My shell also aims to have closer compatibility with POSIX (albeit it's not a POSIX shell) so you can use all the same command line tools you're already familiar with too (which, for me at least, was the biggest hurdle in my adoption of PowerShell).
It also supports other file types out of the box too. eg jsonlines
$ open test/example.csv | format jsonl
["Login email","Identifier","One-time password","Recovery code","First name","Last name","Department","Location"]
["rachel@example.com","9012","12se74","rb9012","Rachel","Booker","Sales","Manchester"]
["laura@example.com","2070","04ap67","lg2070","Laura","Grey","Depot","London"]
["craig@example.com","4081","30no86","cj4081","Craig","Johnson","Depot","London"]
["mary@example.com","9346","14ju73","mj9346","Mary","Jenkins","Engineering","Manchester"]
["jamie@example.com","5079","09ja61","js5079","Jamie","Smith","Engineering","Manchester"]
PowerShell is actually pretty good at manipulating CSV and JSON. However, I would definitely recommend using v7 (i.e. pwsh) since it has many improvements over v5 (default on Windows). For example, Group-Object seems to be several orders of magnitude faster using the latest version.
Edit: this reminds me of Jimmy Kimmel's segment where they "bleep and blur whether they need it or not" so that innocent TV clips appear to have profanity/innuendo etc.
It, for example, allowed me to make an educated guess as to the answer to the question “how does this handle huge files?”. It by default only reads 25 lines.
(That makes the example from the header:
cat diamonds.csv | head -n 35 | tv
a bad example. You shouldn’t need that head in-between.
However, line 167 says
//.take(row_display_option + 1)
That seems to indicate this reads the entire file into memory, and that guess wasn’t that educated at all.
I have some work to do on the README. I will show the output better. The difficulty with showing the output only is that it does not capture the coloring. Maybe I will show the output, or add a picture, or have an animated gif. Maybe all three.
Hey, great start. I spend half my day in CSVs and I am definitely your target audience. Most of the time I use bat, visidata or tabview. In many ways tabview is the best, though recently the project has been abandoned.
tv looks excellent. Fun name. I think if you added a couple of features it would ascend to my toolbox:
(1) scrolling (horizontal and vertical)
(2) better command line parsing. Running "tv" without stdin or arguments should produce an error/help. Running "tv xyz.csv" should read that file.
cat just regurgitates the contents of the file, but the resulting piped fd is non-seekable. Since almost every command that can operate on a file from stdin can also operate on the file by name/path, at best this is just a needless invocation of a process (i.e. `tv foo.csv` should have been used instead of `cat foo.csv | tv`) - if the app in question can't handle paths, then you can have the shell pipe the file into it instead (e.g. `tv < foo.csv`). At worst, the recipient program would need to buffer the entire contents of the input if it needs to perform non-sequential operations on the source data - this is the case with things like `tac` that need to seek to the end of the input (see https://github.com/neosmart/tac for how `cat foo | tac` requires buffering but both `tac foo` and even `tac < foo` don't).
To some, it's a faux-pas. Personally, I like the aesthetics of cat for my own scripts. It follows the "pipe flowing" idiom better.
There are performance reasons why "useless cat" should be avoided though. So avoid it where performance is important (or when some other hardcore CLI jockey is going to see your code :))
Avoiding “useless cat” on the command line is premature optimization. Sure, don’t do it in a script that is invoked a lot but it shouldn’t be a concern when prototyping a filter pipeline.
$ cat foo | head -4
b
a
c
b
$ cat foo | head -4 | sort
a
b
b
c
$ cat foo | head -4 | sort | uniq -c
1 a
2 b
1 c
$ cat foo | head -4 | sort | uniq -c | sort -k1nr | head -1
2 b
Very nice! How does it handle CSVs that are wider or longer than the terminal? How does it deal with columns that are exceptionally long, or multiline?
Often when working with large CSV files, I'll need to show or hide specific columns, especially if they are very long. Also, grepping the output for a specific line will hide the header as well, not to mention make the output unnecessarily wide if non-matching lines have longer fields than do the matching lines. So a built-in grepping feature would make this very useful.
This is quite nice, but I don't like how it cuts off the output (instead of making it scrollable). Also, why require the use of `cat`? Accepting a filename so I can do `tv foo.csv` would be much more ergonomic, in my opinion.
xsv is one of my favorite data manipulation tools. Also, the author of that package is one of the best developers I know. I use xsv with tv. I normally pipe the output of xsv to tv.
1. As you noted NA comprehension
2. Column overflow logic for different sized terminals
3. Summary meta data in the header
4. Significant digits logic. This allows users to view more columns than they would otherwise view due to decimal dust shifting the columns over.
5. This is the most import! It looks really pretty!
There should be a new package next to the traditional gnu tools containing the modern needed tools e.g. jq, curl or tv. Sometimes i really miss the extended sw package on some machines?
If you come across an edge case that tv does not handle then let me know. I will add a tests csv file as part of the current portfolio of test csvs. https://github.com/alexhallam/tv/tree/main/data
I have been using tv now for a couple months at work. It has been working well on the data I see. If you find edge cases then please open an issue with an example csv.
I recently moved away from manually building binaries to automated building for many architectures. I am still learning how to use github actions to build for a matrix of architectures. I am still learning.
I love visidata! But when I want to just glance at a csv file I reach for tv (I used to use `column` which is more of a tv competitor than visidata). This is for a couple reasons.
1. tv gives a quick summary of the count of rows and columns
2. tv shows all columns at the bottom that don't fit in the terminal. With vd I have to scroll on wide data.
3. tv guides the eye to missing data better with NA highlights
4. tv has sigfig logic that is better. I work with files where the decimal dust can become long. Those unnecessary characters pushes remaining columns off the screen. This means the user would need scroll over to see additional columns. I generally think it is better to avoid additional key presses if possible.
5. tv is fast for large files. It does not have to read and format all of the data like vd. tv is focused more on looking at the file and not operating on file. It does not have to do as much as vd. That helps tv with what it is uniquely good at. "Do one thing and do it well"
It does not matter if your file is really wide (lots of columns) or really long -- tv will give the user a compact useful pretty print of the data. Why not use vd as a TUI spreadsheet and tv for glancing at csv files. They are both great tools in my eyes with different purposes.
Hey there, VisiData author here. Nice work with tv! I'm sure it's more useful than VisiData for certain use cases. I just want to clear some things up since there are a few misconceptions here (which will happen if you don't use VisiData a lot):
1. In VisiData, The number of rows is always shown in the lower right, and you can see the number of columns with either Ctrl+G or a list of the columns with Shift+C. Or Shift+I for the list of columns with summary statistics (mode/distinct/errors/etc). This is an extra keystroke, but the amount of data you can get with that keystroke more than justifies it.
5. VisiData will instantly open and show any file it can, and continue to load the rest until it's done or you press Ctrl+C (or quit). Everything in VisiData is lazily evaluated, so it's not actually doing any more work than tv when you view the first page of rows, and then you can see the next few pages of rows with only one keystroke (PgDn, as opposed to having to edit a command and rerun it). Fewer keypresses ftw!
A lot of people think VisiData is a TUI spreadsheet, but vd is not a "spreadsheet" in the classic sense, as it's not cell-based. Its primary use-case is exploring and wrangling tabular data. It just turns out that this is what a lot of people are doing with their spreadsheets, but they have to bend over backwards to get Excel/whatever to play nice with their data's structure. By the same token, if you try to do little single-cell formulas in VisiData, it's going to be quite difficult.
For people who like static binaries and only need to view a few rows of CSV files, or produce part of a larger report in a pipeline, tv could be a better fit than VisiData, especially if it continues to be maintained. I'm always excited to see new data tools in the terminal space!
Oo, I am sorry. I see I misrepresented VisiData. I apologize. Thank you for the corrections.
I have a lot of respect your work. Let me know if I can make it up to you. I would be happy to point people to VisiData in my README as a recommendation of a tool that is built to explore and wrangle tabular data.
Also, thanks for the compliment! Like you, I like seeing more data tools in the terminal.
This is why I love HN. Never knew this existed. It has become my favorite tool in the past 5 mins I installed it. Also reminds me of Mainframe programs that I encountered in the past. I wish we had more tools like this instead of electron mouse click based apps for people who prefer speed and keyboard.
For scripting I would use grep and cut, maybe awk. For scripting with CSV files, at least in my experience, you usually want specific columns from specific lines.
If TV had a switch for specifying only certain columns, that would make the job much easier.
XSV [0] can also pretty-print (minus the colors), but that's just the tip of the iceberg as far as what it can do. It's very handle for quick statistical analysis of CSV input.
that's exactly the comment I was looking for! xsv is super powerful and I think you might both draw inspiration from one another. I read above that tv reads everything into memory: maybe you can exploit some xsv tricks to avoid that. I feel tv looks great to visualise the outcome at the end of a pipeline, perhaps with xsv. I am no Ruby expert either, but this can become a cool Homebrew binary: people on macOS will use it too!
I will add some Homebrew installation instructions. That is now an open issue. I want this tool to be highly accessible. Again, xsv is the best. I like the idea of small utilities that specialize in a specific task.
From the Unix philosophy:
> Make each program do one thing well. To do a new job, build afresh rather than complicate old programs by adding new "features".
Sure! I completely understand and I am really glad you made this great tool to display the data in such a beautiful format. Maybe there can also be a “—head” flag to display only the top n rows by default, given that the data is read into memory?
It's very weird for a project made to "maximize viewer enjoyment" to not put a space after the prompt. The one saved character on the line is definitely not worth the illegible resulting line: this doesn't maximize my enjoyment at all when viewing the examples.
tv modifies raw files in the following ways:
1. NA detection and highlighting 2. Printing only significant digits 3. Header and footer meta data
I have been using this a lot at work. There is a lot more work to do, but it is in a usable state.
Give it a try! If you like it then star on Github!