Hacker News new | past | comments | ask | show | jobs | submit login
Crab – SQL for your filesystem (etia.co.uk)
127 points by cogs on Sept 10, 2015 | hide | past | favorite | 43 comments



> Just try these in Bash or PowerShell! > select fullpath from files where fullpath like '%sublime%' and fullpath like '%settings%' and fullpath not like '%backup%';

This isn't a very good example, because it's trivial to do in bash:

    locate sublime | grep settings | grep -v backup
(Replace `locate sublime` with `find / | grep sublime` if locate's results are too old.)

> select fullpath, bytes from files order by bytes desc limit 5;

This is better. Here it is in bash:

     find / -type f -exec stat -c '%s %n' {} \; | sort -nr | head -n 5
Cherry picking another one that stood out to me.

> select writeln('/Users/SJohnson/dictionary2.txt', data) from fileslines where fullpath = '/Users/SJohnson/dictionary.txt' order by data;

    cd /Users/SJohnson/; sort dictionary.txt > dictionary2.txt
Some of the rest of the examples are trivial in bash, and others look potentially useful. Of course they are trying to demonstrate its capabilities so the examples are contrived. I can see how this would be useful for someone who doesn't know the command-line, but as someone who is proficient in both SQL is pretty verbose.

In the real world I'd switch to a scripting language for some of the more complex cases, since they'd be rare.


Sure, if you already know all that. But

    select fullpath, bytes from files order by bytes desc limit 5;
is infinitely more readable/parseable than:

    find / -type f -exec stat -c '%s %n' {} \; | sort -nr | head -n 5
Just something exists doesn't mean it isn't an incomprehensible pile.


I think powershell still isnt really that hard for this (and was mentioned earlier)

Something like:

  gci / -recurse | sort fullname | select fullName, length -first 5
Really doesnt seem that hard in comparison.


Agreed. It's interesting - last time I looked at PowerShell it seemed insanely verbose, but maybe I was looking at some odd examples.


As pointed out a lot of the verbosity in PowerShell is optional these days as there are a good number of aliases for common commands.

The nice thing about the verbosity is kind of similar to the point of SQL in that it too has something of a natural language DSL intent (in PowerShell commands are expected to be "Verb-Noun" and verbs for the most part are encouraged to be from a relatively small set of common verbs). This can make it a bear to fully write out without an IDE (and there are multiple choices these days) or tab completion (which gets better with each release).

The benefit however, is that typically you can very easily read someone's PS1 script if it has been written at full verbosity and know what it is doing. It's often very close to self-documenting at that point.


Oh, it can be. The confusion lies around the fact that you can do basic things half a dozen ways, and I also used some shorter aliases (eg Get-ChildItem (or gci, or ls) Sort-Object (sort) Select-Object (select))

You can sort of choose what level of verbosity you want to encode in your scripts for your sanity at a later date.


Not really. If you write SQL statements for a living, the former is more readable, if you sysadmin for a living, then the latter wins. A sysadmin will use tools like find, sort and head very frequently, whilst SQL statements will be fairly rare.


They are both technical, but 'order by bytes desc' has got to be more expressive than 'sort -nr'. It's almost natural human English, whereas the latter doesn't express anything.

That said, I don't know how much time it would genuinely save. As with most of these tools, you shouldn't be installing them on production servers, so you still have to know Bash anyway.


Crab has tab completion, so there's not too much extra typing, and the repacement of hieroglyphics by words does help readability.

It's especially handy to see what is going on if you wrote it a month ago, and don't remember all the command line switches.


Except Crab isn't as old as find. Find is likely to be around in 20 years.


> > select fullpath, bytes from files order by bytes desc limit 5;

> This is better. Here it is in bash:

> find / -type f -exec stat -c '%s %n' {} \; | sort -nr | head -n 5

This is also a nice example of the biggest problem piped shell commands: each command is executed in isolation. Because of that you miss on many optimizations that are possible when you know the full query.


find / -type f -iname 'ASTERISKsublimeASTERISK' | grep -i settings | grep -v backup


Reminds me of facebook's https://osquery.io/ does this offer anything different?


Crab is aimed specifically at the filesystem, rather than querying devices or processes.

For example wildcard matching on paths, combined with the exec function to run OS commands on the files you get back.


osquery also has some level of setup overhead and requires a daemon running in the background. I don't think this is true of crab.


You don't need to run the daemon. It's also possible to just use it interactively via osqueryi.


Thanks for clarifying. I haven't actually used osquery myself so I wasn't aware.


As a reminder, we actually did have a filesystem with integrated SQL (-like) query features in BeFS, in 1997.


This is pretty cool, I have find myself wanting a tool like this for ages. However, does anyone know of a pure open source alternative (just for Linux) ?


A more unixy alternative to osquery is termsql (https://github.com/tobimensch/termsql): it works with anything on the input, so it's a matter of "ls"-ing the correct folder and then using SQL to output what you want.



It would not be terribly difficult to create a workalike on other * nixes as the task is effectively just mapping your filesystem's stat(2) metadata onto an SQLite db, with a virtual table representing file content and a user defined function that calls exec()* . SQLite gives you most of what you need here for free and makes the rest relatively easy. In fact a module containing the implementations for the latter can even be loaded directly into the existing sqlite3 cli.

If you added an inotify daemon and FTS indexes you would have essentially a clone of Spotlight and other indexed filesystem search engines.

* piping the filenames somewhat obviates this though


I put up something similar here after experimenting with a related idea https://github.com/claes/osql


This is a brilliant idea!

You can do most of the same kinds of things via find and grep and some shell foo, but honestly who can remember all of that? Maybe someone smarter than me, but every time I need it I am reading man pages.

The find syntax to get files modified more than 20 minutes ago? How can you remember that? But modified > now() - interval '5 minutes' (well that's postgres but still), I can remember it and I haven't used it in 2 or 3 years, because it's slightly less arbitrary and doesn't have 8 different gotchas.

EDIT:

find . -mmin -5 # that gets you files modified in the last 5 minutes. The part I can't ever remember:

find . -mmin +5 # that gives you files last modified more than 5 minutes ago

find . -mmin 5 # apparently this is files modified exactly 5 minutes ago? The fact that this syntax exists (and is different from +5) seems absurd to me. What is the resolution? It must be minutes. This option exists only to confuse people.


may be an alternative

find . -mmin <5 # less than 5mins

find . -mmin >5 # more than 5mins

find . -mmin =5 # exact 5mins


"Multi platform" ... only runs on OSX.


We're working on Windows, the query syntax and the functions are going to be the same.


I like to use ad-hoc linq queries with linqpad to get this type of stuff done.

Ex : Directory.GetFiles(theDirectory).Take(50).GroupBy(a => ...)



We looked at Log Parser before building Crab. The syntax was far from standard SQL, it didn't have joins etc.

It didn't use string matching to identify subsets of files to query.

And maybe the most important thing, it doesn't have the exec function to run operating system commands on the files you get back in your query results.


I've been looking for something like this (and thinking about developing something if I can't find a satisfactory solution) to use across many different hosts to identify duplicate files, etc. I've got media spread across many different linux and os x machines. Can crab handle this?


Well a workaround is mounting and using joins ?

Now, be careful the example find duplicate file names (with equal size). Not duplicate files ! Those would require check of the contents of the file.

Also there are tools like fdupe, same problem regarding remote hosts though. Some tools use xattr to store hashes. Some might use DB's. (With xattr tools, you just run the tool on the remote host first, then on the local host, if you need to save bandwith.)

I however don't have the perfect answer.


Yeah, ideally what I want is a daemon which hooks into libevent or something similar, and each time a file changes or is created, calculates a checksum and updates other metadata, and then finally provides this information back to a central queryable database.


The example does check the contents, by using the SHA-1 of the files


We don't have any linux builds at the moment, if there's demand we'll make some.

We can scan mounted drives, and Crab has a command line switch to treat names as case sensitive or not. SHA-1 calls are probably not practical across a network to compare file contents, even scans can be a bit slow, but you might do something by comparing file sizes.


Makes me think of txt-sushi (http://keithsheppard.name/txt-sushi/). Can be pretty useful instead of using awk.


Reminds me of WinFS, which was cancelled but I always thought it was a brilliant concept.

https://en.wikipedia.org/wiki/WinFS


I think it was abandoned when Microsoft realized they could achieve similar object orientation built on the same old file system (so no incompatibilities), just with the .NET layer in between. See also PowerShell and the example above. :)


IMHO ... doesn't replace the good old for,find,grep etc..


Does this maintain an inode metadata index as well? Otherwise how will you avoid stat'ing the whole filesystem (or a branch thereof)?

Does it handle extended attributes?


Crab doesn't pick up metadata during the scan, because this would slow the scan too much.

There is a wrapper function 'metadata' which returns a specific metadata item from a file at a given path at query run time, but this basically runs mdls under the hood.

Crab can handle extended attributes using the EVAL function function which runs an OS command and returns the result as a string. But you have to parse the string, for example to return the size of a resource fork:

select bytes + coalesce(matchedgroup(eval('ls','-l@',fullpath),'.com.apple.ResourceFork\t(\d+)'),0)


Yay for commercial open source!


It doesn't like it's open source - you download a time-limited trial version, after which you need to buy a license. And there's no link to the source.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: