Osquery: Expose the operating system as a relational database

ibdknox · on Oct 29, 2014

This is really interesting and very cool to see. The approach we're taking with Eve (gory details at [1]) is that you can treat everything as relational and doing so provides lots of benefits. One thing that wasn't clear though, was how you extend that notion down into the OS-level for both performance and semantic reasons. It's encouraging to see someone with requirements as deep as facebook's find that this strategy works in that context.

The next step would be manipulating the OS as relations. E.g. an insert into the process table allows you to actually spawn a process. It would start to get really interesting from there...

[1]: http://incidentalcomplexity.com/2014/10/16/retrospective/

marpaia · on Oct 29, 2014

Right now osquery only supports read operations but you're totally right; if you could kick off tasks, kill processes, unload kexts, etc. via CREATE, DELETE, etc statements, that would be so killer!

vezzy-fnord · on Oct 29, 2014

So in essence, you're motivated by the same underlying concept as the Plan 9/Inferno developers: define a small set of abstractions and apply them ruthlessly, in contrast to the myriad of non-uniform interfaces one is practically using every day in an operating system.

For Plan 9, it was "everything is a file" and the power of using simple operations like bind mounts to create complex software interactions that would otherwise require monolithic protocol and library stacks anywhere else.

Here, it's... "everything is a table"? I'm not very familiar with table-oriented programming, but what advantages does having the RDBMS be the prime metaphor over the file system really bring? Structure? Rob Pike had some interesting words on that: http://slashdot.org/story/50858

grandalf · on Oct 29, 2014

> what advantages does having the RDBMS be the prime metaphor over the file system?

Easy queryability and easy joining of data across a whole datacenter.

This makes it easy to think about system data across all sorts of boundaries that the file metaphor makes somewhat cumbersome.

trhway · on Oct 29, 2014

> This makes it easy to think about system data across all sorts of boundaries that the file metaphor makes somewhat cumbersome.

while file metaphor became big in previously SQL domain of data processing, i.e. the whole ecosystem of HDFS and everything on top of it

electrum · on Oct 29, 2014

At the end of the day, almost everyone queries those HDFS files with query language using Pig, Hive, Presto, etc.

geofft · on Oct 30, 2014

"Everything is a file" means that you need to parse files to get useful data out (think things like /etc/mtab and /proc/mounts), which is closely tied to another UNIX philosophy, that tools should generate plain text and parse it using generic text-processing tools. This is great for getting things done quickly. It's also great for security holes (think CVE-2011-1749 and related issues; arguably, see also Shellshock).

One advantage of "everything is a table" is that your structures are well-formatted and there's no risk of problems when you put a space in a pathname. For most implementations of "table", you can also have the data formats be well-typed. This brings reliability and security benefits.

I think there's validity to Rob Pike's argument in many contexts -- for instance, you absolutely won't see me defending the semantic web over the greppable/Googleable one. But in the specific case of text files with a single, well-defined structure, his own argument seems to imply that there's no sense in a second tool having to infer the structure on its own.

(The usual way this is worked around these days is separate files for each field, or files designed to be parseable, which is why Linux's /proc/*/ is such a mess. Compare /proc/self/stat and /proc/self/status, and /proc/self/mounts and /proc/self/mountinfo. Also look around /sys a bit.)

harelba · on Oct 31, 2014

There's a command line tool called q, which allows performing SQL-like queries directly on text files, basically treating text as data and auto detecting column types.

http://harelba.github.io/q/

geofft · on Nov 2, 2014

Neat, but auto-detection is exactly what I don't want. We have structure on one side. Why round-trip it through an unstructured format and attempt to guess the exact same structure on the other side? If I guess wrong, it's a security hole.

emmelaich · on Oct 30, 2014

Yes!

This is great. One of the frustrations I've had with Puppet and Ansible is the lack of a clear model for data. It's quite difficult to know the scope and dependencies and origin of all the variables that one deals with.

If one could update tables and then have that representation be reified to the machines it would be awesome.

7952 · on Oct 29, 2014

>> The next step would be manipulating the OS as relations.

This approach could be a good fit for package management. So that packages are updated and run within a transaction and changes can be committed in a single seamless step.

lifty · on Oct 29, 2014

Package managers usually have non-idempotent actions that might change parts of the operating system in non desired ways. That means that you could not have atomic operations ala SQL. There is one packager manager that solves that, Nix(from NixOS), on top of which you could apply something like an SQL language.

amelius · on Oct 30, 2014

I'd rather see an OS and package manager that has a "functional" design (as in functional programming language, functional data structures). This would allow conflicting packages to be installed next to eachother in different branches of a functional filesystem.

mej10 · on Oct 30, 2014

The Nix package manager refers to itself as "The Purely Functional Package Manager", and that is exactly what it lets you do.

"It provides atomic upgrades and rollbacks, side-by-side installation of multiple versions of a package, multi-user package management and easy setup of build environments."

http://nixos.org/nix/

amelius · on Oct 31, 2014

This seems interesting. But I find it a bit unsatisfying that it can only be used as a package manager then. How about using this functional machinery, e.g., for a general build system? A "functional make" so to speak. And I bet there are plenty of other use cases.

kbenson · on Oct 29, 2014

This would really require the underlying package management system support this, and then it's simply a shim from osquery to the package manager to do the actual work. The main problem would probably being making it work across all the disparate systems it's supposed to support.

That said, it would be awesome to query specific package versions, or even individual package file MD5s from an SQL interface to check system exposure when new exploits come down the pipe.

andrewchambers · on Oct 29, 2014

package managers usually work with transactions already.

maaku · on Oct 29, 2014

While you're at it, one related piece that I'd like to see explored is relations-as-code, code-as-relations a la Lisp. It would be interesting to see what a program would look like if represented as tuples in a relation, and able to self-modify by updating that relation.

Abstract, I know, but I don't think anyone has looked at this in any detail yet.

ibdknox · on Oct 29, 2014

We're already doing that one :)

p4bl0 · on Oct 30, 2014

Would you mind sharing a link to a paper or something about that?

jamii · on Oct 31, 2014

The only good references I know of:

http://p2.berkeley.intel-research.net/papers/EvitaRacedVLDB2...

http://www.seas.upenn.edu/~mengmeng/papers/datalog2.0.pdf

maaku · on Oct 31, 2014

Datalog isn't what I was talking about. I would like to see a programming language where both the basic data type and the representation of code itself are relations, as is the case with Lisp and lists.

jamii · on Oct 31, 2014

Datalog is relational. Both of those papers show datalog compilers in datalog.

maaku · on Nov 4, 2014

Thank you. I have had a misunderstanding of what datalog is for years. I have always thought of it as a DSL for making queries against a database using the relational calculus. I have never considered, nor have the texts I've encountered demonstrated how the syntax of datalog is itself relational, though it seems obvious now that I've noticed.

I have some planned projects which require a homoiconic relational language. I was hoping someone else could be inspired to design such a thing so that I don't have to. It looks like someone already did. I am happy to be proven wrong :)

maaku · on Oct 30, 2014

Yes please!

emmanueloga_ · on Oct 30, 2014

ibdknox, here's [1] an example of a relational scheme on top of HTTP APIs, maybe this could serve as inspiration for some aspect of your project. (BTW I feel like I keep bringing this [seemingly defunct] project up... oh well).

p.s.: Oh! there's a Meijer paper behind it [2].

1: http://ql.io/ 2: http://queue.acm.org/detail.cfm?id=1961297

zwischenzug · on Oct 29, 2014

Available as a docker image:

  docker run -t -i imiell/osquery /bin/bash
  root@81fbc2076e1c/# osqueryi
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  osquery - being built, with love, at Facebook
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  Connected to a transient in-memory database.
  Use ".open FILENAME" to reopen on a persistent database.
  osquery> select * from processes;
  +----------+-------------------------+-----------+-------+---------+------------+---------------+----------------+-----------+-------------+------------+--------+
  | name     | path                    | cmdline   | pid   | on_disk | wired_size | resident_size | phys_footprint | user_time | system_time | start_time | parent |
  +----------+-------------------------+-----------+-------+---------+------------+---------------+----------------+-----------+-------------+------------+--------+
  | bash     |                         | /bin/bash | 1     | -1      |            | 1764          | 18276          | 17        | 18          | 95476444   | 0      |
  | osqueryi | /usr/local/bin/osqueryi | osqueryi  | 19380 | 1       |            | 4312          | 110652         | 225       | 327         | 96321589   | 1      |
  +----------+-------------------------+-----------+-------+---------+------------+---------------+----------------+-----------+-------------+------------+--------+
  osquery>

mrisse · on Oct 29, 2014

Do you have a public Dockerfile for your image? https://registry.hub.docker.com/u/imiell/osquery/

zwischenzug · on Oct 30, 2014

https://github.com/ianmiell/shutit/blob/master/library/osque...

zwischenzug · on Oct 29, 2014

ShutIt script here:

https://github.com/ianmiell/shutit/blob/master/library/osque...

There are deps on thrift and rocksdb modules defined at the bottom.

Should be useful for those looking to port to other CM tools.

cf:

http://ianmiell.github.io/shutit/

MrBuddyCasino · on Oct 29, 2014

Cool, so basically it brings something like WQL to nix, because this is something that exists in Windows already:
SELECT * FROM Win32_LogicalDisk WHERE FreeSpace < 2097152

marpaia · on Oct 29, 2014

Hey, Mike Arpaia here. You're totally right in that the SQL interface is very similar to WMI. Some core differences with osquery are:

- it's cross platform and supports many *nix operating systems

- adding new tables is very well supported via a simple API: https://github.com/facebook/osquery/wiki/creating-a-new-tabl...

- several tools and utilities exist to leverage the power of SQL at scale (osqueryd is a full operating system instrumentation tool which allows you to use SQL to instrument your whole infra): https://github.com/facebook/osquery/wiki/using-osqueryd

All in all, WMI is great, no doubt about it, but osquery has a few unique features which make it a cool, interesting product that you can use all across you internal infrastructure.

MrBuddyCasino · on Oct 29, 2014

Didn't want to give off a "someones been there and has done that already" vibe, your version definitely looks much better. Honestly, I found WMI to be lacking and also to be quite slow.

Great to have a better alternative for unixes!

marpaia · on Oct 29, 2014

Of course, it's an intuitive comparison to make. I'm glad you like it! I'm looking forward to working with you and similarly insightful community members on improving osquery for everyone :)

pstuart · on Oct 29, 2014

Federated queries across machines would be quite cool.

jamii · on Oct 29, 2014

There has been some really interesting work on distributed streaming queries for monitoring/diagnostics eg http://p2.berkeley.intel-research.net/papers/EuroSys2006-deb...

markbnj · on Oct 29, 2014

I think Microsoft does deserve some credit for the idea behind WMI, but when I worked with it I always got the sense that the API was sitting on top of a massive hairball that I didn't want to look at for too long. It was slow, but it worked. I'd be thrilled if osquery just achieved the same level of query completeness, and made it easier to find properties and query them. I found WMI confusing in that regard.

csours · on Oct 30, 2014

A (not so) simple case: How many network interfaces do I have? WMIC gives one answer, Netsh gives another, ncpa.cpl - Network Control Panel gives another, hardware manager may give another. I'm impressed anything works.

bmajz · on Oct 30, 2014

Oh, the wonders of legacy interfaces

bryanrasmussen · on Oct 29, 2014

Since WMI is the Windows implementation of WBEM I had assumed there were ways to do WQL like stuff on other platforms already.

danbruc · on Oct 29, 2014

I quick look at the Wikipedia page [1] reveals there is and I always assumed it all kind of works - I know WMI, CMI and the related stuff and it is not completely cooked up by Microsoft alone and therefore I always assumed that this is a standard solution and I could just point a WMI console to a Linux machine if I wanted to. Now I am not sure if I should like this project because it does something useful or dislike it because it is another incompatible solution (and I lean a bit towards a One True Way philosophy).

[1] http://en.wikipedia.org/wiki/Web-Based_Enterprise_Management...

kbenson · on Oct 29, 2014

I think we could just as easily point to SNMP, and the same reasons why that isn't a good argument apply here. They differ in where their priorities lay and what they aspect of management they emphasize.

In this case, it's about easily correlating data to pull more complex information out of specific system data sets.

0xbadcafebee · on Oct 29, 2014

It's hard for me to grok this project's design goals. I mean, the basic idea is simple enough to understand: I want to run SQL queries on metadata about my hosts. I've built and run several different iterations of that same idea, but they didn't require thousands of lines of C++.

The usual implementation is simple: take any host monitor (say, collectd) that can export key/value pairs from a host, or take a log stream over the network and pair it with a host monitor/log scraper to create key/value pairs. Then insert into an SQL engine while appending to a log for a historical record (or PTA/PITR/whatever, i'm not a DBA). Separately you can create a database application to query/modify the database as needed.

But we're talking like, a handful of python scripts that don't ever change except to add new search features. This seems like a big departure from the simplicity of that approach. Am I missing something?

sanderjd · on Oct 29, 2014

From what I can tell, the advantage of this is that it doesn't require integrating with an actual database, which then has to be maintained and scaled.

0xbadcafebee · on Oct 30, 2014

When you have tens of thousands of hosts, and this database (because they do use an SQLite backend) is grown exponentially, it doesn't matter what the storage or query engine is. Whatever it is will have to be maintained and scaled.

sanderjd · on Oct 31, 2014

I actually didn't realize it had a sqlite backend, so point taken!

gooseyard · on Oct 29, 2014

Akamai has been using a system like this since 1999 or so, nicely documented in this presentation to LISA some years ago:

http://www.akamai.com/dl/technical_publications/lisa_2010.pd...

asb · on Oct 29, 2014

There was a paper at EuroSys this year "Relational access to Unix kernel data structures" which seems to attempt to offer similar functionality. A comparison between the two would be interesting.

HTML version of the paper here: http://www.dmst.aueb.gr/dds/pubs/conf/2014-EuroSys-PicoQL-ke... Paywalled version: http://dl.acm.org/citation.cfm?id=2592802

mapleoin · on Oct 29, 2014

Welcome to 2014 where cross-platform means "Ubuntu, CentOS and Mac OS X".

weavejester · on Oct 29, 2014

Cross-platform just means it runs on more than one platform.

bonif · on Oct 29, 2014

Should we really think about Windows as a OSS server environment in 2014 ?

tracker1 · on Oct 29, 2014

Given the number of applications tied to running under a windows server, yes. Also, 80% of azure is running windows host environments... Many small-medium businesses are running a handful of windows servers, and most of them would prefer to simply install new applications on their already existing and under-loaded servers.

On the flip side, while I am more than happy to deploy new development projects to Linux hosts, and even in docker containers, I'm still stuck developing under windows, and having to run a VM to do most of my development isn't something I'm all that in favor of.

moondowner · on Oct 30, 2014

"80% of azure is running windows host environments" because .net developers mostly use Azure.

tracker1 · on Oct 31, 2014

.Net developers mostly deploy to internal application servers.

That said, Azure is about the only cloud provider that really caters to .Net deployments, that is correct and probably accounts for a lot of it.

Me employer's next generation of applications is being deployed in Linux on Azure, because of their better support for our legacy applications (some will be around for quite a while)... which is part of why I said that "yes" running in windows is important.

If Node and Mongo didn't run in windows three years ago, we wouldn't be migrating to Linux today. I introduced Node in order to improve client-side web resources in a few web projects... once that was in place, it was a natural fit for one-off scripts (importers, timed tasks, etc). From there it became the API service for search (with mongodb behind IIS/ARR). Because of that, and the stability so far, it's our next generation platform. None of that would have happened without being able to run on windows.

zobzu · on Oct 30, 2014

What about environments with both Windows and Linux? (and OSX, and BSD, and what not).

I think a lot of people/companies have some of these. Even if you're 90% Linux for example, and 10% Windows, its nice to have a tool that works across the board.

Since this also seems to be targeted at laptops however, I bet Windows is still a large percentage there!

_kdhr · on Oct 30, 2014

And what a glorious year it was, too!

_almosnow · on Oct 29, 2014

It's not even 'on Linux'

_almosnow · on Oct 29, 2014

What? Am I lying? What is wrong with that comment...

olalonde · on Oct 29, 2014

Ubuntu and Centos are both Linux distributions.

_almosnow · on Oct 29, 2014

Damn... I can't believe you guys really thought that I didn't know that...

What I meant is that they don't even say 'works on Linux' like in 'they don't even cover all the most commonly deployed Linux distros'.

pessimizer · on Oct 29, 2014

What I think they're saying is that they only test on CentOS and Ubuntu, two significantly different flavors of distro. I don't think they're saying that it won't build on fedora or debian.

The idea that they should test on every distro in order to say that it works on Linux is a bit silly.

adl · on Oct 30, 2014

I believe they refer to the 'make package' support. It builds a .deb (for Ubuntu) or a .rpm (for CentOS). The code itself seems very Linux generic.

kbenson · on Oct 29, 2014

Through the eyes of the author, the original comment probably seems obvious in what it's attempting to convey. Through the eyes of other people, its intent can be ambiguous. In my case, very ambiguous. I wouldn't be surprised if you were downvoted just as much for being unclear as for people misunderstanding your point.

gtrubetskoy · on Oct 30, 2014

This is not exactly same, but similar in some ways to IBM's S/38, OS/400 or whatever they call it now... In this OS (and last I touched it was 15 years ago, so things may have changed) there were no "files" - everything was a database table, and that was the only way you could store anything and it was how you for the most part interoperated with the system, i.e. the OS was essentially a relational DB. http://en.wikipedia.org/wiki/IBM_i

sanj · on Oct 29, 2014

This reminds of a old project to kill daemons with a shotgun: http://www.cs.unm.edu/~dlchao/flake/doom/

falcor84 · on Oct 30, 2014

I like this idea very much, but it is perhaps too orderly. In the spirit of Chaos Monkey, I suggest killing processes via an interface based on Peggle.

pothibo · on Oct 29, 2014

This is quite cool. I'm installing it right now on my system.

From the wiki, it says it will soon be available on homebrew as well.

https://github.com/facebook/osquery/wiki/install-os-x

marpaia · on Oct 29, 2014

Yup! The only reason why it's not today is because we couldn't get a formula merged into mainline homebrew until the download link was public. It is now though, so we'll sort this out soon. Now that you mention it though, this would be a pretty easy, high-impact contribution that a community member could make as well!

tdicola · on Oct 29, 2014

Are there instructions how to build it? I'm looking around the github page and don't see anything like what depedencies it needs, what infrastructure it uses (looks like CMake?), etc. It's cool to distribute it as a vagrant source, but I'd like to compile and run this on a Raspberry Pi which doesn't run Vagrant. Supplying some basic how to build instructions would really help.

zwischenzug · on Oct 29, 2014

docker run -t -i imiell/osquery /bin/bash history

for my gory details.

I'll post a shutit script with the lot in tomorrow probably.

The rocksdb dep is already done:

https://github.com/ianmiell/shutit/blob/master/library/rocks...

but bed beckons. In the meantime IYI:

https://github.com/ianmiell/shutit/blob/master/library/osque...

Thrift is another dep:

https://github.com/ianmiell/shutit/blob/master/library/thrif...

sehrope · on Oct 29, 2014

This is pretty neat. I'm a big fan of SQL in general and being able to query system stats like this feels pretty natural to me.

A long time back I created something similar to this atop Oracle[1]. It used a Java function calling out to system functions to get similar data sets (I/O usage, memory usage, etc). It was definitely a hack, but a really pleasant one to use.

Be cool to see an foreign data wrapper for PostgreSQL[2] that exposes similar functionality. I'm guessing it'd be pretty easy to put together as you'd only need to expose the data sets themselves as set returning functions. PostgreSQL would handle the rest. Though I guess that would limit it's usefulness to servers that have PG already installed. Having it be separate like this let's you drop it on any server (looks like it's cross platform too!).

[1]: I don't remember exactly when but I think 10g had just been released.

[2]: http://www.postgresql.org/docs/9.3/static/postgres-fdw.html

dragonwriter · on Oct 29, 2014

> looks like it's cross platform too!

Well, the announcement says its cross-platform because it runs on two flavors of Linux (Ubuntu and CentOS) and on Max OSX.

dougabug · on Oct 30, 2014

After a lost decade of misplaced vitriol directed at relational models and SQL, I'm personally heartened to see a trend back to human readable query languages over rpc as an interface, sensible representation of information in relational form suitable for ad hoc queries/discovery versus complex implementation-dependent deep hierarchy spaghetti.

thibauts · on Oct 29, 2014

That's a very good candidate for a postgres FDW ...

https://wiki.postgresql.org/wiki/Foreign_data_wrappers

Shish2k · on Oct 30, 2014

Done, took 15 minutes :) (Very incomplete, but works as a proof of concept and is super-easy to extend)

https://github.com/shish/pgosquery

thibauts · on Oct 31, 2014

Just saw it, totally awesome and so short on LoC !

zobzu · on Oct 30, 2014

Google has GRR and Mozilla has MIG (http://mig.mozilla.org/)

I think its interesting to see that MIG is in Go and thus cross platform "by default". It also seems to be more privacy-compliant.

osquery's SQL is sexy however.

That said I'm also wary of a single piece of software that basically give you control over absolutely everything (control everyones laptop, etc. silently and quickly. Thats the best rootkit ever. You wont even detect if its being compromised because its a trusted piece of the OS!)

jvehent · on Oct 30, 2014

MIG author here. First, congrats to Mike Arpaia for shipping! That must have been a busy couple of months :)

There is a number of conceptual differences between MIG, GRR and OSquery. MIG does not retrieve data from endpoints, but instead focuses on answering yes/no questions.

For example: find hosts that have a file in /var/www that contains the regex '12345'. MIG will run the search on all endpoints and return the location of files and hosts that match. But it won't return the files themselves. Privacy is preserved, but investigators need to manually retrieve files if needed. MIG takes this approach to keep the search very fast (parallel AMQP runs in seconds across thousands of endpoints). Retrieving data at large scale is too expensive for Mozilla (bandwidth, storage, execution time), and we like to protect privacy.

I really like the SQL approach taken by osquery. I brainstormed something similar last year [1] but did not get around to it yet. SQL is very natural to use for a lot of IT/Security people, and it's great to be able to transfer that knowledge to search tools.

GRR and OSquery are awesome tools. And I can tell their authors are solving some really hard problems. We're competing a bit, but it's OK to have multiple tools with varying goals. And it's great for borrowing each other's good ideas ;)

[1] http://4u.1nw.eu/Presentation_workweek_20130916.pdf slides 10 & 11

Pxtl · on Oct 29, 2014

I just wish there were better relational languages than SQL for accessing/manipulating this stuff. Relational logic is great. SQL is... okay.

handojin · on Oct 29, 2014

There are better languages. Tutorial D, the dataphor implementation D4 and datalog (as implemented for datomic e.g.) spring to mind.

adl · on Oct 30, 2014

I build the .deb for Ubuntu 14.10 (downloaded the project, the Vagrant image, etc, the works. 1.1GB in total according to du -h)

It's here if anyone wants to try it: osquery-0.0.1-trusty.amd64.deb (11 MB)

https://drive.google.com/file/d/0B3ROVJqBXqYAOVNTTkhqQzNUa0k...

chacham15 · on Oct 29, 2014

Is it possible to create triggers with this kind of emulated database? E.g. Insert a row into the notifications tablewhen free space drops below 10% or to use their example when SELECT name, path, pid FROM processes WHERE on_disk = 0 actually returns a row.

electrum · on Oct 29, 2014

That is a great idea, but unfortunately triggers are not supported for SQLite virtual tables. You could achieve the same thing by periodically running a query to check for such conditions. This is a common way to build alerting systems.

rdtsc · on Oct 29, 2014

Any relationship or inspiration from BeOS's file system?

http://en.wikipedia.org/wiki/Be_File_System

I remember from back in the day that was one of the really cool feature of Be.

mongol · on Oct 29, 2014

Something similar using SQLite virtual tables and the proc filesystem https://github.com/claes/osql

gdulli · on Oct 29, 2014

I'd like to try this out but is there a way to install it that's as simple as all of the other Linux software I've ever installed before, and doesn't require installing vagrant, installing virtualbox, downloading an ubuntu image, creating a whole VM (which failed) so I can then make a package I can install? I can usually try these things out without making a whole thing of it.

pessimizer · on Oct 29, 2014

Is there any reason to think that a plain configure/make won't work? I think they're just explaining it via vagrant VM due to the monoculture.

edit: reading the source, I may be totally wrong about this.

edit2: https://github.com/facebook/osquery/wiki/building-the-code

gdulli · on Oct 29, 2014

Thanks, that page helps.

godisdad · on Oct 29, 2014

Nifty, but not incredibly novel. SQLite's VFSes have been around for some time, albeit in smaller breadth and scope. I think one thing this kind of glosses over is the notion of transactions, what if load/fs contents change between independent parts of your query, are they memoized, recomputed, etc?

Having said all that I'm going to install it and try it out because it's new and shiny.

amelius · on Oct 30, 2014

Wouldn't it be cool if there were a library that could build relational APIs for any problem (including an optimizing SQL compiler)...

eli · on Oct 29, 2014

Reminds me of LogParser, a funky Microsoft tool that let you run queries against log files, directories, the registry, etc. I think it was a skunkworks project. Apparently still exists: http://www.microsoft.com/en-us/download/details.aspx?id=2465...

brianpgordon · on Oct 29, 2014

This is neat, but why is Facebook making it? I may be too used to working at startups, but it seems to me that "look Mom, SQL!" isn't nearly worth the cost of the engineer hours it must have taken to bring this project to maturity.

I guess it does buy community goodwill to throw handfuls of money off the Facebook float...

karavelov · on Oct 29, 2014

I guess because of the learning curve - everybody already knows SQL and the time spent to create SQL parser is balanced by the time of thousands people not needing to learn some new (most of the time badly designed, look to example MongoDB) query language.

electrum · on Oct 29, 2014

This is built on SQLite virtual tables, so it didn't require writing a new SQL engine (which is a monumental task): http://www.sqlite.org/vtab.html

natrius · on Oct 29, 2014

Tools are top priority. You must read this: http://algeri-wong.com/yishan/engineering-management-tools-a...

gluczywo · on Oct 30, 2014

This is excellent idea. While I try to avoid unnecessary abstractions (yes, I'm looking at you docker), having a consistent cross-platform and familiar API for OS instrumentation seems like a big boon. At low complexity cost there is a chance to offload admin memory from idiosyncrasies of OS monitoring details.

annnnd · on Oct 30, 2014

Looks nice! That said, SQL in this case is just a way to look at data. Given that most network devices (and printers and UPSs and ...) in existence use SNMP, it would be nice to have an (SQL?) engine which would query devices via SNMP in background... If I understand correctly, this solution is tied to servers only.

spo81rty · on Oct 29, 2014

I wish something like this shipped with every version of linux just like WMI does on Windows. This is awesome to see.

karavelov · on Oct 29, 2014

There is something similar in aws. The main difference is that it is distributed and can query and aggregate across clusters of thousands machines. Also I think it is not built on sqlite but implements new db engine.

jacquesm · on Oct 29, 2014

Is this reading os datastructures synchronously or asynchronously?

stefanobaghino · on Oct 29, 2014

Brilliant idea, this can easily expose metrics in interesting ways to a whole lot of people who happen to know SQL better than the /proc filesystem.

njx · on Oct 30, 2014

Where are the JDBC drivers?

you could then deploy these frameworks on a bunch of servers and a external monitor can independently query via SQL "How u doing;"

mixedbit · on Oct 29, 2014

A system with SQL interface != a relational database

mixedbit · on Oct 29, 2014

I don't know why I'm down-voted for the comment above. Relations have certain semantics that are very hard to preserve without RDBMS controlling the data. For example in a relational system a query like this:

`select p0.name, p1.name from processes as p0 join processes as p1 on p1.id = 1 where p0.id = 1;`

Should never produce two different values. This is impossible to guarantee in all cases if data is updated asynchronously by OS.

mads_ravn · on Oct 29, 2014

Haven't read the code, but this transactional behaviour should be possible to implement. Just ensure that all bound tables in a query reference the same snapshot.

mixedbit · on Oct 29, 2014

Linux does not expose a consistent snapshot of its state that could be read within a single transaction.

mads_ravn · on Oct 29, 2014

I get what you are alluding to, and you are right that this would require kernel cooperation. But I still think that consistency needs to be defined, before you can make such claim. For instance let's say we're joining a table of processes and open network connections. Now we might have network connections that belong to a processes not listed -is this consistent? What if the implementation filters out the connections that do not belong to a process?

I guess my point is that a reasonable definition of consistency could be many things, including a definition that says referenced tables are read only once in the order they are encountered (this should be equivalent to a series of eg. ps and netstat commands stored in variables and then manipulated)

guipsp · on Oct 29, 2014

What stops the data from being updated asynchronously by another db connection?

mixedbit · on Oct 29, 2014

RDBMS can guarantee that there are no such updates

falcolas · on Oct 29, 2014

You know, this wouldn't be too hard to implement as a storage engine for MySQL... What an intriguing idea.

marpaia · on Oct 29, 2014

The SQL engine part of osquery is backed by SQLite's parser, lexer, query planner, etc. MySQL would be awesome as well though, mostly because of how easy it might be to integrate the MySQL remote access protocol to do remotely authenticated queries!

Shish2k · on Oct 30, 2014

Postgres would probably be even easier, since it already has the framework to do this (foreign data wrappers) and a nice API to write said wrappers in python[1] - 200 lines to use your SQL client as an email reader[2], I imagine that wrapping /proc would be considerably less :)

[1] http://multicorn.org/foreign-data-wrappers/ [2] https://github.com/Kozea/Multicorn/blob/master/python/multic...

Shish2k · on Oct 30, 2014

Turns out postgres is even easier. Took 15 minutes to go from "never used FDWs before, but how hard can it be?" to "got access to the process table as SQL" :)

https://github.com/shish/pgosquery

agrover · on Oct 29, 2014

easier in Postgresql, with its foreign data wrapper capability.

ris · on Oct 29, 2014

Please just let MySQL die.

ninjaplease · on Oct 29, 2014

Get a load of Mr. "Im-too-l33t-to-use-MySQL" over here.

Otherwise it sounds like you're insinuating that MySQL is somehow languishing or limping along, which is ridiculous.

MySQL is going to SELECT almost any query faster than Postgres in a single-user/desktop installation, it's going to be easier (although granted less strict) to import data, plus it has Sequel Pro.

AlisdairO · on Oct 30, 2014

Without endorsing the GP's trolling, Postgres generally substantially outperforms MySQL on queries with any complexity to them. Its query optimiser is enormously superior.

The idea that MySQL is much faster than PG is largely a relic of the era of MyISAM, which was very fast for trivial queries, but also terrible in basically every other way.

ninjaplease · on Oct 31, 2014

Interesting. Apparently my received wisdom is out-of-date. I usually use MySQL on the laptop when I have to clean data for analysis because it's already installed, but I will have to look into replacing it.

politician · on Oct 29, 2014

Would've preferred to see the from-where-select style rather than the normal way of writing SQL.

elwell · on Oct 30, 2014

Great, so now we can have SQL Injection at the OS level.

/sarcasm/

pron · on Oct 29, 2014

Nice! Can we have a JDBC driver for this?

Alex3917 · on Oct 29, 2014

Someone should make an ORM wrapper. Perhaps with one of those trendy fluent APIs, so you can just pipe the output from one command as input to the next.

pron · on Oct 29, 2014

Was that a joke? If there's a general purpose querying mechanism for OS internals, does using JSON strings in a proprietary protocol that you then have to parse make more sense than JDBC? JDBC is the closest we have for a standard RDBMS access API, and it's got mechanisms that can be helpful if you want to speed up query execution (and waste less resources) like prepared statements. Plus, there's tons of visualization software for it already.

Alex3917 · on Oct 29, 2014

Yeah it was just a joke, on the fact that ORMs are so popular because people think learning SQL is too hard, but if you actually make a fluent ORM then that's basically exactly what UNIX is to start with.

pron · on Oct 29, 2014

Except this project takes the SQL approach, not the UNIX approach, and the most popular SQL API is JDBC. Also, I don't know if ORMs are still popular, but their original goal had little to do with simplifying SQL and a lot to do with trying (unsuccessfully) to reconcile the relational model with the OO model, and to save application developers the need to do the translation manually.

This project currently has a proprietary C++ API, so why not standardize with JDBC (at least for those making use of the JVM)?

brianpgordon · on Oct 29, 2014

Of course, someone is going to have to make a command line tool for easily interacting with this wrapper over a shell...

callesgg · on Oct 29, 2014

Hahhahahahhahahahahaha

electrum · on Oct 29, 2014

This uses SQLite virtual tables with a substantial amount of native code, so I imagine getting the existing SQLite JDBC drivers that rely on JNI would be a lot of work. Using a client/server model would probably be easier, though I haven't seen an existing server/driver combo for SQLite.

One idea is to make a PostgreSQL Foreign Data Wrapper (FDW) that would allow using PostgreSQL JDBC (and ODBC) drivers.

avifreedman · on Oct 30, 2014

marpaia, Did you consider using OpenTSDB?

At CloudHelix, we did a Postgres FDW to OpenTSDB, which gives a time dimension as well.

That was an issue at Akamai - how to get historic as well as realtime with Akamai's Query system ([WARNING: PDF direct download] http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd...)

Interesting stuff though! Maybe a FDW could connect Postgres to osquery, which could allow joining with local tables or other FDW-accessible data.

The FDW approach to OpenTSDB looks like:

select to_timestamp(atime::float), value, hstore(regexp_split_to_array(tags, ',')) as hs from chf_realtime where i_start_time >= now() - interval'1 min' and agg = 'sum' and metric = 'df.bytes.percentused' and tags = 'host=*,mount=/|/data|/ssd' ; to_timestamp | value | hs ------------------------+-------+--------------------------------------------------------- 2014-10-30 01:15:33+00 | 84 | "host"=>"XY4.iad1", "mount"=>"/", "fstype"=>"xfs" 2014-10-30 01:16:33+00 | 84 | "host"=>"XY4.iad1", "mount"=>"/", "fstype"=>"xfs" 2014-10-30 01:15:33+00 | 9 | "host"=>"XY.iad1", "mount"=>"/data", "fstype"=>"btrfs" 2014-10-30 01:16:33+00 | 9 | "host"=>"XY.iad1", "mount"=>"/data", "fstype"=>"btrfs" 2014-10-30 01:15:33+00 | 49 | "host"=>"XY.iad1", "mount"=>"/ssd", "fstype"=>"btrfs" 2014-10-30 01:16:33+00 | 49 | "host"=>"XY.iad1", "mount"=>"/ssd", "fstype"=>"btrfs" 2014-10-30 01:14:55+00 | 63 | "host"=>"XY.iad1", "mount"=>"/", "fstype"=>"xfs" 2014-10-30 01:15:55+00 | 63 | "host"=>"XY.iad1", "mount"=>"/", "fstype"=>"xfs" 2014-10-30 01:14:55+00 | 1 | "host"=>"XY.iad1", "mount"=>"/data", "fstype"=>"btrfs" 2014-10-30 01:15:55+00 | 21 | "host"=>"XY.iad1", "mount"=>"/ssd", "fstype"=>"xfs" 2014-10-30 01:14:55+00 | 21 | "host"=>"XY.iad1", "mount"=>"/ssd", "fstype"=>"xfs" 2014-10-30 01:15:50+00 | 63 | "host"=>"XY.iad1", "mount"=>"/", "fstype"=>"xfs" 2014-10-30 01:15:50+00 | 8 | "host"=>"XY.iad1", "mount"=>"/ssd", "fstype"=>"btrfs" 2014-10-30 01:14:56+00 | 89 | "host"=>"XY.iad1", "mount"=>"/", "fstype"=>"xfs" 2014-10-30 01:15:56+00 | 89 | "host"=>"XY.iad1", "mount"=>"/", "fstype"=>"xfs" 2014-10-30 01:14:56+00 | 55 | "host"=>"XY.iad1", "mount"=>"/ssd", "fstype"=>"xfs" 2014-10-30 01:15:56+00 | 55 | "host"=>"XY.iad1", "mount"=>"/ssd", "fstype"=>"xfs" (17 rows)

marpaia · on Oct 30, 2014

Hmn, no, not really. there is no underlying datastore for the data that you query. it's all generated on the fly.

avifreedman · on Oct 30, 2014

Cool. Would be interesting to see if that comes up as a common request; we have found the ability to do real-time and historic to be interesting for query-ability for system data.

Have you seen the other neat functionality about the Akamai query system? They allow (actually require) devs to create callbacks that get exposed as query tables to do debugging (vs letting devs on machines as first or second line of debugging).

It's an interesting paradigm though again the way they use it, like osquery, is only real-time and using the psql/OpenTSDB method allows for history as well as real-time.

osquery (and Akamai Query's) syntax is definitely simpler. We do plan to add a logical schema to map tags in OpenTSDB to logical columns to help with that.

spydum · on Oct 30, 2014

seems like it would be minimal effort for someone to slap sqlite and a makeshift ETL process to build time series?

edit: and after more reading, looks like the ETL could be done with osqueryd, and a plugin to push output to sqlite if desired -- even better!

peetle · on Oct 29, 2014

Hey? LINQ? What?

innguest · on Oct 29, 2014

The original wiki had much to say, over its many discussions, about TOP - table-oriented programming.

Today we know via Category Theory that tables are Turing complete and are actually quite synonymous with CT itself.

In other words, thinking of computation in terms of tables with rows and columns and relationships between tables is an interesting and promising (given CT) approach to computing that has been discussed in the past but then left largely unexplored.

tree_of_item · on Oct 29, 2014

Do you have some examples of how tables are "synonymous with CT itself"?

There may be some application of concepts from category theory to relational algebra, of course, since category theory is incredibly abstract and has some tangential relation with most everything. But it seems a bit too glib to say that tables are "synonymous with CT itself".

losvedir · on Oct 30, 2014

I was intrigued by that statement as well and it sent me off on a little research excursion.

Evidently a mathematician at MIT, David Spivak[0], has done some work on Databases as Categories. I found a presentation he gave to Galois[1] and a summary of the talk by E.Z. Yang.[2].

That said, I think "tables synonymous with CT itself" is a bit strong. Rather, the argument is that database schemas are categories if you model it appropriately: tables are objects in the category, and foreign keys are the morphisms/arrows).

[0] http://math.mit.edu/~dspivak/ [1] http://math.mit.edu/~dspivak/informatics/talks/galois.pdf [2] http://blog.ezyang.com/2010/06/databases-are-categories/

innguest · on Oct 30, 2014

Yes, I phrased it loosely. :)

Also see this: http://code.galois.com/talk/2010/10-06-spivak.pdf

rpm33 · on Oct 29, 2014

This is amazing.Installing it right away

mapcars · on Oct 30, 2014

Wow, these guys really should spent some time learning history and plan 9 specifically.

I mean how do they think people will write and rewrite programs using sql when files are already here?

_kdhr · on Oct 30, 2014

That's quite a myopic reaction.