This is really interesting and very cool to see. The approach we're taking with Eve (gory details at [1]) is that you can treat everything as relational and doing so provides lots of benefits. One thing that wasn't clear though, was how you extend that notion down into the OS-level for both performance and semantic reasons. It's encouraging to see someone with requirements as deep as facebook's find that this strategy works in that context.
The next step would be manipulating the OS as relations. E.g. an insert into the process table allows you to actually spawn a process. It would start to get really interesting from there...
Right now osquery only supports read operations but you're totally right; if you could kick off tasks, kill processes, unload kexts, etc. via CREATE, DELETE, etc statements, that would be so killer!
So in essence, you're motivated by the same underlying concept as the Plan 9/Inferno developers: define a small set of abstractions and apply them ruthlessly, in contrast to the myriad of non-uniform interfaces one is practically using every day in an operating system.
For Plan 9, it was "everything is a file" and the power of using simple operations like bind mounts to create complex software interactions that would otherwise require monolithic protocol and library stacks anywhere else.
Here, it's... "everything is a table"? I'm not very familiar with table-oriented programming, but what advantages does having the RDBMS be the prime metaphor over the file system really bring? Structure? Rob Pike had some interesting words on that: http://slashdot.org/story/50858
"Everything is a file" means that you need to parse files to get useful data out (think things like /etc/mtab and /proc/mounts), which is closely tied to another UNIX philosophy, that tools should generate plain text and parse it using generic text-processing tools. This is great for getting things done quickly. It's also great for security holes (think CVE-2011-1749 and related issues; arguably, see also Shellshock).
One advantage of "everything is a table" is that your structures are well-formatted and there's no risk of problems when you put a space in a pathname. For most implementations of "table", you can also have the data formats be well-typed. This brings reliability and security benefits.
I think there's validity to Rob Pike's argument in many contexts -- for instance, you absolutely won't see me defending the semantic web over the greppable/Googleable one. But in the specific case of text files with a single, well-defined structure, his own argument seems to imply that there's no sense in a second tool having to infer the structure on its own.
(The usual way this is worked around these days is separate files for each field, or files designed to be parseable, which is why Linux's /proc/*/ is such a mess. Compare /proc/self/stat and /proc/self/status, and /proc/self/mounts and /proc/self/mountinfo. Also look around /sys a bit.)
There's a command line tool called q, which allows performing SQL-like queries directly on text files, basically treating text as data and auto detecting column types.
Neat, but auto-detection is exactly what I don't want. We have structure on one side. Why round-trip it through an unstructured format and attempt to guess the exact same structure on the other side? If I guess wrong, it's a security hole.
This is great. One of the frustrations I've had with Puppet and Ansible is the lack of a clear model for data. It's quite difficult to know the scope and dependencies and origin of all the variables that one deals with.
If one could update tables and then have that representation be reified to the machines it would be awesome.
>> The next step would be manipulating the OS as relations.
This approach could be a good fit for package management. So that packages are updated and run within a transaction and changes can be committed in a single seamless step.
Package managers usually have non-idempotent actions that might change parts of the operating system in non desired ways. That means that you could not have atomic operations ala SQL. There is one packager manager that solves that, Nix(from NixOS), on top of which you could apply something like an SQL language.
I'd rather see an OS and package manager that has a "functional" design (as in functional programming language, functional data structures). This would allow conflicting packages to be installed next to eachother in different branches of a functional filesystem.
The Nix package manager refers to itself as "The Purely Functional Package Manager", and that is exactly what it lets you do.
"It provides atomic upgrades and rollbacks, side-by-side installation of multiple versions of a package, multi-user package management and easy setup of build environments."
This seems interesting. But I find it a bit unsatisfying that it can only be used as a package manager then. How about using this functional machinery, e.g., for a general build system? A "functional make" so to speak. And I bet there are plenty of other use cases.
This would really require the underlying package management system support this, and then it's simply a shim from osquery to the package manager to do the actual work. The main problem would probably being making it work across all the disparate systems it's supposed to support.
That said, it would be awesome to query specific package versions, or even individual package file MD5s from an SQL interface to check system exposure when new exploits come down the pipe.
While you're at it, one related piece that I'd like to see explored is relations-as-code, code-as-relations a la Lisp. It would be interesting to see what a program would look like if represented as tuples in a relation, and able to self-modify by updating that relation.
Abstract, I know, but I don't think anyone has looked at this in any detail yet.
Datalog isn't what I was talking about. I would like to see a programming language where both the basic data type and the representation of code itself are relations, as is the case with Lisp and lists.
Thank you. I have had a misunderstanding of what datalog is for years. I have always thought of it as a DSL for making queries against a database using the relational calculus. I have never considered, nor have the texts I've encountered demonstrated how the syntax of datalog is itself relational, though it seems obvious now that I've noticed.
I have some planned projects which require a homoiconic relational language. I was hoping someone else could be inspired to design such a thing so that I don't have to. It looks like someone already did. I am happy to be proven wrong :)
ibdknox, here's [1] an example of a relational scheme on top of HTTP APIs, maybe this could serve as inspiration for some aspect of your project. (BTW I feel like I keep bringing this [seemingly defunct] project up... oh well).
- several tools and utilities exist to leverage the power of SQL at scale (osqueryd is a full operating system instrumentation tool which allows you to use SQL to instrument your whole infra): https://github.com/facebook/osquery/wiki/using-osqueryd
All in all, WMI is great, no doubt about it, but osquery has a few unique features which make it a cool, interesting product that you can use all across you internal infrastructure.
Didn't want to give off a "someones been there and has done that already" vibe, your version definitely looks much better. Honestly, I found WMI to be lacking and also to be quite slow.
Of course, it's an intuitive comparison to make. I'm glad you like it! I'm looking forward to working with you and similarly insightful community members on improving osquery for everyone :)
I think Microsoft does deserve some credit for the idea behind WMI, but when I worked with it I always got the sense that the API was sitting on top of a massive hairball that I didn't want to look at for too long. It was slow, but it worked. I'd be thrilled if osquery just achieved the same level of query completeness, and made it easier to find properties and query them. I found WMI confusing in that regard.
A (not so) simple case: How many network interfaces do I have? WMIC gives one answer, Netsh gives another, ncpa.cpl - Network Control Panel gives another, hardware manager may give another. I'm impressed anything works.
I quick look at the Wikipedia page [1] reveals there is and I always assumed it all kind of works - I know WMI, CMI and the related stuff and it is not completely cooked up by Microsoft alone and therefore I always assumed that this is a standard solution and I could just point a WMI console to a Linux machine if I wanted to. Now I am not sure if I should like this project because it does something useful or dislike it because it is another incompatible solution (and I lean a bit towards a One True Way philosophy).
I think we could just as easily point to SNMP, and the same reasons why that isn't a good argument apply here. They differ in where their priorities lay and what they aspect of management they emphasize.
In this case, it's about easily correlating data to pull more complex information out of specific system data sets.
It's hard for me to grok this project's design goals. I mean, the basic idea is simple enough to understand: I want to run SQL queries on metadata about my hosts. I've built and run several different iterations of that same idea, but they didn't require thousands of lines of C++.
The usual implementation is simple: take any host monitor (say, collectd) that can export key/value pairs from a host, or take a log stream over the network and pair it with a host monitor/log scraper to create key/value pairs. Then insert into an SQL engine while appending to a log for a historical record (or PTA/PITR/whatever, i'm not a DBA). Separately you can create a database application to query/modify the database as needed.
But we're talking like, a handful of python scripts that don't ever change except to add new search features. This seems like a big departure from the simplicity of that approach. Am I missing something?
From what I can tell, the advantage of this is that it doesn't require integrating with an actual database, which then has to be maintained and scaled.
When you have tens of thousands of hosts, and this database (because they do use an SQLite backend) is grown exponentially, it doesn't matter what the storage or query engine is. Whatever it is will have to be maintained and scaled.
There was a paper at EuroSys this year "Relational access to Unix kernel data structures" which seems to attempt to offer similar functionality. A comparison between the two would be interesting.
Given the number of applications tied to running under a windows server, yes. Also, 80% of azure is running windows host environments... Many small-medium businesses are running a handful of windows servers, and most of them would prefer to simply install new applications on their already existing and under-loaded servers.
On the flip side, while I am more than happy to deploy new development projects to Linux hosts, and even in docker containers, I'm still stuck developing under windows, and having to run a VM to do most of my development isn't something I'm all that in favor of.
.Net developers mostly deploy to internal application servers.
That said, Azure is about the only cloud provider that really caters to .Net deployments, that is correct and probably accounts for a lot of it.
Me employer's next generation of applications is being deployed in Linux on Azure, because of their better support for our legacy applications (some will be around for quite a while)... which is part of why I said that "yes" running in windows is important.
If Node and Mongo didn't run in windows three years ago, we wouldn't be migrating to Linux today. I introduced Node in order to improve client-side web resources in a few web projects... once that was in place, it was a natural fit for one-off scripts (importers, timed tasks, etc). From there it became the API service for search (with mongodb behind IIS/ARR). Because of that, and the stability so far, it's our next generation platform. None of that would have happened without being able to run on windows.
What about environments with both Windows and Linux? (and OSX, and BSD, and what not).
I think a lot of people/companies have some of these. Even if you're 90% Linux for example, and 10% Windows, its nice to have a tool that works across the board.
Since this also seems to be targeted at laptops however, I bet Windows is still a large percentage there!
What I think they're saying is that they only test on CentOS and Ubuntu, two significantly different flavors of distro. I don't think they're saying that it won't build on fedora or debian.
The idea that they should test on every distro in order to say that it works on Linux is a bit silly.
Through the eyes of the author, the original comment probably seems obvious in what it's attempting to convey. Through the eyes of other people, its intent can be ambiguous. In my case, very ambiguous. I wouldn't be surprised if you were downvoted just as much for being unclear as for people misunderstanding your point.
This is not exactly same, but similar in some ways to IBM's S/38, OS/400 or whatever they call it now... In this OS (and last I touched it was 15 years ago, so things may have changed) there were no "files" - everything was a database table, and that was the only way you could store anything and it was how you for the most part interoperated with the system, i.e. the OS was essentially a relational DB. http://en.wikipedia.org/wiki/IBM_i
I like this idea very much, but it is perhaps too orderly.
In the spirit of Chaos Monkey, I suggest killing processes via an interface based on Peggle.
Yup! The only reason why it's not today is because we couldn't get a formula merged into mainline homebrew until the download link was public. It is now though, so we'll sort this out soon. Now that you mention it though, this would be a pretty easy, high-impact contribution that a community member could make as well!
Are there instructions how to build it? I'm looking around the github page and don't see anything like what depedencies it needs, what infrastructure it uses (looks like CMake?), etc. It's cool to distribute it as a vagrant source, but I'd like to compile and run this on a Raspberry Pi which doesn't run Vagrant. Supplying some basic how to build instructions would really help.
This is pretty neat. I'm a big fan of SQL in general and being able to query system stats like this feels pretty natural to me.
A long time back I created something similar to this atop Oracle[1]. It used a Java function calling out to system functions to get similar data sets (I/O usage, memory usage, etc). It was definitely a hack, but a really pleasant one to use.
Be cool to see an foreign data wrapper for PostgreSQL[2] that exposes similar functionality. I'm guessing it'd be pretty easy to put together as you'd only need to expose the data sets themselves as set returning functions. PostgreSQL would handle the rest. Though I guess that would limit it's usefulness to servers that have PG already installed. Having it be separate like this let's you drop it on any server (looks like it's cross platform too!).
[1]: I don't remember exactly when but I think 10g had just been released.
After a lost decade of misplaced vitriol directed at relational models and SQL, I'm personally heartened to see a trend back to human readable query languages over rpc as an interface, sensible representation of information in relational form suitable for ad hoc queries/discovery versus complex implementation-dependent deep hierarchy spaghetti.
I think its interesting to see that MIG is in Go and thus cross platform "by default". It also seems to be more privacy-compliant.
osquery's SQL is sexy however.
That said I'm also wary of a single piece of software that basically give you control over absolutely everything (control everyones laptop, etc. silently and quickly. Thats the best rootkit ever. You wont even detect if its being compromised because its a trusted piece of the OS!)
MIG author here. First, congrats to Mike Arpaia for shipping! That must have been a busy couple of months :)
There is a number of conceptual differences between MIG, GRR and OSquery. MIG does not retrieve data from endpoints, but instead focuses on answering yes/no questions.
For example: find hosts that have a file in /var/www that contains the regex '12345'. MIG will run the search on all endpoints and return the location of files and hosts that match. But it won't return the files themselves. Privacy is preserved, but investigators need to manually retrieve files if needed. MIG takes this approach to keep the search very fast (parallel AMQP runs in seconds across thousands of endpoints). Retrieving data at large scale is too expensive for Mozilla (bandwidth, storage, execution time), and we like to protect privacy.
I really like the SQL approach taken by osquery. I brainstormed something similar last year [1] but did not get around to it yet. SQL is very natural to use for a lot of IT/Security people, and it's great to be able to transfer that knowledge to search tools.
GRR and OSquery are awesome tools. And I can tell their authors are solving some really hard problems. We're competing a bit, but it's OK to have multiple tools with varying goals. And it's great for borrowing each other's good ideas ;)
Is it possible to create triggers with this kind of emulated database? E.g. Insert a row into the notifications tablewhen free space drops below 10% or to use their example when SELECT name, path, pid FROM processes WHERE on_disk = 0 actually returns a row.
That is a great idea, but unfortunately triggers are not supported for SQLite virtual tables. You could achieve the same thing by periodically running a query to check for such conditions. This is a common way to build alerting systems.
I'd like to try this out but is there a way to install it that's as simple as all of the other Linux software I've ever installed before, and doesn't require installing vagrant, installing virtualbox, downloading an ubuntu image, creating a whole VM (which failed) so I can then make a package I can install? I can usually try these things out without making a whole thing of it.
Nifty, but not incredibly novel. SQLite's VFSes have been around for some time, albeit in smaller breadth and scope. I think one thing this kind of glosses over is the notion of transactions, what if load/fs contents change between independent parts of your query, are they memoized, recomputed, etc?
Having said all that I'm going to install it and try it out because it's new and shiny.
This is neat, but why is Facebook making it? I may be too used to working at startups, but it seems to me that "look Mom, SQL!" isn't nearly worth the cost of the engineer hours it must have taken to bring this project to maturity.
I guess it does buy community goodwill to throw handfuls of money off the Facebook float...
I guess because of the learning curve - everybody already knows SQL and the time spent to create SQL parser is balanced by the time of thousands people not needing to learn some new (most of the time badly designed, look to example MongoDB) query language.
This is excellent idea. While I try to avoid unnecessary abstractions (yes, I'm looking at you docker), having a consistent cross-platform and familiar API for OS instrumentation seems like a big boon.
At low complexity cost there is a chance to offload admin memory from idiosyncrasies of OS monitoring details.
Looks nice! That said, SQL in this case is just a way to look at data. Given that most network devices (and printers and UPSs and ...) in existence use SNMP, it would be nice to have an (SQL?) engine which would query devices via SNMP in background... If I understand correctly, this solution is tied to servers only.
There is something similar in aws. The main difference is that it is distributed and can query and aggregate across clusters of thousands machines. Also I think it is not built on sqlite but implements new db engine.
I don't know why I'm down-voted for the comment above. Relations have certain semantics that are very hard to preserve without RDBMS controlling the data. For example in a relational system a query like this:
`select p0.name, p1.name from processes as p0 join processes as p1 on p1.id = 1 where p0.id = 1;`
Should never produce two different values. This is impossible to guarantee in all cases if data is updated asynchronously by OS.
Haven't read the code, but this transactional behaviour should be possible to implement. Just ensure that all bound tables in a query reference the same snapshot.
I get what you are alluding to, and you are right that this would require kernel cooperation. But I still think that consistency needs to be defined, before you can make such claim. For instance let's say we're joining a table of processes and open network connections. Now we might have network connections that belong to a processes not listed -is this consistent? What if the implementation filters out the connections that do not belong to a process?
I guess my point is that a reasonable definition of consistency could be many things, including a definition that says referenced tables are read only once in the order they are encountered (this should be equivalent to a series of eg. ps and netstat commands stored in variables and then manipulated)
The SQL engine part of osquery is backed by SQLite's parser, lexer, query planner, etc. MySQL would be awesome as well though, mostly because of how easy it might be to integrate the MySQL remote access protocol to do remotely authenticated queries!
Postgres would probably be even easier, since it already has the framework to do this (foreign data wrappers) and a nice API to write said wrappers in python[1] - 200 lines to use your SQL client as an email reader[2], I imagine that wrapping /proc would be considerably less :)
Turns out postgres is even easier. Took 15 minutes to go from "never used FDWs before, but how hard can it be?" to "got access to the process table as SQL" :)
Get a load of Mr. "Im-too-l33t-to-use-MySQL" over here.
Otherwise it sounds like you're insinuating that MySQL is somehow languishing or limping along, which is ridiculous.
MySQL is going to SELECT almost any query faster than Postgres in a single-user/desktop installation, it's going to be easier (although granted less strict) to import data, plus it has Sequel Pro.
Without endorsing the GP's trolling, Postgres generally substantially outperforms MySQL on queries with any complexity to them. Its query optimiser is enormously superior.
The idea that MySQL is much faster than PG is largely a relic of the era of MyISAM, which was very fast for trivial queries, but also terrible in basically every other way.
Interesting. Apparently my received wisdom is out-of-date. I usually use MySQL on the laptop when I have to clean data for analysis because it's already installed, but I will have to look into replacing it.
Someone should make an ORM wrapper. Perhaps with one of those trendy fluent APIs, so you can just pipe the output from one command as input to the next.
Was that a joke? If there's a general purpose querying mechanism for OS internals, does using JSON strings in a proprietary protocol that you then have to parse make more sense than JDBC? JDBC is the closest we have for a standard RDBMS access API, and it's got mechanisms that can be helpful if you want to speed up query execution (and waste less resources) like prepared statements. Plus, there's tons of visualization software for it already.
Yeah it was just a joke, on the fact that ORMs are so popular because people think learning SQL is too hard, but if you actually make a fluent ORM then that's basically exactly what UNIX is to start with.
Except this project takes the SQL approach, not the UNIX approach, and the most popular SQL API is JDBC. Also, I don't know if ORMs are still popular, but their original goal had little to do with simplifying SQL and a lot to do with trying (unsuccessfully) to reconcile the relational model with the OO model, and to save application developers the need to do the translation manually.
This project currently has a proprietary C++ API, so why not standardize with JDBC (at least for those making use of the JVM)?
This uses SQLite virtual tables with a substantial amount of native code, so I imagine getting the existing SQLite JDBC drivers that rely on JNI would be a lot of work. Using a client/server model would probably be easier, though I haven't seen an existing server/driver combo for SQLite.
One idea is to make a PostgreSQL Foreign Data Wrapper (FDW) that would allow using PostgreSQL JDBC (and ODBC) drivers.
Cool. Would be interesting to see if that comes up as a common request; we have found the ability to do real-time and historic to be interesting for query-ability for system data.
Have you seen the other neat functionality about the Akamai query system? They allow (actually require) devs to create callbacks that get exposed as query tables to do debugging (vs letting devs on machines as first or second line of debugging).
It's an interesting paradigm though again the way they use it, like osquery, is only real-time and using the psql/OpenTSDB method allows for history as well as real-time.
osquery (and Akamai Query's) syntax is definitely simpler. We do plan to add a logical schema to map tags in OpenTSDB to logical columns to help with that.
The original wiki had much to say, over its many discussions, about TOP - table-oriented programming.
Today we know via Category Theory that tables are Turing complete and are actually quite synonymous with CT itself.
In other words, thinking of computation in terms of tables with rows and columns and relationships between tables is an interesting and promising (given CT) approach to computing that has been discussed in the past but then left largely unexplored.
Do you have some examples of how tables are "synonymous with CT itself"?
There may be some application of concepts from category theory to relational algebra, of course, since category theory is incredibly abstract and has some tangential relation with most everything. But it seems a bit too glib to say that tables are "synonymous with CT itself".
I was intrigued by that statement as well and it sent me off on a little research excursion.
Evidently a mathematician at MIT, David Spivak[0], has done some work on Databases as Categories. I found a presentation he gave to Galois[1] and a summary of the talk by E.Z. Yang.[2].
That said, I think "tables synonymous with CT itself" is a bit strong. Rather, the argument is that database schemas are categories if you model it appropriately: tables are objects in the category, and foreign keys are the morphisms/arrows).
The next step would be manipulating the OS as relations. E.g. an insert into the process table allows you to actually spawn a process. It would start to get really interesting from there...
[1]: http://incidentalcomplexity.com/2014/10/16/retrospective/