Tenzing: A SQL Implementation On The MapReduce Framework

matclayton · on Jan 24, 2012

Isn't this a Google version of Hive, which was open sourced by Facebook and provides an SQL style syntax to Hadoop. Queries aren't quick, it just allows offline data crunching to be coded quickly with out users having to code lots of map reduce. Cool concept but dont expect to see the online part of web apps powered by this.

wyuenho · on Jan 24, 2012

Well this is pretty much a given isn't it? It's just a SQL implementation. It doesn't say anything about the underlying storage and guarantees. Given that it runs on GFS and Bigtable, unless these tech supports ACID, don't expect Tenzing to be able to support it either. Here's a quote from the paper:

    Tenzing is not ACID compliant - specifically, we are atomic, consistent and
    durable, but do not support isolation.

civild · on Jan 24, 2012

It would be interesting to know the identity of the vendor for "DBMS-X". I work in the "enterprise" data warehouse space and I'm trying to advocate moving away from "database appliances" towards distributed computing, and having a quotable source from Google would be very compelling.

pwang · on Jan 24, 2012

I'm curious - what sorts of work do you do in the data warehousing space? Do you work as a consultant, or as an implementor at a customer of data warehouse products?

It seems to me that that whole industry (DW & ETL) is a dinosaur whose lunch is about to get eaten by some upstarts.

jaylevitt · on Jan 24, 2012

I've read a few books on data warehousing, and maybe you can confirm my suspicion:

Isn't ETL just an acronym that means "I wrote this Perl script to populate the database"?

How on earth is that even an industry?

karlmdavis · on Jan 25, 2012

Simple ETL jobs are mostly just E & L: extract the data from one system, load it into another.

Where things get complex is in the Transform aspect of some jobs. Mapping disparate schemas is complex, often messy work. Especially when one (or both) sides of the ETL job have poor/no primary keys, foreign keys, or even are just "mostly standard" CSV files [shudder].

Also: some ETL jobs can get quite large. I know one guy who had to create an ETL system that continuously moved data from one 1200-table system into some other system. Crazy.

civild · on Jan 25, 2012

The term "ETL" itself is often used in place of "Data Integration" which is much larger, particularly when it comes to data warehouse design. The wiki article is a good drop off point: http://en.wikipedia.org/wiki/Data_integration

It may be difficult to understand how this is an industry coming from a web development/startup angle (big supposition there) but there are literally thousands of companies with lots of databases varying in age, size and complexity that need integrating, and plenty of companies competing for that work as either implementors or software providers. A perl script might do the job but most products focus on performance, reuse, ease of maintenance and compatability across many different database/file types.

jackowayed · on Jan 24, 2012

Eh. Even if growth slows a lot because more and more new systems are Hadoop/etc, big companies are so tied to their legacy systems that they basically never get rid of what they have, so those companies will have significant recurring revenue from their current customers for the foreseeable future.

I also get the impression that Exadata is a pretty impressive feat of engineering and, if you need to do what it's optimized for and are prepared to pay a few million per rack, it's a very good option.

civild · on Jan 25, 2012

Both, really. I work as a consultant for a company that provides consultancy for clients that use ETL products (software/'appliances' etc).

Your second comment is true, however the DW industry has in the last year figured this out and started to embrace the "Big Data" movement. Informatica (the largest player in the DW space according to Gartner) added HDFS connectors to its latest release, for instance.

willvarfar · on Jan 24, 2012

(Related article the likely prompted this link, but has since fallen off the HN front-page just in time for the American audience is:

http://news.ycombinator.com/item?id=3503866 )

sigil · on Jan 24, 2012

How does this relate to Dremel? I thought Dremel had a SQL frontend to MapReduce that was already in wide use at Google.

http://research.google.com/pubs/pub36632.html

benmccann · on Jan 24, 2012

Dremel is mostly used for SQL-like queries in logs processing while Tenzing is largely used to run SQL-like queries on BigTable.

seporaitis · on Jan 24, 2012

That's exactly the question I have. There seems to be couple of hints about it, e.g. in section #4.8:

"Tenzing has read-only support for structured (nested and repeated) data formats such as complex protocol buffer struc- tures. <...> The engine itself can only deal with flat relational data, unlike Dremel [17]"

And from section #5.4 I assume that currently they use Dremel query engine, but are in the works of creating another one.

yaroslavvb · on Jan 25, 2012

Dremel aka BigQuery has a dedicated execution engine, roughly an order of magnitude faster than MapReduce for typical SQL queries

JoachimSchipper · on Jan 24, 2012

> Tenzing is currently used internally at Google by 1000+ employees and serves 10000+ queries per day

So that's 10 queries/employee/day. That screams "experimental". Still, this would be very nice.

willvarfar · on Jan 24, 2012

that doesn't say how big these queries are

and this was quite a while ago