Hacker News new | past | comments | ask | show | jobs | submit login
Presto: Interacting with petabytes of data at Facebook (facebook.com)
336 points by ternaryoperator on Nov 6, 2013 | hide | past | favorite | 97 comments



>As mentioned above, development on Presto started in Fall 2012. We had our first production system up and running in early 2013. It was fully rolled out to the entire company by Spring 2013.

Wow. As somebody who skimmed through the original Google Dremel paper and thought for a while about how one would go about implementing such an interactive system, that strikes me as an amazingly impressive timeline.


Hi, I work on Presto at Facebook and would be happy to answer any questions about it.


1. Can you force it to do exact (not approximate counts) and aggregations? If so, what are the preconditions (size of tables?)

2. In data warehouse terms, would you fit it as a MOLAP, ROLAP or HOLAP kind of engine? I'm not sure whether all data is held in memory and whether aggregates are cached. Can you preprocess the dataset in batch mode (let's say overnight), repopulate aggregate caches for faster retrieval later on? (I mean that as something similar to SQL Server Analysis Services in MOLAP mode).

3. Can you compare it to Apache Drill ?


1) Presto always returns exact answers, with the exception of the approximate aggregate functions approx_distinct() and approx_percentile() (see http://prestodb.io/docs/current/functions/aggregate.html). We do not currently support COUNT(DISTINCT), though if you really need that you can use a subquery with SELECT DISTINCT.

We have a pull request for partitioned hash aggregations, but currently all the groups must fit in memory limit specified by the "task.max-memory" configuration parameter (see http://prestodb.io/docs/current/installation/deployment.html for details).

Regarding approximate queries, we are working with the author of BlinkDB (http://blinkdb.org/) to add it to Presto. BlinkDB allows very fast approximate queries with bounded errors (an important requirement for statisticians / data scientists).

2) Presto is a traditional SQL engine, so it would be ROLAP. We don't yet have any support for building cubes or other specialized structures (though full materialized view support with rewrite is on the roadmap).

The Presto query engine is actually agnostic to the data source. Data is queried via pluggable connectors. We don't currently have any caching ready for production.

There is a very alpha quality native store that we plan to use soon for query acceleration. The idea is you create a materialized view against a Hive table which loads the data into the native store. The view then gets used transparently when you query the Hive table, so the user doesn't have to rewrite their queries or even know about it. All they see is that their query is 100x faster. (All this code is there today in the open source release but needs work to productionize it.)

We have a dashboard system at Facebook that uses Presto. For large tables, users typically setup pipelines in Hive that run daily to compute summary tables, then write the dashboard queries against these summary tables. In the future, we would like to be able to handle all of this within Presto as materialized views.

3) We're excited about the Drill project. They have some interesting ideas about integrating unstructured data processing (like arbitrary JSON documents) with standard SQL. However, last I looked they were still in early development, whereas Presto is in production at Facebook and is usable today. Please also see this comment: https://news.ycombinator.com/item?id=6684785


Thanks a lot for the answers, good luck on your further development. I'm a regular SQL Server user and MOLAP goodies can be very useful. From what I understand, it should be very easy to join a new data source to existing Presto warehouse. So I suppose you could "upload" additional mappings from Excel/CSV into the warehouse and do new queries live (instead of waiting for batch reprocess). This can be very beneficial to the user/analyst who often likes to define new data categorizations on-the-fly and use them to dig into data further.Great job! I'm putting this project on my must-try-shortlist right now!


If this is a hot button item, seriously look into microstrategy >9.2.1. You build the project against some warehouse, and can import other data sources like excel, access, or even other odbc sources. The imported data gets converted to an in memory cube which you can then combine analysis with the larger warehouse. No reprocessing on the warehouse needed.


How complete is the SQL dialect? Are nested queries supported? I was surprised to see that the ANTLR grammar is only ~900 lines; a quick perusal didn't suggest to me that nested queries are supported, but I'm not a guru...


> It currently supports a large subset of ANSI SQL, including joins, left/right outer joins, subqueries,and most of the common aggregate and scalar functions, including approximate distinct counts (using HyperLogLog) and approximate percentiles (based on quantile digest). The main restrictions at this stage are a size limitation on the join tables and cardinality of unique keys/groups. The system also lacks the ability to write output data back to tables (currently query results are streamed to the client).


2) Can't you directly read the RCFiles/etc. bypassing Hive? Why implement another storage format?


Presto does read Hive data directly out of HDFS, bypassing Hive/MapReduce. However, neither HDFS nor the file formats like RCFile are efficient. The entire design of HDFS makes it difficult to build a real column store. Having a native store for Presto allows us to have tight control over the exact data format and placement, enabling 10-100x performance increases over the 10x that Presto already has over Hive.

Additionally, in the long term, we want to enable Presto to be a completely standalone system that is not dependent on HDFS or the Hive metastore, while enabling next-generation features such as full transaction support, writable snapshots, tiered storage, etc.


HDFS now supports placement groups with one use case being columnar storage.


Any metrics you can share about the team size, any esoteric skill sets found to be crucial, or overall cost? Some sense for those who wonder "what would it take to create or recreate something like this?"


We started the project with 4 full-time engineers and added a fifth person recently. We began working on Presto in August last year and deployed our first working version in January, and we've been iterating on it ever since.

One big advantage we have that speeds up development is the Facebook culture of moving fast and shipping often. Facebook employees are used to working with software as it's being built and refined. This kept us focused on the key subset of features that matter to users and getting close to realtime feedback from them. Our development cycle is typically one release and push to production per week.

Also, from the beginning Presto had to work with current Facebook infrastructure (100s of machines, 100s of petabytes), so we faced and solved all the associated scaling challenges up-front.


So, the obvious question is does it connect easily with Scuba? From the paper, Scuba doesn't support joins, so would that logic be handled in Presto?


We are working on a Scuba connector.


Sorry, didn't answer the core question. We are working on a Scuba connector specifically to bring joins and other powerful SQL features to Scuba. These features are handled by the Presto query engine, so data in existing non-relation systems can be easily analyzed. Even better, Presto can join data across connectors to formerly isolated systems.


Hi!

I'm missing something very basic here. The core idea of Presto seems to be to scan data from dumber systems -- Hive, non-relational stores, whatever -- and do SQL processing on it. So isn't its speed bounded by the speed of the underlying systems? I know you're working on various solutions to that, but what parts are actually in production today?


Can you point me to any documentation on the airlift framework that Presto uses? It seems to be a REST server and service discovery framework. I can see the framework on github but there is very little documentation.


Airlift is a toolkit for building distributed services and it is the foundation of Presto. For the most part, it's a convenience layer on top of some well-known libraries (Jackson, Jersey, Jetty, Guice, etc.), but it also has support for various things you'd typically need to build a deployable server, such as packaging, configuration, logging, instrumentation (stats), etc.

Unfortunately, we don't have any docs, so you'll just have to peruse the code. There's minimal server in the codebase to demonstrate usage of some of its features: https://github.com/airlift/airlift/tree/master/sample-server


Thanks! It reminds me a bit of Dropwizard at first glance. Doing REST services in Java has been really easy and enjoyable lately. The Jersey, jackson, jetty, combo is great.


But why did you you use this particular one? And how is it related to https://github.com/proofpoint/platform ? Thanks in advance.


Can you give a pointer to some sample code implementing a custom connector to a file containing flat binary records? If no, sample code for a connector to HDFS...


We don't have a connector for flat binary records or plain HDFS, but you can take a look at the Hive connector here: https://github.com/facebook/presto/tree/master/presto-hive/s...

We're also working on a example connector that can read from files/urls. We should have that code up soon.


Congrats on open sourcing this. It sounds really great and I'm excited to get a chance to play with it.

Do you think this would be a decent candidate for click stream analysis?


Thanks! Presto supports arbitrary SQL including some basic window functions (full support is on our roadmap). I've used window functions at previous companies with commercial data warehouses to do some basic click stream analysis: http://prestodb.io/docs/current/functions/window.html

We have over a thousand employees in Facebook using it daily for anything you can imagine, so I recommend trying it and letting us know what you find useful and how it can be improved.


Is there any benefit to use Presto on top of Riak or ElasticSearch (assuming someone writes the required connectors)? What are the main differences there?


The benefits would be that you could query them using full ANSI SQL and could join them with other data sources such as Hive. One of the benefits of Presto's connector architecture is that it allows queries across multiple disparate data sources (after the someone writes the connectors).


What are the access paths that presto uses (full scan, range scan, skip scan, ...)? Does it understand indexes?


Presto supports full scan right now, but pushes down key value equality pairs to the connectors for additional connector side pruning (e.g. partitions in Hive). The engine itself will also do some additional partition pruning at planning time with the partitions returned by the connector.

We currently have a pull request open right now that will allow range predicates to be be pushed into the connectors. This will allow connectors the ability to implement range/skip scans.

The core presto engine does not take advantage of indexes right now.


Is Presto meant mainly for ad hoc internal analysis or is it fast enough for user-facing features as well?


Presto is designed for OLAP type workloads and is thus primarily used heavily internally at Facebook for data analysis and reporting. It won't be replacing traditional databases (like mysql) that are typically used for user-facing features, but may be suitable for serving user-facing traffic doing OLAP type queries such as generating dashboards. We are actually looking into doing something like that right now for Facebook.


Can it be installed and played around with on the virtual machine cloudera provides for hadoop?


Wow, I didn't realize they had a VM image available. We haven't tried it, but there's no reason why it wouldn't work. Try it and let us know!


Are you looking for interns?


Facebook is always looking for interns and has a great intern program. See https://www.facebook.com/careers/department?dept=interns&req...


Thanks for the link, I'll try that. At the career fairs, the recruiters have indicated they filter out people like me who aren't juniors in college, so I'm so looking for someone on the inside to help me get past this barrier.


I do a bunch of on-campus recruiting at Facebook.

For freshmen we recommend FBU: https://www.facebook.com/careers/university/fbu

Everyone else should check out regular internships or new grad positions: https://www.facebook.com/careers/university


Is the problem that you're younger than that or older than that?

In the case that you're younger, at Google, we offer something called Engineering Practicum internships. In this program, you get paired up with another freshman or sophomore, and work on an intern project together.

https://www.google.com/about/jobs/search/?src=Online/TOPs/NA...

Feel free to email me if you're interested in this. I'm doing some heavy data analysis and would be more than happy to host some interns next summer.


As an Eng Practicum alum, I can't recommend this internship enough. You'll write code that matters, learn a ton about 'real-world' engineering at scale, and meet some amazing people.

If you're a freshman or sophomore, apply!


We (I'm another eng at FB) definitely hire interns who are not juniors in college. You can be younger or older (or you could be a junior - I did my internship here after junior year).

As I understand it you just need to be going back to school at the end of your internship.


⒈⭔ Which GraphDB do you use at Facebook?

⒉⬠ In which DB do you store the text and media?


Facebook uses TAO for graphs: https://www.facebook.com/notes/facebook-engineering/tao-the-...

MySQL is used for storing textual user content like comments, etc. See https://www.facebook.com/MySQLatFacebook

Photos are stored using specialized systems: https://www.facebook.com/note.php?note_id=76191543919 http://www.stanford.edu/class/cs240/readings/haystack.pdf

And you can see some pictures of our new data center used for cold storage of older photos: http://readwrite.com/2013/10/16/facebook-prineville-cold-sto...


wow thank you! Is there an opensource alternative comparable to TAO for Graphs? Neo4j is dead slow..


How does Presto compare to Impala?



Will be very interesting if someone does a benchmark comparison of Presto with Cloudera Impala, Amazon RedShift and Apache Drill.

Also, very curious to know (from any Googlers browsing HN) if Dremel is still the state-of-the-art within Google, or if there is already a newer replacement.


The state of the art at Google appears to be F1 [0], although that takes a pretty different approach than Presto/Dremel and may be complementary. This area (interactive SQL on big data) has become very active lately. In addition to those you list, there's Shark [1] and BlinkDB [2] from Berkeley's AMP Lab.

[0] http://research.google.com/pubs/pub41344.html [1] http://shark.cs.berkeley.edu [2] http://blinkdb.org/


As sprizzle says, F1 and Dremel are not competitors; Dremel is a read-only system supporting large analysis queries, while F1 is a read-write system designed for large numbers of smaller queries. Last year Google published a paper on PowerDrill (http://vldb.org/pvldb/vol5/p1436_alexanderhall_vldb2012.pdf), which sounds somewhat like Dremel but is designed to use large amounts of RAM, which in turn enables some very powerful optimizations.


I believe F1 is the SQL DB that powers ads (AdWords, AdSense, etc.) and is used in production for consumer-facing apps. Dremel is more of a data/log analysis tool, but doesn't typically interact directly with consumer-facing apps. Dremel (externally: Google BigQuery) is still widely used at Google across all product areas; F1 is used in a few products but not many.


I can least say that I've used both a large Hadoop+Hive farm and a moderately sized RedShift cluster on the same large data set, and holy smoke, RedShift is orders of magnitude faster. Results vary by how big the nodes are you allocate to Redshift, and it's not cheap at all, but very impressive.


This is exactly what you should expect in almost all cases because you are kinda comparing apples and oranges.

Hadoop/Hive was not focused on speed but proving it is even possible to do queries reliably on such large datasets. Once this was achieved it immediately became clear the long waits for Hadoop/Hive batch jobs to finish made it impractical for many uses. Presto, Impala, Drill, RedShift, etc were all designed Primarily to address this problem and be much, much faster than Hadoop/Hive so that the data could be queried interactively.

All these new projects/products are in a very active competition to find the best way or ways to do this. You should compare RedShift to these other projects rather than Hadoop if speed is an issue for you. Each has it's pluses and minuses depending on the situation.


Wait, so RedShift (Drill, Impala, Presto) do the same job as Hadoop\Hive, only faster? You only mentioned they addressed the speed, but what's downside? At what cost they achieved their velocity?


Basically, Hive is incredibly inefficient. Hive works by taking apart the HiveQL query and turning it into a series of Hadoop MR steps. A complex hive query may have many such steps, and for each one data must be read from HDFS, processed, and written back to HDFS. For most queries, the vast majority of your time will be spent just doing IO. Hadoop is also very, very slow to start up tasks (> 1 minute), so when you have a lot of them that can come to dominate your total run time.

Impala, Drill, etc. avoid all those unnecessary reads and writes by implementing the querying logic directly, rather than by compiling to Map Reduce.

Shark [0] is an interesting counterpoint. It takes essentially the same approach as Hive but on Spark instead of Hadoop, and achieves similar or better performance than the more "direct" implementations.

[0] http://shark.cs.berkeley.edu


The tradeoffs are both complex and evolving and are different for each project/product. In general Hadoop is more scalable and more flexible but much slower.

For example as of today RedShift can hold a maximum of 256 terrabytes of compressed data while Facebook's Hadoop cluster was over 200 Petabytes in late 2012. RedShift only supports limited query and data types and a single index while Hadoop can theoretically handle arbitrary data processing. But if these constraints are acceptable then RedShift will likely be orders of magnitude faster in most cases.

Other projects/products will have different tradeoffs but they are almost always faster as this was almost always the primary goal.


Where do you get the limit of 256TB compressed from?

     Amazon Redshift enables you to start with as little 
     as a single 2TB XL node and scale up all the way to
     a hundred 16TB 8XL nodes for 1.6PB of compressed user data.
from http://aws.amazon.com/redshift/features-and-benefits/


My mistake, thank for the correction.

The 16 node limit I was familiar with is not a hard limit. You can request more nodes.

http://aws.amazon.com/redshift/faqs/#0080

It would be interesting to see performance comparisons on these huge datasets. I would expect to see new and interesting problems at that scale.


I am working on a related product. Is there some way I could contact you to pick your brain about some things?


Although a little out of date, there is a website dedicated to this:

https://amplab.cs.berkeley.edu/benchmark/


This is awesome, thanks for sharing!

Redshift looks to be order-of-magnitude faster than Impala or Shark in all the test. Does this mean that once RedShift supports user-defined functions, there is no competing solution that is any match? (Unless you want to avoid using the cloud)


The benchmark above is testing Impala with SequenceFiles compressed with GZIP, against RedShift, which is not a fair comparison.

In the "What's next?" section, they say they want to re-do the Impala tests using Parquet, which is a columnar format based on the Dremel whitepaper (http://parquet.io/).


Ah that makes sense! Looking at their results and how RedShift was so much faster in every scenario, it looked like something was amiss. Is Parquet Cloudera-only like Impala or is it available with vanilla Hadoop?


Impala isn't technically Cloudera only, it's open source (https://github.com/cloudera/impala), and other people have gotten it to run on their Hadoop distribution, but since it's developed by Cloudera, it was developed to run on the CDH platform (Hadoop).

Parquet was a joint effort between Cloudera and Twitter, and now it's being developed by many other companies. You can use it with Hive, Pig, MapReduce, Cascading, Crunch and I think Apache Drill's first milestone has adopted it as a columnar format as well. Parquet also allows you to use your Avro or Thrift schema (soon Protobuffs) to write Parquet data, too.

It's a separate project in the ecosystem and has its own roadmap (https://github.com/Parquet/parquet-mr).


Note that the suggested benchmark (https://amplab.cs.berkeley.edu/benchmark/) is a slightly modified version of the Hive Benchmark. Both of these are just 3-4 tables total and 4 very basic queries. I recommend looking at something more realistic (e.g. TPC-DS, TPC-H, etc).


The file format for hadoop tools is a sequence file. For most queries this is the second slowest format, after text.

Rcfile or parquet would be a more interesting benchmark.


From the website:

"Here from HackerNews? This was originally posted several months ago. Check back in two weeks for an updated benchmark including newer versions of Hive, Impala, and Shark."


Would also be curious to see how the execution model and language relates to/differs from Microsoft Dryad[0]/SCOPE[1].

[0] http://research.microsoft.com/en-us/projects/dryad/ [1] http://research.microsoft.com/en-us/um/people/jrzhou/pub/Sco...


For those of you interested in more information on Impala and it's performance characteristics see: * rideimpala.com * https://speakerdeck.com/grahn/practical-performance-analysis...


There is also Shard-Query which is build on top of a cluster on MySQL servers. According to its developer, who works at Percona, running Shard-Query with MySQL using a column store (Infobrite community edition) give performance similar to RedShift.

http://shardquery.com/


For interactive workflows, there is powerdrill: http://vldb.org/pvldb/vol5/p1436_alexanderhall_vldb2012.pdf


You might want to add Apache Tez to the list as well. Series of posts on Tez: http://hortonworks.com/hadoop/tez/


I'm curious what these 1000s of queries by facebook employees are for. What data / information does facebook mine for daily?


I'm a recent user of Presto and Hive at Facebook after being here coming up to three years. I've been using them primarily to understand ways in which we might be sending people's traffic to non-optimal locations, or non-optimal paths between our CDN and the source content, or to help with understanding current and extrapolating future capacity needs on our CDN (total requests, changes in cache hit object/byte rates, &c.)


I imagine that they could use it to track how many cats walk across peoples keyboards daily, and what peak "cat walking on keyboard" hours are. I'd be more than mildly interested in the results.


> (In a later post, we will share some tips and tricks for writing high-performance Java system code and the lessons learned while building Presto.)

I'm waiting in anticipation for that article!


Rounding down is not a good solution to the rounding problem. As a software engineer, I also used rounding in my contracts when I was working as a contractor - and here's why:

If I have 5 clients at a time, and each one has a concept/idea every day that takes 20 seconds to explain to me (and isn't on the current iteration) then they are distracting me by not managing their time properly and calling me a dozen times a day to tell me about their thoughts. If it is a change to the current iteration, then it should have been discussed when we agreed on the current iteration's feature set. Either way, the call is not a result of well thought out time management.

As much as I like hearing new concepts and ideas, I also have to take attention away from a project that I'm working on in order to provide my full attention to the client calling me.

After the call is done, I also have to come back to the project at hand and hopefully I'm not working on something that requires that I retain a super complicated thought chain which may or may not have been lost in discussion with another client - especially in consideration that I'm not going to bill on other client project for the time that I've spent having been sidetracked and/or getting back to where I was before the call was made.

So, charging in $15 increments causes the client to actually manage their time with the same effectiveness that they would hope that I am managing mine.

Rounding down makes this more of a problem for me, not less of one. Now a 90 second call is at no charge, and I can get more than one of those in one hour - still at no charge - based on the suggested agreement.


Posted on the wrong story?



I really hope one day Google can follow Facebook to opensource some of their core infrastructure technologies (even the legacy one, e.g. 1st generation of GFS etc) to the opensource community :)


It will be extremely kind if someone can explain the major differences or features that Presto offers that are different then Impala or other similar products.


I can't speak to what other products like Impala, Apache Drill, etc. offer, but Presto supports the following:

- Standard ANSI SQL syntax, including all the basic features you'd expect from a SQL engine (aggregations, joins, etc) and other more advanced features like analytic window functions, common table expressions (WITH), approximate distinct counts and percentiles.

- It's extensible. The open source code base includes a connector for Hive, but we also have some custom connectors for internal data stores at Facebook. We're working on a connector for HBase, too.

- In comparison to Hive, it's very fast and efficient. For our workloads it's at least 10x more CPU-efficient. Airbnb is using it and has had a similar experience.

- Most importantly, Presto has been battle-tested. It's been in production at Facebook since January and it's used by 1,000 employees every day running 30,000 queries daily. We've hit every edge case you can imagine.


Any best practices or gotchas about query authoring in comparison to a typical relational database? What part of the traditional relational db mentality needs to be changed or thrown out as I write queries. For example, should I avoid using this aggregate function or that join in a way I'm familiar with in the context of something like Postgres.


When running against Hive data, Presto is similar to many analytic databases in that queries typically perform a full table or partition scan, so "point lookups" that search for one or a few records will be much less efficient than they would be in an OLTP system like PostgreSQL that has a precomputed index for that query. (This is actually a property of the data source and not the Presto query engine. For example, we are writing an HBase connector that can take advantage of HBase's indexes for very efficient queries.)

In general, you should be able to write your query in the simplest and most readable way, and Presto should execute it efficiently. We already have the start of an advanced optimizer that supports equality inference, full predicate move-around, etc. This means that you don't need to write redundant predicates everywhere as is required with some query engines.

Also, if you are familiar with PostgreSQL, you should feel right at home using Presto. When making decisions for things not covered by ANSI SQL, the first thing we look at is "what does PostgreSQL do".


What about connectors to other relational databases: PostgreSQL, MySQL, etc?


Connecting Presto to a relational database is a tricky question. If you just want to have Presto scan the table and perform all computation in Presto, it is pretty easy, but for this to perform well, you would want to push down some of the computation to the database. The other problem is if you only have one database, you would have hundreds of cores hammering that single database for data.

That said, earlier this year, during a hackathon, we build a prototype connector that could split a query and push down the relevant parts to a distributed database that supports simple aggregations. It would be more work to clean this up and integrate, so if a lot of people are interested in this we can prioritize that.


Makes sense. I was thinking for the case where you have data in multiple databases/servers and want to do aggregation or joins without first doing some ETL step to bring the data into another format. Unless there is something Presto can do that a relational database can't, I would assume you just use normal SQL if you have a single database.


If you did that, wouldn't you just be re-inventing DATAllegro or early versions of Greenplum or early versions of Aster? I.e., better than nothing, but still far short of a modern analytic relational DBMS?


Are those things open-source?


Actually, there was an early version of Greenplum that was open source. Nobody seemed to care much.


The data still lives in HDFS, so Sqoop is the preferred way to export to RDBMS as far as I know.


Interesting use of optimizing sql query plans down to JVM byte code. I don't think anybody has done that before - but it sounds like it would work really well. Would be really interesting to hear more on that as it might be applicable to a lot of other areas as well.


Maybe with SQL but compiling a DML to byte code isn’t a new idea;

http://xml.coverpages.org/saxon42Ann.html


How can I install and use this on Win 2012 server with SQL server ?


Really nice to see posts like that from Facebook.


what is the data format FB stores? (xml/json/?)


Most of the data in our data warehouse is stored as relational records using Hive's RCFile format, though we are actively working on converting it to the new ORCFile format.

The Hive data model supports complex types including arrays, maps and structs (nested tables), and those are used liberally. We also have a fair amount of data stored as JSON inside of string columns in Hive tables, though much of this is actually structured and could be better modeled using Hive's complex types.

Presto currently has limited support for Hive complex types. All complex types are converted to JSON at query time and can be accessed using the JSON functions: http://prestodb.io/docs/current/functions/json.html


Java -_-


“Beautiful. Unethical. Dangerous.”

— Lucius Fox, The Dark Knight




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: