Hacker News new | past | comments | ask | show | jobs | submit login
Apache Drill's Future (apache.org)
132 points by kermatt on May 20, 2020 | hide | past | favorite | 75 comments



Apache Drill is an interesting project, from all the MPP engines that appeared a few years ago, it was the most similar one to BigQuery (the first public version) and the most flexible.

However, the competion was fierce and each Big Data vendor (MapR, Cloudera and HortonWorks) was pushing its own solution: Drill, Impala and Hive on Tez. Competion is always a good thing, but it fragmented the user base too much so no clear winner emerged.

At the same time, Spark SQL got sufficiently better to replace these tools in most use cases and Presto (from Facebook) got the traction and the user base that none of these projects had by being vendor agnostic (and its adoption by AWS in Athena and EMR also helped boost its popularity).


There is a common belief that SparkSQL is better than Hive because SparkSQL uses in-memory computing while Hive is disk-based. Another common belief is that Presto is better than Hive because it is based on MPP design and was invented for the very purpose of overcoming the slow speed of Hive by the very company (Facebook) that invented Hive in early 2010s.

The reality is that nowadays both SparkSQL and Presto are way behind Hive, in terms of both speed and maturity. Hive made tremendous progress since 2015 (with the introduction of LLAP), while SparkSQL still has the issue of stability of fault tolerance and shuffling. (Presto does not support fault tolerance.) So, IMO, SparkSQL is nowhere near ready to replace Hive.

If you are curious about the performance of these systems, see [1] and [2] which compare Hive, SparkSQL, and Presto. Disclaimer: We are developing MR3 mentioned in the articles. However, we tried to make a fair comparison in the performance evalaution.

[1] https://mr3.postech.ac.kr/blog/2019/11/07/sparksql2.3.2-0.10... [2] https://mr3.postech.ac.kr/blog/2019/08/22/comparison-presto3...


We have never been able to make Hive LLAP run reliably on our HDP cluster, queries sometimes just hang for no apparent reason.

On the other hand, our Presto cluster runs pretty much anything we throw at it, and when it fails, the failures are easier to anticipate and mitigate. It's also quite simple to deploy and operate.


Could you expand more on the reasons why Hive is faster than Spark? Aren't Hive joins also achieved via a MapReduce shuffle?


Query plans are heavily optimized, and map-side joins are used extensively. The use of optimizations exploiting memory makes the so-called in-memory computing of Spark no longer relevant because Hive also uses memory efficiently. Hive community is actively working toward Hive 4, so I guess the future version will be even faster.


Spark also has a query plan optimizer, and uses map-side joins (referred to as broadcast joins) whenever it makes sense. I'm just curious what other differences architecturally in your opinion can result in a performance discrepancy?


While I cannot give a definitive answer because I am not an expert on Spark internals, my opinion is that the discrepancy results mainly from query and runtime optimization.

Apart from adding new features (e.g., ACID support), a lot of effort is still put into optimizing queries and runtime. In essence, Hive is a tool specialized for SQL, so it tries to implement all the optimizations you can think of in the context of executing SQL queries. For example, Hive implements vectorized execution, whose counterpart in Spark was implemented only in a later version (with the introduction on Tungsten IIRC). Hive even supports query re-execution: if a query fails with a fatal error like OOM, Hive re-generates a new query after analyzing the runtime statistics collected by then. The second query usually runs much faster, and you can also update the column statistics in Metastore.

In contrast, Spark is a general-purpose execution engine where SparkSQL is just an application. I remember someone comparing Spark to Swiss army knife, which enables you to do a lot of things easily, but is no match against specialized tools developed for a particular task. My (opinionated) guess is that SparkSQL will be replaced by Hive and Presto, and Spark streaming will be replaced by Flink.


Was SparkSQL ever intended to replace hive? My impression was that it was supposed to supplement spark for times it was convenient. I kind of suspected at one point they got caught up in the SQL hadoop race, but I always felt like it was best to do SQL elsewhere, and save spark for things that couldn't be easily expressed in SQL.


The original SparkSQL was pretty much modelled after the Hive flavour of SQL, down to the available udfs. The compatibility was never complete and has somewhat diverged again with respective releases of the frameworks, but for the most part, Hive was the big data framework to beat at the time (2015-2016), and not everyone wanted to write Scala.

I think that now, maintaining that compatibility is less of a need for Spark and Hive has introduced a lot of goodies in the meantime, so there might not be a need for the SQL flavors to be in lockstep anymore.


SQL can be used as a dataframe, or a hive temp view that can be called from other SQL. That gives flexibility to mix and match SQL and programmatic logic within the same spark app.


IIRC, earlier in the project, a differentiator for Drill was to be the ability to run drillbit processes across servers, and run distributed queries from one of them with Zookeeper as a coordinator. This would have been a simple approach to distributed queries where secondary extract and loading into a distributed filesystem or Parquet file was not desired [1]

Unfortunately to date, distributed queries will fail if the paths _and_ files are not symmetric in name - all file paths and names must exist on all nodes - therefore the "in situ" approach is not available. It appears the project focused on querying distributed file systems like HDFS and S3 and therefore had a lot of competition.

I hope some group picks up where HPE orphaned Drill after the MapR acquisition and pivots to a pure distributed worker approach. Running a drillbit on nodes where the data originates could be useful, the original example was SQL over http logs directly from webservers.

[1] https://mapr.com/blog/drill-your-big-data-today-apache-drill...


I think it's important to add notes about the "Hadoop eras" in which some of these were first developed and evolved.

Hadoop 1.x (i.e. "MapReduce" execution engine):

* Apache Pig

* Apache Hive

* Apache Drill

* Cloudera Impala

If I recall correctly, neither Drill nor Impala actually used Hadoop 1.x MapReduce as the execution engine, and were mostly bundled to read data commonly stored in the HDFS cluster.

Hadoop 2.x (i.e. the MR2 / "YARN" era):

* Apache Pig

* Apache Hive

* Apache Tez (technically a substitute execution engine for MR2), built to allow containers to persist, optimize coalesce operation/task stages, amongst other things to reduce overall job latency

* Apache Spark (technically a substitute execution engine for MR2)

Spark entered the Hadoop ecosystem, as many people were storing their data in HDFS, and the Hadoop 2 YARN resource model/containers provided the compute resources to run Spark as an execution engine, in lieu of MR2. You could and can also run a separate Spark-dedicated cluster, but many people were already running Hadoop and storing their data in HDFS clusters.

"Shark" became SparkSQL somewhere around Spark 1.3-1.4x? and Schema-ed RDDs evolved to DataFrames and better enabled people to reason and interface with their data in a table-like manner. Python/PySpark performance also rapidly improved from things like Project Tungsten and DataFrames.

https://databricks.com/blog/2015/02/17/introducing-dataframe...

Post-Hadoop / MR2:

* Hive

* Spark

* Presto

Tez was very much backed by Hortonworks, as part of their HDP Hadoop distribution, and motivated improve the performance of existing Apache Pig and Hive tools (major contributors from Yahoo, Microsoft, Hortonworks). Hortonworks later incorporated Spark as part of their distribution.

Spark was adopted by Cloudera as part of their CDH Hadoop distribution, and coexisted with Impala.

Tn the post-Hadoop / post-Spark world, both Hortonworks and Cloudera merged as well:

https://www.cloudera.com/about/news-and-blogs/press-releases...

Also since we're talking MPP withSQL/SQL-like dialects, we may as well mention that Greenplum, ParAccel/Redshift also coexisted with all of these.


Great post :) This reads like a History Channel Documentary !


I've not spent much time, but I've never exactly understood what Presto is. Is it just map reduce across databases?


Its a distributed SQL engine that can query files from various database engines (via connectors or JDBC drivers), including structured file formats like CSV or Parquet (using the Hive metastore).

Presto does not manage storage itself, but instead focuses on fronting those data sources with a single access point, with the option to federate (join) different sources in a single query.


"Presto is an open-source distributed SQL query engine optimized for low-latency, ad-hoc analysis of data. It supports the ANSI SQL standard, including complex queries, aggregations, joins, and window functions. Presto can process data from multiple data sources including the Hadoop Distributed File System (HDFS) and Amazon S3"

TIL that Presto is available in EMR.


Not only that, but, AWS Athena is basically serverless Presto. It's an extremely handy tool particularly if you've got structured or semi-structured data being dumped into S3 and you want a near zero maintenance (only have to create schemas) way to explore it.


Its basically a compute engine that maintains all state in memory and does distributed computations similarly to Spark.

The big thing it adds is that it isn't stuck to any storage format. It has a connector that lets you load data into it from basically anything, from mysql dbs to hdfs files to whatever. So you can do cross database joins and just not care about where the data lives. You can also output to almost any database too.


Not being snarky, but is that basically a sql interface for map reduce across databases?


It's basically federated SQL. Nothing to do with map/reduce, really.


It loads everything from the data sources (presumably pushing as much down to the underlying database) and then does sql in ram. Pretty much the definition of map reduce. Federated sql wouldn’t give speed up across a single database but presto does.


Loading from data sources and doing SQL in ram is very much not the definition of Map Reduce.


I don't think it's a crazy comparison, but AFAIK Presto doesn't have a shuffle stage, which is fundamental in traditional MapReduce. They're both distributed computations, that's the comparison I would make.

Edit: thinking about it more, how could Presto accomplish joins and other SQL operations without a shuffle? Seems very similar to Spark SQL, which is just syntactic sugar for multi-stage MapReduces.


Presto can do shuffles when needed. But it has a smaller max size shuffle than Spark typically because everything has to stay in RAM


It is basically federated SQL to get all the data you need into mem.

Then it does a bunch of techniques, one of witch is map reduce, to do computation.

Its like you took a distributed SQL server and made it never have its own storage outside of memory. Instead it operates on wherever the data is living.


Is there any form of distributed query engine in use today that doesn't fit the definition of the MapReduce pattern? Is describing a distributed query engine as MapReduce still a meaningful distinction from some other, non-MapReduce approach?


> the definition of the MapReduce pattern?

Impala, Presto etc don't fit that model at all - they follow the Volcano model.

In this mode, they are not pure functional - if a task fails, there is no way to reproduce the output of that task.

The func within the map() was guaranteed to produce the same output for the same input across multiple attempts on failure or concurrently (for speculation).

Because of this, they can be faster as they do not wait for a task to be complete to run a subsequent stage & can pipeline better, but at the cost of failing all queries running on a node during a crash.

There are no retries for anything. This was deemed acceptable, if your hardware is reliable and the response to a failed query is just to "run it again", rather than per-node query recovery.

The reason for proper node failure tolerance for Spark/Tez/Flink etc are because they follow the functional model as closely as possible with exceptions for non-deterministic functions (say, UUID() in a SQL call).

The advantage of the failure tolerance is that these tools can push the whole cluster towards a single query performance when it is otherwise idle, because preemption can recover capacity out of a running system, if a higher priority query enters the system at a later point in time.


Sort of. The biggest difference is that it can be a pseudo-datawarehouse for analysts and data scientists over an object store (Eg s3) without needing to manage a complicated ETL process. AWS Athena goes even further by not needing to provision compute so that queries are run on ephemeral VMs over the object store.

Hive makes a terrible data warehouse no matter how SQL compatible it is.


Please correct me if I have this wrong, but my vague understanding is that the data representation heart of Apache Drill lives on in the rather active Apache Arrow project.

https://stackoverflow.com/questions/53533506/what-is-the-dif...

https://github.com/apache/arrow/commit/e6905effbb9383afd2423...

And the platform/tools side of Drill now lives on as Dremio, which uses Apache Arrow.

https://github.com/dremio/dremio-oss

So the essence of Drill still lives, but it became half Apache project and half vendor controlled and supported, and the root of that split is now orphaned.


For those interested, the relevant rules seem to be here: https://www.apache.org/foundation/voting.html

As far as I can tell, the implication is that there are now fewer than three people interested enough to participate in code reviews, and ASF rules require at least three +1 votes for basically anything to happen.


That's basically the idea though the details are subtly different.

What the ASF won't let you do if you can't muster the votes is actually make a release — that takes 3 votes from people on the Drill PMC (Project Management Committee). If you can't get 3 PMC votes, the project cannot even make security releases and must be retired.

As to who can commit code, from the ASF's standpoint any person with commit rights can do so at any time. However, the project may impose additional constraints, such as requiring a code review.


> As to who can commit code, from the ASF's standpoint any person with commit rights can do so at any time. However, the project may impose additional constraints, such as requiring a code review.

Projects can amend their bylaws around this, but that also goes to the PMC - lazy consensus can apply, but yeah you can't go through the source release process without 3 binding votes.

Eventually, you realize that this sort of software is not anybody's hobby project and takes real dollars to pay for development.

Doesn't help that Dremio would profit from Drill collapsing and leaving that space open.


Some context

Drill: https://en.wikipedia.org/wiki/Apache_Drill

MapR sold to Hewlett-Packard Enterprise (HPE): https://en.wikipedia.org/wiki/MapR


Sad to see this happen as I really like the idea!

If you're interested in Drill, check out OctoSQL[0]. It shares the same vision of querying multiple datasources using pure SQL, and pushing down as much operations as possible to the underlying datasource.

Moreover, there's been a huge rewrite under way this past year, ready to use on the master branch, yet unreleased however (will be available soon).

It adds Kafka and Parquet support and most importantly first-class unbounded stream support. Including temporal SQL extensions for working with event time metadata (instead of system time) so you can use stuff like live updated time window aggregations on incoming kafka streams. It also now uses on-disk badger storage as the primary way to store its state, so you can do Group Bys / Joins with lots of keys, and restarts of OctoSQL won't alter the final result (exactly-once semantics).

Make sure to check it out, it's also very simple to get going locally!

Disclosure: I'm one of the main contributors.

[0]:https://github.com/cube2222/octosql


Does OctoSQL support reading columnar compressed formats (ex. Parquet) from distributed storage (ex. S3) ?


No, we don't support Excel, JSON, CSV and Parquet datasources other than local files yet, though that's definitely planned and would be very easy to add.


Drill does ;-) Drill also has a streaming Excel reader that works very well with large Excel files...


How does it compare to prestosql?


Much less mature.

There's no distributed execution yet.

Easy to get going locally.

I'm not sure but I think Presto is in-memory? We're optimizing for SSD disks to easily support big states, and achieve durability this way. SSD disks are still plenty fast.

Definitely better streaming support. We're very much concentrating on good streaming ergonomics. OctoSQL isn't a batch execution engine at its heart, it's a streaming one.


As the author of the above post, I thought I'd respond to the comments below. Firstly, I'm glad to see that people noticed... :-)

Firstly, to paraphrase Monty Python, "Drill's not dead yet." There are some efforts to get corporate sponsorship, but it will take time. HPE's withdrawal was expected, but disappointing none the less. With that said, we are still gearing up to release Drill 1.18 which will have a considerable amount of enhancements, including new formats (SPSS, HDF5, possibly SAS) as well as new storage plugins to enable Drill to connect to Druid as well as REST APIs.

Personally, I've always felt that Drill was marketed to the wrong audience. As everyone notes, there are many competitors in the big data analytics space: Spark, ES, Splunk, Presto, Impala to name a few. Where I see Drill as filling a rather unique niche in the market is the small-to medium size data analytics with complex data.

For instance, if you have a CSV file, you use Excel. If you have 100 CSVs you have to code. Or what if you have an Excel spreadsheet and you want to pull data from your corporate reference API, which happens to return JSON and uses OAUTH authentication? Drill is the kind of tool that can bridge this gap and allow analysts to rapidly get value out of these situations.

As an example, here's a demo of me building a COVID dashboard from a REST API and spreadsheet in about 15 min with zero data prep: https://youtu.be/oEOhFWm3D9A

Another example: incident response. Let's say you have a PCAP file, use Wireshark. What if you have 10GB of PCAP files? Most likely you'll have to code up some solution. With Drill however, you can query that w/o coding, which means that you can get the value out of this data faster.

I know there are people using Drill. I know when I demo Drill to analysts, they love it. (Ok.. I'm biased here but I would say that's an accurate representation of the response to my presentations) What Drill lost was an active developer community. I hope over the next few months, that Drill's users will step up a bit and contribute to code reviews and/or actual code contributions. I've been thinking about creating a security focused fork of Drill as well so we'll see what happens.

If you have ideas/comments/questions, you can email me at cgivre@apache.org.


Your introduction is exactly how I feel about drill. However, the official doc is not as good as the book (Learning Apache Drill)


How many Apache projects are there in the "Big Data" space? It seems every time I look around I see a new one.


49 according to the project list page. Some of which are in attic, some in incubation.

https://projects.apache.org/projects.html?category#big-data

Were you looking for a single project for all your big-data needs?


Looking for some continuity after those splashy Strata presentations.


Strata is very corporate. If you want what you see in Strata presentations, buy it from the vendors (Cloudera, Microsoft, etc.)

What you see on apache.org is what gets put in presentations at say, the Hadoop Summit.

edit: On a more helpful note, Apache Spark is probably as close as you can get to a single project for all your big data needs if that is what one wants out-of-the-box from an open-source project. It includes a SQL framework, streaming framework, either bundles or improves upon more general work done in Hadoop, etc. It can be pretty vendor-controlled at times, but it's birth was in academia, making it pretty different from the other projects that were mostly born as components in already established commercial platforms. There are pros and cons to that, of course.


> Were you looking for a single project for all your big-data needs?

Isn't awk enough? :)


It's why you need a Zookeeper ;)


This is unfortunate.

For anyone curious about Apache Drill, it was inspired by Dremel which was used internally at Google and once the engine beneath GCP's BigQuery.

https://static.googleusercontent.com/media/research.google.c...

In the era of Hadoop, it was most closely aligned with the MapR distribution, while Hortonworks aligned with Hive, and Cloudera offered Impala as their solution.

History of Apache Drill: http://radar.oreilly.com/2015/09/apache-drill-tracking-its-h...


> Dremel which was [...] once the engine beneath GCP's BigQuery

Is there a white paper describing what replaced Dremel in BigQuery's architecture?


Clarification -- There's the Dremel paper, the Google-internal implementation, and BigQuery, a public-facing implementation that evolved from Dremel :)

Here's some interesting reads on BigQuery: https://cloud.google.com/files/BigQueryTechnicalWP.pdf

Also, Happy 10th Birthday, BigQuery! https://cloud.google.com/blog/products/data-analytics/bigque...


Check out Dremio. The interface and speed were awesome when I tested on some pretty large datasets.


I was thinking of using Apache Drill for converting CSVs to Parquet. What else do people use for that?


I've also been looking for that. In an ideal world there would be a small, fast, standalone cli tool that can convert csv to parquet. There is a (sadly, unfinished) parquet writer Rust library in the Arrow repository that looks promising. All approaches I've tried so far (spark, pyarrow, drill, ...) require everything and the kitchen sink. So far I've settled on a java cli tool that uses jackson + org.apache.parquet internally, but it's cpu bound and has a huge amount of maven dependencies.


pandas + fastparquet fairly lightweight. but yes I would love to see a simple c++/golang binary that's just a simple csv2parq call.


Newer versions of Pandas don't even need fastparquet anymore. This code works:

import pandas as pd

df = pd.read_csv('data/us_presidents.csv')

df.to_parquet('tmp/us_presidents.parquet')


Nice! Does that work alongside reading in via chunks and writing via row_groups? If I have a 500GB CSV will it work?


This article explains how to convert CSVs to Parquet with Pandas, PySpark, and Koalas: https://mungingdata.com/python/writing-parquet-pandas-pyspar...

This article explains how to convert CSVs to Parquet with Go and Scala: https://mungingdata.com/go/csv-to-parquet/

If you're converting hundreds/thousands of files, Spark/PySpark is probably the best tool for the job. For fewer files, Python or Go is just fine.



+1, in my experience the Pandas CSV parser seems very robust and more than sufficiently fast.

The only faster/better CSV parser I've used with any frequency is the fread function in the R package data.table. When I used R in the past, Parquet was much less popular, but I think now the arrow package supports writing to Parquet.


Just use a Apache NiFi fast and does that on 1 step


Spark


Spark community put a lot of effort into a feature-rich CSV data source. Often times the most challenging part of ingesting CSV is parsing it. There are many flavors each with different corner cases for NULLs, comments, headers, etc.


Maybe Presto/Athena


pyarrow


I'm really impressed with the misson of Drill - write SQL for disparate data sources - but I've actually never installed it. When I have a bunch of parquet/csv/... files sitting around, I can normally slurp them in with pandas.


Try it out! I'd agree that if you have 1 CSV you don't need Drill. If you have more than one, is where it starts to get interesting. Also there is PyDrill which enables you to execute a Drill query and put the results directly into a pandas data frame. If you're an R person, there's Sergeant that does the same.


Are you Charles Givre? I have your book!


I am!! Glad to hear you liked the book!


Sad to hear this since I've been following Drill for a while. From what I understand Drill was based on the google Dremel paper, hence the name.

https://research.google/pubs/pub36632/

Wondering if maybe Spark REPL or Apache Zeppelin might be a decent replacement for Drill.


One thing I like about Spark is the number of data sources it supports, including its ability to write out to a table or whatever. If you want to set up a JDBC or ODBC server you can also access spark SQL :

https://spark.apache.org/docs/latest/sql-distributed-sql-eng...

From an analytics perspective, a lot of people like the ability to connect to a JDBC or ODBC source.


I really like Apache Drill. I used it about 4 years ago as an MPP SQL on Anything Engine. Presto took a lot of that momentum away and MapR failed to adapt to the new age kind of meant EOL for the project. I hope someone else picks it up.


I wonder how many apache projects actually survive this sort of event?


Just use bigquery.


Still waiting for Apache dril.


For what?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: