Apache Drill's Future

agacera · on May 20, 2020

Apache Drill is an interesting project, from all the MPP engines that appeared a few years ago, it was the most similar one to BigQuery (the first public version) and the most flexible.

However, the competion was fierce and each Big Data vendor (MapR, Cloudera and HortonWorks) was pushing its own solution: Drill, Impala and Hive on Tez. Competion is always a good thing, but it fragmented the user base too much so no clear winner emerged.

At the same time, Spark SQL got sufficiently better to replace these tools in most use cases and Presto (from Facebook) got the traction and the user base that none of these projects had by being vendor agnostic (and its adoption by AWS in Athena and EMR also helped boost its popularity).

epdlxjmonad · on May 21, 2020

There is a common belief that SparkSQL is better than Hive because SparkSQL uses in-memory computing while Hive is disk-based. Another common belief is that Presto is better than Hive because it is based on MPP design and was invented for the very purpose of overcoming the slow speed of Hive by the very company (Facebook) that invented Hive in early 2010s.

The reality is that nowadays both SparkSQL and Presto are way behind Hive, in terms of both speed and maturity. Hive made tremendous progress since 2015 (with the introduction of LLAP), while SparkSQL still has the issue of stability of fault tolerance and shuffling. (Presto does not support fault tolerance.) So, IMO, SparkSQL is nowhere near ready to replace Hive.

If you are curious about the performance of these systems, see [1] and [2] which compare Hive, SparkSQL, and Presto. Disclaimer: We are developing MR3 mentioned in the articles. However, we tried to make a fair comparison in the performance evalaution.

[1] https://mr3.postech.ac.kr/blog/2019/11/07/sparksql2.3.2-0.10... [2] https://mr3.postech.ac.kr/blog/2019/08/22/comparison-presto3...

dikei · on May 21, 2020

We have never been able to make Hive LLAP run reliably on our HDP cluster, queries sometimes just hang for no apparent reason.

On the other hand, our Presto cluster runs pretty much anything we throw at it, and when it fails, the failures are easier to anticipate and mitigate. It's also quite simple to deploy and operate.

cozos · on May 21, 2020

Could you expand more on the reasons why Hive is faster than Spark? Aren't Hive joins also achieved via a MapReduce shuffle?

epdlxjmonad · on May 21, 2020

Query plans are heavily optimized, and map-side joins are used extensively. The use of optimizations exploiting memory makes the so-called in-memory computing of Spark no longer relevant because Hive also uses memory efficiently. Hive community is actively working toward Hive 4, so I guess the future version will be even faster.

cozos · on May 22, 2020

Spark also has a query plan optimizer, and uses map-side joins (referred to as broadcast joins) whenever it makes sense. I'm just curious what other differences architecturally in your opinion can result in a performance discrepancy?

epdlxjmonad · on May 23, 2020

While I cannot give a definitive answer because I am not an expert on Spark internals, my opinion is that the discrepancy results mainly from query and runtime optimization.

Apart from adding new features (e.g., ACID support), a lot of effort is still put into optimizing queries and runtime. In essence, Hive is a tool specialized for SQL, so it tries to implement all the optimizations you can think of in the context of executing SQL queries. For example, Hive implements vectorized execution, whose counterpart in Spark was implemented only in a later version (with the introduction on Tungsten IIRC). Hive even supports query re-execution: if a query fails with a fatal error like OOM, Hive re-generates a new query after analyzing the runtime statistics collected by then. The second query usually runs much faster, and you can also update the column statistics in Metastore.

In contrast, Spark is a general-purpose execution engine where SparkSQL is just an application. I remember someone comparing Spark to Swiss army knife, which enables you to do a lot of things easily, but is no match against specialized tools developed for a particular task. My (opinionated) guess is that SparkSQL will be replaced by Hive and Presto, and Spark streaming will be replaced by Flink.

yachtman · on May 21, 2020

Was SparkSQL ever intended to replace hive? My impression was that it was supposed to supplement spark for times it was convenient. I kind of suspected at one point they got caught up in the SQL hadoop race, but I always felt like it was best to do SQL elsewhere, and save spark for things that couldn't be easily expressed in SQL.

artwr · on May 21, 2020

The original SparkSQL was pretty much modelled after the Hive flavour of SQL, down to the available udfs. The compatibility was never complete and has somewhat diverged again with respective releases of the frameworks, but for the most part, Hive was the big data framework to beat at the time (2015-2016), and not everyone wanted to write Scala.

I think that now, maintaining that compatibility is less of a need for Spark and Hive has introduced a lot of goodies in the meantime, so there might not be a need for the SQL flavors to be in lockstep anymore.

jlj · on May 21, 2020

SQL can be used as a dataframe, or a hive temp view that can be called from other SQL. That gives flexibility to mix and match SQL and programmatic logic within the same spark app.

kermatt · on May 20, 2020

IIRC, earlier in the project, a differentiator for Drill was to be the ability to run drillbit processes across servers, and run distributed queries from one of them with Zookeeper as a coordinator. This would have been a simple approach to distributed queries where secondary extract and loading into a distributed filesystem or Parquet file was not desired [1]

Unfortunately to date, distributed queries will fail if the paths _and_ files are not symmetric in name - all file paths and names must exist on all nodes - therefore the "in situ" approach is not available. It appears the project focused on querying distributed file systems like HDFS and S3 and therefore had a lot of competition.

I hope some group picks up where HPE orphaned Drill after the MapR acquisition and pivots to a pure distributed worker approach. Running a drillbit on nodes where the data originates could be useful, the original example was SQL over http logs directly from webservers.

[1] https://mapr.com/blog/drill-your-big-data-today-apache-drill...

moandcompany · on May 21, 2020

I think it's important to add notes about the "Hadoop eras" in which some of these were first developed and evolved.

Hadoop 1.x (i.e. "MapReduce" execution engine):

* Apache Pig

* Apache Hive

* Apache Drill

* Cloudera Impala

If I recall correctly, neither Drill nor Impala actually used Hadoop 1.x MapReduce as the execution engine, and were mostly bundled to read data commonly stored in the HDFS cluster.

Hadoop 2.x (i.e. the MR2 / "YARN" era):

* Apache Pig

* Apache Hive

* Apache Tez (technically a substitute execution engine for MR2), built to allow containers to persist, optimize coalesce operation/task stages, amongst other things to reduce overall job latency

* Apache Spark (technically a substitute execution engine for MR2)

Spark entered the Hadoop ecosystem, as many people were storing their data in HDFS, and the Hadoop 2 YARN resource model/containers provided the compute resources to run Spark as an execution engine, in lieu of MR2. You could and can also run a separate Spark-dedicated cluster, but many people were already running Hadoop and storing their data in HDFS clusters.

"Shark" became SparkSQL somewhere around Spark 1.3-1.4x? and Schema-ed RDDs evolved to DataFrames and better enabled people to reason and interface with their data in a table-like manner. Python/PySpark performance also rapidly improved from things like Project Tungsten and DataFrames.

https://databricks.com/blog/2015/02/17/introducing-dataframe...

Post-Hadoop / MR2:

* Hive

* Spark

* Presto

Tez was very much backed by Hortonworks, as part of their HDP Hadoop distribution, and motivated improve the performance of existing Apache Pig and Hive tools (major contributors from Yahoo, Microsoft, Hortonworks). Hortonworks later incorporated Spark as part of their distribution.

Spark was adopted by Cloudera as part of their CDH Hadoop distribution, and coexisted with Impala.

Tn the post-Hadoop / post-Spark world, both Hortonworks and Cloudera merged as well:

https://www.cloudera.com/about/news-and-blogs/press-releases...

Also since we're talking MPP withSQL/SQL-like dialects, we may as well mention that Greenplum, ParAccel/Redshift also coexisted with all of these.

rawoke083600 · on May 21, 2020

Great post :) This reads like a History Channel Documentary !

qeternity · on May 20, 2020

I've not spent much time, but I've never exactly understood what Presto is. Is it just map reduce across databases?

kermatt · on May 20, 2020

Its a distributed SQL engine that can query files from various database engines (via connectors or JDBC drivers), including structured file formats like CSV or Parquet (using the Hive metastore).

Presto does not manage storage itself, but instead focuses on fronting those data sources with a single access point, with the option to federate (join) different sources in a single query.

chrisjc · on May 20, 2020

"Presto is an open-source distributed SQL query engine optimized for low-latency, ad-hoc analysis of data. It supports the ANSI SQL standard, including complex queries, aggregations, joins, and window functions. Presto can process data from multiple data sources including the Hadoop Distributed File System (HDFS) and Amazon S3"

TIL that Presto is available in EMR.

ztjio · on May 21, 2020

Not only that, but, AWS Athena is basically serverless Presto. It's an extremely handy tool particularly if you've got structured or semi-structured data being dumped into S3 and you want a near zero maintenance (only have to create schemas) way to explore it.

WookieRushing · on May 20, 2020

Its basically a compute engine that maintains all state in memory and does distributed computations similarly to Spark.

The big thing it adds is that it isn't stuck to any storage format. It has a connector that lets you load data into it from basically anything, from mysql dbs to hdfs files to whatever. So you can do cross database joins and just not care about where the data lives. You can also output to almost any database too.

qeternity · on May 20, 2020

Not being snarky, but is that basically a sql interface for map reduce across databases?

Boxxed · on May 20, 2020

It's basically federated SQL. Nothing to do with map/reduce, really.

qeternity · on May 20, 2020

It loads everything from the data sources (presumably pushing as much down to the underlying database) and then does sql in ram. Pretty much the definition of map reduce. Federated sql wouldn’t give speed up across a single database but presto does.

zaphar · on May 20, 2020

Loading from data sources and doing SQL in ram is very much not the definition of Map Reduce.

cmckn · on May 21, 2020

I don't think it's a crazy comparison, but AFAIK Presto doesn't have a shuffle stage, which is fundamental in traditional MapReduce. They're both distributed computations, that's the comparison I would make.

Edit: thinking about it more, how could Presto accomplish joins and other SQL operations without a shuffle? Seems very similar to Spark SQL, which is just syntactic sugar for multi-stage MapReduces.

WookieRushing · on May 21, 2020

Presto can do shuffles when needed. But it has a smaller max size shuffle than Spark typically because everything has to stay in RAM

WookieRushing · on May 21, 2020

It is basically federated SQL to get all the data you need into mem.

Then it does a bunch of techniques, one of witch is map reduce, to do computation.

Its like you took a distributed SQL server and made it never have its own storage outside of memory. Instead it operates on wherever the data is living.

kyllo · on May 20, 2020

Is there any form of distributed query engine in use today that doesn't fit the definition of the MapReduce pattern? Is describing a distributed query engine as MapReduce still a meaningful distinction from some other, non-MapReduce approach?

gopalv · on May 21, 2020

> the definition of the MapReduce pattern?

Impala, Presto etc don't fit that model at all - they follow the Volcano model.

In this mode, they are not pure functional - if a task fails, there is no way to reproduce the output of that task.

The func within the map() was guaranteed to produce the same output for the same input across multiple attempts on failure or concurrently (for speculation).

Because of this, they can be faster as they do not wait for a task to be complete to run a subsequent stage & can pipeline better, but at the cost of failing all queries running on a node during a crash.

There are no retries for anything. This was deemed acceptable, if your hardware is reliable and the response to a failed query is just to "run it again", rather than per-node query recovery.

The reason for proper node failure tolerance for Spark/Tez/Flink etc are because they follow the functional model as closely as possible with exceptions for non-deterministic functions (say, UUID() in a SQL call).

The advantage of the failure tolerance is that these tools can push the whole cluster towards a single query performance when it is otherwise idle, because preemption can recover capacity out of a running system, if a higher priority query enters the system at a later point in time.

ABeeSea · on May 21, 2020

Sort of. The biggest difference is that it can be a pseudo-datawarehouse for analysts and data scientists over an object store (Eg s3) without needing to manage a complicated ETL process. AWS Athena goes even further by not needing to provision compute so that queries are run on ephemeral VMs over the object store.

Hive makes a terrible data warehouse no matter how SQL compatible it is.

vhold · on May 20, 2020

Please correct me if I have this wrong, but my vague understanding is that the data representation heart of Apache Drill lives on in the rather active Apache Arrow project.

https://stackoverflow.com/questions/53533506/what-is-the-dif...

https://github.com/apache/arrow/commit/e6905effbb9383afd2423...

And the platform/tools side of Drill now lives on as Dremio, which uses Apache Arrow.

https://github.com/dremio/dremio-oss

So the essence of Drill still lives, but it became half Apache project and half vendor controlled and supported, and the root of that split is now orphaned.

srl · on May 20, 2020

For those interested, the relevant rules seem to be here: https://www.apache.org/foundation/voting.html

As far as I can tell, the implication is that there are now fewer than three people interested enough to participate in code reviews, and ASF rules require at least three +1 votes for basically anything to happen.

rectang · on May 20, 2020

That's basically the idea though the details are subtly different.

What the ASF won't let you do if you can't muster the votes is actually make a release — that takes 3 votes from people on the Drill PMC (Project Management Committee). If you can't get 3 PMC votes, the project cannot even make security releases and must be retired.

As to who can commit code, from the ASF's standpoint any person with commit rights can do so at any time. However, the project may impose additional constraints, such as requiring a code review.

gopalv · on May 20, 2020

> As to who can commit code, from the ASF's standpoint any person with commit rights can do so at any time. However, the project may impose additional constraints, such as requiring a code review.

Projects can amend their bylaws around this, but that also goes to the PMC - lazy consensus can apply, but yeah you can't go through the source release process without 3 binding votes.

Eventually, you realize that this sort of software is not anybody's hobby project and takes real dollars to pay for development.

Doesn't help that Dremio would profit from Drill collapsing and leaving that space open.

dmix · on May 20, 2020

Some context

Drill: https://en.wikipedia.org/wiki/Apache_Drill

MapR sold to Hewlett-Packard Enterprise (HPE): https://en.wikipedia.org/wiki/MapR

cube2222 · on May 20, 2020

Sad to see this happen as I really like the idea!

If you're interested in Drill, check out OctoSQL[0]. It shares the same vision of querying multiple datasources using pure SQL, and pushing down as much operations as possible to the underlying datasource.

Moreover, there's been a huge rewrite under way this past year, ready to use on the master branch, yet unreleased however (will be available soon).

It adds Kafka and Parquet support and most importantly first-class unbounded stream support. Including temporal SQL extensions for working with event time metadata (instead of system time) so you can use stuff like live updated time window aggregations on incoming kafka streams. It also now uses on-disk badger storage as the primary way to store its state, so you can do Group Bys / Joins with lots of keys, and restarts of OctoSQL won't alter the final result (exactly-once semantics).

Make sure to check it out, it's also very simple to get going locally!

Disclosure: I'm one of the main contributors.

[0]:https://github.com/cube2222/octosql

kermatt · on May 20, 2020

Does OctoSQL support reading columnar compressed formats (ex. Parquet) from distributed storage (ex. S3) ?

cube2222 · on May 20, 2020

No, we don't support Excel, JSON, CSV and Parquet datasources other than local files yet, though that's definitely planned and would be very easy to add.

cgivre · on May 22, 2020

Drill does ;-) Drill also has a streaming Excel reader that works very well with large Excel files...

willvarfar · on May 20, 2020

How does it compare to prestosql?

cube2222 · on May 20, 2020

Much less mature.

There's no distributed execution yet.

Easy to get going locally.

I'm not sure but I think Presto is in-memory? We're optimizing for SSD disks to easily support big states, and achieve durability this way. SSD disks are still plenty fast.

Definitely better streaming support. We're very much concentrating on good streaming ergonomics. OctoSQL isn't a batch execution engine at its heart, it's a streaming one.

cgivre · on May 21, 2020

As the author of the above post, I thought I'd respond to the comments below. Firstly, I'm glad to see that people noticed... :-)

Firstly, to paraphrase Monty Python, "Drill's not dead yet." There are some efforts to get corporate sponsorship, but it will take time. HPE's withdrawal was expected, but disappointing none the less. With that said, we are still gearing up to release Drill 1.18 which will have a considerable amount of enhancements, including new formats (SPSS, HDF5, possibly SAS) as well as new storage plugins to enable Drill to connect to Druid as well as REST APIs.

Personally, I've always felt that Drill was marketed to the wrong audience. As everyone notes, there are many competitors in the big data analytics space: Spark, ES, Splunk, Presto, Impala to name a few. Where I see Drill as filling a rather unique niche in the market is the small-to medium size data analytics with complex data.

For instance, if you have a CSV file, you use Excel. If you have 100 CSVs you have to code. Or what if you have an Excel spreadsheet and you want to pull data from your corporate reference API, which happens to return JSON and uses OAUTH authentication? Drill is the kind of tool that can bridge this gap and allow analysts to rapidly get value out of these situations.

As an example, here's a demo of me building a COVID dashboard from a REST API and spreadsheet in about 15 min with zero data prep: https://youtu.be/oEOhFWm3D9A

Another example: incident response. Let's say you have a PCAP file, use Wireshark. What if you have 10GB of PCAP files? Most likely you'll have to code up some solution. With Drill however, you can query that w/o coding, which means that you can get the value out of this data faster.

I know there are people using Drill. I know when I demo Drill to analysts, they love it. (Ok.. I'm biased here but I would say that's an accurate representation of the response to my presentations) What Drill lost was an active developer community. I hope over the next few months, that Drill's users will step up a bit and contribute to code reviews and/or actual code contributions. I've been thinking about creating a security focused fork of Drill as well so we'll see what happens.

If you have ideas/comments/questions, you can email me at cgivre@apache.org.

swuecho · on May 21, 2020

Your introduction is exactly how I feel about drill. However, the official doc is not as good as the book (Learning Apache Drill)

PaulHoule · on May 20, 2020

How many Apache projects are there in the "Big Data" space? It seems every time I look around I see a new one.

chrisjc · on May 20, 2020

49 according to the project list page. Some of which are in attic, some in incubation.

https://projects.apache.org/projects.html?category#big-data

Were you looking for a single project for all your big-data needs?

catawbasam · on May 20, 2020

Looking for some continuity after those splashy Strata presentations.

TallGuyShort · on May 20, 2020

Strata is very corporate. If you want what you see in Strata presentations, buy it from the vendors (Cloudera, Microsoft, etc.)

What you see on apache.org is what gets put in presentations at say, the Hadoop Summit.

edit: On a more helpful note, Apache Spark is probably as close as you can get to a single project for all your big data needs if that is what one wants out-of-the-box from an open-source project. It includes a SQL framework, streaming framework, either bundles or improves upon more general work done in Hadoop, etc. It can be pretty vendor-controlled at times, but it's birth was in academia, making it pretty different from the other projects that were mostly born as components in already established commercial platforms. There are pros and cons to that, of course.

_ZeD_ · on May 21, 2020

> Were you looking for a single project for all your big-data needs?

Isn't awk enough? :)

moandcompany · on May 21, 2020

It's why you need a Zookeeper ;)

moandcompany · on May 21, 2020

This is unfortunate.

For anyone curious about Apache Drill, it was inspired by Dremel which was used internally at Google and once the engine beneath GCP's BigQuery.

https://static.googleusercontent.com/media/research.google.c...

In the era of Hadoop, it was most closely aligned with the MapR distribution, while Hortonworks aligned with Hive, and Cloudera offered Impala as their solution.

History of Apache Drill: http://radar.oreilly.com/2015/09/apache-drill-tracking-its-h...

pulisse · on May 21, 2020

> Dremel which was [...] once the engine beneath GCP's BigQuery

Is there a white paper describing what replaced Dremel in BigQuery's architecture?

moandcompany · on May 21, 2020

Clarification -- There's the Dremel paper, the Google-internal implementation, and BigQuery, a public-facing implementation that evolved from Dremel :)

Here's some interesting reads on BigQuery: https://cloud.google.com/files/BigQueryTechnicalWP.pdf

Also, Happy 10th Birthday, BigQuery! https://cloud.google.com/blog/products/data-analytics/bigque...

greendisc7 · on May 21, 2020

Check out Dremio. The interface and speed were awesome when I tested on some pretty large datasets.

kkwteh · on May 20, 2020

I was thinking of using Apache Drill for converting CSVs to Parquet. What else do people use for that?

lurker458 · on May 20, 2020

I've also been looking for that. In an ideal world there would be a small, fast, standalone cli tool that can convert csv to parquet. There is a (sadly, unfinished) parquet writer Rust library in the Arrow repository that looks promising. All approaches I've tried so far (spark, pyarrow, drill, ...) require everything and the kitchen sink. So far I've settled on a java cli tool that uses jackson + org.apache.parquet internally, but it's cpu bound and has a huge amount of maven dependencies.

meritt · on May 20, 2020

pandas + fastparquet fairly lightweight. but yes I would love to see a simple c++/golang binary that's just a simple csv2parq call.

MrPowers · on May 21, 2020

Newer versions of Pandas don't even need fastparquet anymore. This code works:

import pandas as pd

df = pd.read_csv('data/us_presidents.csv')

df.to_parquet('tmp/us_presidents.parquet')

meritt · on May 21, 2020

Nice! Does that work alongside reading in via chunks and writing via row_groups? If I have a 500GB CSV will it work?

MrPowers · on May 21, 2020

This article explains how to convert CSVs to Parquet with Pandas, PySpark, and Koalas: https://mungingdata.com/python/writing-parquet-pandas-pyspar...

This article explains how to convert CSVs to Parquet with Go and Scala: https://mungingdata.com/go/csv-to-parquet/

If you're converting hundreds/thousands of files, Spark/PySpark is probably the best tool for the job. For fewer files, Python or Go is just fine.

poorman · on May 20, 2020

https://pandas.pydata.org/pandas-docs/stable/reference/api/p...

nerdponx · on May 20, 2020

+1, in my experience the Pandas CSV parser seems very robust and more than sufficiently fast.

The only faster/better CSV parser I've used with any frequency is the fread function in the R package data.table. When I used R in the past, Parquet was much less popular, but I think now the arrow package supports writing to Parquet.

tspann · on May 21, 2020

Just use a Apache NiFi fast and does that on 1 step

kermatt · on May 20, 2020

Spark

falaki · on May 20, 2020

Spark community put a lot of effort into a feature-rich CSV data source. Often times the most challenging part of ingesting CSV is parsing it. There are many flavors each with different corner cases for NULLs, comments, headers, etc.

ecnahc515 · on May 20, 2020

Maybe Presto/Athena

mochomocha · on May 20, 2020

pyarrow

timClicks · on May 20, 2020

I'm really impressed with the misson of Drill - write SQL for disparate data sources - but I've actually never installed it. When I have a bunch of parquet/csv/... files sitting around, I can normally slurp them in with pandas.

cgivre · on May 21, 2020

Try it out! I'd agree that if you have 1 CSV you don't need Drill. If you have more than one, is where it starts to get interesting. Also there is PyDrill which enables you to execute a Drill query and put the results directly into a pandas data frame. If you're an R person, there's Sergeant that does the same.

timClicks · on May 22, 2020

Are you Charles Givre? I have your book!

cgivre · on May 22, 2020

I am!! Glad to hear you liked the book!

chrisjc · on May 20, 2020

Sad to hear this since I've been following Drill for a while. From what I understand Drill was based on the google Dremel paper, hence the name.

https://research.google/pubs/pub36632/

Wondering if maybe Spark REPL or Apache Zeppelin might be a decent replacement for Drill.

rhacker · on May 20, 2020

One thing I like about Spark is the number of data sources it supports, including its ability to write out to a table or whatever. If you want to set up a JDBC or ODBC server you can also access spark SQL :

https://spark.apache.org/docs/latest/sql-distributed-sql-eng...

From an analytics perspective, a lot of people like the ability to connect to a JDBC or ODBC source.

josep2 · on May 21, 2020

I really like Apache Drill. I used it about 4 years ago as an MPP SQL on Anything Engine. Presto took a lot of that momentum away and MapR failed to adapt to the new age kind of meant EOL for the project. I hope someone else picks it up.

bradhe · on May 20, 2020

I wonder how many apache projects actually survive this sort of event?

jerdavis · on May 21, 2020

Just use bigquery.

SrslyJosh · on May 21, 2020

Still waiting for Apache dril.

cgivre · on May 21, 2020

For what?