Hacker News new | past | comments | ask | show | jobs | submit login
How to Become a Data Engineer in 2021 (khashtamov.com)
264 points by adilkhash on Jan 11, 2021 | hide | past | favorite | 137 comments



(source for everything following: I recently hired entry-level data engineers)

The experience required differs dramatically between [semi]structured transactional data moving into data warehouses versus highly unstructured data that the data engineer has to do a lot of munging on.

If you're working in an environment where the data is mostly structured, you will be primarily working in SQL. A LOT of SQL. You'll also need to know a lot about a particular database stack and how to squeeze it. In this scenario, you're probably going to be thinking a lot about job-scheduling workflows, query optimization, data quality. It is a very operations-heavy workflow. There are a lot of tools available to help make this process easier.

If you're working in a highly unstructured data environment, you're going to be munging a lot of this data yourself. The "operations" focus is still useful, but at the entry level data engineer, you're going to be spending a lot more time thinking about writing parsers and basic jobs. If you're focusing your practice time on writing scripts that move data in Structure A in Place X to Structure B in Place Y, you're setting yourself up for success.

I agree with a few other commentators here that Hadoop/Spark isn't being used a lot in their production environments - but - there are a lot of useful concepts in Hadoop/Spark that are helpful for data engineers to be familiar with. While you might not be using those tools on a day-to-day basis, chances are your hiring manager used them when she was in your position and it will give you an opportunity you know a few tools at a deeper level.


Agree 100% with this comment,

Old stack: Hadoop, spark, hive, hdfs.

New stack: kafka/kinesis, fivetran/stitch/singer, airflow/dagster, dbt/dataform, snowflake/redshift


Huh, what replaces Spark in those lists?

For my money, its the best distributed ML system out there, so I'd be interested to know what new hotness I'm missing.


distributed ML != Distributed DWH.

Distributed ML is tough to train because of very little control over train loop. I personally prefer using single server trainkng even on large datasets, or switch to online learning algos that do train/inference/retrain at the same time.

as for snowflake, I havent heard of people using snowflake to train ML, but sbnowflake is a killer in managed distribited DWH that you dont have to tinker and tune


> sbnowflake is a killer in managed distribited DWH that you dont have to tinker and tune

How do Snowflake (and Redshift, mentioned above) compare with CitusDB? I really like the PostgreSQL experience offered by Citus. I've been bit by too many commercial databases where the sales brochure promises the product does X, Y, and Z, only to discover later that you can't do any of them together because reasons.


So do I, theoretically at least.

But Spark is super cool and actually has algorithms which complete in a reasonable time frame on hardware I can get access to.

Like, I understand that the SQL portion is pretty commoditised (though even there, SparkSQL python and R API's are super nice), but I'm not aware of any other frameworks for doing distributed training of ML models.

Have all the hipsters moved to GPUs or something? \s

> sbnowflake is a killer in managed distribited DWH that you dont have to tinker and tune

It's so very expensive though, and their pricing model is frustratingly annoying (why the hell do I need tickets?).

That being said, tuning Spark/Presto or any of the non-managed alternatives is no fun either, so I wonder if it's the right tradeoff.

One thing I really, really like about Spark is the ability to write Python/R/Scala code to solve the problems that cannot be usefully expressed in SQL.

All the replies to my original comment seem to forget that, or maybe Snowflake has such functionality and I'm unaware of it.


>I'm not aware of any other frameworks for doing distributed training of ML models.

Tensorflow, PyTorch (not sure if Ray is needed) and Mxnet all support distributed training across CPUs/GPUs in a single machine or multiple machines. So does XGBoost if you don't want deep learning. You can then run them with KubeFlow or on whatever platform your SaaS provider has (GCP AI Platform, AWS Sagemaker, etc.).

edit:

>All the replies to my original comment seem to forget that, or maybe Snowflake has such functionality and I'm unaware of it.

Snowflake has support for custom Javascript UDFs and a lot of built in features (you can do absurd things with window functions). I also found it much faster than Spark.


> Snowflake has support for custom Javascript UDFs and a lot of built in features (you can do absurd things with window functions). I also found it much faster than Spark.

UDF support isn't really the same, to be honest. You're still prisoner of the select from pattern. Don't get me wrong, SQL is wonderful where it works, but it doesn't work for everything that I need.

I completely agree that it's faster than Spark, but it's also super-expensive and more limited. I suspect it would probably be cheaper to run a managed Spark cluster vs Snowflake and just eat the performance hit by scaling up.

Tensorflow, PyTorch (not sure if Ray is needed) and Mxnet all support distributed training across CPUs/GPUs in a single machine or multiple machines. So does XGBoost if you don't want deep learning.

I forgot about Xgboost, but I'm a big fan of unsupervised methods (as input to supervised methods, mostly) and Spark has a bunch of these. I haven't ever tried to do it, but based on my experience of running deep learning frameworks and distributed ML, I suspect the combination of both to be exponentially more annoying ;) (And i deal mostly with structured data, so it doesn't buy me as much).

> You can then run them with KubeFlow or on whatever platform your SaaS provider has (GCP AI Platform, AWS Sagemaker, etc.).

Do people really find these tools useful? Again, I'm not really sure what SageMaker (for example) buys me on AWS, and their pricing structure is so opaque that I'm hesitant to even invest time in it.


>UDF support isn't really the same, to be honest. You're still prisoner of the select from pattern. Don't get me wrong, SQL is wonderful where it works, but it doesn't work for everything that I need.

Not sure how it's different from what you can do in Spark in terms of data transformations. Taking a list of objects as an argument basically allows your UDF to do arbitrary computations on tabular data.

> I forgot about Xgboost, but I'm a big fan of unsupervised methods (as input to supervised methods, mostly) and Spark has a bunch of these.

That's true, distributed unsupervised methods aren't done in most other places I know of. I'm guessing there's ways to do that with neural network although I haven't looked into it. The datasets I deal with have structure in them between events even if they're unlabeled.

>I completely agree that it's faster than Spark, but it's also super-expensive and more limited. I suspect it would probably be cheaper to run a managed Spark cluster vs Snowflake and just eat the performance hit by scaling up.

I used to do that on AWS. For our use case, Athena ate its lunch in terms of performance, latency and cost by an order of magnitude. Snowflake is priced based on demand so I suspect it'd do likewise.


Spark has a superset of the functionality Athena has. Athena is faster, but it's also very limited. They're not designed to do the same thing.


I'd put spark in both lists. Old is spark-sql, new is the programming language interface.


Snowflake I suppose for the average ML use case. Not for your high-performance ML, but for your average data scientist, maybe?

Edit: I may be wrong[1], would be curious to know what users who've used Spark AND Snowflake would add to the conversation.

[1] https://www.snowflake.com/blog/snowflake-and-spark-part-1-wh...


Snowflake hits its limits with complex transformations I feel. Not just due to using SQL. It's "type system" is simpler than Spark's which makes certain operations annoying. There's a lack of UDFs for working with complex types (lists, structs, etc.). Having to write UDFs in Javascript is also not the greatest experience.


> There's a lack of UDFs for working with complex types (lists, structs, etc.). Having to write UDFs in Javascript is also not the greatest experience.

We load our data into SF in json and do plenty of list/struct manipulation using their inbuilt functions[1]. I guess you might have write a UDF if you are doing something super weird but inbuilt functions should get you pretty far 90% of the time.

https://docs.snowflake.com/en/sql-reference/functions-semist...


> best distributed ML system out there

I was comparing it for "traditional" data engineering stack that used spark for data munging, transformations ect.

I don't have much insight into ML systems or how spark fits there. Not all data teams are building 'ml systems' though. Parent comment wasn't referring to any 'ml systems', not sure why that would be automatically inferred when someone mentions data stack .


Yeah, I suppose. I kinda think that distributed SQL is a mostly commoditised space, and wondered what replaced Spark for distributed training.

For context, I'm a DS who's spent far too much time not being able to run useful models because of hardware limitations, and a Spark cluster is incredibly good for that.

Additionally, I'd argue in favour of Spark even for ETL, as the ability to write (and test!) complicated SQL queries in R, Python and Scala was super, super transformative.

We don't really use Spark at my current place, and every time I write Snowflake (which is great, to be fair), I'm reminded of the inherent limitations of SQL and how wonderful Spark SQL was.

I'm weird though, to be fair.


I agree with this.

Along with ML it is also a very high performance extract and transformation engine.

Would love to hear what other tech that are being used to replace Spark.


Can you elaborate more on the "roles" of the "new stack"? To me dbt/dataform and airflow/dagster are quite similar, so why do you need one of each? fivetran/stitch/singer are all new


I've used all of these so I might be able to offer some perspective here

In an ELT/ETL pipeline:

Airflow is similar to the "extract" portion of the pipeline and is great for scheduling tasks and provides the high-level view for understanding state changes and status of a given system. I'll typically use airflow to schedule a job that will get raw data from xyz source(s), do something else with it, then drop it into S3. This can then trigger other tasks/workflows/slack notifications as necessary.

You can think of dbt as the "transform" part. It really shines with how it enables data teams to write modular, testable, and version controlled SQL - similar to how a more traditional type developer writes code. For example, when modeling a schema in a data warehouse all of the various source tables, transformation and aggregation logic, as well as materialization methods are able to to live in the their own files and be referenced elsewhere through templating. All of the table/view dependencies are handled under the hood by dbt. For my organization, it helped untangle the web of views building views building views and made it simpler to grok exactly what and where might be changing and how something may affect something else downstream. Airflow could do this too in theory, but given you write SQL to interface with dbt, it makes it far more accessible for a wider audience to contribute.

Fivetran/Stitch/Singer can serve as both the "extract" and "load" parts of the equation. Fivetran "does it for you" more or less with their range of connectors for various sources and destinations. Singer simply defines a spec for sources (taps) and destinations (targets) to be used as a standard when writing a pipeline. I think the way Singer drew a line in the sand and approached defining a way of doing things is pretty cool - however active development on it really took a hit when the company was acquired. Stitch came up with the singer spec and their offered service is through managing the and scheduling various taps and targets for you.


Airflow allows for more complex transformations of data that SQL may not be suited for. DBT is largely stuck utilizing the SQL capabilities of the warehouse it sits on, so for instance, with Redshift you have a really bad time working with JSON based data with DBT, Airflow can solve this problem. That's one example, but last I was working with it we found DBT was great for analytical modeling type transformations but from getting whatever munged up data into a useable format in the first place Airflow was king.

We also trained our analysts to write the more analytical DBT transformations which was nice, shifted that work onto them.

Don't get me wrong though, you can get really far with just DBT + Fivetran, in fact, it removes like 80% of the really tedious, but trivial ETL work. Airflow is just there for the last 20%

(Plus you can then utilize airflow as a general job scheduler)


Just a thought : what about dremio ?


> I agree with a few other commentators here that Hadoop/Spark isn't being used a lot in their production environments

I guess I'm the odd-man out because that's all I've used for this kind of work. Spark, Hive, Hadoop, Scala, Kafka, etc.


I should have specified more thoroughly.

I am not seeing Spark being chosen for new data eng roll-outs. It is still very prevalent in existing environments because it still works well. (used at $lastjob myself)

However - I am still seeing a lot of Spark for machine-learning work by data scientists. Distributed ML feels like it is getting split into a different toolkit than distributed DE.


I guess it depends on what jobs you're looking for. There's a lot of exiting companies/teams (like mine) looking to hire people but we're on the "old stack" using Kafka, Scala, Spark, etc. We don't do any ML stuff but I'm on the pipeline side of it. The data scientists down the line tend to use Hive/SparkSQL/Athena for a lot of work but I'm much less involved with that.

Not all jobs are new pasture and I think that's forgotten very frequently.


I'd love to do some Kafka, Scala and Spark. What kind of exp are you looking for?


I'm also the odd one out, so many enterprises moving to spark on databricks.


It does rather depends what sort of data I bet a data engineer at CERN or JPL has quiet a different set of required skills to say Google or a company playing at data science because its the next big thing.

I should imagine at CERN etc knowing which end of soldering iron gets hot might still be required in some cases.

I recall back in the mumble extracting data from b&w film shot with a high speed camera, by projecting it on to graph paper taped to the wall and manualy marking the position of the "object"


When I was there, almost 20 years ago, it was all about C++, Python and Fortran, with GUIs being done in Java Swing.

I bet it is still mostly the same, just using Web GUIs nowadays.


We yet have to wait for the proper sweet spot: a language that allows SQL-like handling without the restrictions of SQL.

As many advantages as SQL has, in many cases it gets into the way. The closer you move to moving data (instead of doing analysis), the more it becomes annoying.

On the other hand, current languages (such as python) lack support when it comes to data transformations. Even Scala, which is one of the better languages for this, has severe drawbacks compared to SQL.

Hopefully better type-systems will help us out in the long term, in particular those with dependent types or similar power to describe data relations.


What's your opinion of LINQ in C#? It's been a while since I've used it but to me it seems like one of the most powerful ways to manipulate data inside of a language.


I haven't used LINQ but I have a good idea about how it works. It is certainly great to write concrete SQL-like transformations (be it based on a database, a list, ...).

Where it lacks is abstraction. To make that more concrete, let me ask: can you write LINQ that takes an arbitrary structure and selects every numeric field and calculate the sum over it? And if that is not possible, it should not compile.

I.e. can you define a function "numericsSum(...)" and call it with "List(User(salary: Number, deductions: Number, name: String))" and have it calcuate the sum (ignoring the name-field) but having it fail to compile when calling with "List(User(name: String, date_of_birth: Date))"?

Another example: is it possible to create your own special "join" function, that joins to data structures if they have exactly one common field (which you don't have to specify)?

In both examples, the LINQ compiler must be able to inspect the structure of a type (such as User) at compile time (not runtime) and do pretty arbitrary calculations. Most languages don't support that and I think even LINQ only works with concrete types in that sense. Which, by the way, is already better than what most languages offer - don't get me wrong here. But it is not as powerful as what SQL "compilers" offer - however those are then limited to SQL only, lacking the ability to do general purpose programming.


Great points. It depends on where the business is at, the scale of their data, how processed their data is, and the timeliness/accuracy requirements of that data.


I think it's missing the resources to one of the hardest sections: Data modelling, like Kimball and Data Vault. That, and maybe a section to modern data infrastructure. I'd put a link to [1] and [2] for a quick overview and probably [3] for more detail.

[1] https://www.holistics.io/books/setup-analytics/ [2] https://a16z.com/2020/10/15/the-emerging-architectures-for-m... [3] https://awesomedataengineering.com/


This. I also think modern columnar databases and other techniques somehow makes Kimball to be obsolete or relaxed some how, but I could be very wrong.

For example we use Vertica and DBA told us that Vertica loves wide tables with many columns, which doesn't look very Kimball to me. This gives me some trouble as I'm not really show how to model data properly.


> For example we use Vertica and DBA told us that Vertica loves wide tables with many columns, which doesn't look very Kimball to me.

I have heard advice like this from colleagues and frankly I don't buy it. It certainly isn't gospel. I think it's an oversimplification.

Columnar stores love star schemas. You can get away with a single table model too but you still need some kind of dimensional or at least domain-based thinking. Your single table is going to basically be a Kimball model but already joined together.

No database is going to be happy with joining orders and billing. The single table is still just going to be a single fact table, you just degenerate all the dimensions.

Personally I think you can gain a lot of benefit from doing proper stars because you get more sorting options but I'm a Redshift guy so maybe I'm stuck in that headspace.

I'm still waiting for someone to come along and propose something different but honestly Kimball's dimensional mental model still resonates with me. Are there compromises, can you relax the model more? Of course, but you're still going to realize huge benefits from starting with that approach. I don't think there is some "new" way of thinking that really changed the data space. All the innovation is on the compute side.

I have precisely zero Vertica experience so maybe I'm totally missing something. I'd be happy for someone to tell me I'm wrong.


I think Kimball has one benefit that single table doesn't have i.e. it's much more business intuitive. For example it's really difficult for me now to explain to our new hire why two completely unrelated fields are stuck in one table and one is NULL for 95% of the time.

The wide table thing is also kind of certified by someone inside of Vertica so I guess it does make sense. I'm not actually working in data team so I don't have the CS knowledge to prove or disprove it. Fortunately our new data lead is going to convert everything into Kimball so we are going to have solid proof eventually.


The CEO of Chartio recently gave a talk advocating for wide tables [1], but I don't fully agree.

Having a wide table can have some performance benefits, but I think the primary benefit of dimensional modelling isn't related to performance, but having standardized definitions, or "conformed" dimensions. For example, having a "calendar" dimension with attributes you can pull into your analysis such as fiscal quarter or reporting week.

In a data model with wide tables, if you need to change the definition of a dimension, it's going to be painful to track down all the different places where that old dimension is used.

[1] https://www.youtube.com/watch?v=3OcS2TMXELU


Those common definitions are priceless. On top of not repeating yourself you also avoid questions like “why does carlineng’s report say it is week 43 but mulmen’s report shows week 44? This is even worse if we both calculated revenue but I did SUM(payment) and you used SUM(price*qty).

I do think there’s a potential for performance benefits too because you can sort the fact and dimensions in ways that make sense for each one. Calendar is sorted by week number, product is sorted by a category. Fact is sorted by a date and locale, etc. since you don’t materialize the full projection more expensive but flexible sorting like interleaving becomes possible.

It’s probably academic but it’s not as hard as people make it sound and I think there are real benefits over these wide tables with simple sorting.

Plus if you use a tool like Looker it will write the reporting SQL for you! Generating star schema queries is easy once the tables are in place.


Yeah I think this is a common misconception with columnar stores. That if they like wide tables (they enable wide tables) that must mean the wider the better.

Sorting (or partitioning) is one of the most powerful optimizations in your toolbox. But only when optimized for some kind of access pattern. When you combine domains to get more width you have to make a compromise on the sorting. Then the wheels come off.

You still need different tables for clicks and orders and payments, even if they are very wide. You may or may not physically conform your dimensions in pure Kimball style but logically you (should) still start there.


Yeah I really wish I could understand all these. There are too many words. Vertica, kafka, spark, and we use all of them. Figured I got to at least know their fundamentals to make good choices.


Yeah I don’t have experience with all the tools. I’m sure they are great and have their strengths. My current setup uses both EMR and Redshift but the data model is the same on both.


SQL is easy, data is hard.


SQL proficiency is important but I wouldn't say it supersedes programming experience. To me, Data Engineering is a specialization of software engineering, and not something like an analyst who writes SQL all day.

As DE has evolved, the role has transitioned away from traditional low code ETL tools towards code heavy tools. Airflow, Dagster, DBT, to name a few.

I work on a small DE team. We don't have the human power to grind out SQL queries for analysts and other teams. Our solutions are platforms and tools we build on top of more fundamental tools that allows other people to get the data themselves. Think tables-as-a-service.


SQL proficiency is something I've seen developers of all sorts neglect, which I think is a huge mistake. And relegating SQL to something that just an "analyst" does is an even bigger mistake.

Several times over my career I've been brought in on a project where the team was considering replacing their RDBMS entirely with a no-SQL data store (a huge undertaking!) because they were having "performance problems". In many cases the solution is as simple as adding an index or modifying a query to use an index, but the devs regard it as some kind of wizardry to read a query plan.


I spent way too large a portion of my last position teaching developers about indexes, query plans and underlying join types and their impact on performance and memory consumption.


Just curious do you have such knowledge for columnar database such as Vertica?


Not Vertica though that looks very interesting. I do have a lot of experience with Redshift though. The difficulty is most implementations of data warehouses are fairly bespoke, even down to query plan and execution so knowledge on Redshift may not completely transfer to Vertica for instance.


Thanks. But how does one approach to learn the internals for these things? It's not like MySQL or SQL Server or PostgreSQL that we have tons of books and very detailed documentation. For Vertica we only have a doc, no books, just provided as is.

It seems to be the norm for everything that takes flight around 2010. Of course many are open sourced so those are OK I guess.


The thing is finding the terminology, in the case of Redshift that is Sort Key, Distribution Key and primary key (though these aren't true primary keys they do influence the query planner).

It took me a few minutes but I found indexes are called projections in Vertica and are more like materialized views than true indexes, here are the docs with a breakdown, https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/Ad...

And here is a general walk through of the architecture including key concepts such as projections. https://www.vertica.com/docs/10.0.x/HTML/Content/Authoring/C...

In a few cases I have had to go to published white papers on the technologies as well.

But honestly it's all searching for the right words and then crawling through docs and papers.


I wholeheartedly agree with the "specialization" comment.

Unless you are in a position where you can entirely rely on managed tools that do the work for you and all effort is centered around managing the data, rather than the holistic view of your data pipelines (Talend ETL, Informatica - the "pre-Hadoop" world, if you will, and maybe some modern tools like Snowflake), then a good Data Engineer needs a deep understanding of programming languages, networking, some sysadmin stuff, distributed systems, containerization, statics, and of course a good "architect" view on the ever-growing zoo of tools and languages with different pros and cons.

Given that at the end of the day, most "Data Pipelines" run on distributed Linux machines, I've seen and solved endless issues with Kernel and OS configurations (noexec flags, ulimits, permissions, keyring limits ...), network bottlenecks, hotspotting (both in networks and databases), overflowing partitions, odd issues on odd file systems, bad partition schemes, a myriad of network issues, JVM flags, needs for auditing and other compliance topics, heavily multi-threaded custom implementations that don't use "standard" tools and rely on language features (goroutines, multiprocessing in Python, Threadpools in Java ...), encoding problems, various TLS and other security challenges, and of course, endless use of GNU tools and other CLI-fun and I would not necessarily expect for a pure SQL use case (not discounting the fact that SQL is, in fact, very important).

Not to mention that a lot of jobs / workflows Data Engineers design and write tend to be very, very expensive, especially on managed Clouds - generally a good idea to make sure everything works and your engineers understand what they are doing.


TL;DR: SQL is still a DE's best and most ubiquitous tool/language in the space--hands down.

I've led DE teams for the last decade. I have lived through shifts in toolsets, languages, etc. Regardless of platform, languages, model types, etc, etc, etc, etc, the one constant has been SQL with some sort of of scripting around it.

Right now, it seems Python is the big wrapper language, whether it's via dag or some other means but that's just the preferred method TODAY. Considering SQL has been around for decades and has outlasted just about every other language and system, many of which have opted for a SQL-like interface on top of their system, I would highly recommend DEs be very strong there.


2021? More like 2010. Hadoop is getting deprecated rapidly and more companies split their write and read workloads. Separated storage and compute is also popular. Scala is not used that much, I think it is not worth the time investment. More and more companies go for Kotlin instead of Java when these want to tap into the Java ecosystem.


Hadoop is still widely used in enterprises (especially in banks), if you have experience working with Hadoop ecosystems it is a big plus anyway.


Yes, that is the status quo.

There is also some trends:

https://trends.google.com/trends/explore?date=today%205-y&ge...


Kotlin's future is tied with Android, on the JVM it will be another Scala in 5 years time.

If JetBrains gets lucky, they might manage to create a cross-platform Kotlin eco-system as they are trying hard to push, as means to sell InteliJ licenses.

Lets see if it doesn't end like like Typesafe.


Are you working in this field? I am looking for a consultant to setup a modern data processing pipeline for a data driven hardware product I am building.


Yes, I am moving companies to their next data pipeline, that is my specialty. I have added my email to my profile, you can reach out to me.


What does your typical stack look like?


Depends. Just some random mixture of stacks: PrestoDB, S3, Airflow, Luigi, Dremio, Athena, Hive LLAP, EMC Isilon, Kafka.

My favorite so far is S3 + PrestoDB with either ORC or Parquet files. It is a solid DWH solution for most enterprises on the cloud. (Cloud or not is a different discussion). It works for small scale (50TB) to really high scale (50PB). There are some (very few) gotchas and moving parts as opposed to Hadoop + co. You can combine it with Kafka for streaming data and you got yourself a pretty solid data solution.


I think learning Scala is a bit of a waste of time, but I don’t know everyone’s stack. Maybe it’s a west coast bubble, but serverless seems to be the most popular choice for new ETL stacks even if the rest the cloud tech stack isn’t serverless. AWS tools like kinesis, glue (pyspark), step functions, pipelines, lambdas, etc.

If you are working in that domain, being able to use the CDK in TypeScript becomes way more important than being able to build a Hadoop cluster from scratch using Scala.


Glue is both more of a pain in the butt than regular old spark with pyspark and way more expensive, from my experience I would seriously question someone suggesting to use it.

We could have been using it wrong, but porting our Glue scripts to standard EMR after our initial POC saved us over 10x the cost and it was substantially faster.


Both pricing and start-up times are significantly better in Glue 2.0 (assuming one can migrate). But even on Glue 1.0, orchestrating an ETL process with with several dozen jobs is a non-trivial amount of configuration and labor. (Jobs failures, job restarts, paging, job run history, cloudwatch logs, re-usable infrastructure as code when creating a new jobs, permissions and security, etc) that the increased cost is more than worth it for us.

https://aws.amazon.com/blogs/aws/aws-glue-version-2-0-featur...


We're crawling and processing TBs of web data, we just use some python workers, Airflow, SQS and trigger a few scheduled EMR jobs easy peasy. Restarts and what not are handled by kubernetes at the container level and by Airflow at the code level. Airflow bakes in permissions and managing jobs. Glue left us a lot to be desired in that area, and $400-600 per ingest can't beat $30 bucks for the time the EMR cluster is up and since we use Kube for everything already it wasn't much a hassle to continue using it here. I'm sure in your case it makes sense, and in ours it didn't and this is why technology is crazy :P


...and nothing of basic statistics? Data Science people want to know about your data pipeline and have some quantification of the quality of that data. Also, monitoring data pipelines for data integrity often relies upon a statistical test. You don't need to go as far as Bayesian but you do need to understand when a median goes way off or if it bi-modal, etc.


That should be assumed in the "engineer" part of the role.


Yeah I would definitely expect an engineer to have a grasp of basic statistics such as mean, median, mode and be able to interpret statistical graphs on a basic level (Modality, skewness, shape, etc) .


a good catch, thanks!


I've been in this space last 6 yrs or so and my scala usuage has gone down to zero. Not worth learning scala.


This is an anecdote - plenty of firms are using Scala in their data engineering stacks and it's a great tool for the job.

While maybe not strictly necessary per se, it's a great way to get a foot in the door, and provides a great way to foster advanced type systems and functional programming (I personally find it to be a really fun language to write in to boot).


> it's a great tool for the job.

What job can this do that can't be done via sql. dealing with unstructured data?


> plenty of firms are using Scala in their data engineering stacks

Isn't that just a result of everyone being into Spark a few years ago?


Regardless of the source I think the main point folks are missing is that a lot of DE jobs will require you to know Scala so it's a good tool to have if you want to be a DE somewhere.

SQL is also an amazing tool and you should definitely learn if but there are a lot of DE jobs out there with Scala in the "Requirements" section of the job listing. Parts of the industry might be moving away from it, but if you're looking to make a jump into DE I think you're hamstringing yourself by avoiding Scala.


Scala, when it's not used because it's just what someone learned the ropes with, is the Haskell of data science and machine learning: it's what people use when they want to inflate their credentials and/or egos.


What languages are worth learning?


SQL has taken over the space completely. 90% of data munging and transforms happen via SQL.

I would learn python. Its the number one language outside sql.


What do you use if you need to process hundreds of GBs of data?


PySpark or Dask


sql on snowflake


Assuming the data is relatively clean in the first place - there is a lot of data sources where SQL would be an absolute swine to work with.


sql, python, terraform, maybe some basic java. Airflow is pretty common. Whatever the company is migrating away from. As long as you`re good at one of those and can pick up the rest on the fly you should be fine to start out.

edit: Guess this was pretty much in the post.


I've been approached about various data engineering jobs over the last couple years and the job descriptions have varied wildly. It has been everything from:

1. SQL/analytics wizard, capable of building out dashboards and quickly finding insights in structured data. Oracle/MSSQL/PostGres etc. Maybe even capable of FE development.

2. Pipeline expert, capable of building out data pipelines for transforming data, Flink, Spark, Beam on top of Kafka/Kinesis/Pubsub run from an orchestration engine like Airflow. Even this could span from using mostly pre-built tools wiring together things with a bit of python to move data from A to B, to the other exteme of full fledge Scala engineer writing complex applications that run on these pipelines.

3. Writing infrastructure software for big data pipelines, customizing Spark/Beam/Flink/Kafka and/or writing custom big data tools when out of the box solutions don't work or scale. Some overlap with 2, but really distinguished by it being a full fledged software engineer specializing in the big data ecosystem.

So, are all three of these appropriate to call Data Engineer? Is it mainly #1 and people are getting confused? I would certainly fall into the #3, so I'm always surprised when people approach me about 'SQL transform' type jobs.


I’d call 2 and 3 data engineers and 1 either a data analyst or BI developer/engineer depending on technical proficiency.


"In order to undestand how these systems work I would recommend to know the language in which they are written. The biggest concern with Python is its poor performance hence the knowledge of a more efficient language will be a big plus to your skillset."

What? The Apache stack that's written in Scala recompiles all your code into JVM bytecode, regardless of what language you've written it in. Yes, that includes Scala. Spark isn't actually firing up a python interpreter and running your python code on the data.


> In order to undestand how these systems work I would recommend to know the language in which they are written. The biggest concern with Python is its poor performance hence the knowledge of a more efficient language will be a big plus to your skillset.

I think these two sentences are sort of orthogonal to one another. The first, I interpret as saying that it's useful to understand Scala if you're using Spark, essentially because of the law of leaky abstractions [1]. I think you're responding to the second sentence and in that case I agree.

[1] https://www.joelonsoftware.com/2002/11/11/the-law-of-leaky-a...


Nice article. From experience I would say the SQL knowledge should be advanced though. Not intermediate.


Somewhat outdated view. This may be the current stack, but its outdated now and is slowly being replaced. The new view is not big data pipelines and ETL jobs, its lambda architecture, live aggregations/materialized views and simple SQL queries on large data warehouses that hide the underlying details. The batch model may still apply to ML I guess, but I'm no expert there.


This is true for only a very limited subset of data producers that need real-time or near real-time data included in ML models. For 99% of the rest, batch processing is just fine and considerably more economical.


Any resources/guides you'd recommend?


Great resource, thanks for sharing it! I will dig deeper into the resources linked here as there's a lot I have never seen before. The main topics are more or less exactly what I've found to be key in this space in the last 2 months trying to wrap my head around data engineering in my new job.

What I'm still trying to grasp is first how to assess the big data tools (Spark/Flink/Synapse/Big Query et.al) for my use cases (mostly ETL). It just seems like Spark wins because it's most used, but I have no idea how to differentiate these tools beyond the general streaming/batch/real-time taglines. Secondly, assessing the "pipeline orchestrator" for our use cases, where like Spark, Airflow usually comes out on top because of usage. Would love to read more about this.

Currently I'm reading Designing Data-Intensive Applications by Kleppman, which is great. I hope this will teach me the fundamentals of this space so it becomes easier to reason about different tools.


Im my experience teams operating under the "The AI Hierarchy of Needs" principles are optimized for generating white papers


We want all these skills, yet, we'll give you a separate title and pay you less than a software engineer. Meanwhile front end software engineers are still software engineers and get high pay.


I just got the 2020 stats from one of the biggest tech recruiters in my country. On every level DEs outperform SWEs on salary.


I don't think data engineers are paid less than software engineers.


They are. I should know. I've worked as one for years including big tech companies. For e.g. FB has a lower pay than SWE, lower RSUs etc. and you can only get SWE pay if you transition into one, and that requires you to go through an interview process internally.


That's totally contrary to my experience where Data Engineers are considered specialized SWE and paid more.

But I've never worked for FB.


You are right in the sense that if you look at average SWE salaries and data engineering salaries, the average salary is higher for data engineers. Because the starting salaries for data engineers tend to be higher because of all the skills that are needed and there's plenty of SWE positions that require more than just a degree in CS. But if you start comparing salaries at maybe a senior level (4-5 yoe+), the salaries for SWEs start becoming a lot more than DEs. And again, I've worked in different companies, big and small, this holds true for all companies that have "Data Engineer" titles. There's of course companies like Netflix where you are a data engineer but still get a SWE title and get paid the same.


experienced data engineers should graduate to data architects/ML engineers and this way they can get on par with SWE, pls correct me if I am wrong.


You're right, but most companies do not have those positions formalized and therefore you're expected to do those as part of your job, but not gain the financial benefits. Also, there is a big disparity in these titles and what the duties entail, which inherently again feeds into the problem.


> You're right, but most companies do not have those positions formalized and therefore you're expected

To move on to somewhere that does, until the laggards get the message.


AFAIK a "data engineer" at FB specifically is not quite equivalent to "software engineer specializing in distributed systems" which this article refers to--i.e. there is a totally separate job track for SWEs in general. The "data engineer" job title is definitely not standardized between companies.


In my experience, they get paid more.


Same here


For GCP, our stacks tend to be Composer (Airflow), BigQuery, Cloud Functions, and Tensorflow.

There's the occasional Hadoop/Spark platform out there, but clients using those tend to have older platforms.


What is your product? I am looking for a consultant to help me setup a good process for a data driven product hardware product.


The work I do is almost entirely Google Anaytics/Ads related. So probably not what you're looking for, but if so, leave your email and I'll reach out!


Incidentally does anyone have resources for SMALL data? EG a few MB of a time, but requires the same ETL, scheduling, traceability. I'd love some lite versions of big-data tools but needs to be simple, small and cheap.


Most orgs working with small data I've seen will just fall back to the tutorial version of some big data tools (often times just eating the unused infrastructure cost for something like Hadoop when they're generating biweekly reports). Most project managers have a dream of their project scaling up and want to be prepared should a dream become a reality. And if you "under-engineer" (by which I mean specifically engineer for the problem the company is facing), you'll get called out by every armchair developer for not going with the "obvious, best solution."

I'm not bitter; you're bitter. /s


Yeah I'm bitter. I want to bring structure, tools and discipline to small scale data gathering, but the big tools are just too time-consuming to get up and keep running with just a few hours.


Depending on what your sources and sinks are:

* Microsoft SSIS is still there, kind of a granddaddy tool but perfectly capable of single-machine ETL

* Trifacta's Wrangler has a free version with limits

* Talend's Open Studio is free, a little clunky but works fine

* Some new players that I've played around with are Airbyte (immature but evolving quick) and Fivetran (consumption-based pricing model, fairly extensible, but kind of biased about the sources/sinks they're interested in supporting)

* I haven't tried Streamsets or Stitch yet, but I've watched a few videos, again, a little more focused on cloud and streaming data sources than traditional batch ETL, but seem fair enough for those use cases as well

* If you want to roll your own SQL/Python/etc ETL, Airflow and Luigi are good and simple orchestrators/schedulers

The cloud services have pretty cheap consumption-based ETL PaaS offerings, too: Azure Data Factory, Amazon Glue, GCP Cloud Data Fusion

Unless what you're doing is highly bespoke ETL, I'd recommend trying out the new kids on the block and seeing if you can build pipelines that suit your needs from those, because they're at the forefront of a lot of evolving data architecture patterns that are about to dominate the 2020s.


Take a look at our https://www.easydatatransform.com tool. It is a drag and drop data munging tool for datasets up to a few million rows. It runs locally on Windows or Mac. You should be able to install it and start transforming your data within a few minutes. It doesn't have a built in scheduler (yet), but you can run it from the command line.

Excel Power Query is also quite lightweight. But is pretty klunky in my (biased) opinion.


Thanks there are enough workstation tools, but I want an automated tool that runs on a server.


I have been working with small data few years. We built an internal library to move data in/out of systems and the schedule this into jobs. We mostly leverage S3 and Spectrum. The major complexities we found were in scheduling, proxies and fetching raw data from legacy applications.


No reason you cant still use airflow for the orchestration/depenendecy management/scheduling. And for the processing and storage - Pandas and sqlite.

Also would highly highly recommend looking into kedro (which has airflow integration, or you could just run your pipelines with crontab)


Take look at luigi, which is a lightweight task orchestrator with minimalistic dependencies.

[1] https://github.com/spotify/luigi


In AWS, lambda’s and step functions.

https://aws.amazon.com/step-functions/


Thanks, presumably this is similar to AWS Airflow too.


IMHO first you need to become a programmer then you can become a data engineer. So if you need to start by learning data structure then you are doing something wrong. Also the topics suggested in "Algorithms & Data Structures" could easily be skipped, the information is drastically misleading. We should seriously have some fact checker, otherwise this kind of bullshit article keep trending on HN and people keep wasting their time on learning LSM tree (what the fuck is that in the first place).


Can you recommend something else? I'm preparing to go through a full CS education.


Coding and building teach you more than taking a course or watching a video. If you don't have any programming background, you can enroll in some coursera or udacity courses to start with. Then go through this course http://web.stanford.edu/class/cs106x/, the course reader is really good. After that for data engineering read this book https://www.amazon.com/Designing-Data-Intensive-Applications.... Also learn some sql. Take some data, feed into sql light db, and ask question and convert question into query. Becoming good at this takes some time. Be patience. The learning curve is like hokey stick, initial phase might have a dip but it accelerate in the later phase. BY ANY CHANCE DO NOT JOIN A BOOTCAMP.


Thanks. I'm actually in middle of cs106 and indeed the reader is pretty good. I also have the intensive book which I read a bit but think I need more technical muscle for it.


In data engineering your goal is "standardization". You can't afford every team using their unique tech stack, their own databases, coding styles, etc. People leave the company all the time and you as a data engineer always end up with their mess which now becomes your responsibility to maintain. You'd at least be grateful if those people had used the same methods to code stuff as you and your team so that you wouldn't have to become a Bletchley Park decoding expert any time someone leaves. Or you'd hope the tech stack was powerful and flexible enough that other people other than engineer types could pick it up and maintain themselves. They mostly cannot do that, because there is no such powerful system out there. Even when some modern ELT systems get you 80% there, you, data engineer, are still needed to bridge the gap for the 20% of the cases.

Data Engineering really comes down to being a set of hacks and workarounds, because there is no data processing system which you could use in a standardized systematic way that data analysts, engineers, scientists and anyone else could use. It's kind of a blue-collar "dirty job" of the software world, which nobody really wants to do, but which pays the highest.

There are of course other parts to it, such as managing multiple data products in a systematic way, which engineering minds seem to be best suited for. But the core of data engineering in 2020, I believe, is still implementing hacks and gluing several systems together so as to have a standardized processing system.

Snowflake or Databricks Spark bring you closest to the ideal unified system despite all their shortcomings. But still, you sometimes need to process unstructured jsons, extract stuff from html and xml files, unzip a bunch of zip archives and put them into something that these systems recognize and only then you can run sql on it. It is much better than the ETL of the past, where you really had to hack and glue 50% of the system yourself, but it is still nowhere near the ideal system in which you'd simply tell your data analysts: you can do it all yourself, I'm going to show you how. And I won't have to run and maintain a preprocessing job to munge some data into something spark recognizable for you.

It is not that difficult to imagine a world where such a system exists and data engineering is not evem needed. But you can be damn sure, that before this happens, that this position will be here to stay, and will be paying high, when 90% of ML and data science is data engineering and cleaning and all these companies hired a shitton of data science and ML people who are now trying to justify their salaries by desperately trying to do data engineers' job.


> It's kind of a blue-collar "dirty job" of the software world, which nobody really wants to do, but which pays the highest.

Love this quote. It hits the nail on the head. Not sure why it's paid so well though...


Why would no one want to do it though? It's way more interesting than my current BA job. I managed to squeeze a bit into the role by opting to do some ETL and it's already very rewarding.

I'd definitely do it even if it's the same pay as BA.


Amazon introduced Step Functions, which are very nice to dig into and a helpful skill for Data Engineering.


With direct integrations to EMR, Lambda, and Athena, its a great tool for building pipelines and effectively costs nothing on its own and is completely headache free.


This is exactly how I use it. I left out that important point.


Your data is only as good as your instrumentation and you usually only get one chance to grab that data, but can have many goes at processing it, do I would argue the bit not covered is the most important.


i see a lot of "spark is dead" talk here. what replaces it for transform inbetween something like kafka and redshift/bigquery?


I agree with both sides here. Spark was the in thing for a while so a lot of places are using it but probably don't need it and could have been better off running various SQL scripts to do some transformations. I worked on a project exactly like this where we should have used SQL scripts instead of Spark.

But I also think that a lot of enterprise pipelines went all in on spark and so now moving to something else (SQL scripts, Snowflake, etc.) just isn't worth it. So Spark is dead, long live Spark.


Shameless plug to my much simpler (simplistic?) view of things. In this case, I think Data Engineers are the people building systems that solely focus on the data, all the data and nothing but the data.

https://www.linkedin.com/pulse/mapping-data-science-professi...


advanced proficiency in SQL and in any scripting language of your choice (C#/powershell, python) is enough to be a data engineer on any technical stack: windows/linux, on-prem/cloud, vendor specific/opensource, literally anything.


I disagree. That's not enough these days.

If you want to build anything mildly interesting, you need to have a solid background on software engineering (building data pipelines in Spark, Flink, etc. goes way beyond knowing SQL), you need to really understand your runtime (e.g. the JVM, and how to tune it when working with massive amounts of data), you need a bit of knowledge about infrastructure, because some of the most specialized and powerful tools do not have yet an established "way of doing things", and the statefulness nature of them make them different from your typical web app deployment.

Maybe if you want to become a data analyst you only need SQL, and I would still doubt it. But data engineering is a bit different.


I believe what you described is a job of Platform Engineer/Systems Engineer/Data lake Architect, especially JVM aspect of it. The interesting job is in the beginning when you build the cluster initially, or do major extension, after that the ops/maintenance is usually outsourced to cheap labor offshore - so this kinda job is personally not for me.

spark has dataframe API which is similar to pandas api and can be learned in one day, especially if you know python.

same for Airflow and other frameworks, it just a fancy scheduler that anyone can pick up in a couple days.


> building data pipelines in Spark, Flink, etc. goes way beyond knowing SQL

What if you build you data pipelines in sql? curious if you have an example of a data pipeline that needs spark?


Is anyone here using kdb+/q for data engineering, specifically outside of finance?


where I work for, our stack is all about GCP/Airflow/Python/BigQuery ML, for recommender systems. I'm now playing around with Turicreate (Apple) to compare with BQML.


I don't recommend it




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: