Data engineering learning path with recommended resources

DataDaoDe · on Oct 18, 2020

First, this is a nice resource, so good job!

As someone who has worked for the past several years in this space, I'd say the biggest problems in data engineering are wholistic in nature. Sure, you need to know Python, SQL, Data Warehouses, Data Modeling, etc., but to me by far the biggest problems have to do with the entire architecture i.e. How do you extract data from potentially unreliable data sources, pull that data into some staging area, build further workflows that base off of this raw data to reliably update or create data warehouses/marts or deploy ml models. How do you allow everyone in your company to access and work with the data in a compliant and secure way? How do you test any of this? How can distributed teams, sometimes technical, sometimes more business oriented interact with the architecture and add/control data and release it into the overall company data stream? Has anyone found a reliable and maintainable way to setup CI/CD for company data architecture/pipelines/projects?

To me these are the big problems. And if anyone has any resources for any of these topics I would be super interested, since I deal with these problems daily :)

crorella · on Oct 18, 2020

Not only that,

- How do you make sure the users are not writing badly optimized tables/pipelines that end up consuming too many resources.

- How do you facilitate data discoverability so they don't end up creating a new table where 90%+ of the data is already present in another already existing one?

- How do you make sure they are mindful with the way the model the tables such there are not too many files due to bad partition/bucketing, compression is leveraged and good datatypes are picked?

atomicnumber3 · on Oct 19, 2020

I work in this space and the questions outlined in you and your parent's comment are absolutely spot-on.

The worst part is, these things tend to grow organically where the original ancestor of everything is engineer #3 of the 5-person startup who decided to write a cronjob to dump the prod db every night to feather files on a samba share. Then one cron job becomes three. Then ten. Then you're using S3. Then there's dependencies. Then you're using Luigi/Airflow. Then you're using Spark. Then you're constantly messing with partitioning and performance. Then you're using Hadoop and YARN and configuring queues and capacity scaling. At this point the "infra" team is 10 people and there's 20 data scientists. Then you triple in size a couple dozen times and now it's time to figure out how to retroactively slap security controls on top of everything.

That turned into a bit of a rant. Honestly I love this space but I agree, as hard as the software bits are, the hard part is the entire picture as a whole.

snird · on Oct 18, 2020

Thank you for the kind feedback!

I absolutely agree. The hard problems are either organisational (how to communicate to analyst and have agreed work method with business?) as well as dealing with third party unreliable resources.

I feel like these things you can learn only through experience. No written resource can reliably transfer this knowledge.

I think data engineering especially is something that requires at least apprenticeship to get into. Both for juniors and for senior developers transferring to data engineering position.

asicsp · on Oct 18, 2020

See if the table of contents of these books address some of your requirements:

* https://nostarch.com/seriouspython

* https://www.manning.com/books/practices-of-the-python-pro

fractionalhare · on Oct 18, 2020

Those resources won't really help OP. What they're talking about is better handled by bespoke ETL architecture alongside workflow orchestration tooling (like Airflow or Prefect) to handle versioning and deployment of modeling and ingestion services in production.

The orchestration part handles the workflows that comprise your ingestion and ETL processes. These are like managed cron jobs specific to data engineering lifecycles. The bespoke part of the architecture is what you'd compose together to handle all of the other requirements; for example, what applications do you build, and how do you design your data warehouse, such that the architecture can be used by both data science and marketing teams?

slt2021 · on Oct 18, 2020

these are largely solved by data lake vendors: Snowflake, Databricks and alike.

an experienced databricks/snowflake architect (or a couple of them) can easily set up and maintain data lake that supports most of what was mentioned.

Overall I absolutely agree, that rather than learning python or SQL, it is much better use of one's time to learn/get certified as Data Lake Architect and be able to create a large data lake from scratch and set up pipelines and maintain them.

elevenoh · on Oct 18, 2020

>but to me by far the biggest problems have to do with the entire architecture

data engineers are dependent on software engineers. and the software engineering is the more difficult part IME.

jskdvsksnb · on Oct 18, 2020

You seem to badly misunderstand what "data engineer" means. IME the role is basically infrastructure up to custom ETL jobs. It is a specialized strain of software engineering.

asicsp · on Oct 18, 2020

Here's some more awesome and free Python learning resources:

* https://greenteapress.com/wp/think-python-2e/

* https://automatetheboringstuff.com/2e/

* https://dabeaz-course.github.io/practical-python/Notes/Conte...

Also, I'd highly discourage tutorialspoint as a resource. Here's an example of them rewording another tutorial as their own: https://twitter.com/nixcraft/status/998248317661335552

gigatexal · on Oct 18, 2020

I’m currently working as a data engineer. I used to be a DBA for 5 years. I’m thinking now that the role of Data Engineer is the combining of what were three roles at my first job: Data warehouse engineer, DBA, and software engineer: it’s really the best of many worlds. I really enjoy it. I get to write the Python I’m good at (was never really good at general software engineering and feature development), gate keep a bit to keep my DBA chops up (SQL code quality, query tuning, access control etc but without all the need to be intimately well versed in any particular database), and spend my time creating new ETL processes and maintaining various EDW’s and data lakes.

It’s my favorite role I’ve had to date and I’m really happy in it.

jmatthews · on Oct 18, 2020

This is a bit of a hail mary but I also really enjoy the data engineer role. I came in to the role from the software engineer side when I took a job at a data first type start up. I was recently offered an opportunity at Amazon as a data engineer but while the SE reviews are good the DE reviews are not. Any experience with either Amazon or other FAANG type DE jobs?

gigatexal · on Oct 18, 2020

It was a hail-Mary for me too. Turns out I am not SW Engineer material. I went DBA then EDW developer then Python developer at a startup and then Python dev at a fintech company and then SRE and then Data Engineer.

As for the SW at the big firms I do not know. I was hoping to end up at an Apple or Amazon or Google so the culture or stress being not so great is disconcerting.

miguendes · on Oct 18, 2020

Is it just me or does anybody feel overwhelmed with lists like these?

I really appreciate the effort but as an anxious person I always paralyzed or disheartened by the road ahead.

For instance, one of the recommendations is Learning Python, 5th Edition - Mark Lutz. This is book alone is a tome.

But anyways, it looks very well presented. Much better than plain bullet points. Well done!

lr1970 · on Oct 18, 2020

Not only `Learning Python, 5th Edition by Mark Lutz` is a tome standing at 1500+ pages -- it is hopelessly outdated. Published in 2013 it covers Python-2.7 and Python-3.3 only. By now both these versions are deprecated.

truculent · on Oct 19, 2020

Is that really hopelessly outdated? I'm aware that new features like type hints ad dataclasses are enabling people to do some very cool things but you'd hope the majority of the language skills and patterns are still transferrable?

gunnarmorling · on Oct 18, 2020

Nice job! Perhaps an interesting resource to add: I'm maintaining an "open-source data engineering" awesome list: https://github.com/gunnarmorling/awesome-opensource-data-eng....

dang · on Oct 18, 2020

This isn't really a valid Show HN so I've taken that out of the title. It's maybe a borderline case because the website has some interactivity, it's ultimately a list, and those are explicitly ruled out: https://news.ycombinator.com/showhn.html

kamranahmedse · on Oct 18, 2020

Looks nice! I am the maintainer of https://roadmap.sh which is a similar list of roadmaps and learning plans. I am currently in the process of making the roadmaps interactive and this gave me some ideas for improving the format that I was preparing. Thank you for sharing!

davuinci · on Oct 18, 2020

Is anyone aware of a similar list but for systems programmer learning path?

carlineng · on Oct 18, 2020

Security and privacy are last on the list and marked with an "essentiality" score of 1/3. I think as an industry, we need to do a better job of emphasizing and prioritizing those topics early and often throughout the educational process, or else the perpetual cycle of data misuse, leaks, and breaches is bound to continue.

kkrbalam · on Oct 19, 2020

Very nice job. Being a data engineer / data warehouse architect with 15 years of experience, I can surely say that this learning path is almost accurately laid out.

Interesting rendering as far as webpage is concerned, what framework did you use? Maybe a tutorial on rendering Json data to a webpage like this is really helpful.

iblaine · on Oct 18, 2020

Here's a worthwhile guide that shows a learning path, includes more skills and seems easier to comprehend https://github.com/datastacktv/data-engineer-roadmap

coding123 · on Oct 18, 2020

Pipelines Management (Workflow management) The second link is wrong (copy paste error?)

Good resource.

snird · on Oct 18, 2020

Fixed. Thank you!

tracer4201 · on Oct 19, 2020

There’s one link I saw on GDPR.

I’d encourage folks to think about their data retention policies early. Build your data architecture with privacy in mind. Regulations like GDPR can require you serve customers a copy or their data or delete their data from your systems.

Don’t store things you don’t need. Keep retention policies. Please protect customer data.