> New languages like Rust/Ocaml/Nim.. if yes then which?
Completely irrelevant. DE is SQL, Python, sometimes Scala/Java.
Get really good at SQL. Learn relational fundamentals. Learn row- and column-store internals. Understand why databases are they way the are. Familiarize yourself with the differences between systems and why you need to care.
Get familiar with at least minimal Hadoop shit. Set up Spark from scratch on your machine, mess it up, do it again. Grapple with how absurdly complex it is.
Understand system-level architecture. Analytics systems tend to organize into ingest -> transform -> serving (BI tools etc)... why? There are good reasons! Non-analytics data systems have different patterns, but you will see echoes of this one.
Above all, spend time understanding the data itself. DE isn't just machinery. Semantics matter. What does the data "know"? What can it never know? Why? What mistakes do we make at the modeling step (often skipped) that result in major data shortcomings, or even permanently corrupt data that can never be salvaged? Talk to people building dashboards. Talk to data end-users. Build some reports end-to-end: understand the business problem, define the metrics, collect the data, compute the metrics, present them. The ultimate teacher.
(who am I? no one, just been doing data engineering since 2005, long before it was even a term in the industry)
The most important DE skill you should learn is how to fix data problems when they happen. Data problems include: duplicate rows/events, upstream data errors that need to be corrected, error in your ETL logic, etc. You need to build tools to reason about the graph of data dependencies, i.e. downstream/upstream dependencies of datasets, and create a plan for repair. Datasets need to repaired in layers, starting from the ones closest to the source and working downstream to the leaves. Some datasets are self-healing or not (i.e. they're rebuilt from scratch when their inputs change), so you just let those rebuild. Incremental datasets are the worst to repair because they don't restate history and you have to fork them from the point when a problem happened and rebuild from that point onwards.
Without tools to help you repair data, sooner or later you'll run into a problem that will take you days (even weeks) to fix, and your stakeholders will be breathing down your neck.
You're implying it here, but I want to state it even more explicitly: create systems that allow you and your team to know when data problems arise!
These can be processes or programmatic but if nothing is looking for problems, then they often go unnoticed until they cause an issue big enough that can't be ignored.
This is going to be a bigger problems when pipelines span across multiple teams within multiple organizations.
The only way is to 1) consolidate pipeline design into fewer teams, 2) build monitoring tools to monitor upstream tasks, 3) build gate keeper tools to double check outputs of tasks owned by my team.
I would focus on theory first, then tools. I recommend the following, in order:
1) Data modeling: Fully read Kimball's book, The Data Warehouse Toolkit, 3rd edition. It's outdated and boring but it's an excellent foundation, so don't skip it.
2) Data modeling: After #1 above, spend at least 30 minutes learning about Inmon's approach, the ER Model, 3NF, etc. You can find some decent YouTube videos. You don't need to learn this deeply, but you should understand how it's different than Kimball's approach.
3) Data warehousing & data lakes: Read the academic paper titled "Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics." For an academic paper, it's surprisingly approachable.
4) If you don't know this already, do 15 minutes of googling to understand ETL vs ELT and the pros and cons of each approach.
5) Putting it all together: Once you've done the things above, then read Kleppman's book, Designing Data-Intensive Applications.
6) Focus on the future: Spend a little time learning about streaming architectures, including batch/microbatch/streaming and pros and cons for each. You may also want to learn about some of the popular tools, but you can easily google those.
the way I think about upskilling is it is essentially a high stakes bet on megatrends in the industry - you want to be early + right as that is how you get paid the most.
Is anyone familiar with any Rust language data libraries for data engineering? My experience with Rust is it's too low level and verbose to be productive at reshaping and cleaning data.
A couple of years ago when checking out Rust for scientific work I started looking for familiar ground, e.g. Jupyter notebook support. I was happy to find google/evcxr's Jupyter kernel [1] and started writing an early access book on Rust for data analysis [2].
I still check _Are we learning yet_ [3] to see their recommendation on the state of the ML stack (including data processing and structures) in Rust. It still lists "the ecosystem isn't very complete yet.", which is how I felt at the time!
Requirements largely require selecting from and aggregating across some subset of the whole universe of data, which is on the order of hundreds of GB of parquet. We use partitioning across files (and folder structure in s3) to lazily scan only the bits of data we need, then predicate and projection push down (thanks to polars) to reduce that further.
Variable, but a single partitioned file goes anywhere from single-digit kb to maybe a little under a meg. A query can touch just one or dozens (all lazily concatenated into a single frame).
I'm a Python dev, learning Rust out of interest, but also considering its viability for Data Engineering. It would be helpful to know the equivalents of the following libraries: Pandas (Polars), Prefect/Airflow/Orchestration, Flask, SqlAlchemy, psycopg2, requests, BeautifulSoup.
Searching for packages on lib.rs, crates.io or github usually works.
There will be differences of course. The ecosystem is not as complete. On the other hand, writing the code and having it work the first time is easier once you know the language.
We’re talking about hundreds of gigabytes, pushing towards many terabytes of data with extremely sparse access patterns. Loading into anything (including an rdbms) wouldn’t be cost-effective or practical.
Using Parquet files is more scalable because the amount of data can be vast. It can be impractical to use a normal RDBMS to hold billions of rows, but is no problem for a bunch of files sitting in S3 or file storage. It can also be radically cheaper because you're just paying for file storage. The Parquet file format is also designed to be both compressed but also skippable to find the rows you're looking for. Also, Parquet files can be mounted in a filesystem for usage with things like Spark.
The big pro is that the only data that ever has to move is that which is actually accessed. It’s extremely cost-effective to produce, store and query statically vs moving it into something else of sufficient scale to handle the entire universe of data.
The downside is that you have to cache data, there is a latency penalty on the first request for any new data. One would expect performance to be poor when querying parquet directly using these new lazy dataframe libraries, but it’s actually amazingly fast. Very surprising.
I don't know a whole lot about Rust. I think Rust is a great language for writing Database components and systems. For example plugin for PostgreSQL. Previously this would be done in C. You could still do this in C today too. Data Engineers are heavy users of databases and different data stores, but they normally don't design database systems.
Generalized data architecture, understanding the roles and challenges of turning data lakes into data warehouse, etc. including the user personas that use each and their needs and limitations. Different types of databases and data storage types and what they excel at. How to save money on storage. Considerations of idempotency, replay-ability, rollback-ability.
You say all workflow engines are the same but even just reading the Pachyderm docs will give an idea of modern data engineering best practices - data versioning and data lineage, incremental computation, etc.
Temporal also has a very cool, modern approach (distributed, robust, event-driven) to generalized workflow management (non big data specific)- if you’re used to stuff like Airflow, Temporal is a jump 10 years forward.
Reading Kleppmann's book and finding projects that are trying to implement some of the concepts there is a good idea, regardless the language and tool.
I'm just doing Python/Spark/AWS related tools. Most of my time is trying to break the bureaucratic ice between multiple layers from insane devops requirements to missing documentation on several "hidden requirements" (the ones that you find during the development process). So not much different than a lead role.
It's definitely different from a pure dev experience because it's expected that you lead changes inside the organization to make the pipelines work consistently. Without that (just plumbing) you can rely on "regular" backend devs.
This is obviously a high level data engineering perspective, not a low level like DB hacking or hyper optimizing existing pipelines and data transformations.
The amount of people management and organizational change needed to do data warehousing in an organization is amazing. Good luck when you get to the point where time is better spent on process change.
We are 5 days from the new year, so in 2023, a data engineer ought to take a serious look at Elixir.
I manage Elixir systems for data engineering related work, so I can attest to its use in this domain within a production environment. I also have used Rust for comparable data-engineering related systems.
Elixir is a phenomenal language for data engineering. I've written message processing pipelines in Rust and didn't get anywhere near the level of design considerations that Broadway / Genstage have. Some day, there may be robust open source offerings in Rust as there are in Elixir, but the ecosystem hasn't reached this state, yet. Rust asyncio is also in a minimum-viable-product condition, lacking the sweet structured concurrency that Erlang-OTP solved long ago and Elixir benefits by. Data pipeline processing in Elixir, utilizing all available cores exposed to the runtime, is straightforward and manageable. Telemetry patterns have been standardized across Elixir ecosystem. There are background worker processing libraries like Oban that help with audit trail/transparency. Smart, helpful developer communities.
Elixir is not going to beat Rust on performance. CPU-bound work is going to take orders of magnitude longer to complete with Elixir than Rust. You could extend Elixir with Rust in CPU-intensive situations using what is known as NIFs but you'll need to become familiar with the tradeoffs associated with using Rust NIFs.
Writing in Rust involves waiting for compilation. When a developer waits for compilation, they switch to something else and lose focus. You can use partial compilation for local development, and that speed things up. You also need to have a very modern workstation for development, preferably an M1 laptop or 16-core Ryzen, with at least 32GB ram and SSD. Elixir, however, has quick compile times as it doesn't do anywhere near the level of work that Rust compiler does. There is a tradeoff for that, though. Elixir is a dynamic language and consequently has all the problems that dynamic languages do that are automatically solved by a strongly-typed compiled language such as Rust. You also discover problems at runtime in Elixir that often would be caught by the Rust compiler.
One final mention is Elixir livebooks. Elixir has thrown down the gauntlet with livebooks. Coupling adhoc data scientist workflows in notebooks with robust data engineering backends makes a lot of sense. There are high-performance livebook dataframes backed by Rust. Elixir backed by Rust is beating Python backed by Rust from developer experience down.
i'll be the first to ack that popularity doesnt matter if you personally feel productive in it; but it does matter when talking about jobs and leveraging an ecosystem (which i dont know elixir's ecosystem so i cant comment on, i had not heard of Broadway / Genstage before this comment). but career-wise, i do observe that developers going after a theoretically better language routinely sacrifice a degree of industry relevance they never get back since they voluntarily enter an extremely niche subculture that for whatever reason never catches on. seen this time and time and time again and i think HN disregards this for the love of the new shiny without a balanced analysis of the career risk/reward
Consider that HN is tightly coupled to YC and that over half of the total market cap of YC companies is from those that started with Ruby on the back-end despite Ruby never having the number of users and “jobs” as PHP, Java, .Net, Python, C/C++, etc.
Were all those devs picking Ruby instead of Java in 2010 ruining their careers? I doubt it. Were the devs picking Java instead of Ruby for their startups? Very possibly, many of them.
Once your business comes into contact with the market, tool productivity matters more than popularity. Popularity can affect productivity, but at the end of the day, it’s the productivity that matters.
>Were all those devs picking Ruby instead of Java in 2010 ruining their careers? I doubt it. Were the devs picking Java instead of Ruby for their startups? Very possibly, many of them.
but is working for yc startups some good thing? like desired?
Your warning about niche technology is fair, but a Stack Overflow survey is NOT representative of what's going on in the real world. Better, more objective approximations can be gained by reviewing download stats of centralized package managers. Elixir uses hex. Rust uses crates.io. Python uses pypi. Node uses npm. Review download stats and then decide how popular libraries/languages are.
It takes time for technologies to catch on. Elixir and Rust both are relative drops in the bucket in respects to industry adoption, making it difficult for practitioners with the languages to continue using them beyond their current employment. Web development often acts as the foot-in-the-door, and Elixir's Phoenix is doing superbly well in this domain. However, Elixir's got an even more impressive position for data engineering related work.
"Chris Grainger (Amplified) delivers the Friday Keynote at ElixirConf US 2022.
We switched our machine learning research, production inference, and ETL pipelines to Elixir. This is the story of how we made that decision and how it's going...."
Most of my learning was by studying existing source and building. Elixir doesn't demand book reading in order to become proficient with it. Just dive in and build something. You'll learn as you go.
This is really interesting. I've spent eight years in the JVM ecosystem (mostly Scala) augmented by Python for data engineering but I really dig Elixir. I had just assumed that Elixir was still too young to have spawned a data engineering ecosystem. Now I have something to look into!
if you describe some of the existing systems you'd like to see ported, explain what they do in abstract terms and maybe someone here can present ways to solve them with Elixir
It was luck. I was the beneficiary of greenfield projects built with Elixir that proved their worth to the organization before I started working with them. Onboarding with Elixir as a senior engineer was fun and interesting, most recently coming from prior years of full time experience working with Rust.
There aren't many data engineering related jobs in either Elixir or Rust, yet, but as someone with depth in both I can use either and lead a team to use them successfully in short time.
That's awesome! Are you comfortable stating your employer? It's always neat to find out who is working differently and what industries they are in. If not that's totally cool. The internet is a weird place and I'm a total stranger. (Not to mention the whole this is publicly available forum.)
Actually, to even the playing field, a bald guy in a blue shirt just looked at your LinkedIn profile. ;)
We're starting to see an unbundling of data engineering into analytics Engineering (BI developer on steroids), ML engineering (AutoML is good enough that if you can do good feature engineering and pipelines the marginal value of adding a data scientist is not worth the cost, and data platform engineering (K8s. K8s everywhere).
Analytics Engineering are folks that transform data in an ELT model. The most popular tool is DBT.
I think the work will continue as companies embrace the idea of not writing custom ETL code anymore. The tools are super immature though. DBT is a huge improvement over the past but far from a final solution to the problem.
First think about your target company type/industry - the majority of corps will be using either azure, gcp, or aws - learn one (or more) of those and how they deploy using code etc.
Then think about that tools they use, in azure it will likely be ADF, Databricks, maybe Synapse.
Languages, python, sql, python, python, sql, some scala, more python, more sql, then get a general understanding of a few different languages (c, typescript, c#, Java, kotlin).
I’ve never seen a data engineering role asking for Rust/OCaml/Nim - I’m not saying they don’t exist but I’ve not seen them and I’ve rarely seen a data engineering role not asking for either python or scala.
FWIW I've found a lot of value in learning Typescript this year and interacting with the AWS CDK to build infrastructure as code. Upskilling my AWS knowledge also massively paid off, especially networking which I never truly understood before.
For short-term career growth, $YOUR_COMPANY's current preferred ETL tool will have the biggest ROI. Focus on design patterns: while APIs will come and go, the concepts, as you rightly say, are transferrable.
If you're looking to land a new role: the market says dbt, databricks and snowflake are pretty strong bets.
If it's personal interest, or a high-risk, high-reward long term play, take your pick from any of the new hotness!
I'd second the notion that the question is far too open.
I'd add that dbt, databricks and snowflake are pretty strong bets still, but you have to acknowledge that they're becoming mainstream with an ever accelerating pace as the companies behind them churn out upskilling courses, meetups and acquire an ever larger share of the market.
If you like to be a specialist, going deep into either of those still holds career value.
If you're taking a more generalist view of where things are headed, the best prediction I heard someone say to set themselves apart is for Data Engineers to optimize for operationalizing data. Focusing much more on reverse ETL, becoming knowledgeable in building data web apps. The no-code or low-code movement around data apps will make the barrier of entry to set something up nonexistent, and I see how that will drive demand.
Pairing (big) data query/ frontend performance and web apps is another beast though.
For all my initial scepticism, I see the Data Mesh concept picking up pace in the years to come. It's vendor independent, couples well with Team Topologies and effective, decoupled, functional SWE teams. There still will be a big need for standards and conventions set by a small enabling core DE team, as of now, the knowledge gap between the baseline DE and your average SWE or Product Owner is just way too big in my experience.
Last but not least, I'd throw data lake out there.
Apache Iceberg is getting a lot of attention and rightfully so.
TCO of a query engine on top of files is so much better than any DWH and any org being able to optimize compute on data for it's current need will be able to save massively while the "convenience" gap steadily closes. Again, pretty generic but there's much to learn around Athena, Trino and the like.
I'm personally not a fan of learning a new language except maybe for Rust.
There is an ever increasing stack of standard "low-code" tools for the typical ETL schlick, and Python won't go anywhere. Again, potential to differentiate will be low and ever lower in many contexts outside of proper big data.
This is only me though and this view is highly context dependent, so YMMV of course.
About snowflake, I am really curious. What do you mean by learn snowflake. The way I was told about snowflake is that it's a cloud based data warehouse. Are there advanced properties in snowflake which one has to learn? Or do you mean optimized queries?
Snowflake at it's most basic is SQL on cloud vms, anyone comfortable with SQL should feel at home there. That said, there are many Snowflake specific features that may take a bit to become familiar with. Just off the top of my head:
- hybrid RBAC, DAC, ABAC security model
- column, row level, and tag based access policies
- multi-account organizations
- cross-account and region data replication
- data shares
- external tables and specialized formats (iceberg, delta)
- pipes and streams
- snowpark API
- streamlit integration
The nice thing about Snowflake is that for many use cases it requires little management.
Things you can learn regarding Snowflake, other than the obvious (SQL, and Snowflake specific language extensions to SQL): proper table partitioning, Snowpipe (and the associated cloud messaging pipelines), and query performance tuning. (Complex queries can become a bear; identifying when its your query/partitioning or when its something on the Snowflake back end is challenging.)
There are always new additions to the Snowflake tooling ecosystem since the company is in competition with Databricks and others (e.g., Snowpark with Python).
I found myself specialized with data eng + platform skills in Clojure + Kafka + K8s + Node and it's a liability if you want to work outside of Big Tech, because few can afford to use tools like that. Node's relatively pretty expensive these days, believe I've used it a lot and bought into JS everywhere. But it's a really hard ecosystem compared to Python.
Might be an out there take, but being able to develop a shoe string web app that can be maintained by a solopreneur might be a good skill. Should translate to rapid prototyping concepts for big corporate managers as well. I'd argue Python could eat PHP's old niche in these regards, because it's way easier to get rolling from scratch with Flask and Pip than PHP was as recently as 2016.
Just fyi, OCaml is about as old as Java (circa 1995) and its direct anscestors (Caml and ML) are about as old as C (1970s to '80s). Not exactly new ;-)
If you're at a huge company, you can up-level as a DE within that company in many ways: (1) understand the hundreds of common ubiquitous datasets in great detail (what they are, their business purpose, how to access them, permissions, team ownership, etc..), (2) moving to a control plane team that works on infrastructure to better understand, (3) experiment with the broad array of ways to solve the same problem (e.g. think through how to optimize for cost instead of maintainability or implementation speed vs. scalability). #3 especially is important because you really level up when you recognize the right pattern to fit the problem you're trying to solve and have direct experience to draw from in solutioning.
Personally, I'm planning to learn about the internal implementation of databas(es), starting with the book Designing Data Intensive Applications.
This is so that I learn about the current ways data is stored
We've used both DataFusion and Polars for very similar use cases, though recently we've really begun to coalesce around Polars and most further development will use it, barring any complication that would require something else (DataFusion included).
Basically what we do amounts to bringing in a ton of json, pulling objects out of that into jsonl, applying a bunch of business logic/transformation etc and storing that as parquet on s3, then querying that directly using a thin HTTP API wrapper around Polars. Everything is written in Rust.
Experience thus far has been great. Initially all of the primary steps (json -> jsonl) were done in Python, but this proved an order of magnitude or two too slow. Rust (with a lot of optimizations... explicit SIMD etc) allowed for huge improvements.
As far as querying the data, the biggest surprise thus far has been how performant Polars can be with thoughtful application. We're able to do a lot more with a lot less than initially expected.
The cost savings for some of the ETL-type stuff vs doing it with a Spark cluster have also been very significant.
Datafusion has better support for object_store api (you can just register the store and pass it s3 uris). With polars, you have to have the file on the local file system (which I’ve found to be a huge performance win, cacheing them anyway).
Like the ETL tools, computer languages are basically the same. It's (usually) more important to learn how to do something new with a programming language than it is to learn how to do all your old things in a new programming language.
(The exception to that is that a professional programmer is frequently a maintenance programmer or problem solver and it may well be you have to learn a bit of language X in a hurry to solve a specific problem in a system written in language X.)
At my workplace once you’re decent with Terraform, a cloud ecosystem or two, SQL and Python the next best thing to do to advance would be to hone soft skills and become a better communicator and manager of people. If you can unlock teams to accomplish things, set vision such that stuff gets done and on time that’s how you become truly valuable.
The skill I value most in a DE is the ability to build effective model deployments and project layouts that are SIMPLE.
If I were to switch functions and find some protected time, I’d go off into the woods and build an example deployment compatible with company infra that is as light weight as possible. Then evangelize my team.
Math. A surprising number of people in 'data science' lack good understanding of basic statistics, leading to poor data quality or incorrect conclusions from the data.
Disagree, at least in this instance. Math isn't a bad thing to learn but it's not helpful for data engineering. Data engineering is focused on data integration (data pipelines), not on any type of analysis or data science.
FYI - I worked as a post-sales consultant for a NoSQL vendor that did a lot of search. This is what seemed the sweet spot for data ingest. ETL was part of what we did, along with search and other applications. I would never have called myself a “data engineer,” but I got to see how customers managed data at a practical level. So take the advice as thou wilt.
If all you are talking about is being a programmer that works with ETL/databases, and no analysis of the data, then probably Spring Batch is a good foundation. Java connects to just about everything, performs well, and has good observability.
Maybe something Apache Camel/Talend for building ingest/enrichment pipelines. For example, cleaning up addresses, redacting data, etc. They have a lot of off the shelf tools for getting something in quick, but also are easily extended for weird corner cases.
Some knowledge of RDF/Ontologies for provenance issues/simple inference. This was especially helpful if we were using a service to disambiguate references to something like Washington, where it could mean George Washington, Washington State, or Washington DC.
Python, but the downside of python is it gets its performance from essentially stitching together native libraries in a friendlier way. However, like Java, Python can basically talk to anything. Although I built a python library to interact with our DB for ETL purposes, Java was my go-to.g
How to do it in the cloud is something I would also recommend. For a lot of places it makes sense to store large data sets on the cloud, but pricing can kill you. Understanding how to navigate both the options and the pricing and how your workload affects pricing is key. This includes things like if it’s cheaper to create a function to apply some transform during load or just run a batch update later. Also, some of the options today are hybrid and that can also impact pricing when you’re dealing with a lot of data. As a practical example, we had one very, very, very large customer that went from 60 bare-metal on-prem servers on their really sub-par storage array to about a dozen very large server on AWS, which had much better storage performance. For reasons that had nothing to do with engineer, but had to do with CapEx and budgeting, they weren’t going to upgrade their sub-par array.
I’m not the first person in the discussion to suggest this, but I strongly encourage you to develop an understanding of vocabulary of your users/customers. For example, in a previous life we had people that worked just with life sciences customers. I got a taste of it when I worked on a POC to ingest PDF submissions for devices at the FDA, extract information from tables and diagrams, enrich them, make them searchable, and track changes between submissions. I had the mechanical end of doing the work. I was shielded by a life-science literate team and was just a ‘hired gun.’
As cliche as it may sound, getting close to the "business value" of data might be a good investment. Learning and building use cases, like a product recommendation pipeline for marketing OR a churn model for support to helping marketing ops migrate from GA (GA4) to an in-house analytics stack etc would help bring the data teams to the (much deserved) front row seat in the organization.
What’s a good authoritative source that tracks use-cases across industries for data? I often find resources that are either too shallow and high-level (literally one-line) or a super deep dive on a singular use case.
Well, most practical ML algorithms (liner/logistic regressions etc) should be in the wheelhouse of most engineers. I read a comment somewhere "Its much easier to teach data-science to an data-engineer than vice versa" :)
I think, if we data-engineers can step up the conversation from frameworks/databases to end-to-end use case, we will get more respect (and everything that comes with it like budget) from the rest of the org.
I think a good data engineer should incorporate some basic data analysis into their toolkit. This is what analytics/BI developers do all day everyday. Data Engineers need to do this whenever they create a new dataset.
It's not hard to pick up this skillset, and this sort of analysis doesn't take a lot of time, and it will make you a really good Data Engineer.
Completely irrelevant. DE is SQL, Python, sometimes Scala/Java.
Get really good at SQL. Learn relational fundamentals. Learn row- and column-store internals. Understand why databases are they way the are. Familiarize yourself with the differences between systems and why you need to care.
Get familiar with at least minimal Hadoop shit. Set up Spark from scratch on your machine, mess it up, do it again. Grapple with how absurdly complex it is.
Understand system-level architecture. Analytics systems tend to organize into ingest -> transform -> serving (BI tools etc)... why? There are good reasons! Non-analytics data systems have different patterns, but you will see echoes of this one.
Above all, spend time understanding the data itself. DE isn't just machinery. Semantics matter. What does the data "know"? What can it never know? Why? What mistakes do we make at the modeling step (often skipped) that result in major data shortcomings, or even permanently corrupt data that can never be salvaged? Talk to people building dashboards. Talk to data end-users. Build some reports end-to-end: understand the business problem, define the metrics, collect the data, compute the metrics, present them. The ultimate teacher.
(who am I? no one, just been doing data engineering since 2005, long before it was even a term in the industry)