Disagree, at least in this instance. Math isn't a bad thing to learn but it's no...

phoehne · on Dec 27, 2022

FYI - I worked as a post-sales consultant for a NoSQL vendor that did a lot of search. This is what seemed the sweet spot for data ingest. ETL was part of what we did, along with search and other applications. I would never have called myself a “data engineer,” but I got to see how customers managed data at a practical level. So take the advice as thou wilt.

If all you are talking about is being a programmer that works with ETL/databases, and no analysis of the data, then probably Spring Batch is a good foundation. Java connects to just about everything, performs well, and has good observability.

Maybe something Apache Camel/Talend for building ingest/enrichment pipelines. For example, cleaning up addresses, redacting data, etc. They have a lot of off the shelf tools for getting something in quick, but also are easily extended for weird corner cases.

Some knowledge of RDF/Ontologies for provenance issues/simple inference. This was especially helpful if we were using a service to disambiguate references to something like Washington, where it could mean George Washington, Washington State, or Washington DC.

Python, but the downside of python is it gets its performance from essentially stitching together native libraries in a friendlier way. However, like Java, Python can basically talk to anything. Although I built a python library to interact with our DB for ETL purposes, Java was my go-to.g

How to do it in the cloud is something I would also recommend. For a lot of places it makes sense to store large data sets on the cloud, but pricing can kill you. Understanding how to navigate both the options and the pricing and how your workload affects pricing is key. This includes things like if it’s cheaper to create a function to apply some transform during load or just run a batch update later. Also, some of the options today are hybrid and that can also impact pricing when you’re dealing with a lot of data. As a practical example, we had one very, very, very large customer that went from 60 bare-metal on-prem servers on their really sub-par storage array to about a dozen very large server on AWS, which had much better storage performance. For reasons that had nothing to do with engineer, but had to do with CapEx and budgeting, they weren’t going to upgrade their sub-par array.

I’m not the first person in the discussion to suggest this, but I strongly encourage you to develop an understanding of vocabulary of your users/customers. For example, in a previous life we had people that worked just with life sciences customers. I got a taste of it when I worked on a POC to ingest PDF submissions for devices at the FDA, extract information from tables and diagrams, enrich them, make them searchable, and track changes between submissions. I had the mechanical end of doing the work. I was shielded by a life-science literate team and was just a ‘hired gun.’