This was something that surprised me after I did my PhD as well. I thought that employers would focus on my specialized skills and "someone else" would somehow pick up the pieces and make something out of what I did. Turns out this is completely wrong, and I now see how frustrating it is to work with people that have this kind of attitude.
Most of most jobs is a bunch of mundane stuff. I've seen it in software development, and I've seen it in management consulting. The best people, typically, are those that will happily do both, understanding that the fun stuff comes with a lot of baggage.
The "someone else is better at the stuff I don't want to do than me" argument rarely holds up either. The friction that comes from dividing the work along lines like modeling and production and trying to hand off is rarely worth it when one person can do both.
Anyway, I've been where the author is, but personally I think it's wishful thinking, unless maybe you want to start your own shop and structure it around yourself that way.
I dunno... I am a software/data engineer who partners with data scientists. I think that comparative advantage here is a real and important. Don't get me wrong, I'm happy when my data scientist partners write good code or show interest in getting better, but I'm more than happy to take their janky code and make it production ready. It often needs to be optimized for scale or refactored for reusability, and a lot of that often falls very solidly in the engineering discipline.
I'll often tell my DS partners, "Don't worry about the code; you get the math right then give it to me. You can go about doing more math, I'll do the engineering." In my experience, this is often a really pragmatic division of labor.
This is also important because often, the data scientists are never on-call, so if something breaks in production, engineering needs to know what is going on.
> You can go about doing more math, I'll do the engineering
I'd say most of the teams I supported as a backend/DevOps/infrastructure engineer followed that pattern.
Generally the evolution I saw was start with a handful of sciency folks working in R or python, get more grant money as work becomes more important, have some SE's thrown in the mix to re-write in python and/or C/Cpp.
A lot of those SE's would float between teams with one of us infra folks to performance tune as they scaled.
> the data scientists are never on-call,
Yeah, but I'll keep my on-call rotation over their constant 11th hour shenanigans trying to get a paper out the door.
I mean, this is probably how things should be done.
Having the specialists of each field appropriately resourced, focused and aligned streamlines everyone’s life. The problem comes in that most places either get the scientists to pull double duty and we end up in the scenario scientists are pushing messy, subpar code out the door that nobody else wants to touch because they’re unsupported and operating out of their depth, and devs don’t want to touch it because they’re not keen to fix someone else’s hacky code and because they probably have other priorities and the science stuff likely isn’t well integrated into the rest of the other code-delivery and monitoring systems.
The big thing is prestige, it takes some pretty big brains to write good, clean production infrastructure that doesn't fall over every five minutes.
Those big brains are absolutely capable of learning the math and doing the creative work of modeling.
They want to do the creative work.
Which means that if you create a prestige gradient and don't let your engineers do interesting things they themselves might get a paper out of, you lose good engineers and get bad engineers instead.
This is also a constant problem. In DevOps and SRE, the very best devops and sre people are absolutely incredibly mercurial and mercenary - because they've been fed total lies about building systems again and again when the net job description is "hey ops person, this application I wrote is misbehaving can you do advanced troubleshooting on this server for me do I can go do leetcode".
This. There is no clean application of comparative advantage here. The great individual that will rewrite the code and knows all the details to get the model scalable, robust and production ready, can also do the data science "creative" work.
If you are not willing to grow to become that person you are only damaging yourself long term. And if you don't attribute that individual's work: a) this is unethical b) they will, and should leave.
> often needs to be optimized for scale or refactored for reusability
It's so easy to make subtle assumptions when you redo their code that completely invalidates the work they've done. ML completely collapses on extremely small errors. Handing something off to someone else to refactor is a dangerous step in the process that risks everyone wasting their time
> It's so easy to make subtle assumptions when you redo their code that completely invalidates the work they've done.
It's equally easy to test whether those assumptions actually break anything. Especially when minor errors can be catastrophic.
I'm part of a team that was tasked with producing a web app from something that was originally a piece of Matlab code.
We considered just running the Matlab code in a container but ultimately IEEE 754 is IEEE 754 regardless of platform/language, so creating a 1:1 implementation in C++ proved possible.
The catastrophic errors aren't the problem, the problem is when things are degraded by 20% which results in a small loss when it could have been a big win. You just assume it didn't work when in reality it wasn't implemented correctly
Personally, I'd like to do both the data science and the engineering. I'd also like to be in a team where each member does both. I at least "feel" like these two skills are close enough spiritually to one another that you can find a team composed entirely of people that do both.
Interestingly, I'm a physical scientist in a business that makes hardware and software. I'm up front with people, that I'm not an engineer, and nobody expects me to be one. I write code to support prototyping and testing, but don't expect it to go into production. Nobody wants my janky prototype. I produce a theory of operation that covers a proposed design, as well as an outline of the basic manufacturing and service processes needed to make it work. The prototype sometimes helps confirm that the technical requirements for the product can be satisfied at an early stage of a project.
When I do share code and things that I've designed, it's usually for tooling not product, e.g., a script that helps calculate non-obvious design parameters and tolerances. Often, my code is used to test hardware components before the official software is ready.
In my case, I'm on call, though not 24/7, because the business as a whole isn't. For instance I'm available to diagnose supplier and production problems, and deal with weird issues that emerge in the field.
Why do you want them to write code at all, then? Why not just task them with writing user stories around the parameterized functions they need, and just let you figure out how to implement it all?
The idea that creating a model can be a “user story written in math” seems to me a variation on a common misunderstanding about what model creation, particularly the role of coding in it is. Data scientists, statisticians, modellers, don’t go in knowing what the model is and just coding or specifying it. They use code as a exploratory tool to test several hypotheses until they find one that seems to hold - the code, the data, the algorithm, the statistical test, the exploration and the reasoning are all tools in the process of discovering the right model. It simply can’t be specified in advance, it has to be run and tested.
Having som spaghetti code is a natural consequence of this exploratory, iterative process. Applying good SW engineering practices to this exploratory endeavour is just a natural consequence of the process when you don’t know if your code will be of any use before you finish running and checking the test results. Why would you bother modularising and doing test coverage on something that is very likely to be thrown away after one or 2 runs?
I say this from 28 years of experience both as a data engineer and data scientist. I am a good python developer, and I can write production grade code. But I won’t refactor my code into that until I know that is the code that generates the right model. And I certainly can’t specify this particular code before writing some dirty version of it, testing it and confirming that the model it trains passes some statistical tests, at least.
Basically, the code is not the product - the product is the result of applying some transformations on data and running that through some ML or statistical algorithm to generate a model. Transformations and algorithm being unknown to be useful until tested, hence specification being unknown until coded, run and tested.
Assuming you're writing actually detailed and complete user stories and those user stories are for functionality that basically amounts to "just math" (i.e. no UI or API standards or whatever to worry about alongside the business logic) I imagine it's not much more difficult or time consuming to just write the code rather than describing it in a user story.
Having one person write user stories and a second person, who understands the domain less than the first person write the actual code would be twice as much work for a worse result.
What's the best way to write those user stories? Certainly not plain English; you'd probably want to use some kind of pseudocode. Ah, but it would be good if you could run that pseudocode and see what the results are in the happy path, without worrying too much about what happens when something goes wrong. We have that, it's called Python, so that tends to be how these user stories are expressed.
I expect that it makes sense to have the modelers do at least enough coding to test the common use cases/expected inputs for the model to validate it before handing it off to be "properly" implemented.
Right but they still need to know some code and talk programming language.
The "here is some shitty code that does the job, make it good and production-ready" requires far less communication and domain knowledge than them telling you how exactly it should work and you translating that into code
That’s an unusual style of operation. It may be a really good one, but we don’t really see a lot of teams where people have complimentary skills although really we should.
That’s what makes technical interviews even more frustrating. Most jobs day to day are like 5-10% solving deep technical puzzles and 90-95% fiddling with tooling and automating things that are incredibly frustrating or time consuming.
So ya I’m not sure how to solve your silly brain teaser, but I have written custom test frameworks to automate the tedium away to save the team hundreds of man hours. It took a deep understanding of VERY specific tools (shout out to ASIC EDA tools). But that doesn’t matter on a technical interview. So you hire somebody who can Leetcode, but can’t figure out how to fit all the pieces of the actual job together.
The way I think of it is that in a business you often need to get a certain fixed number of things right before you could even start to make money.
Indeed, 95% of those things are mundane and tractable (maybe you have to be fast and careful), but the remaining 5% are challenges that can only be solved by specialized knowledge or some innovation.
If you hire someone good at solving mundane problems, they can contribute to the 95%. But if 95% of the company is hired like that, the pool of people you can draw on to do the remaining 5% is <5% (maybe more, if the company is lucky). There's a good chance that business is blocked by the remaining 5% because of the high uncertainty in delivery, fundamental to the nature of the problem.
Another approach is to hire, say 50% to do mundane work and 50% to do both mundane work and more special work. This costs more, but if done right it can make the company move faster, and have better chance to survive than any case of the former.
First of all, it's often (but not always) the case that people who can do special work can be trained to do the mundane things better. Second, having many people who can solve one-of-a-kind problems is especially important if for each project the 5% can depend on a different set of things; it's often also the case that solutions to the innovative work is closely related to experience dealing the more mundane parts, so this structure basically recognizes that innovation can come from rank and file rather than a specialized "ideas guy" that just flies by to solve problems. So what happens in certain small but profitable companies is that they try to find people who are happy to do dirty work, have a business mindset, but at the same time show signs of being innovative.
This is not to say I agree leetcoding is a good test. I think Google and Facebook don't test enough whether someone is capable of or interested in identifying technical priorities with a business mind. Although Google has some data to connect leetcode performance to job performance, I'm skeptical of their performance evaluation methods.
Great way to paraphrase it. I’ve found all jobs to be plumbing. The actual task of writing the code to do xyz was always straightforward or 5 minutes of googling. What no Googling is going to solve for you though is how to fit A B and C of your company’s process together into a manageable solution. Manageable being another keyword here. It’s easy to write a lot of code. It’s hard to keep all the pieces of it organized and flexible to future use cases/needs.
> So you hire somebody who can Leetcode, but can’t figure out how to fit all the pieces of the actual job together.
Yep - 2021 was a bad time for me, mostly because I was stuck working with "smart" people on 2 different projects who had no idea how to add value. All 3 of us have PhDs, but while they had coasted from there in positions where being "the genius" was enough, I had taken a couple of career tangents. They were constantly mesmerized by my Bash skills and infrastructure knowledge, and in the end my parts of the project got done and theirs were written off due to lack of progress.
> It took a deep understanding of VERY specific tools (shout out to ASIC EDA tools). But that doesn’t matter on a technical interview. So you hire somebody who can Leetcode, but can’t figure out how to fit all the pieces of the actual job together.
What would you suggest testing? If you ask about the specifics of a given framework, you'll get someone who's memorized that framework but can't actually think. If you get someone who can code and has a decent level of general intelligence, they'll generally be able to learn a new API.
I totally agree. I work on the private sector, coming from a research position too. I was also focused on the "interesting" side of the problem: the modeling, integrating domain knowledge into the analysis, drawing all sorts of plots... But there were other unavoidable and "uninteresting" needs for the research project, like building a data gathering system with its API and everything. This required my best software engineering abilities. Needless to say my best weren't precisely THE best, so as the project got bigger, the not-so temporary fixes increased, as well as poor design choices (if any). This finally led to a complete reestructure and almost fresh start.
I feel some of it could be avoided, so I learned the hard way that the whole modelling + software engineering process is a subtle craft. It is important to take care on the implications of your code and, specially, on how its done, since it may fall back onto you eventually. This reconciled me with the more technical stuff (my tools) to eventually put up a good work in a more satisfying way.
I think there is value in both and it sort of depends on the organization.
With the commoditization of models, we are seeing the rise of MLEs over data scientists. Engineers that understand enough DS to make things work are wildly proficient in this space.
However, not all models have been commoditized, and there is still a need for new math in many places and that’s where the division of labor makes sense. You can’t be an all star engineer and an all star data scientist it’s just too much for one human
I have a name for the sort of mundane-yet-employable programming tasks. "Plumbing work". You're not doing the clever problem solving that once sucked you into programming, you're welding pipes together that other people made.
>The "someone else is better at the stuff I don't want to do than me" argument rarely holds up either. The friction that comes from dividing the work along lines like modeling and production and trying to hand off is rarely worth it when one person can do both.
This was always my attitude. Every time you split something you add coordination overhead. This overhead gets worse the more times you split.
Of course there is specialization that can make someone else sufficiently more effective that you shouldn't do everything.
But every gain in specialization has to be weighed against increased communication costs.
Add to that a lot of problems don't need a lot of specialization but touch on many different disciplines.
This is why I think, contrary to the trend of specialization, we need generalists that can cover most bases at once and decrease the communication overhead considerably.
It's often still good to have specialists, but you should mostly employ generalists and only a few specialists in key technologies that set your company apart from others.
Well, in purely software shops, there's often people dreaming up the 'what to do' and a different group of people actually writing the code. Same with systems design, we have 'architects.'
This should be no different for statistical modelers or other disciplines. Employers are just cheap.
I've never worked on more painful codebases than when the "architects" don't have to bother writing actual code and so are ignorant of all the edge cases and special-case business rules that turn their pretty pictures into a horrific ball of mud.
The "architects" I've known also had a habit of giving you their grand perfect plan in a meeting that lasts no more than an hour or so, disappearing for the next 6 months without communicating with the people building the thing at all, and then being surprised when the final codebase looks nothing like the perfect system they'd designed in their head.
Architects have to be involved in building the thing they're architecting.
(It feels like the blueprint analogy fits here - actual building architects write blueprints, and then builders go off and build the house based on those blueprints. The code we write is effectively the blueprint, not the house. So what the hell are our architects making?
We need to treat software "architects" more like building site foremen than actual architects)
Having some times taken architect like roles, I think the big problem is that architects is a very loaded term, like everything in IT.
You have proper architects, those that design nice diagrams, have proper technical knowledge, and also code, even if small portions when compared with the rest of the team. These I would call proper architects, and tend to explicit mention Technical Architects to make the point.
Then you have the "architects" that do diagrams, spend the time in meetings with customers, plan features per sprint, delegate activities, and so forth. This ones I call managers and always double check if the company isn't using architec as synomim for managers.
At least some shops are more honest, by making use of business analyst or solution architect as synomim for high level management work.
It is not that surprising because software engineers are usually more expensive than scientists. That leads business people to ask serious questions about which staff they really need, how much of that work can be done by lower-paid specialists, and whether they really need the professional code quality when the stuck-together python that scientists tend to write usually also works.
To all the specialists out there, if you value job security, become a specialist that can produce. When the company I'm in did layoffs, the area that had the biggest layoffs were in groups that did not actively produce the product we were selling.
I love boring stuff because often it allows me to do something really well. Hard problems are fun but hard to solve well by their nature. Programming is fun because I get to do both of these things.
Having done both sides of the work, I totally understand the author’s perspective. Modelling and coding production are certainly two different skillsets, and many people will prefer one or the other, but not both. Now, there is one thing I guess the author doesn’t get (or he gets but doesn’t like) and a thing that most companies don’t get.
The former is that companies don’t care enough about what the individuals they hire prefer - especially not if it doesn’t address their need. They don’t need a beautifully crafted model that can’t be run - they want actionable results that hit the bottom line, and in the energy forecasting model that means generating new forecasts at every cycle (months, days, hours, minutes - whatever). A great model that can’t be put in production and run efficiently has about the same value as no model, and an average model in production will have much more impact.
The latter is that companies don’t usually understand that ML is not software development. Putting ML in production is, but finding the right model is research work, and code is mostly a discardable tool for research. Its goal is not to go live, it is to validate a hypothesis (in this case, that algorithm X, when presented with data Y, generates a model with enough predicting power to be useful to the business). This validation requires code to gather Y, clean/join/analyze/reshape/featurize it into a more informative and clean representation (clean from the algorithm’s perspective, not necessarily a human’s), run X, run inference with the generated model and run some test of the results against additional data.
If this test is negative, some or all the code written above is useless, and we go back to the drawing board. Given the very coupled nature of this process (a new data source has to be joined to the rest - coupling; a new data transformation changing a feature changes the data schema downstream - coupling; a new algorithm needs a different data input - coupling, and so it goes). If you have an experience like mine, you may actually be able to write in a way where you can reuse some of it, but I have 28 years of experience with data, there are simply not enough people in the market with that level of background or the interest in learning all this. Companies must accept that they will not always get this perfect candidate with all the skills they want, and start thinking of pairing the right people in teams.
Some who have been around for a while may remember the Venn diagram of the perfect Data Scientists - it usually was an intersection of business, math and programming skills (also often communication skills and a few others). My thoughts since I first saw this diagram were: “Even if there are people out there with all these skills, why would they want to work for others?”
This, more than anything else, is my guess at the core reason why so few companies are successful in putting ML in production.
Everybody wants to work on the fun stuff. Turns out in software, getting something to actually deliver value (meaning convert it from prototype or proof of concept to something running in production) is 90% not fun stuff. Most people learn that in their first job.
As a software engineer the only way I’d become a “data scientist/trader’s servant” is if I’m getting paid exorbitantly. Otherwise it’s the worst kind of work because someone else is going to get credit for everything that goes right while I take the flak for all the hard stuff.
The converse is that to get a software engineer to be your servant you better also be really good, or else you’re probably going to end up paired with someone who doesn’t have the luxury of taking other jobs and maybe won’t be particularly good at getting your stuff to actually run.
One of the first lessons that a junior developer learns is:
>My job is not to write code. My job is to solve business problems efficiently, using computers as a medium.
Sounds like data scientists have the same problem. Sometimes, to solve business problems efficiently, you have to step out of your comfort zone and learn to do things that seem ridiculous.
Why just the other day, I had to write a report for humans to read. Using words and pictures! What am I, a technical writer?
I strongly disagree, most software engineers will not have excellent knowledge of modelling, and someone who is mainly a data scientist will not be able to compete with a dedicated engineer that studied distributed systems, kubernetes, and knows undefined C behaviour by heart.
When building a house, the guy that lay concrete are welders are different people, Anastesiologists and surgeons are not interchangeable, etc.
Many Industries will produce code, just like many people often need to nail two wooden bits together, but that does not make them a qualified structural engineer.
everyone differs on what the "fun stuff" is. i would find doing data modelling and statistics unbearably tedious, but quite enjoy taking working code and refactoring it to be performant, maintainable and well tested.
The modeler who can write code unlike you will get hired first.
Okay, now that I said the provocative thing to kind of drive home how real and serious this point is, I will say that I am not a statistician, but I run more computational physics simulations, so less statistical modelling but modelling of experiments and systems based on the PDEs. The one thing I observe is that there really is only this patience for this lack of understanding your own tools for theorists. Experimentalists can fix their own tools, they can open up the casing and resolder the boards if they need to, heck most of them can fix their own cars. But you have no idea how many computational scientists just load up Lumerical or Ansys and just click around but really have no concept or idea how it works under the hood beyond just things they show on intro slides to talks. Some know how to script say Meep or something if they're good but they've never implemented a DE solver themselves unless it was in a class in college or first year grad school then they forgot it all.
You really only have this disconnect from your own tools for theorists. Programming is your breadboard, your substrate. Code is the material you use to do your work. I don't understand why it is okay for theorists of all stripes to slide on never really understanding how their on research actually works on a computer whereas every experimentalist I've ever known could recreate their entire experimental apparatus from scratch if they were paid to do so. But that's okay, because that means as long as too many theorists can only write equations and then have to have someone to hand hold them so they actually do the things they've written down, I will be valuable and have job opportunities for myself. It would however make life easier for myself and lessen the many headaches I've been subject to, and heck, may be science could move forward a little better, yadda yadda.
I probably shouldn't encourage my competition like that especially when they're injuring themselves but that "move science forward and lessen my headaches" vibe does make me want to share the sentiment so that theorists at least understood on some level how the libraries they import work sometimes.
Computation is not theory. Theorists use paper and pencil and sometimes mathematica, not simulation. As for not being able to rebuild a computer, that’s truly ridiculous.
I loved Python. It was easy to learn, very powerful, has libraries for everything...
Then I started supporting researchers and scientists who wrote "python code" to run simulations etc.
Most of it's pretty basic, install some scientific code published by some research group. They chuck their data in and run it.
But then they started abusing virtual environments, writing their own code, cutting and pasting, commenting out random lines because they saw someone else "fix" something that way... and they all want Jupyter notebooks.
Now it's like an eternal September plus I get to deal with annoyingly slow package managers like Conda and rough academic projects with poor documentation and little testing.
Having been in the same situation as you, you have to bring some technical leadership if you want to change things. And some empathy, these people are stuck in a local maximum, they try their best probably with limited tools and knowledge.
Conda is a shitshow for sure: teach them how things could be better.
Notebooks are a reproducibility nightmare: show them how nice eg. Pycharm is
People are unqualified to write O(n) code in less than O(n^3) time: identify skill gaps and get their management to sign off on some software development / python training.
If you can make a convincing case why your approach is better, you can make everyone's life better.
Conda may suck in some ways but it solved a big problem; installing software without root accesss. It effectively removed os dependencies allowing for operating system changes. HPC operating systems were effectively ossified prior to that.
Notebooks on the other hand are a god damned abomination. Obligatory past discussion;
> Notebooks on the other hand are a god damned abomination.
I completely disagree. Notebooks are an excellent way of doing literate programming, because they go beyond plain text, and integrate richly formatted text, tables, figures, videos with code. This is excellent where the human is following a line of reasoning, and offloading the computation to the computer. The reasoning is explained via the non-code, made concrete via the code cells, and its effects shown via the code cells.
Of course, people can abuse the conveniences of the medium to write really bad code, and follow some really bad practices. But that doesn't mean the medium is bad. In fact, I think if you forced people to stop using notebooks, their productivity would drop by quite a bit. Notebooks are tool for thought [1].
The solution is to teach them better practices, and provide better tools within the notebook interface that will help people write better code.
No conda is for dependencies of your python environment. You need python itself, shell utilities, and some dependencies are dynamically linked. Think package manager for the `--prefix` compilation flag AND `LD_LIBRARY_PATH`
That sounds similar to my experience with the Python ecosystem (of which I have a profound dislike at the moment). If you haven't tried Mamba yet I encourage you to give it a shot; for me it has had a pretty significant impact on the time I have to spend sitting around waiting for Conda packages to install.
To make this more clear as I have had the same experience and want more people to enjoy it: Mamba is more or less a drop-in replacement for the conda tool with the guts rewritten in C++ and as a result is much faster to resolve dependencies -- I want to say an order of magnitude faster or more. It's even noticeably faster at downloading and installing. You just
conda install -c conda-forge mamba
in your base environment and then you can start running mamba instead of conda.
The situations such as improving package management in python is a problem but fundamentally in any software stack you will eventually find bad code. Especially as the code base gets larger and people move on.
Instead you can try to fight it by teaching the rest of the group how to write more idiomatic code but that only goes so far. Also training and documentation can go out the window when the objective is to ship something that works once.
> every one of them wants Python. I haven’t seen a single one where they’re looking for R or even C++; Python rules this roost.
Tried putting R into production recently? It’s a frustrating and brittle experience. Don’t get me wrong, R is fantastic at what it does - analysis, research, statistics, and arguably the API’s on the R data frame packages are a lot saner than Pandas.
C++ is out for different reasons I suspect. As this touches on, “modellers” (and data scientists/engineers) ought to be decent developers, but a lot of them are not, and in my experience actively refuse to learn any of these skills, the comfort zone is “jupyter notebooks” and that’s it. Getting them to write C++ (disregarding language debates), a language that is unequivocally more difficult and fraught with complexity than Python is basically a non-starter.
Do I wish it was different? Yep. Do I wish there was some more variety in the “language ecosystem” so it’s more than just the “lowest common denominator Python” dominance? Absolutely.
>Tried putting R into production recently? It’s a frustrating and brittle experience.
Oh man, can I sign onto this rant. While Python has spent a decade+ trying-and-failing to standardize on one of a dozen tools to properly manage dependencies, at least it is trying. R is still global-namespace, no-pinning by default. Sorta-kinda you can squint where renv is going, but still needs a lot of development.
I had some hopes of Julia stealing the R mindshare and righting some of the more egregious wrongs in that ecosystem, but (as an outsider) it feels like Julia has lost a lot of steam.
If the cure for R is that I need to jump fully into the Nix ecosystem, I am not sure that is a solution.
Disappointing about Julia. Only played with it a small bit, but I had assumed it had identified dependency management as a huge problem that needed to be addressed.
Julia’s package management is miles ahead of R, and bests Python’s on a large number of factors. However the last time I used it, it still required you to issue commands into the repl to setup your packages. However, that might have been resolved by now? The language and tooling clips along at a pretty good pace.
I find the flexibility of being able to setup and switch environments from within the REPL very enabling, and the Python approach of using the command-line shell feels kinda messy and cobbled together to me. (I felt this way about the Python approach before I even came across Julia, by the way.)
That said, there is a command-line utility jlpkg [1] that makes package management available from the shell. It's not very widely used, but maybe it suits your needs.
Although, the way you phrased
> it still required you to issue commands into the repl to setup your packages
makes me think maybe you didn't know about the Pkg mode in the REPL, and assumed you had to do everything with commands like `Pkg.add("DataFrames"); Pkg.update()` and such? If so, there's a package management mode to the REPL you can access with the ] key, which is kind of its own subshell within the REPL where you can do `add DataFrames`, `update`, and such instead.
I think my issue with so many of these tools is that they’re fundamentally meant to be used interactively. R (renv/etc) doesn’t have a package manager separate from the runtime, so you can’t just “yarn/pip/cargo/etc install” and have your dependencies setup prior to your script running, you have to invoke manual R commands and hope everything goes off ok, it doesn’t spontaneously pick a different directory to install your stuff in, or have a tantrum about permissions because it wants to install in some root dir.
As I alluded to, renv exists, but it requires a lot of development work before it is a comparably robust option for the ecosystem. Basic things like a command-line interface [0], working with non-CRAN repos [1], using an existing DESCRIPTION file [2], etc. There are many use cases where renv does not work in a corporate environment (ie not open-source all public code scenarios). Some of those issues have been open for years.
I do not believe the situation is unsolvable, but there is significant work to be done. Renv provides value today, and I will encourage everyone to use it. However, it has significant blind spots which continue to make R deployments challenging.
I believe that is one of the points the author of the article is trying to make: that the data scientists ought to be able to work in whatever language lets them develop the model most quickly and then hand it over to a software developer to make it work well in code, so that each can focus on doing what they do best and both aspects can be done very well.
I work in an Industrial smelter (I'm a Materials/Chem Eng I look at outputs of models don't have a lot to do with developing them) the way the data pipelines are setup here the models are deployed as something called a "pickle file" I think this is something very specific to Python - maybe the equivalent would be a .dll file in C++ world.
From point of view of the integration process the pickle file is just a black box input data flows in, response flows out of it. If something changes with the model (trained it on a new dataset etc.) then it's just a matter of drop in a new pickle file and it all keeps working.
I don't know if you can do the equivalent with packaging R code like this. All of the tooling is setup around Python and I think this is pretty industry standard we use Azure and I'm pretty sure this is all out of the box stuff.
Pickles and .dlls solve completely different purposes. A .dll, which stands for "dynamic link library", is code - it's how you package parts of your implementation in certain environments (Windows). A pickle file is data - it's a way of serializing out the contents of the state of a data structure, using some magic built into Python reflection, and then loading it back later.
Pickling is very convenient for its purpose. It's also... very bad in production. The problem is that the serialization format is derived from the implementation and not very stable, so if the implementation changes - sometimes in subtle ways like upgrading your Python version - the serialized data isn't readable anymore. You're almost always better off either using an explicit data serialization format (JSON with schemas, protocol buffers, etc.) and writing your own logic for saving and restoring to/from it.
> the pickle file is just a black box input data flows in, response flows out of it
There's 100% more to it than that - there's got to be something that's evaluating the data in the pickle file and actually using it to compute the response. That substrate layer probably doesn't change often, but it should have its own versioning and deployment story.
>the serialization format is derived from the implementation and not very stable, so if the implementation changes - sometimes in subtle ways like upgrading your Python version - the serialized data isn't readable anymore.
Could be wrong, but I do not believe this is true anymore. There are pickle versions, but I believe the format has been standardized such that you should be able to freely move between Python releases.
The base serialization is stable and compatible across Python versions. However pickle files can serialize entire Python objects, including from custom and third party code. Compatibility of serialization then depends on the versions of the code that implements those objects, which is often not well managed.
> There's 100% more to it than that - there's got to be something that's evaluating the data in the pickle file and actually using it to compute the response.
I suspect that is the Microsoft Azure ML part I'm not sure. As I said I don't write the models, I just come along afterwards and review the results. I don't know anything about Python.
I think the R equivalent of a pickle file is just an object(s) that's been saved with "save" or "saveRDS" to a file. These can then be loaded and could contain a trained model.
Kind of ignorant... given that Kotlin is JVM-compatible, it works with libraries even more than 20 years old.
Do I care if it is at parity with Python? Hell no I don't, as long as it allows me to accomplish the tasks I need -- and it does, faster, easier, with fewer errors -- then I'm very happy.
This is an interesting observation. Some of the data scientists I work with have been using Kotlin to define "analytic grammars", and is the first example (in my limited Kotlin awareness) i've seen of Kotlin outside of Android development.
It seems that Roman Elizarov (Kotlin Project Lead) has identified the opportunity for a better language ecosystem to enter the data science space
> I’d much rather get something working and then hand it off to someone else who can refactor it for speed and clarity, and have it conform to the desired style conventions, etc. etc.
I've been on the coding end of this - when everyone actually has those fixed roles and goes into it eyes wide open, it goes pretty well! When instead the PhD is assigned to go do the feature, and then the programmer is called in later when it's not implemented well, it tends to go quite poorly and nobody is happy, as everyone's time is wasted.
I'm a modeler and best described as a lifelong scientific programmer. I'm much, much better at doing the specialty science I was trained to do and have done than write unit tests (for Jupyter notebooks? why?) or struggle through big-O questions (again, why? I write ~100 line programs that are never production code). Like the author of the post, I am not a professional developer and don't pretend to be or want to be. There are people out there way better than me for doing those jobs.
Recently, I interviewed for a role as a computational chemist - this is an ideal fit for a person like me with the domain knowledge, advanced degree in the subject area, passion, and a proven (if dated) track record of publishing in the domain. The idea is to use software, amend what's there if/when needed, and apply my knowledge to the process and what comes out of it.
What did the interview start with? A surprise interactive coding challenge that I wasn't prepared for, and thankfully the interviewer was kind and professional enough to understand that my value-add is in, well, computational chemistry and not in the details covered by a CS major.
I thought Jesus, I bet not a single software engineer at this company got asked to do a simple organic synthesis or even a redox problem during their interview process.
> I bet not a single software engineer at this company got asked to do a simple organic synthesis or even a redox problem during their interview process.
Well they should have! One of the most important traits for a good software engineer is to be able to pick up a working understanding of the domain they're programming in. If they can't do that, they're more likely to make negative contributions than any positive ones.
I might be reading too much into your story, but if your CV indicates that you're "much better at doing the specialty science I was trained to do and have done than" CS stuff, a good interviewer might be concerned whether you had the necessary CS knowledge / programming skills.
To me at least, an interview is not to flatter the candidate by asking questions they're most comfortable with, but to make sure they have the requisite skills expected for the job. If the job expects X, Y and Z, and they seem like they're good at X and Y, you spend the time checking whether they're reasonably competent at Z as well.
In my case, the ~30 years of experience I've had as a software developer, data scientist, assistant professor, and quantitative researcher - not to mention my refereed publications in several computation-heavy areas - should be proof enough that I possess the minimal amount of programming skills for the job.
Either that or I've been pulling the wool over everyone's eyes for 3 decades.
In my present and past experience, the asymmetry of what is required for a computational scientist (the science, the programming, the CS whiteboard stuff) vs. software engineer (software engineering) is quite pronounced. YMMV.
If OP wants to make models but never worry about them after doing the fun bits, it sounds like they might enjoy academia. The industry premium salary is in part from doing all the work around the “fun part”. As others have mentioned that’s where a lot of the value is, and nobody wants to be your servant.
Though even in academics, you have to write the paper yourself after doing the fun bits.
> If a company says that they need excellent Python skills, and they mean it, then I’m not the right person for that job.
Eh, let the company decide that. That's one of the biggest things I tell my buddy who thinks he's bad at programming and really doesn't want to have to look for a job. He's convinced he's horrible but I don't think he is. He just doesn't want to get rejected, so he just avoids interviews as much as possible and tries to stay at the company he's at.
I would imagine that most companies are okay with python code that is rough and ready as long as it can be integrated by the rest of the team into the production code. Most data programmers are probably expected to be better at the data than the programming.
He should find a SWE and partner up. As more clients enter the ML space there are more less sophisticated ones who has less of a concept of cross functional teams. Rather than educate themselves these clients have unrealistic demands.
While I am a huge advocate of DSea writing better code going fullstack is unrealistic for them. What does it even mean to integrate? Fastapi? Docker? Helm charts? Monitoring and observability? SRE? The list is endless.
This is a classic case of "the client doesn't care the want business value".
If as a DS you want to get better at writing code join our Code Quality for Data Science (CQ4DS) discord:
As someone who was naively hired as a "data-scientist" without my employer checking if I actually knew anything about modeling I ended up falling into the "person who fixes up the code, maintains tests it, etc." role, while my seniors, who were much less enthused about software engineering, were the ones pushing out models that were admittedly very impressive but complete disasters in terms of code.
We worked very well off of each other, it was interesting to pick at a model from a software engineering perspective, how the code could be structured and improved, where some tradeoffs would need to be made and how we would test and verify if it actually worked for our users. I eventually left because the company was more concerned with getting new models out as soon as possible regardless of their actual performance, but it did ignite my passion for software engineering and devops.
> Python doesn’t yet have anything remotely close to ggplot for rapidly making exploratory graphics, for example.
Plug for plotnine (https://plotnine.readthedocs.io/en/stable/). I don't know R but use ggplot indirectly through this library for exploratory data analysis, and comparing the experience to any other python plotting library, I understand why R folks are usually so sad to be using Python.
I think nobody here really pointed out a very relevant issue that's completely widespread at least with tech job market: companies don't want to pay people.
You don't see this kind of problem in other established professions, you don't expect an accountant to be able to perform the job of a lawyer neither you expect a nurse to be able wear the hat of a nutritionist.
Now with the technological professions, let's use the term knowledge professions as an umbrella term, companies take advantage of the fact that these professions have not been around for that long and are not that established to keep expanding their rol of responsibilities.
We see that all the time with tech companies. It's not rare that you're supposed to know the frontend, backend, testing, devops, some of them even domain knowledge and the list keeps expanding even though sometimes they entail different sets of skills. The salary, not surprisingly, doesn't grow proportionally to the list of requirements. Companies don't want specialized people anymore, they want someone who will quickly pick the job of other people when/if they finish theirs.
That's what I believe the author's rant was about. He has been looking for a job in his field, he is not a software engineer. Yet people are expecting him to be a professional developer on the top of being a professional data modeler.
> you don't expect an accountant to be able to perform the job of a lawyer neither you expect a nurse to be able wear the hat of a nutritionist.
You'd be surprised. I mean, they're not competent to do so professionally, so they can't formally give professional advice outside of their field of competence, but you can be sure people do ask them such questions, and get disappointed if the answer is "that's not my job".
Not sure that I would recommend actually hiring that guy. He does not seem to understand that modelling is only one part of the equation. If you model something non-trivial, usually you can not just hand over your R scripts to someone else and say: Please implement that in python/C. You have implementation constraints which feed back into the modelling itself, like latency or scalability. Furthermore, good luck in letting someone translate your non-trivial math into another language hoping that he won't break it in some subtle or non-subtle way. Its just far more efficient if you have someone who can actually do the prototype directly in python or C, and let then a pro-developer optimize specific parts of the code.
People who are experts at their domain are not always good at explaining things to others. This means that if a data scientist does not know how to code, they need to partner with a programmer and be able to explain things to them. Same goes for a bunch of other professions. For some people, it is easier to learn to code than it is to learn to communicate their ideas with people.
If you are not going to be able to implement something, whether in code, or with a saw and hammer, you must be able to explain things really well. If you can not do either, you will have limited ability to apply your craft.
modeling is 1-2% of the overall work involved in having a model serve customers something useful. There's just so much other work to do that it rarely makes sense to have people who can only model. At some level of scale, building out infrastructure to support people to model full time and constantly run a/b tests make sense but that is not the vast majority of use cases for ML in the real world and building out that support infrastructure is a huge investment
My dad got me into computers, but he's not a programmer by trade. He's a mechanical engineer, and he used BASIC on a TRS-80 to do engine simulations to characterize the mechanical forces on the crankshaft for different crank types. Back then, BASIC was basically what MATLAB is today for engineers. They just slammed some code in there to do the math they needed to do and ran it. They didn't care about things we care about: testability, maintainability, observability.
Similarly, Judge Alsup of Oracle v. Google fame writes astronomy software in QBasic. He doesn't give a shit about best practices, if it helps him aim his telescope correctly it's all good.
Welcome to a world with citizen programmers. A world of terrible code that does the job. I frickin' love it.
Yes, that’s expected of every scientist I am working with. The reason is quite simple and have little to do with Engineering work being more expensive than Scientific one (it’s actually the opposite): get a nice problem to solve or a great model to build, and very soon, you have an ivory tower completely disconnected from the original objective.
To be fair, I don’t ask scientists to understand concurrency issues in programming. Only the basic stuff that is required for delivering a functional program. Yes, Pandas and Scikit-learn belong to the basic stuff.
To be honest I don't believe this to be isolated to just modelling and production code, I feel as though this issue presents itself in most of the job ads I see today, Backend roles with requirements for React, Python roles with requirements for Javascript, Backend roles with requirements for DevOps/SRE.
I don't necessarily have an issue with widely skilled engineers, but I would prefer it's for the right reasons, and I largely believe that it's an exercise in laziness on most companies behalf. They just want to hire less people and have more of their workers do tasks that are outside of their remit.
I have zero interest in writing Javascript, absolutely none, I don't want to do it, and I have pushed my career in directions that mean that for the most part, I don't have too. I'm happy with this decision and have made it willingly.
It's the same with a lot of "DevOps" tasks, having previously been a DevOps Engineer, I now just want to write code, real code, but it feels as though most places now are just not hiring DevOps/Infra people, and just telling their other engineers to do it, which I understand, but it results in a far worse experience for both sides of it. I have to regularly force my hand down from volunteering for things that I have the technical experience to do, and do properly, versus colleagues that don't have the experience, because I'm tired of being shoehorned back into a role that I intentionally left. All of this is because the idea of "cross functional teams" no longer means hiring specialised engineers to do specialised roles, and just getting everyone to do everything, and then being surprised when the context switch penalty actually exists, and it's done to a worse standard than someone who is skilled in that role.
I know exactly what you feel like and I used to be exactly like you. Even down to preferring R to Python.
All I can say is: Bite the bullet. Learn Python, forget about R. R is nice and all but there is nothing that couldn't also be done in Python.
What helped me is: Learn to appreciate the beauty in actual coding, in deployments in environments in well structured, maintainable code. In scaling issues, databases etc.. There is an endless world out there which is extremely fascinating as soon as you get over the "all I want to do is modeling" mindset.
Good luck. You can definitely do it, because I did it as well.
I'm a data engineer who works in an R-centric engineering team, which is quite unusual. Our experience has been that R works well for our use cases and lets us work closely with our analysts and data scientists who are all R users. There are no silos due to different teams speaking different languages. That said, am I still writing Python whenever possible? Damn straight. I'm well aware of how peculiar our team is and, like the post says about modelling, Python (+ SQL) is the default language of the data engineering world. If I want another job, I need Python.
Tensorflow.js is as good as the Python version, and Node.js is a better production platform, and models can run in browsers. Python is easy though; hiring about a language is a sign management doesn't know enough about programming to trust their own judgement.
d) most people I have seen that have claimed they are good at modeling, but not good at writing that for production, have actually not reached their "I am good at modeling" state yet.
e) (Most) People don't write code to write code. Like fiction writers do not write lines of text to write text. Writing is a means to an end. It is part of the idea birth process.
TL;DR: Write code for production; you will be great at it relatively soon.
P.S. Curious if people have had the opposite experience or counterexamples.
Edit: stylistic.
Since you can basically use numpy like a large calculator, it seems like a potentially useful tool to have under your belt. And matplotlib is good for making graphs. Python/numpy/etc. seems like a reasonable alternative to matlab (etc.) in many cases.
Symbolic math tools like Mathematica are useful as well.
It's certainly helpful if modelers can understand the code that implements the model and spot obvious errors in the code as well as in the results. It's also extremely beneficial if whoever is writing, testing, and using the model code has a very good understanding of the model itself.
A potential step toward this happen is implementing the model as a standalone library of very straightforward code that everyone on the project can understand.
Isn’t this because they are hiring to solve the problem, like, a function that predicts X, and if the model engineer is also the production engineer this is one hire that solves the business problem?
The alternative is hiring two engineers and possibly additional pm/management workflow to make sure they mesh and the prod engineer is not blocked on the model engineer, and the model engineer delivers things that are usable. It’s a bit like when we used to have a “webmaster”, or now I suppose they could be a “full stack consultant” who was in charge of making sure the right pixels appeared on thecompanywebsite.com more or less by any means necessary because that was the business need.
In Silicon Valley back in 2014, when the seeds of ML/DS started to get traction, all the people with Data Science titles knew how to write Java webapps or Hadoop MR jobs to ingest, clean, transform, model/analyze, and serve results that went into production.
The specialization def. has narrowed scope in the last ~10ish years, but the original roots were that: Java + stats + database knowhow.
So yes, learn some production level skills. Having far too specialized people also runs the risk of lost-in-translation models that only work in the original implementation, until edge cases show up and model is out of date.
2014 is pretty late for this, more like 2011. I joined FB in 2023 and data people were mostly writing hive at that point. Agreed on the stats and db part, that's never gone away.
What stood out to me even more was the "machine learning" buzzword - even though it doesn't seem like there's any guarantee that training a neural network would actually improve the modeling (and it's just another tool that the modeler should be able to decide on their own to use or not).
Specifically, my advisor just suggested that I ignore this bit, and send in my resumé to those job offers anyway (this being something that we should be able to learn on the fly at our level anyway I guess...)
Isn't what they mean just "be able to export the model in a form that a) is portable enough to be taken and integrated by someone else, b) has a reasonable runtime performance, and then c) explain how to use it to an application developer"? Those are high demands already. There are companies with excellent modelers who never actually get anything deployed because there is no one to bridge the technological gap between what they're doing and production services.
My hypothesis is that few companies are big enough to be able to dedicate many people to _exclusively_ modeling.
If a modeler has 4 weeks to spend on learning new things, odds are most companies would benefit more from the modeler learning how to do basic parts of the operational (python) part of their job (which needs to happen for models to be useful work) vs spending 4 weeks diving deeper on some aspect of modeling (which may or may not yield percentage points of improvement on some problem).
Is there anything like Blender Nodes, but for data modeling? It's an amazingly powerful system. [1] The learning curve is still fairly steep in terms of having to learn all the nodes and how they interact, but you can't make a syntax error.
RapidMiner is one. In the physics and electronics/physics domain there is also Simulink.
It can often hard to integrate these into production code though, as they are designed primarily as interactive desktop applications, with per seat licensing.
Reading the data is hardly the problem. Most R libraries seem to be inherently single threaded. Unlike sklearn where just about everything has a n_jobs parameter. Even stuff like xgboost is insanely parallelizable. Try fitting a really big mixed effects model with LMER, you will cry (I have submitted multiple fixes and performance improvements to LMER and Mertools). Cool I have a model. Now I want to serve predictions in real time to downstream APIs. Hello single threaded R runtime my old friend.
I was talking to somebody at the bus stop the other day who I met back in my physics days and still teaches scientific computing about how I got dragged kicking and screaming into Python (people just kept showing up with work to be done), that I’d almost like to drop Python from my practice because there is no way I can quit Java or JavaScript but libraries keep me in the ecosystem and how we are both shocked that anybody uses R.
As a manager, I have no interest in model-based solution unless we have a good plan to refit / update and test the model on an on-going basis. And for that, Python >> R. Not my line, but I really like it: R is for Research, Python is for Production.
I support a research environment for biostatisticians and other researchers and we have Python and R offerings, and R is the overwhelming favorite, or so product tells me. As someone who isn’t a modeler, it’s interesting how this varies by industry.
Most of most jobs is a bunch of mundane stuff. I've seen it in software development, and I've seen it in management consulting. The best people, typically, are those that will happily do both, understanding that the fun stuff comes with a lot of baggage.
The "someone else is better at the stuff I don't want to do than me" argument rarely holds up either. The friction that comes from dividing the work along lines like modeling and production and trying to hand off is rarely worth it when one person can do both.
Anyway, I've been where the author is, but personally I think it's wishful thinking, unless maybe you want to start your own shop and structure it around yourself that way.