Most libraries load entire notebooks from top to bottom when executing, and I be...

zwaps · on Jan 15, 2021

Is rewriting your functions from notebook to a py file really something a research scientist can not do?

Or is it infeasible for some other reason?

I'd imagine many data scientists want to publish their work as python packages or libraries during their PhD, so they should be familiar with writing classes or functions that work at a bare minimum.

proverbialbunny · on Jan 15, 2021

>Or is it infeasible for some other reason?

I've had projects where the model doesn't perform as intended. Because one person was making the model and another productionizing it, it was hard to identify where the performance difference was coming from. Was the bug in the model itself or in the productionization process itself? It took longer to figure it out than writing the model or productionizing it the first time.

It takes so long to deal with these bugs because the model gets changed, so then prod gets changed to match it. Changing prod (rewriting functions) has the potential to create a new bug, so you solved one but added another, and still can't identify if it is in the initial model or from prod. This continues over and over again, problem after problem.

It's noteworthy to mention if one person is doing both the model building and converting to production this problem is significantly reduced, but is still a problem. The problem is exasperated from the lack of domain knowledge, being that both people are in the dark from the other person's process.

Furthermore, what if you need to update the model? Do you rewrite prod doubling or tripling your work? Do you take that risk to introduce another potential hard to diagnose bug, even if you're the one doing both roles?

Or do you automate the process, so the same code being developed on is the same code running on the server at the end of the day? No more bugs, half to 1/3rd the amount of the work. Why not do it this way? It's soo much easier to debug a problem in prod this way. You can take the log data and spit it into the local machine and know what you're seeing is what the user saw. No more guessing where the problem is.

One way to think of it is software engineers would think it is absurd to write their code, then hand it off to someone who doesn't completely understand it, to rewrite it in another language and put it up on a server. "Why would you ever want to do that?" they would think, and I agree with this sentiment. It is absurd to have someone (even you) rewrite your work unless you have no other option, and you do have other options. Transpilers are a thing if prod needs to be in another language. I've written models that have to go onto embedded environments. I know these challenges all too well.

>I'd imagine many data scientists want to publish their work as python packages or libraries during their PhD, so they should be familiar with writing classes or functions that work at a bare minimum.

It depends if you're writing a library, like doing ML / machine learning engineer type work, or you're solving a domain challenge and writing an end to end solution for that problem, and are using standard cookie cutter ML for your phd, aka data science type work.

One leads to an engineering role, and not surprisingly writing a library for it is ideal, so other people can use it. Another leads to a data science type role and not surprisingly showing code snippets in your paper with plots / EDA and all, the same way you'd write a notebook at work, is ideal.

I'm a data scientist, not an ML specialist (though I have invented a new form of ML for work once, but that was just once and not my primary thing). I specialize in end-to-end domain problems I'm solving. I'll write a notebook to solve it, not that I have to. I've been in the industry longer than notebooks were a thing, so I'm fine doing it the old fashioned way. What I am not is an MLE. I don't need to write libraries for other users to use. I don't need to write custom ML. I don't need to do that engineering bit. To be fair, I have, and I know when it's the right tool for the job. On stackoverflow all of my points come from helping people with the glue parts between C++ and R, so they too can write libraries for R. I'm proficient in modern C++ too. I can do the library ML type work, and I have enjoyed it, but I really do enjoy solving domain problems more, so it's what I'm doing, and it's what the previous comments in this chain you're responding to are all about.