Most libraries load entire notebooks from top to bottom when executing, and I believe papermill does too. (Please correct me if I'm wrong, as I've not used papermill.)
This is great for making a dashboard, a report, or some other kind of analytics, but when it comes to a service the customer uses, you typically never want to load the whole notebook. This is where the industry standard way of loading the whole notebook tends to fall on its face.
What we do is the cells that will end up in prod are written as functions inside of the notebook. This helps reduce globals when writing the notebook, so it is good form when prototyping, but also it allows just those functions to be called from the notebook, instead of running the entire notebook.
You will probably want to write your own library to do this, but in the mean time there is one that works for this purpose https://github.com/grst/nbimporter (Ironically the author doesn't recognize this use case.)
Using nbimporter you can import a notebook without loading it. You can then call functions within that notebook and only those functions get loaded and called.
In my notebooks I have a process function which is like main(), but for for feature engineering. On the prod side the process function is called from the notebook. Process calls all of the necessary cells/functions for me in the correct order. This way the py wrapper only has to call one function, then the ML predict function gets called, so it's pretty small on the .py wrapper side. There are tests written on the .py side, IO functions and what not too.
Data engineers love their classes, so it's easy to write a class that calls the notebook, and best of all calling a single function this way does not load globals, so the data engineers are happy. It's a nice library, because otherwisw you'd have to write your own (which you may end up wanting to do).
This way if the model doesn't work as intended in production it's my fault. We log everything, so I can run the instance prod caught on my local machine, figure out what is going on, update the model, and then it can be deployed instantly.
Version numbers on the engineering side I can't comment on as they have their own method, but on my end the second the model writes to a database then I strongly push for having a version number column or a version number metadata table in the database, so it's easy for me to access for future analysis.
Is rewriting your functions from notebook to a py file really something a research scientist can not do?
Or is it infeasible for some other reason?
I'd imagine many data scientists want to publish their work as python packages or libraries during their PhD, so they should be familiar with writing classes or functions that work at a bare minimum.
I've had projects where the model doesn't perform as intended. Because one person was making the model and another productionizing it, it was hard to identify where the performance difference was coming from. Was the bug in the model itself or in the productionization process itself? It took longer to figure it out than writing the model or productionizing it the first time.
It takes so long to deal with these bugs because the model gets changed, so then prod gets changed to match it. Changing prod (rewriting functions) has the potential to create a new bug, so you solved one but added another, and still can't identify if it is in the initial model or from prod. This continues over and over again, problem after problem.
It's noteworthy to mention if one person is doing both the model building and converting to production this problem is significantly reduced, but is still a problem. The problem is exasperated from the lack of domain knowledge, being that both people are in the dark from the other person's process.
Furthermore, what if you need to update the model? Do you rewrite prod doubling or tripling your work? Do you take that risk to introduce another potential hard to diagnose bug, even if you're the one doing both roles?
Or do you automate the process, so the same code being developed on is the same code running on the server at the end of the day? No more bugs, half to 1/3rd the amount of the work. Why not do it this way? It's soo much easier to debug a problem in prod this way. You can take the log data and spit it into the local machine and know what you're seeing is what the user saw. No more guessing where the problem is.
One way to think of it is software engineers would think it is absurd to write their code, then hand it off to someone who doesn't completely understand it, to rewrite it in another language and put it up on a server. "Why would you ever want to do that?" they would think, and I agree with this sentiment. It is absurd to have someone (even you) rewrite your work unless you have no other option, and you do have other options. Transpilers are a thing if prod needs to be in another language. I've written models that have to go onto embedded environments. I know these challenges all too well.
>I'd imagine many data scientists want to publish their work as python packages or libraries during their PhD, so they should be familiar with writing classes or functions that work at a bare minimum.
It depends if you're writing a library, like doing ML / machine learning engineer type work, or you're solving a domain challenge and writing an end to end solution for that problem, and are using standard cookie cutter ML for your phd, aka data science type work.
One leads to an engineering role, and not surprisingly writing a library for it is ideal, so other people can use it. Another leads to a data science type role and not surprisingly showing code snippets in your paper with plots / EDA and all, the same way you'd write a notebook at work, is ideal.
I'm a data scientist, not an ML specialist (though I have invented a new form of ML for work once, but that was just once and not my primary thing). I specialize in end-to-end domain problems I'm solving. I'll write a notebook to solve it, not that I have to. I've been in the industry longer than notebooks were a thing, so I'm fine doing it the old fashioned way. What I am not is an MLE. I don't need to write libraries for other users to use. I don't need to write custom ML. I don't need to do that engineering bit. To be fair, I have, and I know when it's the right tool for the job. On stackoverflow all of my points come from helping people with the glue parts between C++ and R, so they too can write libraries for R. I'm proficient in modern C++ too. I can do the library ML type work, and I have enjoyed it, but I really do enjoy solving domain problems more, so it's what I'm doing, and it's what the previous comments in this chain you're responding to are all about.
This is great for making a dashboard, a report, or some other kind of analytics, but when it comes to a service the customer uses, you typically never want to load the whole notebook. This is where the industry standard way of loading the whole notebook tends to fall on its face.
What we do is the cells that will end up in prod are written as functions inside of the notebook. This helps reduce globals when writing the notebook, so it is good form when prototyping, but also it allows just those functions to be called from the notebook, instead of running the entire notebook.
You will probably want to write your own library to do this, but in the mean time there is one that works for this purpose https://github.com/grst/nbimporter (Ironically the author doesn't recognize this use case.)
Using nbimporter you can import a notebook without loading it. You can then call functions within that notebook and only those functions get loaded and called.
In my notebooks I have a process function which is like main(), but for for feature engineering. On the prod side the process function is called from the notebook. Process calls all of the necessary cells/functions for me in the correct order. This way the py wrapper only has to call one function, then the ML predict function gets called, so it's pretty small on the .py wrapper side. There are tests written on the .py side, IO functions and what not too.
Data engineers love their classes, so it's easy to write a class that calls the notebook, and best of all calling a single function this way does not load globals, so the data engineers are happy. It's a nice library, because otherwisw you'd have to write your own (which you may end up wanting to do).
This way if the model doesn't work as intended in production it's my fault. We log everything, so I can run the instance prod caught on my local machine, figure out what is going on, update the model, and then it can be deployed instantly.
Version numbers on the engineering side I can't comment on as they have their own method, but on my end the second the model writes to a database then I strongly push for having a version number column or a version number metadata table in the database, so it's easy for me to access for future analysis.