Hacker News new | past | comments | ask | show | jobs | submit login

I just wanted to say thank you. Many of the points in your study strikes a nerve. Part of my responsibility at my last job was to introduce good software engineering practices. What happens? The data scientists go rogue and start running notebooks left and right. How do they productionize their work? Well, they don't. They were academics. All they know is that the models ran fine in their notebooks on their laptops. Meanwhile, we didn't have anyone that was devoted full time on model productionization.

Sharing data? They had enough problems sharing their notebooks.




I just happened to be reading Peter Naur's "Programming as theory building" recently. It strikes me that taking its theme even a little seriously helps understand why notebooks are so popular. Notebooks happen to be convenient tools for exploring a new domain (interactively). Irrespective of how much software purists might complain, conventional software engineering provides very few tools/solutions/practices for that process. The wretched state of interactive debugging (in most languages) is a simple example.

As someone who spends a substantial amount of time working with both modes (writing research code in Jupyter notebooks, and writing production code as python modules), notebooks scratch certain itches that IDEs typically don't even come close to. (Some recent progress on add-ons in Javascript-based editors is potentially interesting, because that might help marry the strengths of the two)

In my experience, in the evolution of code from Jupyter notebooks to repositories of production code as part of any project, there comes a "right time" to switch from the former to the latter. And this can typically only be learned with experience.


I just refactor into a module that I import into my notebook as I go along. This lets me use the notebook for quick prototyping, but also productionize faster if need be.


That only works after the code in the module is largely "frozen". It doesn't work well if you're experimenting with ideas inside the module. OTOH, if the algorithm is largely frozen, and you're trying to experiment with its performance on a bunch of examples, the workflow of putting the algorithm in a module and using a notebook to interface with data and visualize results is quite useful.

That is basically what I meant by knowing when to transition from one mode to the other.

Here's a concrete example (maybe somebody considers this an inspiring challenge?), to illustrate how notebooks are infuriating in their primitiveness, but still better compared to using an editor on source files: Imagine a beginner trying to write/learn a sorting algorithm, and who would like to keep experimenting with their code and observing what happens on examples, possibly profiling space/time complexity along the way.

To expand on my point above, there are actually three distinct computational use cases, not just two: Interactive learning -> Sharing insights with others -> Productionizing code.


Why doesn't work with experimenting with the module? In jupyter, if you're using auto reload then the module will refresh every time you use it.


I guess the objection is that if you what you are experimenting is inside a module, you've moved the "active" code out of the notebook, and then given up the interactiveness.


>to introduce good software engineering practices. What happens? The data scientists go rogue and start running notebooks left and right. How do they productionize their work? Well, they don't. They were academics.

My background is programming (instead of data analysis & modeling) so I'm sympathetic to your idealistic "software engineering" view... but I'm also sympathetic to the academics' side as explained by Yihui Xie's blog post:

https://yihui.org/en/2018/09/notebook-war/

He's convinced me that criticizing non-programmers for using (or over-using) computational notebooks when it should be a "proper" programming language and deployment is like criticizing financial analysts over-using Excel to learn how to program VB or Python and re-write their spreadsheets into a "proper database" like Oracle or MySQL. That's just not reality. This divide between "end user tools" and "proper programmer tools" will always exist because there is no perfect tool in existence that serves the needs of both skill sets. Therefore, the programmers will always be able to say the data scientists or financial analysts are "doing it wrong".


> He's convinced me that criticizing non-programmers for using (or over-using) computational notebooks when it should be a "proper" programming language and deployment is like criticizing financial analysts over-using Excel to learn how to program VB or Python and re-write their spreadsheets into a "proper database" like Oracle or MySQL.

I think this is very much off the mark. For sure plenty of scientists are poor programmers, but that isn't the reason they use notebooks. It is because:

They are not attempting to write something that will run everywhere, and often. They are either analyzing some data or doing rapid prototyping. For the latter, it's like criticizing someone who uses a REPL. It's just that the Notebook is much more powerful than a simple REPL that one can safely stick to it. Imagine you will do 40-50 prototypes and only one of those may end up worthy enough to make a product out of, and you don't know which one that will be. If you used a non-notebook environment, you'd give up in frustration by the time you hit the 15th one.

As you said: At the moment, there simply isn't an alternative that allows for rapid prototyping and is production ready. It's a hard problem to solve - there's a reason no one had solved it for decades (well before notebooks were a thing).

Had notebooks not been invented, you would have the same people handing you MATLAB code asking you to productize it.

Claiming they are beginners/novice programmers is off the mark. Peter Norvig started using notebooks for a reason, and no one would call him a novice. I do SW for a living, but when I need to analyze data and visualize it, I'll pick a notebook over "proper" SW tools any day.


We shouldn't assume it will always exist. It exists because programming languages and tools are not as usable as they can be. That is something we can and should expect to change.


Notebooks are like training wheels. They serve multiple purposes, one of the most important being signaling ineptitude to others. Code smells are useful and a notebook does too.


The tone of what you are saying strikes a nerve with me - we had exactly the same issues with Excel in the front office in investment banking.

Unknowable ad-hoc, unversioned spreadsheets running much of the capital of the company.


That’s a really good comparison. Excel is often used for storing data and doing analysis because it just plain works. And anyone can use it.

Notebooks tend to be the same way. It’s a simple GUI-ish was to do many complex analyses in a quick and dirty way.

And many of the arguments for not using Excel are the same as not using notebooks. Each is good at the initial data exploration stage, but are often abused and used in production when everyone knows it is a bad idea. But it still “works” so it is unlikely to be replaced.

(Especially when those that are working with the data don’t always have the skill set to build out a full production workflow.)


Think of all the damage caused by excel. We replaced one set of avoidable catastrophes for another. But this time there’s no shame.


I'm a computational biologist and Excel has been the bane of my existence for 20 years. We've "known better" for all of that time, but I still deal with people passing around Excel files of data or having common spreadsheets on shared drives (or now Dropbox shared). We all "know better", but Excel is often the first thing that people try to keep track of data, and once a system works, there is just too much inertia to change.

(For what it's worth, I feel the same way about people who try to send me RDS files with dataframes stored as R objects).

However, I think that whoever decided to name genes "OCT4" and "SEPT7" have to share some of the blame here too...


Are you hiring for summer positions?


My last job I spent 80+% of my time productionizing models and notebooks. It was an absolute nightmare. Everyone had slightly different preprocessing hacks for different stages and things were always working fine locally, but I couldn't replicate the results in docker containers.

I am very happy to be out of that business.


are you me?


>Sharing data? They had enough problems sharing their notebooks.

We just store the data tables in the project's database on a Postgres server. Then it's just a matter of pd.read_sql_query()




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: