I just wanted to say thank you. Many of the points in your study strikes a nerve. Part of my responsibility at my last job was to introduce good software engineering practices. What happens? The data scientists go rogue and start running notebooks left and right. How do they productionize their work? Well, they don't. They were academics. All they know is that the models ran fine in their notebooks on their laptops. Meanwhile, we didn't have anyone that was devoted full time on model productionization.
Sharing data? They had enough problems sharing their notebooks.
I just happened to be reading Peter Naur's "Programming as theory building" recently. It strikes me that taking its theme even a little seriously helps understand why notebooks are so popular. Notebooks happen to be convenient tools for exploring a new domain (interactively). Irrespective of how much software purists might complain, conventional software engineering provides very few tools/solutions/practices for that process. The wretched state of interactive debugging (in most languages) is a simple example.
As someone who spends a substantial amount of time working with both modes (writing research code in Jupyter notebooks, and writing production code as python modules), notebooks scratch certain itches that IDEs typically don't even come close to. (Some recent progress on add-ons in Javascript-based editors is potentially interesting, because that might help marry the strengths of the two)
In my experience, in the evolution of code from Jupyter notebooks to repositories of production code as part of any project, there comes a "right time" to switch from the former to the latter. And this can typically only be learned with experience.
I just refactor into a module that I import into my notebook as I go along. This lets me use the notebook for quick prototyping, but also productionize faster if need be.
That only works after the code in the module is largely "frozen". It doesn't work well if you're experimenting with ideas inside the module. OTOH, if the algorithm is largely frozen, and you're trying to experiment with its performance on a bunch of examples, the workflow of putting the algorithm in a module and using a notebook to interface with data and visualize results is quite useful.
That is basically what I meant by knowing when to transition from one mode to the other.
Here's a concrete example (maybe somebody considers this an inspiring challenge?), to illustrate how notebooks are infuriating in their primitiveness, but still better compared to using an editor on source files: Imagine a beginner trying to write/learn a sorting algorithm, and who would like to keep experimenting with their code and observing what happens on examples, possibly profiling space/time complexity along the way.
To expand on my point above, there are actually three distinct computational use cases, not just two: Interactive learning -> Sharing insights with others -> Productionizing code.
I guess the objection is that if you what you are experimenting is inside a module, you've moved the "active" code out of the notebook, and then given up the interactiveness.
>to introduce good software engineering practices. What happens? The data scientists go rogue and start running notebooks left and right. How do they productionize their work? Well, they don't. They were academics.
My background is programming (instead of data analysis & modeling) so I'm sympathetic to your idealistic "software engineering" view... but I'm also sympathetic to the academics' side as explained by Yihui Xie's blog post:
He's convinced me that criticizing non-programmers for using (or over-using) computational notebooks when it should be a "proper" programming language and deployment is like criticizing financial analysts over-using Excel to learn how to program VB or Python and re-write their spreadsheets into a "proper database" like Oracle or MySQL. That's just not reality. This divide between "end user tools" and "proper programmer tools" will always exist because there is no perfect tool in existence that serves the needs of both skill sets. Therefore, the programmers will always be able to say the data scientists or financial analysts are "doing it wrong".
> He's convinced me that criticizing non-programmers for using (or over-using) computational notebooks when it should be a "proper" programming language and deployment is like criticizing financial analysts over-using Excel to learn how to program VB or Python and re-write their spreadsheets into a "proper database" like Oracle or MySQL.
I think this is very much off the mark. For sure plenty of scientists are poor programmers, but that isn't the reason they use notebooks. It is because:
They are not attempting to write something that will run everywhere, and often. They are either analyzing some data or doing rapid prototyping. For the latter, it's like criticizing someone who uses a REPL. It's just that the Notebook is much more powerful than a simple REPL that one can safely stick to it. Imagine you will do 40-50 prototypes and only one of those may end up worthy enough to make a product out of, and you don't know which one that will be. If you used a non-notebook environment, you'd give up in frustration by the time you hit the 15th one.
As you said: At the moment, there simply isn't an alternative that allows for rapid prototyping and is production ready. It's a hard problem to solve - there's a reason no one had solved it for decades (well before notebooks were a thing).
Had notebooks not been invented, you would have the same people handing you MATLAB code asking you to productize it.
Claiming they are beginners/novice programmers is off the mark. Peter Norvig started using notebooks for a reason, and no one would call him a novice. I do SW for a living, but when I need to analyze data and visualize it, I'll pick a notebook over "proper" SW tools any day.
We shouldn't assume it will always exist. It exists because programming languages and tools are not as usable as they can be. That is something we can and should expect to change.
Notebooks are like training wheels. They serve multiple purposes, one of the most important being signaling ineptitude to others. Code smells are useful and a notebook does too.
That’s a really good comparison. Excel is often used for storing data and doing analysis because it just plain works. And anyone can use it.
Notebooks tend to be the same way. It’s a simple GUI-ish was to do many complex analyses in a quick and dirty way.
And many of the arguments for not using Excel are the same as not using notebooks. Each is good at the initial data exploration stage, but are often abused and used in production when everyone knows it is a bad idea. But it still “works” so it is unlikely to be replaced.
(Especially when those that are working with the data don’t always have the skill set to build out a full production workflow.)
I'm a computational biologist and Excel has been the bane of my existence for 20 years. We've "known better" for all of that time, but I still deal with people passing around Excel files of data or having common spreadsheets on shared drives (or now Dropbox shared). We all "know better", but Excel is often the first thing that people try to keep track of data, and once a system works, there is just too much inertia to change.
(For what it's worth, I feel the same way about people who try to send me RDS files with dataframes stored as R objects).
However, I think that whoever decided to name genes "OCT4" and "SEPT7" have to share some of the blame here too...
My last job I spent 80+% of my time productionizing models and notebooks. It was an absolute nightmare. Everyone had slightly different preprocessing hacks for different stages and things were always working fine locally, but I couldn't replicate the results in docker containers.
Have you looked into the domain of "research data management"? Concerns such as "archival", "security" or "share & collaborate" are core to this research domain:
In academics, there's a trend to prepare a "data management plan" up front that creates awareness about these concerns. They are even a requirement in order to get funding:
So, it's a bit odd to see a study that's focussed on a single technical tool yield the same concerns... but not making that jump to a larger, existing framework on information management.
Looking at the authors, it seems you are located at Oregon State University. A quick DuckDuckGo search yields this service from your colleagues at the University Library:
With the context of notebooks themselves, I think the study reflects on "if you have hammer, every problem looks like a nail." Notebooks aren't the only powerful tool to work with data. I think many of the same concerns could be raised with Google Sheets or Excel with heavy VBA scripting. Like others said, this is not a new problem.
Notebooks do have a place in the bigger process of doing iterative research based on data mining techniques. They can help to formulate more accurate questions and perform quick tests without the friction of having to set up complex environments. Moving on from initial data exploration, it's up to the researcher to use a formal method and tools that do mitigate those concerns. RDM is all about providing tools and mitigating (legal) liabilities as far as "what do you do with your data?" is concerned.
In my experience, the best approach is to treat the notebook as the frontend. So widgets, graphs, annotations are generally ok. Anything compute intensive should be relegated to the backend.
I think adding feedback for marking cells as dependant on each other might be a good idea.
I'd also love code completion in notebooks.
I think the cleaning and code reuse problems can easily be mitigated by putting functions into libraries and using auto reload.
My normal workflow is hack something in a notebook until it runs, then refactor and put in a library I import with auto reload. I work on production ML and I use this for both software development and research.
> Co-author of the study here. Let me know if you have any questions or how you overcome some of the problems we identified!
It's not clear who the audience is. It sounds like most people who complain about them are software people and not researchers/scientists.
For someone like me, who once did computational research using MATLAB, and later analyzed data for my job, Jupyter is not worse, and is in most ways superior. Let's take your points one by one:
> Participants stated they often downloaded data outside of the notebook from various data sources since interfacing with them programmatically was too much hassle.
This was the norm with MATLAB, Excel and JMP as well, unless someone wrote code to autodownload (extremely rare - less than 1% of people did that). And if you are going to write code to get the data from somewhere, it's much nicer in Jupyter than in these other tools.
> Not only that, but notebooks often crash with large data sets (possibly due to the notebooks running in a web browser).
I honestly have not seen this, and the reason makes no sense. Your browser is not handling the data. The kernel is. I mean yes, if you try to load several GB of data in pandas, it's possible you will have problems if you run out of RAM, but this has nothing to do with notebooks.
> Once the data is loaded, it then has to be cleaned, which participants complained is a repetitive and time consuming task
This was as much a problem prior to notebooks as it is now. Notebooks did not make this any worse.
> Explore and analyze. Modeling and visualizing data are common tasks but can become frustrating. For example, we observed one participant tweak the parameters of a plot more than 20 times in less than 5 minutes.
It was even worse with MATLAB. Ditto for Excel. JMP is a bit nicer for visualization, though.
> Notebooks do not have all of the features of an IDE, like integrated documentation or sophisticated autocomplete, so participants often switch back and forth between an IDE (e.g., VS Code) and their notebook.
It may be better now, but this was a problem in MATLAB as well.
> While it is easy to share the notebook file, it is often not easy to share the data.
This is as true with MATLAB, JMP, etc. A lot of the complaints about it being hard to reuse notebooks is because notebooks at least attempt to be reproducible, and thus many more people attempt it. Prior to notebooks, I know almost no one who tried to share MATLAB analyses, because it was such a pain to do so.
> Notebooks as products. If a large data set is used, as one might expect in production, then the notebook will lose the interactivity while it is executing. Also, notebooks encourage "quick and dirty" code that may require rewriting before it is production quality.
I suppose some people are trying to make products out of notebooks, and this is where all the recent grief I see is coming from. I do not think it was the primary goal of notebooks, though. They were meant for data analyses and prototyping, not for production use.
Much of your comment could be summarised as “it’s no worse than prior tools”. That doesn’t invalidate the authors points though: just because it’s better than previous tools, which have the same or worse problems, doesn’t mean notebooks don’t have problems that should perhaps be tackled or talked about. That it’s an improvement over what existed before doesn’t mean you can’t be critical about the flaws it still has and a study like this (looking at real people) and discussion like we’re having here is a necessary start to finding out how to improve on this.
People who design and implement features in notebooks. The conclusions in the blog post and research paper are clear that improving these identified problems could improve user experience.
> I honestly have not seen this, and the reason makes no sense. Your browser is not handling the data. The kernel is. I mean yes, if you try to load several GB of data in pandas, it's possible you will have problems if you run out of RAM, but this has nothing to do with notebooks.
This is true nowadays with Jupyter, because it is smart about truncating output. But it used to be possible to OOM the browser by e.g. printing in an long-running loop or displaying too long of a list/table.
Maybe because of the negative sentiment/bad taste statements like "What's wrong with..." leave. It could say "What can be improved with ... in 2020" instead. Probably not intention of the author but it comes as non-constructive crtiticism/"not recommended to use" a bit too much. Some observations seem to not be directly related to notebooks per se. Other feel like could be made as just entries in a FAQ/best practices section of documentation.
Kudos for looking at real people's work and surveying it.
I wonder how much workflow could be improved if researchers would be temporarily paired with developers - who are generally better at modularising and removing friction in their work.
Personally I believe that a bit of clean-code discipline and following known best practices could solve couple of those pain points.
It's also true some could be improved by rethinking how notebooks work; ie. being able to specify input/output of notebook so it can be used as a library; detaching runtime data from the code so it plays better with version control/publishing; maybe even more radical ideas like adding visual/flow view that helps with linking elements; adding built-in excel-like sheets that can be queried/manipulated could also be interesting; built-in, first class support for relational database (sqlite) could also be a big win.
There are many interesting developments happening in this space and there seem to be some unexplored ideas waiting to be tested out.
Yes! Not only that, but with all of the end-user programming research. I did some studies on LabVIEW programmers before and I noticed a lot of the same phenomenon with data scientists. They have a lot of domain knowledge, some programming experience, but usually do not use software engineering best practices or tools (e.g., unit testing, code reviews, automated refactoring). All of this is very understandable but reveals a lot of potential for tools to better support them.
See Yestercode [1] and CodeDeviant [2], two tools that I specifically designed for LabVIEW programmers to refactor and test their code without expecting them to behave like traditional software engineers.
Interesting study! I'm curious what shadowing 15 R data scientists would look like, since it seems to resolve some of the pain points around caching results, debugging, and scaling.
This is a very minor question (and I am not concerned about risk to participants)--when you say they signed consent "in accordance with our institutional ethics board", are you talking about Microsoft, one of the two universities, or all?
I want a notebook where causality can only flow forward through the cells. I hate notebook time-loops where a variable from a deleted cell can still be in scope.
1. Checkpoint the interpreter state after every cell execution.
2. If I edit a cell, roll back to the previous checkpoint and let execution follow from there.
I can't tell you how many times I've seen accidental persistence of dead state waste hours of people's time.
My problem with notebooks is that I feel like the natural mental model for them is a spreadsheet mental model, not a REPL mental model. Under that assumption, changing a calculation in the middle means that all of the cells that depend on that calculation would be updated, but instead you need to go and manually re-run the cells after it that depend on that calculation (or re-run the entire notebook) to see the effect on later things. Keeping track of the internal state of the REPL environment is tricky, and my notebooks have usually just ended up being convenient REPL blocks rather than a useful notebook since that's the workflow it emphasizes.
Yep, the real complaint is “dead state”, not out of order execution. Worrying about linear flow per se turns out to be misguided based on lack of imagination for/experience with a better model: reactive re-rendering of dependent cells. Observable entirely solves the dead state problem, in a much more effective way than just guaranteeing linear flow would do.
* * *
More generally, Observable solves or at least ameliorates every item in the linked article’s list of complaints. (In 2020, any survey about modern notebook environments really should be discussing it.)
I found the article quite superficial. More like “water cooler gripes from notebook users we polled” than fundamental problems with or opportunities for notebooks as a cognitive tool. I think you could have learned more or less the same thing from going to whatever online forum Jupyter users write their complaints at and skimming the discussion for a couple weeks.
I guess this might be the best we can hope for from the results of a questionnaire like this. But it seems crazy having an article about notebook UI which makes no mention of spreadsheets, literate programming, Mathematica, REPLs, Bret Victor’s work, etc.
From the title I was hoping for something more thoughtful and insightful.
You can get a jupyter extension[1] that allows you to add tags and dependencies and this way construct the dependency graph as you go along. Of course, you have to do it manually and the interface is a bit clunky, but it does what it says.
In practice I think taking care not to accidentally shadow variables is much more important: this dependency business only makes sense once you have a clear idea of what you need and by that point you are mostly done anyway.
I don’t understand what you are trying to say in your second paragraph, but I highly recommend you spend a few weeks playing with http://observablehq.com instead of speculating about the differences.
In practice, I find it to be dramatically better than previous notebook environments for data analysis, exploratory programming / computational research, prototyping, data visualization, and writing/reading interactive documents (blog posts, software library documentation, expository papers ...). It has a lower barrier to starting new projects, a lower-friction flow throughout
I find it better at every stage of my thinking process from blank page up through final code/document, and would recommend it vs. Jupyter or Matlab or Mathematica in every case unless some specific software library is needed which is unavailable in Javascript. The only other tool I really need is pen and paper, though I also use http://desmos.com/calculator and Photoshop a fair bit.
This falls apart when computation is a factor, though. You can't recompute the whole notebook on every commit when there are 30 cells that each take 2-8 seconds to complete.
In Jupyter I approach this by structuring my exploratory analysis in sections, with the minimum of variables reused between sections.
Typically the time-intensive data prep stage is section 1.
The remaining sections are designed essentially like function blocks: data inputs listed in the first cell and data outputs/visualizations towards the end.
Once I decide the exploratory analysis in a section is more-or-less right, I bundle up the code cells into a standalone function, ready for reuse later in my analysis.
Jupyter notebooks can easily get disorganised with out-of-order state. However that is their strength too: exploratory analysis and trying different code approaches is inherently a creative rather than a linear activity.
Maybe I'm missing a joke here, but if that's your workflow then there's absolutely no advantage to notebooks over something like Spyder or even VS Code.
No, that's not the workflow. You work in the notebook as normal but from time to time (say every two hours) rerun the whole thing.
One advantage of this is that it forces you to name your variables such that they don't overwrite each other. Further down the line this enables sophisticated comparisons of states (e.g. dataframes) before and after (something data scientists need)
If you have a few long data loading and preprocessing steps it's a pain to wait for them to run again, people try to avoid it.
When something odd begins to happen, they don't immediately consider the possibility that it's not their bug and waste time trying to 'debug' the problem instead of just rerunning the notebook.
Would it be a solution to store intermediate computations to an in-memory or disk database like Redis, SQLite? It is a matter of few minutes to run a docker instance and write simple read / write + serialize Python util functions?
You don't reload every time you write a line of code. Nobody's insane like that. You reload every two hours or so. This is good enough for most except most extreme data sets.
Well if block 1 takes ages but everything after that is dependant 2->3->4 etc. Obviously it would be nice to just re-run block two and have those changes cascade
That’s what I’d always do. On more complex notebooks, though, is it possible that isn’t a solution? I wouldn’t think so but I am happy to be surprised. Then again I use notebooks only at the end of a project to present work in “executable presentation” style. Restart and Rerun all has been always been sufficient for me.
More generally, I took a look at notebooks, thought, “Why develop with all the extra baggage” and left it at that until ready to experiment with presentation methods for (tight) core ideas.
> One advantage of this is that it forces you to name your variables such that they don't overwrite each other. Further down the line this enables sophisticated comparisons of states (e.g. dataframes) before and after (something data scientists need)
Also, not sure about you, but I like seeing all of my outputs on a single browser page without having to write any glue code whatsoever.
A couple years out of college we finally took a hard look at the credit cards and realized we had fucked up.
We were gonna buckle down, pay the cards down hard for a while, 'color' our money so we both had discretionary spending separate from, say, the power bill. She had much more Excel experience than I did so she worked up a spreadsheet.
It was bad. We had worked up some 'fair' notion of proportionality and she basically had no spending money and mine was pretty bleak. So I redid the numbers from scratch with split that was better for her. In the new spreadsheet she has much more spending money and... hold on, I've got a bit more too? I looked at her spreadsheet repeatedly and I never did figure out where a couple hundred bucks got lost. I went back to sanity checking mine instead to make sure I wasn't wrong. It checked out.
I wonder sometimes how often small companies discover they've been running in the red instead of the black, because some cell got zeroed out, a sum didn't cover an entire column, or embezzlement is encoded straight into the spreadsheet without anyone noticing.
The entire accounting department at any company exists to make sure their numbers are spot on. If your wife had an entire accounting department scrutinizing her numbers, they'd find the discrepancy. These are people who were willing to sacrifice their entire professional career and their lives during busy season at least to do nothing but tinker with excel for 40+ years; always trust a masochist verging on the insane.
> I wonder sometimes how often small companies discover they've been running in the red instead of the black, because some cell got zeroed out, a sum didn't cover an entire column
This is a really interesting insight (actually obvious when you think about it). I'm currently working on a spreadsheet app and these kinds of observations are very interesting to me. I guess things like named cells/variables will help (instead of using $A$4 etc.). Range selection could also be more intelligent (it could actively warn you if a range selection seems to be missing a few cells of the same data type). Do you have any other insights here?
One company I sued to work for had this happen there was a magic spreadsheet in the accounting system - one factor in the massive restricting of the company - ICAN was the other
Going back to even some of the earliest literate programming exercises by Knuth, there's a lot of demonstrable usefulness in being able to write the code "out of order", or to at least demonstrate it in such form. It's not entirely out of the question that setup requirements aren't interesting to the main narrative flow, and maybe even distract from it, such that the "natural" place to put stuff in say a textbook is in the back as an Appendix.
A good notebook (again, similar to early literate programming tools) should help you piece the final execution flow back into the procedural flow needed for the given compiler/interpreter, but it probably should still let you rearrange it to best fit your narrative/thesis/arc.
This is how runkit does it for nodejs and I think it’s working quite well for them.
We (at Nextjournal) tried doing the same for other languages (Python, Julia, R) and felt that it didn’t work nearly as well. You often want to change a cell at the top e.g. to add a new import and it can be quite annoying when long-running dependent cells re-execute automatically. I now think that automatic execution of dependent cells works great when your use case is fast executing cells (see observablehq) but we need to figure out something else for longer running cells. One idea that I haven’t tried yet is only run cells automatically that have executed within a given threshold.
I hear a lot of complaints about hidden state but I think it’s less of a problem in reality. It’s just a lot faster than always rerunning things from a clean slate. Clojure's live programming model [1] works incredibly well by giving the user full control over what should be evaluated. But Clojure's focus on immutability also makes this work really well. I rarely run into issues where I'm still depending on a var that's been removed and then there's still the reloaded workflow [2].
Overall I think notebooks are currently a great improvement for people that would otherwise create plain scripts – working on it is a lot quicker when you have an easy way to just execute parts of it. Plus there's the obvious benefit of interleaving prose and results. That doesn't mean we should not be thinking about addressing the hidden state problem but I think notebooks do add a lot of value nevertheless.
If people are wondering about cases that can cause this - a common one (for me) is a mis-spelled variable name. If you go back and change it, the old one is still there and if you make the same mistake twice you will have code that runs but doesn't work. It's then really not obvious why it doesn't work.
But, strangely, Jupyter doesn't also give you a REPL (like, say, R Studio does). I'm always making new cells in the middle to output the column names of my spreadsheet, and then I have to delete them. I used to just always have an ipython REPL running and test things out in there as I write. You can start a ipython instance on the same kernel but I found that messed up my plots when I did that IIRC.
You can get a REPL attached to a notebook in jupyter. When you open a console in jupyter-lab you have the option of attaching it to an already running kernel. Using the notebook interface you can connect a console using `jupyter console --existing`. By default this connects to the most recent session, but you can also specify a session by passing a token.
It's an easy thing to miss though, because you also then need to delete the line of code you used to delete the old object/name so you have no record of cleaning up after yourself.
Hot reloads can become very expensive. Especially when it comes to computationally heavy tasks that notebooks are built for.
If you decide you want hot reloads by default, it'd mean each time you click on a cell and then click on another you'd be restarting the whole notebook.
If you had massive datasets you were loading or other args that you were parsing manually or at prompt, you'd have to go back and do all that. Don't even get me started on the operations you'd have done with those dataframes prior.
I think it is a good thing that notebooks separate instructions and re-execute manually by default. The cost of the alternative is just too high
> I think it is a good thing that notebooks separate instructions and re-execute manually by default. The cost of the alternative is just too high.
Maybe add a "lock" toggle so a user can block a cell from being automatically executed? The heavy numeric setup tasks could then be gathered in a few cells and locked, leaving the lighter plotting & summary stats cells free to update reactively.
Toggle a whole environment and intrepeter's behavior? Do you know how much architecture that would involve? That's like trying to tell IDLE to be able to both delete or keep your variables on exit, or the JVM to have a toggle switch for memory and garbage management.
Why doesn't the developer make themselves useful and simply write a save function that freezes their buffer variable values to a text, json or SQLite file that they can read from or stream rather than trying to set back a whole community years of progress in an effort to accommodate perhaps entitled or lazy devs.
Can you even imagine the architectural costs of trying to accommodate streaming data and timestamped data as opposed to you just writing your own stuff to file?
I think adding a toggle to run or not run a cell would be a trivial change, something like adding a property to the cell (https://raw.githubusercontent.com/jupyter/notebook/master/no...) and checking whether that is set or not before running the contents.
I don't think re executing by default would be beneficial. I just don't want state present in the interpreter that isn't in any live cell, and I only want time to flow one direction. Other than that, I think the other ergonomics of notebooks are fine.
If the interpreter state contains large variables checkpointing might not be viable (eg I have dataframes that are 100s of GB/large fractions of total available memory, reading/writing from hard drive all the time would be relatively slow. If you can save deltas I guess it wouldn't be too space inefficient but I imagine still slow).
At the same time, I do like the idea of an append only notebook where you can:
1. Only run cells in sequential order
2. Only edit cells that are below the most recently ran cell.
Thankfully you can enforce it through code practice and the notebook is relatively guaranteed to be "run all"-able. You will need to refactor it after the initial dirty run, but at least it's easy to reason about.
I want a notebook situation where the platform understand sampling, so that, while I'm doing my EDA and initial development and generally doing the kinds of work that are appropriate to do in notebook, I'm never working with 100GB data frames.
I suspect that a big part of my annoyance about the current state of the data space is that parts of the ecosystem were designed with the needs of data scientists in mind, and other parts of the ecosystem were designed with the needs of data engineers in mind, and it's all been jammed together in a way that makes sure nobody can ever be happy.
You can sample data if you want already (or sequentially load partial data, which is what I usually do if I just want to test basic transformations), but if you need to worry about rare occurrences (and don't know the rate) then sampling can be dangerous. For example, when validating data there are edge cases that are very rare (ie sometimes I catch issues that are less than one record per billion), it can be hard to catch them without looking at all of the data.
Assuming the data isn't changed, thanks to CoW forking wouldn't cause any extra memory usage. If only a subset of data is changed, same thing - only the changed cells will take extra space. The problem only occurs when the whole variable changes - in which case yeah, you're SOL. I wonder what the usage patterns are for such datasets?
Personal experience: when first looking at the data I often do lots of map /reduce style operations which might transform large portions of the dataframe.
Question, if you use CoW then presumably your variable blocks are no longer contiguous, wouldn't this really slow down vector operations?
> Question, if you use CoW then presumably your variable blocks are no longer contiguous, wouldn't this really slow down vector operations?
I don't think so. Vector operations require the data to be aligned to whatever the vector size is, no? E.G. 16-byte vector ops require the data to be aligned to 16-byte, etc... At least that's my understanding.
I've played with prototypes of this by calling fork on IPython to take snapshots of interpreter state https://github.com/thomasballinger/rlundo/blob/master/readme... but if you can't serialize state fully, rerunning from the top (bpython's approach) can work, or rerunning as a dependency dag shows is necessary (my current employer Observable's approach) works nicely.
Check out our reproducibility focused notebook Vizier (https://vizierdb.info). In Vizier, inter-cell communication happens through spark dataframes (and we're working on other datatypes too. This makes it possible for Vizier to track inter-cell dependencies and automatically reschedules execution of dependent cells. (It also makes Vizier a polyglot notebook :) )
* Information only flows downwards.
* Computations are cached.
* Things are automatically recomputed when the values they depend on change.
* Things with effects is instead cleared and awaits confirmation before running.
I haven't dug into it myself, but Netflix makes something called Polynote that is supposed to add some awareness of the sequence of the cells to combat this
I used Mathematica’s notebook interface quite heavily 15-20 years ago; Jupyter’s interface is a clone of that in many ways.
At the time, my workflow was to use two different notebooks for everything: foo.nb and foo-scratch.nb. I’d get things working a piece at a time in foo-scratch.nb, not caring at all how it looked, not having to worry about leaving extra output or dead ends of explorations lying around; then the refined cells would be copied over to foo.nb, which would get pristine presentation, and which I could run top-to-bottom.
This workflow worked pretty well for me: very clean reproducible output, with the ability to easily refer back to all the steps of how I’d derived something, along with copious detailed private notes.
I never had to use it but I’m pretty sure each cell even had its modification time stored in the metadata in case I wanted to view a chronological history.
I make a "scratch pad" section of my notebook and work on ideas there. Then once I've pieced together a function line by line and tested it a bit I move it up to where it should be in the chronological order of the notebook. Kind of like your two notebook system but makes copying easier in Jupyter.
I do the same, though it feels dangerous because both the good-copy and scratch sections share the same kernel. JupyterLab works on .ipynb files, and makes it way easier to copy (or drag and drop) cells between different notebooks. One of these days, I plan to switch to JupyterLab to get a sense of what else it offers above Jupyter Notebook.
Superb talk! It's worth noting that a lot of the issues he brings up, ultimately stem from the format in which Jupyter notebooks are stored. R notebooks, with their plain-text stored format as well as code-chunk parameters, solve some, but not all, of these problems.
Yea, when I first found that extension, I was pretty excited about it. But it ultimately is not a first class citizen in the way it is for R notebooks, so I simply don't feel as comfortable using this as I might otherwise have been.
I'm someone who has been programming for a very long time and has been using notebooks for a reasonably long time (and almost always starts projects with them), my feeling is that they are a bit like C in that they make it easy to accidentally shoot yourself in the foot if you aren't careful. I always strive to end up with a notebook that can be "Run All" from a fresh clone, and I'd say that I'm successful with that maybe 60-70% of the time, and am close enough that I can fix it in the remainder.
As the article (and the many others like it that have frequently cropped up as soon as IPython Notebooks first started ramping up in popularity) points out though, a lot of newer users don't have the discipline to ensure that they're not jumping around too much. It's not a problem for them in the immediate term since they know how the state ought to work, but then it becomes a mess when they try to share it with someone else (or to run it themselves again 3 months later).
The challenge though is that the data analysis workflows that it allows are unbeatable by any other tools I've tried. In the end, it may just be that it's the worst form of data programming except for all of the others that have been tried.
I interned @ Google AI last summer; used notebooks nearly everyday. Estimated productivity gain is 3-5x.
Biggest tip I have is to turn auto reload on, then write the bulk of your code as modular functions and call functions within your notebooks. Keeps the notebook tidy and it’s easier to push your code this way.
It’s also easier for sharing since most people viewing your notebooks (mentors, people outside your team) are interested in results/artifacts such as metrics, generated text, images, audio, which notebooks display well (not your code).
(A frequent Jupyter Notebook user here. For data exploration, and teaching deep learning - then Colab is indispensable.)
The main question is: what are the alternatives, for data exploration (and sharing its results). Similarly, for data science tool demos, Notebooks shine.
IMHO the problem is not in the notebooks, but in how they are being used (i.e. the workflow). By writing scripts in py files, and using notebooks only to show their results (processed data, charts, etc) we get the best of both worlds.
The only build-in problem with Jupyter Notebooks is JSON, mixing input and output (and making it pain to work with version control). But here RMarkdown (and a few other alternatives) work well.
Yes, the article mentions users copy pasting snippets from their personal "library". Well, that could just be made into an actual library of functions to call.
I'm currently at uni enrolled in an AI/ML degree, and there are a lot of people with no previous exposure to programming. It's just that most people don't know that these things are possible, don't want to learn another tool (IDE) and are not interested in longevity of the code, just in the results. This shouldn't sound like me complaining, I totally understand. I think a lot of the stuff could be solved with just better tooling, but a familiarity with software development is definitely helpful.
Also a while back streamlit (https://www.streamlit.io/) was here on HN and since then I've been meaning to try it. I think this could be a good approach to bring together the best of both worlds.
most people don't know that these things are possible, don't want to learn another tool (IDE) and are not interested in longevity of the code, just in the results.
This approach is arguably more effective than wasting time trying to refactor everything into a library of functions.
Programming-as-crafting needs to be more of a thing. Not everything is written to be long-lasting. Even HN ushered the ugly code into hook functions that weren't shipped with the main codebase.
> The only build-in problem with Jupyter Notebooks is JSON, mixing input and output (and making it pain to work with version control)
Absolutely. JSON notebook format makes it very hard to do code reviews, merge in remote changes etc. After being frustrated with lack of solutions, I built ReviewNB[1] specifically to do notebook code reviews on GitHub. Alternatively, Nbdime[2] is also a nice open source library to see diffs locally & merge in changes.
It eases most of the pain regarding version control. You can use it as a 'git filter', so only inputs would be shown in diffs and committed (and also works with interactive adding!), while keeping outputs in your working tree.
I don't get why anyone one who knows how to use an IDE would ever use a notebook, the coding experience is garbage in comparison. I understand they started as a way to get STEM kids coding quick, but now they are like a standard in data analysis and data science, with those people needing experienced devs to translate the notebook into production code. This just drives the silo walls up higher.
Doing data science in an IDE would be terrible. With a notebook, you get the chance to load the data, view it, clean it where needed, view it again, analyze it, model it and do anything else you need to it. An IDE means that you can't use the previous output to guide your next operation in a direct fashion like you can with a notebook.
> With a notebook, you get the chance to load the data, view it, clean it where needed, view it again, analyze it, model it and do anything else you need to it.
In a good data-oriented IDE like RStudio you get to do all of those things and write code which can be saved as plain text and can be version controlled well under git which you can't do well with Jupyter.
R folks have to be the best indicator in this case because they have access to a good IDE and they have good support for Jupyter. Their use is overwhelmingly in plain text files in RStudio, a small portion of rmarkdown notebooks and pretty much no one user R in Jupyter.
Yes! Rstudio is the one thing I miss most when doing datascience in python.
Notebooks give me some of the interactivity but the experience degrades significantly.
The spyder IDE seem like an okayish replacement but some of the library I use expect you to have html display (within a notebook) to give you full functionalities which is not yet available in spyder.
No but, looking at some screenshot + descriptions, it seems to get me further from the code which does not seem like what I am looking for
Rstudio gives you the experience of a classical IDE + easy data exploration which I found to be productive from the exploratory stages (where I need to see my data and the effect of my code) to the clean-up phase (where I refactor my file).
It is interesting to see this discussion about notebooks while I'm thinking about all the RStudio users who do all their work inside the IDE and are pretty happy. Notebooks seem like such an inferior tool to me. I'm also extremely bias.
That kind of depends on your process. In many cases pdb (or the debugging interface in your IDE of choice) works just fine for that. It's certainly not "terrible".
After the exploration and preprocessing stage I personally don't see much benefit of the notebook model, training/evaluation and any meaningful visualization takes forever anyway, that means I need to cache and persist intermittent results. With that it doesn't really matter all too much if I work on it in vim&pdb, an IDE, or Jupyter.
'Thinking about the data' most often requires looking at the data from hundreds of different angles, quickly investigating its properties and statistics, maybe plotting or fitting a few things, checking some hypotheses etc (all of the above code you will most likely throw out after the initial stage).
Same with the results - once you've coded something (perhaps outside of a notebook environment) and obtained results, verifying that they are what you expect is much more efficient to do in a notebook.
Maybe you use a notebook I'm completely unfamiliar with, but my experience is that they allow you to write code, run it, and save the results in cells. My IDE does all of that except the saving of partial results part, but this can be done easily by just dumping your precomputed data to disk if you can't recompute it easily. In either case, an IDE gives you get an actual debugger, plus with IntelliJ it has a great data visualization plugins, database viewer, great autocompletion, and integrates with your VCS, etc. What do you do when you need an actual debugger, or need to profile your code? What about documentation for the function you are calling? In my IDE this is a popup, in every notebook I've used, this is a google search.
I use both PyCharm and JupyterLab on daily basis, typically dealing with multi-gb datasets.
If I'm writing a library or adding new features to one, or writing tests I'll use PyCharm sure thing, otherwise the notebook is a quicker way to sketch prototypes and always have a kernel with preloaded datasets and pre-imported stuff ready at hand. I don't want to wait 10 minutes to just load the data every time I want to check if my new function works well on it at big scale. That's one of the most important bits.
PyCharm is a clear winner at actually writing code that you won't throw in the bin 10 min later, and once you know what to write.
Debugging? Don't remember ever using PyCharm despite the fact it exists... either pudb or python-devtools or something else. I'd just write tests and things start working in the process. And btw you have pdb debugger (some weak version of it) in jupyter if you really need it. Docstrings? Press tab twice in the notebook. Or keep PyCharm open on the side so you can cmd-b. Profiling? Never a pycharm builtin, maybe something like flamegraph but an external tool anyway.
> I don't get why anyone one who knows how to use an IDE would ever use a notebook,
The Python IDEs for data science are mostly garbage - if you have any recommendations, I'm all ears because I really don't like notebooks but still keep switching between jupyter and vscode depending on what I'm working on.
I use IntelliJ for all my work, data or normal dev stuff, and it works great (all is python). Maybe there is just a workflow issue here where people are used to saving their data as they go in cells. I just write my algorithms all the way through, get a subset of data to debug against, then use the debugger to help me see what mistakes I made. I always run my code all the way through and only stop at the step I'm debugging. I like this better than saving the data from previous computations because I tend to refactor a lot and would need to rerun most of the notebook anyway. Also, rerunning it all the way through a lot makes me notice slow spots more than if I only ran that area a few times and saved the results. For me, this has the effect that those areas get more attention and my code is closer to production grade than if I had used a notebook workflow. My two cents, but give IntelliJ a try if you want a good python IDE.
I have found PyCharm to offer a good trade-off between data exploration and productionizing your code. It has the best Python debugger that I've used. You can also run Jupyter notebooks in PyCharm when that makes sense for you.
I don't know what this document is meant to do but you will have to take my Jupyter-lab instance from my dead cold hands.
I love notebooks, I work fast, line by line I execute commands and I immediately see the output (dataframes or graphs). For complex code I have an editor open (in jupyter-lab or vscode) for some functions and classes. But the main developing is done in the notebook, anything that ends in a module start in my notebooks.
As a biologist that learned to program after 30 I just don't understand how you can develop data processing code without such a close handle on dataframes and without checking in graphs/visualizations if your code does what you expect. I don't see how I would do that in pure vscode of other IDEs.
I also don't understand this sentence: "Once the data is loaded, it then has to be cleaned, which participants complained is a repetitive and time consuming task that involves copying and pasting code from their personal "library" of commonly used functions." What is the alternative? Not cleaning the code? And why copy and paste when you can perfectly fine have your own shareable module on the side? I guess most notebook users do some kind of hybrid development.
Good point on the last one; I think we have to 'educate' researchers on the fact that they can also write their own libraries and frameworks, and they should. Even basic data manipulation utilities can be made into Python modules and distributed at ease. If something is tedious, there definitely is a way to make it less so.
Came here to post the same thing. Nbdev helps fill in the strengths that IDEs are traditionally good at. Even if you don't use the full nbdev library and templates, the work flow makes sense. Write code in Jupyter, export to a python library, and you can use it everywhere else after that.
I suspect that a lot of the conventions I describe help mitigate problems described here, some of which should be strictly or optionally enforced by the notebook instead of the user.
(The site’s very much a work in progress, so expect to see odd and broken things if you go poking around.)
Thanks for this! Didn't know about the watermark extension, that looks useful.
I just started working with the Guix kernel for more easily reproducible and reusable notebooks. I suppose that's an alternative to using a conda environment.
No mention of https://observablehq.com notebooks? They’re the best I’ve found in the “Share and collaborate” and “As products” category. JupyterLab is still pretty great for exploratory stuff, but visualization possibilities in observable are incredible.
Problem is that Javascript doesn't have the scientific computing ecosystem that Python, R and Julia have. Jupyter supports those languages and any others that people write kernels for. And you can also execute bash, JS, CSS and HTML directly in python notebooks with magic commands.
Agreed, Observable fixes a lot of the problems I've had with other notebooks. It can still be fiddly for code over a certain size/ complexity but the ability to import from npm modules goes a long way to fixing the problem. The user base seems to be predominantly drawn from the visualisation side of things + the fact that it's javascript may limit its uptake in science/maths areas. Aside: I've felt for a while that JS is really missing decent maths/stats libraries, any suggestions?
The problem of notebooks has been solved by the Python extension in Visual Studio Code (and some other editors too, although VS Code is the one I'm most familiar with).
Editing an ordinary Python file, if you insert the comment "# %%", you turn everything between that comment and the next "# %%" (or the end of the file) into a code cell that can be submitted to the ipython kernel, just as in a Jupyter notebook. The editor splits into two halves, the left half your Python file and the right half the Jupyter notebook window with submitted code and formatted output (e.g., DataFrames look pretty, plots display normally, etc.). When you're done running everything, you can export the result as a Jupyter notebook. Because you're editing an ordinary Python file, standard features like version control and importing the file you're editing into other files (you cannot normally import .ipynb files IIRC) work normally.
And of course since VS Code is a real editor/IDE, you can double click a file and have it open right up (no resorting to a Terminal to start your Jupyter session) and you get syntax themes, a built in Terminal, a git UI, code snippets, documentation on hover, vim mode if that's your thing, etc.
The only downside I've found is that the Python extension doesn't incorporate ipython's autocomplete in its own autocompletion, but that's a small price to pay for getting to treat .py files as notebooks.
So literally all of these complaints are about their particular implementations of notebooks, not the concept of computational notebooks in general, or are all computational notebooks destined to have unstable kernels?
In my mind, notebooks should be married to a functional style of programming, where you use the notebook's markup to thoroughly explain and document your functions. Below your "function definition" section, you keep a "trying things out" section where you actually plug the data into your functions for debugging/visualizations. You can't shoot yourself in the foot with variables because all the work is done in your function's lexical scope. You can shoot yourself in the foot with stale function definitions, but a good notebook interface gives you the ability to clear function definitions and run groups of cells, so you can make sure you always run your functions in a group that starts with a "clear function definitions" cell.
When you are done, you just cut the "trying things out" section into a second notebook which references the functions in the first and viola, you've got a very well documented library of functions, and a new work notebook where you can freely polish your visualizations/whatever.
This is a solid list. It will be even better if juxtaposed with current efforts to solve each of these problems - every DS I know is addressing at least 2-3 of these with some pet tools in their own environment. For example, we use Panel and Holoviews to make data exploration much easier. I have a feeling the ecosystem would improve faster if we had an index of (partial) solutions aligned with this problem set.
One category left out of the list: testing of data pipelines (c.f. great expectations).
I use Jupyter Lab with Python every day. It's where I do my initial data exploration and cleaning. Jupyter Lab is not perfect, but most of these findings seem like they are more issues of inexperience with technology and programming, not computational notebooks.
I have been heads down in jupyter for the past couple of weeks and I finally realized I just DO NOT LIKE IT AT ALL! Cracks started appearing and then suddenly there was an avalanche of disappointment.
The first crack -- it's almost impossible to build a nice presentation in Jupyter, because you always have to show your code and its stderr. I imported all the TeX goodness, and it looked pretty nice, but I couldn't show the output without showing the TeX code. Importing the TeX interpreter is quite non-standard and means that my notebook doesn't play well with the public servers. I also got burned by some kind of permissions issue, so that all my charts ended up being invisible to read-only users.
The second crack -- I can only look at the code from within my own jupyter server. The source is buried in a very noisy json format.
The third crack -- Who wants to write code in the impoverished browser based editor provided? How many times have I deleted a closing brace that was automatically inserted incorrectly? How can I do a global search and replace?
The fourth crack -- I can't test my code unless I include all the tests in the notebook!
I'm complaining. I realize that I don't have anything constructive to offer, and I'm really a beginner. However, I think some of my disappointment is justified, as I think it was reasonable to assume that I could build my notebooks to be next level presentations.
So once your code gets large enough that it doesn't fit neatly within a jupyter notebook, it's time to split the code out into another package, and then import it into your notebook.
The benefit here is that now your code and be used inside the jupyter notebook, and also inside a webserver say.
I've started to use jupyter with kdb on the back end for analysis. For me I have some hope it will hit the sweet spot because:
1. kdb is "too obtuse" for many and python glue makes it more amenable
2. I can still have kdb functions in source code and call them from jupyter with pyq
3. I can do most of my "editing" in emacs to the kdb back-end, write python "libs" for parsing results, and just use jupyter as a fairly thin presentation layer
4. I can share notebooks with analysts who run the same jupyter server virtual env, so finally we can share notes
Will this add more value than just using kdb? Time will tell, hard to know right now.
I agree that the "ide" experience absolutely sucks.
This all sounds very familiar to me. I'm at a robotics company; we had some experimental infrastructure built up around processing ROS bag files via notebooks, and it just eventually became like pulling teeth. Stuff would get cut and pasted between notebooks, or moved out to helper modules which then had versioning and permissions chaos. Each bag needed its own notebook/interpreter instance because there's no way to rerun a notebook on new data, but then the server would explode because of these massive Python processes hanging around with half-processed data state still in them.
In the end we dumped it all and turned the good parts into a sane CLI tool which ingests data and dumps out Bokeh plots. At some point we'll throw a Jenkins front end on it, but the current approach seems to be working fine.
To offer some help for 2 and 4, you can get a script out of a notebook with jupyter nbconvert <notebook.ipynb> --to python, which you can even include as a cell that starts with ! since that will run shell commands.
For part of 3, there is global search and replace accessible either via Edit -> Find and Replace or Esc-F, and it includes case sensitivity toggles, regex search, and the option to change the current cell or globally.
For 1, I think there are some plugins that can make that tidier, but I've generally just accepted that if the people I'm presenting to don't want to see code then I'll just need to make a set of slides out of the whole thing.
>it's almost impossible to build a nice presentation in Jupyter, because you always have to show your code and its stderr.
You should be able to see a blue vertical bar to the left of every code cell. Click that bar to collapse the cell. You can do this for both the code and the output. I know this works for Jupyter Lab, but I don't know about legacy notebooks.
I think Atom’s hydrogen and VSCode’s python are best-in-class Jupyter clients that achieve everything Jupyter Lab set out to do with more and better features. I develop scripts that function top to bottom with a notebook side-by-side that on a keyboard stroke executes code blocks from my script in the notebook.
I think Computational Notebooks are a great idea, and yet I have the feeling that we are in the process of seeing them overapplied. They are wonderful for certain situations, and teaching or demonstrating code to others is right in its sweet spot.
I get the impression that people are creeping in the direction of trying to do everything with one tool, which sounds like it would end up in the same swamp that Eclipse went into. Sometimes, you need to use different tools for different tasks, and not everything should integrate. Just my opinion.
I think that all the pain points of the article are a result of not using notebooks for their purpose. In my opinion, notebooks are good for:
1. POC/MVP: Showing that what you want to do will work before making a full structure.
2. Creating PDF/HTML documents with code and output.
3. Exploratory data analysis and visualization.
I think many of the data scientists in the article go well beyond what a notebook is. A notebook is where you start, but should never be a production tool.
Jupyter notebooks are great for many purposes. They have, however, two really tragic shortcomings:
1. They are stored by default in stupid json files instead of plain source code with comments.
2. The text editing interface inside the browser is horrific and very difficult to normalize (e.g., disable "smart" closing of parentheses, disable the capture of classic unix copy-pasting, etc).
This is a great list, and totally matches my experience. I also agree this is solvable with tooling.
A) VS Code / IDE needs to be the primary editor
B) Results are not stored with source
C) Export (build) allows packaging for whatever platform.
Python notebooks especially also use some crazy mutable APIs. In general notebooks align with other code written by people who aren’t usually software engineers building production systems. They’re much more about getting things done, APIs and tools are less questioned, a lot of pain is swallowed because PhDs have plenty of time to write a few lines of code. I don’t want to sound disparaging towards these people, it’s just a different set of tradeoffs from writing production grade software.
Logging, monitoring, security, versioning, etc. These are things that most often get ignored due to ignorance or inexperience, but are required for production grade software.
As a computer scientist/software engineer, please allow me the question: Why would I prefer a notebook over e.g. equivalent python script(s) in a git?
I first saw jupyter notebooks when my sister (physicist, non-programmer) used it for analyzing economical data with pandas. Run-time for the full data set was half a day (and IMHO for that analysis SQL would have been better suited). I understand that as a non-programmer it looks alluring, but once the language proficiency is build up, why not use an IDE and run the code on a shell?
If step A takes 5 minutes (and 5 minutes is a very short time) and I want to experiment on step B, then I don't want to rerun step A each time while I'm writing and running code that helps me understand what step B is going to be; I'd want that to be interactive and immediate, not have each rerun take 5 minutes.
Storing/loading to disk is not a good option because all the data that needs to be stored is not yet determined until the exploration is finished; If I write code to save/load A, then I need to change (and test) it after I'm done with B and now want to experiment with C, and it all becomes even more complicated when I need to add an extra step and data field to step A and rerun everything. Deciding what data should be stored in what format is something that you can do in 'productionizing' the code after you've done the exploratory analysis.
REPL is not a good option because it's not convenient to save and replicate the code that got you to the current REPL state.
The other aspect is that visually 'debugging' intermediary data through various plots is not conveniently possible in IDEs. I could generate some picture files in a folder or possibly an HTML 'dashboard' to see the results of my most recent run but that takes extra code and effort, and the results aren't immediately in my face like in a notebook.
Ah, I think I have a hugely different approach to data processing: For my work I often have a very good idea what the output should look like, and what transformations are required on the input to get there. E.g. when processing log files to generate an overview page, or (as I'm doing right now) adding a target to binutils (assembler, linker,...). (Obviously I'm not a data scientist ;-)
With what you describe, intuitively I would use a library that allows me to store&load data per step (with verifying the structure matches), or pass it in-memory. Think JSON (yeah, slow) or something like protobuffers. That way I could do both
during development (or in case B is in another language as A), and in production just
> B(A(read(some_other_input)))
But yeah, that's just my intuition of course. Maybe I'd be a bad data scientist.
However, can't you just experiment with smaller data sets? That's what I usually do if processing is slow (e.g. instead of parsing 10GB of log files, I'll just do 50MB to verify the processing pipeline works, and once that's it, run it on the full 10GB and grab a coffee while it runs). Not an option for data science?
Sure, if you know what the output should look like and if it's possible to e.g. write tests to verify that it's correct, then jupyter notebooks would not be the proper tool to use.
The intended usecase of these notebooks is in scenarios where the main output is not the code and not a particular set transformed data, but knowledge gained during a 'computational exploration' of that data. With that knowledge in hand, you can then build 'productionized' code with different methodologies (possibly but not necessarily using or adapting large parts of the code in your notebook), if that's needed - and in such data analysis scenarios it often happens that it's not ever needed.
Sampling a subset of the data sometimes works. Sometimes it would alter the results substantionally and drive the exploration in a wrong direction; questioning and verifying assumptions is important, and it can be a big difference if all A's are also B or only 99% of them are.
The main reason is that one has to fiddle with the code a lot and re-running the whole thing is much too slow.
A common example is that a huge text file containing experimental data gets parsed in the beginning. Then you have to explore the data step-by-step using all kinds of visualization and analysis such as Fourier transforms, curve fitting, etc.
If you simply put everything into one giant python script, for every step you have to re-run the entire thing which takes forever. Of course you can speed things up by writing intermediate results to disk, but this adds tons of boilerplate code and is quite error prone.
One alternative would be to write individual scripts for each step and read them into an interactive REPL shell. However, then you still have to somehow record the proper execution order if you ever want to repeat the analysis.
https://datalore.io has
(1) a reactive Datalore kernel that solves the reproducibility problem. It recalculates the code automatically when something is changed, and recalculates only the changed and the dependent code;
(2) good completion;
(3) online collaboration;
(4) read-only sharing;
(5) publishing;
(6) sensitive data can be saved in .private directory that is not exposed when the notebook is shared with read-only access
it seems it's cloud-based. Fun for playing around but not suitable for real work (at least for me).
I can't just upload random data to some cloud service to work with it, also I can't upload data if it's too big. Often the data that's valuable is very sensitive.
0. They try to be "be-all, end-all" proprietary container documents, so they lack generality, compatibility and embeddability. It would be better if live code try-out snippets were self-contained and embeddable in other documents: HTML, other software, maybe PDF, LaTex or literate programming formats. Maybe there should be standard, versioned interpreters for each kind of programming language in WebAssembly and cached for offline usage by the browser for inclusion in documentation, papers, etc.?
1. For prototyping, it is better to have try-out live code (and/or REPLs with undo) for prototyping like what is Xcode/iOS Playgrounds for Swift or ReInteract was for Python.
2. Computational notebook software, that I've seen, are terrible, complex, fragile and messy to install. The ones I've seen make TeXLive look effortless by comparison.
3. Beyond replicability what goal(s) are CN really trying to solve?
3.0. For replicability itself, why not have a GitLab/BitBucket/GitHub repo for code and a Docker/Vagrant container one-liner that grabs the latest source when built? Without a clear, consistent and simple build process, there is no replicability, only wasted time, headaches and fragile/messy results.
3.1. Are CNs "hammers" for "nails" that don't exist?
> Maybe there should be standard, versioned interpreters for each kind of programming language in WebAssembly and cached for offline usage by the browser for inclusion in documentation, papers, etc.
This would be incredible. Even better, the output from the code (like graphs) should be able to be embedded in the paper. You have no idea how many papers have errors in the code that generated the graphs/statistics/etc. and nobody can tell because the authors rarely release the data, let alone the source
For WASM, there ought to be a package-management/registry mechanism for installation (unless there is already? It might get complicated, but would seem a good idea to reuse code/plugins.)... or as below, there ought to be some caching priority mechanism.
Then for HTML assets (and CSS ones too), perhaps a hint on asset-linking tags (a, script, link, img, audio, video, etc.) there ought to an offline-priority attribute to help the browser decide what to throw away when clearing cache the regular way or evicting items from the cache, while being able to leave some things deemed vital when not nuking the entire cache. Yes, websites could be goofy and game caching mechanisms, marking everything "vital" like for 0-pixel image cookies but I'm sure someone would make an "RBL" (real-time blackhole list) system of which priorities on which websites to ignore.
Related aside: There's a lot of common frameworks, libraries and bits that could be cached user-side, with the trick either to a) herding web devs to de-fragment their CDNs, which could create SPoF's or b) changing the standard allowing multiple SRCs or HREFs for high-availability/less bitrot to preserve both choice and encourage de-duplication of common assets. [0]
Here's an idea - what if you could put a `hash` attribute on an script tag. After downloading whatever it links to, the browser checks the hash matches the one you provided. Then it could also cache the result and reuse it whenever the hash matches, even if the link pointed elsewhere.
Their observations bring to mind the benefits of watching people program on YouTube or video where you learn a style of working you may not even have considered.
However there is one other issue that is not on the list: because a notebook is meant to be read or shared, I always feel like my work is public and feel less inclined to play around and just take a look at things. When I do “transfer” my work to a notebook, it’s only surprising or interesting things that suppress the discovery process.
One thing that I find to be incredibly useful is the keyboard shortcut `00` (press zero twice while focus is outside of a cell), which will restart the kernel, clear all output and re-run the whole notebook.
This way I'm sure that "library code" that I'm editing in parallel in a real text editor is up to date in the notebook and also solves the limits the confusion due to run-out-of-oder problems.
The overall workflow is something like this:
1. explore using thing.<TAB>, thing?, and %psource thing
2. edit draft code chunk or function
3. when chunk 80% done; move it to a module
and replace it with an import statement
4. press 00 to re-run everything, then GOTO step 1
The key to preserving sanity is step 3—as soon as the exploration phase is done, move to a real text editor (and start adding tests). Don't try to do big chunks of software development in the notebook. You wouldn't write an entire program in the REPL, would you?
Sometimes I keep around the notebook as a record for failed explorations or as a "test harness" for the code, but most of the time it's throwoutable since all the useful bits have moved into a normal python module/script under version control.
Another useful tip which doesn't require always doing '00': when editing the library code, import things like this:
import mylib; importlib.reload(mylib); from mylib import foo
Then in most cases except some very entangled ones, you can simply rerun this cell without having to restart the kernel (especially if it requires reloading all the data).
The reality is — notebooks are and need to be developed as an app platform ...
In order to do notebooks properly — you need:
1. discovery (Ideally static discovery) of all the state the notebook needs, and the bulk of state the notebook will/could manipulate during its execution. Your container needs to intercept the filesystem and the networking apis that will be invoked so that a determination of the state that results from these operations can be observed by the runtime and shimmed appropriately for reproducibility and for performance optimization
2. The notebook (and the runtime inferred model of all the required inputs) needs to be repo stable — I Should be able to write a notebook app that reads from the file system on my development host, deploy it somewhere, and the runtime should take care that wherever that however that post deployment file system read is implemented matches my local development semantics
3. Pplatform level dependency graph needs to exist to model re-execution requirements automatically — incorporating code changes and external state
Apple could build this And “notebook-os” would be the correct conceptual framework for it ... anything less is always going to leave us severely wanting
My main use for notebooks is a simple way to constantly hold a whole large dataset in memory. That way if I want to try some feature reduction or remove some bad result, I can just do that and not wait 10 minutes for my slow PC to rerun my import code. I feel like an easy way to do that in base python would draw me away from notebooks.
Obviously it's pretty hard to make general criticism of the Notebook GUI. This is especially without comparing to a specific other user interface for data scientists, such as a traditional REPL terminal, or some other command line tools?
The Python world gives a good example about the sheer complexity of a notebook infrastructure. The is IPython, there is Jupyter Notebook, JupyterLab. There is even stuff like the SageMathCloud (nowadays called CoCalc) which is basically a web GUI to a VPS combining command lines and various notebooks. And hell, most of these web based interfaces try to make sharing easy.
Mabye we should start comparing these (mostly OSS) tools to the traditional notebook GUIs of Matlab and Mathematica, something we used in the 90s and 2000s. From my feeling, they were more robust, could handle large data better, but they lack all the tooling we get for free in the web.
Jupyter Notebook/Jupyter Lab has replaced IPython as the notebook front end.
I suspect 90%+ of Python Notebook work is done in Jupyter/Jupyter Lab (or things built on it like Google Collab/Kaggle Kernels).
traditional notebook GUIs of Matlab and Mathematica, something we used in the 90s and 2000s. From my feeling, they were more robust, could handle large data better
I've done 10s of terabyte analysis on Jupyter (Spark backend) and I personally know people doing petabyte work on it so this seems doubtful.
You probably were careful enough to understand the limits of the Jupyter server and client (frontend).
It's easy to screw up a terminal application in data science when dumping a large array. Many REPLs cannot handle this properly (and CTRL+C won't work). It's easy to test this: What does your favourite notebook do when you call some command such as (pseudocode/python here)
print(range(int(1e7))) # or 1e8
In this particular example, the python CLI seems to handle keyboard interrupts fine when the terminal (or RAM) is flooded.
I'm not sure I understand the issue about the user repeatedly tweaking parameters for their data visualization. If anything, that is a reason notebooks are so nice. The repeated tweaks are due to the notebook format, its because that's an inherent part of the data visualization process, where the end result of a particular parameter choice is hard to predict how it will look with a given data set. So the same process would occur whether one was using a notebook or a script, but with a script it becomes much more cumbersome to actually see the result. In a notebook, the parameter tweaking for a data visualization is immediately followed by the result.
I definitely agree with most of the other points though.
I think it's easy to do notebooks wrong, but possible to do them right. I try and do quick prototyping in notebook cells before moving it off to a separate .py file, and avoid keeping any code that does anything other than visualisation or parameter setting inside a cell long-term. That way, if you need to run something "in production" (whatever that means in your context), you don't end up having to pick apart and re-write your code -- you just import the .py file you wrote along the way.
For me, notebooks are a super handy way of visualising and sharing results during meetings, and it's difficult to imagine a more convenient alternative.
I tried to encourage our team to use notebooks, however everyone prefers using PyCharm and git for sharing code. We dont have much visualization, which might be the reason, but I was surprised just how many people just hated it.
Notebooks are not so much for writing programs or collections of functions.. they are better for a style of "code plus explanation" .. add flexible inline charting for data itself
Are you using oo? Still not sure how to “explain” an oo system once sophisticated enough. Just better than go-to everywhere but not much. Of course a trigger based system (gui, system) also have the same issue.
Interesting study, I like the mixed-method approach. A quick glance at the industry of the participants suggests that there might be a bias towards structural data (which I think is actually acceptable as that makes up a huge chunk of the non-academic ML-Notebook work)
Edit: The authors acknowledge this in the "Limitations".
This is akin to reviewing how well a screwdriver drives nails. Yes, it has problems. That doesn't mean it's a bad tool - you're just not using it right. Does it require discipline? Yes, but so does the screwdriver.
That being said, I think jupyter specifically has some legacy issues around format, and I prefer R markdown. As much as I love pycharm, it's never going to do more than replicate the notebook experience.
IMHO, the main author publishes on code UI/UX, the title seems more like click bait. Not sure why it's so upvoted.
I quite like the Spyder approach: Pure python code that is segmented into cells by inserting a special comment line.
The cells can then be individually executed in an ipython shell, or the entire script can be run with the regular python interpreter. This makes it easy to tweak the individual parts without having to re-run everything. In contrast to jupyter notebooks you still end up with a valid python script that can be easily version controlled.
I just wish that I could use vim instead of the Spyder IDE.
It lets you do linear execution of blocks like in Jupyter, but in a normal .py file. Obviously more lightweight than Jupyter and you get to use your regular editor.
I work at https://www.deepnote.com/, we are trying to tackle some of the pains mentioned in the article (setup, collaboration, IDE features like auto-complete or linting).
We are still early access, but if you are interested in an invite just let me know. My email is filip at deepnote dot com.
Deepnote seems quite interesting, but as a cheapskate grad student, I'm compelled to ask.
If this information isn't private, what sort of business model do you use? I take it you'll have a SaaS subscription model? I see it's free to use now, but how does your company plan to make money (especially taking into account the cost of the cloud hosting Deepnote requires)?
Our goal right now is to build the most amazing data science notebook. We need a lot of feedback to get there, that's why we are keeping it free. But since the servers also cost us something, we haven't opened up Deepnote to the public just yet.
Once in GA, we know we can support students on a free tier almost indefinitely (it doesn't really cost that much) while offering more advanced features on a subscription model for teams and enterprises.
I'm curious. How do people protocol their experiments? When I started, I used to just keep the cells but that lead to very long and impossible to parse Jupyter notebooks. I have since opted for keeping a journal.txt file in Atom where I write down hpyerparameter configurations, epochs run and results (for ML).
But that feels a bit awkward as well.
At Gigantum, we're trying to solve some of these issues too. A Gigantum Project lets you run Jupyter or RStudio in a container that is managed for you. Everything is automatically versioned so you can sort out exactly what was run, by who, and when.
One idea for a pain point not mentioned: better variable persistence. If I declare a variable, then delete the cell I declared it in, the variable persists. I've had this cause issues because if I use the deleted variable by accident, it will work fine right up until a kernel restart.
I’m surprised no one has mentioned what I see as the biggest failings of notebooks: poor handling of connection loss / re-connection. The kernel will continue to run, but a connection hiccup will often make the notebook UI stop updating (and lose any kernel output).
Notebooks are bad and unreliable. You are repeating your code all the time, you are limited to work with smaller datasets. If you are into visual data analysis use Orange or other similar data mining tools.
We allow usage of notebooks only for presentation purposes.
It is interesting to me how this talks about "computational notebooks" but it seems to be about Jupyter and derivatives thereof -- RMarkdown notebooks run inside of the RStudio IDE, and they don't use the term 'kernels' like Jupyter does.
What are the currently available CI options for notebooks? You'd think this would be one of the first tools people would need to make sure notebooks are reproducible, but there seems to be little sign of CI usage.
Streamlit is imo the best alternative. I was a beta tester and I found that it encouraged good coding practice without sacrificing too much functionality. I highly recommend that other data scientists check it out.
"Free software" is about freedom, not price. Putting the research that is your life's work inside of proprietary software that can be taken away from you, forever, at any time — that seems foolish.
Mathematica is great for symbolic mathematics and terrible for anything else.
The awful control flow syntax makes reading longer scripts pretty much impossible. Plotting is very clunky and by default produces output files that are essentially unreadable.
Of course one can somehow work around these issues, but it is much easier (and free) to just use python.
Why not teach data scientists how to write software effectively? Those are smart people, it’s not like using version control, writing unit tests and extracting common code into libraries is rocket science.
What is wrong with life? Many but let us appreciate how to use it more instead of seemingly criticise it.
The world is so much better with you alive. So is the founded tool of computational notebook. Not sure I read it covered R notebook which is really good to share info and analyst. Just wonder how to use it better.
Of course they can always improve on it. But I would promote more expansion - How about a lisp notebook, a clojure notebook, a js notebook and a forth notebook.
The real problem is can you have oo notebook ... it is more “serial” and graphic and data. But not for the “messy” class or trigger Based system. Hence if I may, the real problem is the scoping. It is so hard to visualise a live oo system. Unlike a live functional or even a stack based system.
It is not life that is the problem. Even useless life has its use, as long as it is alive. But if it is not reaching there an alternative may have to think about. Just like we cannot be there we send in our voyagers outside solar system.