Comment by mcpar-land

Comment by mcpar-land 2 days ago

31 replies

What is the benefit of parameterizing a jupyter notebook over just writing python that's not in a jupyter notebook? I like jupyter notebooks for rapid prototyping but once I want to dial some logic in, I switch to just writing a .py file.

singhrac 2 days ago

We use papermill extensively, and our team is all good programmers. The difference is plots. It is a lot easier to write (and modify our existing template) to create a plot for X vs Y than it is to build and test a script that outputs e.g. a PDF.

For example, if your notebook runs into a bug, you can just run all the cells and then examine the locals after it breaks. This is extremely common when working with data (e.g. "data is missing on date X for column Y... why?").

I think most of the "real" use cases for notebooks is data analysis of various kinds, which is why a lot of people dismiss them. I wrote a blog post about this a while ago: https://rachitsingh.com/collaborating-jupyter/

gnulinux 2 days ago

It's a literate programming tool. If you find literate programming useful (such as Donald Knuth's Latex) then you can write a Jupyter notebook, add text, add latex, titles, paragraphs, explanations, stories and attach code too. Then, you can just run it. I know that this sounds pretty rare but this is mostly how I write code (not in Jupyter notebook, I use Markdown instead and write code in a combination of Obsidian and Emacs). To me, code is just writing, there is no difference between prose, poetry, musical notation, or computer programming. They're just different languages that mean something to human beings and I think they're done best when they're treated like writing.

  • zelphirkalt 2 days ago

    Does it support more of literate programming than the small amount of features, that normal Jupyter notebook supports?

    I always wish they would take a hint from Emacs org mode and make notebooks more useful for development.

    • gnulinux 2 days ago

      No it supports less actually. Obsidian is only a markdown editor, it does allow you to edit code fragments like code (so there is basic code highlighting, auto-tabbing etc) but that's it. I personally find this a lot easier in some cases. I find that sometimes if the code is too complicated that you need anything more than just "seeing" you probably need to break it further down to its atomic elements. For certain kinds of development, I do find myself needing to be in "programming groove" then I use Emacs. But other times, I accompany the code with a story and/or technical description so it feels like the end goal is to write the document, and not the code. Executable code is just a artifact that comes with it. It's definitely a niche application as far e.g. the industry goes.

  • crabbone 2 days ago

    I have to disagree... Literate programming is still programming: it produces programs (but with an extra effort of writing documentation up-front).

    Jupyter is a tool to do some exploratory interactive programming. Most notebooks I've seen in my life (probably thousands at this point) are worthless as complete programs. They are more akin to shell sessions, which, for the most part, I wouldn't care storing for later.

    Of course, Jupyter notebooks aren't the same as shell sessions, and there's value in being able to re-run a notebook, but they are so bad at being programs, that there's a probably a number N in low two-digits, where if you expect to have to run a notebook more than N times, you are better off writing an actual program instead.

    • gnulinux 2 days ago

      Literate programming is not just "documentation + code" any more than a textbook you read about Calculus is "documentation + CalculusCode" or a novel is "documentation + plot". It goes way beyond that, using literate programming you can attach an arbitrary text that accompanies the code such that fragments of your code is simply one part of the whole text. Literate programming is not just commenting (or supercommenting), if it were, you could use comments, it's a practice of simply attaching fragments of code in a separate text such that you can then later utilize that separate text the same way you utilize code. When you write a literate program, your end goal is the text and the program, not just the program. You can write a literate program, and publish it as is as a textbook, poem, blog post, documentation, website, fiction, musical notation etc... Unless you think that all human writing is documentation then literate programming is not just documentation.

      • crabbone a day ago

        Yes. I tried it. And, eh... it's documentation + code (you can publish code + documentation as a textbook, poem, blog post, Website just as well). No need to exaggerate. It's also very inconvenient to write, for zero benefits. It's kind of like writing prose in one language, and then translating individual pieces of it into another language, while hoping that somehow the sum will still come out OK.

        Some people like challenge in their lives... and I don't blame them. For sport, I would also rewrite some silly programs in languages I never intend to use, or do some code-golfing etc. Literate programming belongs in this general area of making extra effort to accomplish something that would've been trivial to do in a much simpler way.

    • abdullahkhalids 2 days ago

      > Don’t get discouraged because there’s a lot of mechanical work to writing. There is, and you can’t get out of it. I rewrote A Farewell to Arms at least fifty times. You’ve got to work it over. The first draft of anything is shit. Ernest Hemingway

      This is how all intellectual work proceeds. Most of the stuff you write is crap. After many iterations you produce one that is good enough for others. Should we take away the typewriter from the novel writers too, along with Jupyter notebooks from scientists, because most typed pages are crap?

      • crabbone a day ago

        I think, you completely missed the point... I compared Jupyter notebooks to shell sessions: it doesn't make them bad (they are, however, but for a different reason). I don't think that shell sessions are bad. The point I'm making is that Jupyter notebooks aren't suitable for being independent modules inside a larger program (and so are shell sessions). The alternative is obvious: just write the program.

        Can you possibly make Jupyter notebook act like a module in a program? -- with a lot of effort and determination, yes. Should you be doing this, especially since the alternative is very accessible and produces far superior results? -- Of course no.

        Using your metaphor, I'm not arguing for taking the typewriter away from the not-so-good writers. I'm arguing that maybe they can use a computer with a word processor, so that they don't waste so much paper.

reeboo 2 days ago

As an MLE who comes from backend web dev, I have flip-flopped on notebooks. I initially felt that everything should be in a python script. But I see the utility in notebooks now.

For notebooks in an ML pipeline, I find that data issues are usually where things fail. Being able to run code "up to" a certain cell and create plots is invaluable. Creating reports by creating a data frame and displaying it as a cell is also super-handy.

You say, "dial some logic in", which is begging the wrong question (in my experience, at least). The logic in ML is usually very strait forward. It's about the data coming into your process and how your models are interacting with it.

  • jamesblonde 2 days ago

    I agree completely with this. Papermill output is a notebook - that is the log file. You can double click on it, it opens in 1-2 seconds and you can see visually how far your notebook progressed and any plots you added for debugging.

jdiez17 2 days ago

There are a lot of people who are not expert Python programmers, but know enough to pull data from various sources and make plots. Jupyter{Notebook,Lab} is great for that.

As you say, from a programmer's point of view the logical thing to do is to convert the notebook to a Python module. But that's an extra step that may not be necessary in some cases.

FWIW I used papermill in my Master's thesis to analyze a whole bunch of calibration data from IMUs. This gave me a nicely readable document with the test report, conclusions etc. for each device pretty easily.

kremi 2 days ago

Some of the replies here are pretty good, I basically agree with “if it works for your data scientists then why not”.

I’m actually a software developer with 10 years experience and also happen to do data science. And found myself in situations where I parametrized a notebook to run in production. So it’s not that I can’t turn it to plain python. The main reasons are

1. I prototype in a notebook. Translating to python code requires extra work. In this case there’s no extra dev involved, it’s just me. Still it’s extra work.

2. You can isolate the code out of the notebook and in theory you’ve just turned your notebook into plain py. You could even log every cell output to your standard logging system. But you loose context of every log. Some cells might output graphs. The notebook just gives you a fast and complete picture that might be tedious to put together otherwise.

3. The saved notebook also acts as versioning. In DS work you could end up with lots of parameters or small variations of the same thing. In the end what has little variations I put in plain python code. What’s more experimental and subject to change I put in the notebook. In certain cases it’s easier than going through commit logs.

4. I’ve never done this but a notebook is just json so in theory you could further process the output with prestodb or similar.

mooreds 2 days ago

It's the same tradeoff of turning an excel spreadsheet into a proper program.

If you do so, you gain:

* the rigor of the SDLC

* reusability by other developers

* more flexible deployment

But you lose the ability for a non-programmer to make significant changes. Every change needs to go through the programmer now.

That is fine if the code is worth it, but not every bit of code is.

  • fifilura 2 days ago

    It also implies that an engineer has better understanding of what is supposed to be done and can discover all the error modes.

    In my experience, most of the time the problem is in the input and interpretation of the data. Not fixable by a unit test.

crystal_revenge 2 days ago

I agree. I was at a company where some DS was really excited about Papermill, and I was trying to explain that this is an excellent time to stop working in a notebook and start writing reusable code.

I was aghast to learn that this person had never written non-notebook based code.

Code notebooks are great as notebooks, but should in no way replace libraries and well structured Python projects. Papermill to me is a huge anti-pattern and a sign that your team is using notebooks wrong.

  • jdiez17 2 days ago

    So you think it was a good move to scoff at someone for using a computer for their work in a way that is different from your preferences?

    • crystal_revenge 2 days ago

      Notebooks are great as notebooks, but it's very well established, even in the DS community, that they are a terrible way to write maintainable, sharable, scalable code.

      It's not about preference, it's objectively a terrible idea to build complex workflows with notebooks.

      The "scoff" was in my head, the action that came out of my mouth was to help them understand how to create reusable Python modules to help them organize their code.

      The answer is to help these teams build an understanding of how to properly translate their notebook work into re-useable packages. There is really no need for data scientists to follow terrible practices, and I've worked on plenty of teams that have successfully been able to onboard DS as functioning software engineers. You just need a process and a culture that notebooks cannot be the last stage of a project.

      • fifilura 2 days ago

        The thing with data pipelines is they have a linear execution. You start from the top and work your way down.

        Notebooks do that, and even leave a trace while doing it. Table outputs, plots, etc.

        It is not like a python backend that listens to events and handle them as they come, sometimes even in parallel.

        For data flow, the code has an inherent direction.

        • crystal_revenge 2 days ago

          > Notebooks do that, and even leave a trace while doing it.

          Perhaps the largest critique against notebooks is that they don't enforce a linear execution of cells. Every data scientist I know has been bitten by this at least once (not realizing they're in a stale cell that should have been updated).

          Sure you could solve this by automating the entire notebook ensuring top-down execution order but then why in the world are you using a notebook like this? There is no case I can think of where this would be remotely better than just pulling out the code into shared libraries.

          I've worked on a wide range of data science teams in my career and by far the most productive ones are the ones that have large shared libraries and have a process in place for getting code out of notebooks and into a proper production pipeline.

          Normally I'm the person defending notebooks since there's a growing number of people who outright don't want to see them used ever. But they do have their place, as notebooks. I can't believe I'm getting down voted for suggesting one shouldn't build complex workflows using notebooks.

jsemrau 2 days ago

I used papermill a while ago to automate a long-running python-based data aggregation task. Airflow would log in remotely to the server, kick-off papermill and track it's progress. Initially I wanted to use pure python, but the connection disconnected frequently disallowing me to track the progress, and also jupyter enabled quick debugging where something went wrong.

Not one of my proudest moments, but it got the job done.

swalsh 2 days ago

My experience is more with Databricks, and their workflow system... but the concept is exactly the same.

It let's data scientists work in the environment they work best in, and it makes it easier to productionize work. If you seperate them, then there's a translation process to move the code into whatever the production format is which means extra testing, and extra development.

__MatrixMan__ 2 days ago

I think there are places where the figure-it-out-in-a-notebook part is one person's job, and then including it in a pipeline is another person's job.

If they can call the notebook like a function, the second person's job becomes much easier.

  • crabbone 2 days ago

    I've been that person, and no it doesn't. It makes my life suck, if I have to include a notebook instead of an actual program in a larger program. Notebooks don't compose well, they are too dependent on the specifics of the environment in which they were launched, they have excessive source code that's also machine-generated and is hard to work with for humans.

    As a stop-gap solution, for cases like a single presentation / proof-of-concept that doesn't need to live on and be reused -- it would work. Anything that doesn't match this description will accumulate technical debt very quickly.

    • __MatrixMan__ 2 days ago

      I sort of suspected that adding parameters was not the end of the story. My experience with this was just "make it work with papermill", so the notebooks I tested with were nice and self contained.

      Although it does seem like packaging dependencies and handling parameters are separate problems, so I'm not sure if papermill is to be blamed for the fact that most notebooks are not ready to be handled like a black box, even after they're parameter-ready. Something like jupyenv is needed also.

      • crabbone a day ago

        Jupyter is not the end of the story here. There are plenty of "extensions". These extensions go, generally, down two different ways: kernels and magic.

        It's not very common for Jupyter magic to be added ad hoc by users, but it typically creates a huge dependency on the environment, so no jupyenv is going to help (eg. all the workload-manager related magic to launch jobs in Slurm / OpenPBS).

        Kernels... well, they can do all sorts of things... beyond your wildest dreams and imagination. And, unlike magic, they are readily available for the end-user to mess with. And, of course, there are a bunch of pre-packaged ones, supplied by all sorts of vendors who want, in this way, to promote their tech. Say, stuff like running Jupyter over Kubernetes with Ceph volumes exposed to the notebook. There's no easy way of making this into a "module" / "black box" that can be combined with some other Python code. It needs a ton of infra code to support this, if it's meant to be somewhat stand-alone.

        • __MatrixMan__ 7 hours ago

          Are we talking about the same https://github.com/tweag/jupyenv ?

          It encapsulates the kernel, which encapsulates pretty much everything for the notebook, right? I haven't worked with Slurm or OpenPBS, but I think if you let nix build the images that your tasks are running in then I think you're covered for pretty much everything except things that only exist at runtime like database connections. Not a perfect black box, but close.

zhoujing204 2 days ago

It might be a pretty useful tool for education. College courses related to Python and AI on Coursera have heavily used Jupyter Notebook for assignments and labs.

z3c0 2 days ago

Parameterizing notebooks is a feature common to modern data platforms, and most of its usefulness comes from saving the output. That makes it easier to debug ML pipelines and such, cos the code, documentation, and last output are all in one place. However I don't see any mention of what happens to the outputs with this tool.