Comment by jdiez17

Comment by jdiez17 2 days ago

3 replies

So you think it was a good move to scoff at someone for using a computer for their work in a way that is different from your preferences?

crystal_revenge 2 days ago

Notebooks are great as notebooks, but it's very well established, even in the DS community, that they are a terrible way to write maintainable, sharable, scalable code.

It's not about preference, it's objectively a terrible idea to build complex workflows with notebooks.

The "scoff" was in my head, the action that came out of my mouth was to help them understand how to create reusable Python modules to help them organize their code.

The answer is to help these teams build an understanding of how to properly translate their notebook work into re-useable packages. There is really no need for data scientists to follow terrible practices, and I've worked on plenty of teams that have successfully been able to onboard DS as functioning software engineers. You just need a process and a culture that notebooks cannot be the last stage of a project.

  • fifilura 2 days ago

    The thing with data pipelines is they have a linear execution. You start from the top and work your way down.

    Notebooks do that, and even leave a trace while doing it. Table outputs, plots, etc.

    It is not like a python backend that listens to events and handle them as they come, sometimes even in parallel.

    For data flow, the code has an inherent direction.

    • crystal_revenge 2 days ago

      > Notebooks do that, and even leave a trace while doing it.

      Perhaps the largest critique against notebooks is that they don't enforce a linear execution of cells. Every data scientist I know has been bitten by this at least once (not realizing they're in a stale cell that should have been updated).

      Sure you could solve this by automating the entire notebook ensuring top-down execution order but then why in the world are you using a notebook like this? There is no case I can think of where this would be remotely better than just pulling out the code into shared libraries.

      I've worked on a wide range of data science teams in my career and by far the most productive ones are the ones that have large shared libraries and have a process in place for getting code out of notebooks and into a proper production pipeline.

      Normally I'm the person defending notebooks since there's a growing number of people who outright don't want to see them used ever. But they do have their place, as notebooks. I can't believe I'm getting down voted for suggesting one shouldn't build complex workflows using notebooks.