Comment by fifilura

Comment by fifilura 2 days ago

1 reply

The thing with data pipelines is they have a linear execution. You start from the top and work your way down.

Notebooks do that, and even leave a trace while doing it. Table outputs, plots, etc.

It is not like a python backend that listens to events and handle them as they come, sometimes even in parallel.

For data flow, the code has an inherent direction.

crystal_revenge 2 days ago

> Notebooks do that, and even leave a trace while doing it.

Perhaps the largest critique against notebooks is that they don't enforce a linear execution of cells. Every data scientist I know has been bitten by this at least once (not realizing they're in a stale cell that should have been updated).

Sure you could solve this by automating the entire notebook ensuring top-down execution order but then why in the world are you using a notebook like this? There is no case I can think of where this would be remotely better than just pulling out the code into shared libraries.

I've worked on a wide range of data science teams in my career and by far the most productive ones are the ones that have large shared libraries and have a process in place for getting code out of notebooks and into a proper production pipeline.

Normally I'm the person defending notebooks since there's a growing number of people who outright don't want to see them used ever. But they do have their place, as notebooks. I can't believe I'm getting down voted for suggesting one shouldn't build complex workflows using notebooks.