Comment by pid-1

Comment by pid-1 11 hours ago

10 replies

Pandas is cancer. Please stop teaching it to people.

Everything it does can be done reasonable well with list comprehensions and objects that support type annotations and runtime type checking (if needed).

Pandas code is untestable, unreadable, hard to refactor and impossible to reuse.

Trillions of dollars are wasted every year by people having to rewrite pandas code.

jononor 4 hours ago

Code using pandas is testable and reusable in much the same way as any other code, make functions that take and return data.

That said, the polars/narwals style API is better than pandas API for sure. More readable and composable, simpler (no index) and a bit less weird overall.

  • jmpeax 4 hours ago

    Polars made the mistake of not maintaining row order for all operations, via the False-by-default argument of maintain_order. This is basically the billion-dollar null mistake for data frames.

isolatedsystem 7 hours ago

I've recently had to migrate over to Python from Matlab. Pandas has been doing my head in. The syntax is so unintuitive. In Matlab, everything begins with a `for` loop. Inelegant and slow, yes, but easy to reason about. Easy to see the scope and domain of the problem, to visualise the data wrangling.

Pandas insist you never use a for loop. So, I feel guilty if I ever need a throwaway variable on the way to creating a new column. Sometimes methods are attached to objects, other times they aren't. And if you need to use a function that isn't vectorised, you've got to do df.apply anyway. You have to remember to change the 'axis' too. Plotting is another thing that I can't get my head around. Am I supposed to use Pandas' helpers like df.plot() all the time? Or ditch it and use the low level matplotlib directly? What is idiomatic? I cannot find answers to much of it, even with ChatGPT. Worse, I can't seem to create a mental model of what Pandas expects me to do in a given situation.

Pandas has disabused me of the notion that Python syntax is self-explanatory and executable-pseudocode. I find it terrible to look at. Matlab was infinitely more enjoyable.

  • kelipso an hour ago

    Yeah, pandas is truly awful. After working with things like R, ggplot, data.table, you soon realize pandas is the worst dataframe analysis and plotting library out there.

    I pretty much consider anyone who likes it to have Stockholm syndrome.

  • radus 6 hours ago

    Polars has a much more consistent API, give it a shot.

    Regarding your plotting question: use seaborn when you can, but you’ll still need to know matplotlib.

mttpgn 10 hours ago

> Pandas code is untestable

The thousand-plus data integrity tests I've written in pandas tell a different story...

mulmboy 7 hours ago

> Everything it does can be done reasonable well with list comprehensions and objects that support type annotations and runtime type checking (if needed).

I see this take somewhat often, and usually with similar lack of nuance. How do you come to this? In other cases where I've seen this it's from people who haven't worked in any context where performance or scientific computing ecosystem interoperability matters - missing a massive part of the picture. I've struggled to get through to them before. Genuine question.

physicsguy 7 hours ago

I found Pandera quite good for wrapping input/output expectations over Pandas. At the end of the day the vectorisation of operations in it and other table based formats mean they’re not easy to replace performantly.

globular-toast 5 hours ago

Can you write more about this? A lot of people use pandas where I work, whereas I'm completely fluent in list comprehensions and dataclasses etc. I had the impression it was doing something "more" like using numpy arrays/matrices for columns.