clickety_clack 21 hours ago

I probably wouldn’t rewrite an entire data science stack that used pandas, but most people would use polars if starting a new project today.

  • biofox 20 hours ago

    R and Matlab workflows have been fairly stable for the past decade. Why is the Python ecosystem so... unstable? It puts me off investing any time in it.

    • clickety_clack 20 hours ago

      The R ecosystem has had a similar evolution with the tidyverse, it was just a little further ago. As for Matlab, I initially learned statistical programming with it a long time ago, but I’m not sure I’ve ever seen it in the wild. I don’t know what’s going on there.

      I’m actually quite partial to R myself, and I used to use it extensively back when quick analysis was more valuable to my career. Things have probably progressed, but I dropped it in favor of python because python can integrate into production systems whereas R was (and maybe still is) geared towards writing reports. One of the best things to happen recently in data science is the plotnine library, bringing the grammar of graphics to python imho.

      The fact is that today, if you want career opportunities as a data scientist, you need to be fluent in python.

      • fluidcruft 12 minutes ago

        Mostly what's going on with Matlab in the wild is that it costs at least $10k a seat as soon as you are no longer at an academic institution.

        Yes, there is Octave but often the toolboxes aren't available or compatible so you're rewriting everything anyway. And when you start rewriting things for Octave you learn/remember what trash Matlab actually is as a language or how big a pain doing anything that isn't what Mathworks expects actually is.

        To be fair: Octave has extended Matlab's syntax with amazing improvements (many inspired by numpy and R). It really makes me angry that Mathworks hasn't stolen Octave's innovations and I hate every minute of not being able to broadcast and having to manually create temp variables because you can't chain indexing whenever I have to touch actual Matlab. So to be clear Octave is somewhat pleasant and for pure numerical syntax superior to numpy.

        But the siren call of Python is significant. Python is not the perfect language (for anything really) but it is a better-than-good language for almost everything and it's old enough and used by so many people that someone has usually scratched what's itching already. Matlab's toolboxes can't compete with that.

    • jononor 4 hours ago

      The pandas workflows have also been stable for the last decade. That there is a new kid on the block (polars) does not make the existing stuff any less stable. And one can just continue writing pandas for the next decade too.

    • crystal_revenge 15 hours ago

      I love R, but how can you make that claim when R uses three distinct object-oriented systems all at the same time? R might seem stable only because it carries along with it 50 years of history of programming languages (part of it's charm, where else can you see the generic function approach to OOP in a language that's still evolving?)

      Finally, as someone who wrote a lot of R pre-tidyverse, I've seen the entire ecosystem radically change over my career.

    • rbartelme 20 hours ago

      Outside bioconductor or the tidyverse in R can be just as unstable due to CRAN's package requirements.

  • [removed] 20 hours ago
    [deleted]
crystal_revenge 15 hours ago

Pandas is generally awful unless you're just living in a notebook (and even then it's probably least favorite implementation of the 'data frame' concept).

Since Pandas lacks Polars' concept of an Expression, it's actually quite challenging to programmatically interact with non-trivial Pandas queries. In Polars the query logic can be entirely independent of the data frame while still referencing specific columns of the data frame. This makes Polars data frames work much more naturally with typical programming abstractions.

Pandas multi-index is a bad idea in nearly all contexts other than it's original use case: financial time series (and I'll admit, if you're working with purely financial time series, then Pandas feels much better). Sufficiently large Pandas code bases are littered with seemingly arbitrary uses of 'reset_index', there are many times where multi-index will create bugs, and, most important, I've never seen any non-financial scenario where anyone has ever used Multi-index to their advantage.

Finally Pandas is slow, which is honestly the least priority for me personally, but using Polars is so refreshing.

What other data frames have you used? Having used R's native dataframes extensively (the way they make use of indexing is so much nicer) in addition to Polars both are drastically preferable to Pandas. My experience is that most people use Pandas because it has been the only data frame implementation in Python. But personally I'd rather just not use data frames if I'm forced to used Pandas. Could you expand on what you like about Pandas over other data frames models you've worked with?

amelius 20 hours ago

Pandas turns 10x developers with a lust for life into 0.1x developers with grey hairs.

  • cbare 15 hours ago

    Ha, I think that happens regardless of the tech you use. Just blame time.

wesleywt 5 hours ago

Nothing, it gets the job done for most people. If you don't like it, make a better tool. Polars is not it.