Comment by agentcoops

Comment by agentcoops 3 days ago

2 replies

I don't know what it looks like on the ground now, but Scala was the defacto language of data infrastructure across the post-Twitter world of SV late stage/growth startups. In large part, this was because these companies were populated by former members of the Twitter data team so it was familiar, but also because there was so much open source tooling at that point. ML teams largely wrote/write Python, product teams in JS/whatever backend language, but data teams -- outside of Google and the pre-Twitter firms -- usually wrote Scala for Spark, Scalding etc in the 2012-2022ish era.

I worked in Scala for most of my career and it was never hard to get a job on a growth stage data team or, indeed, in finance/data-intensive industries. I'm still shocked at how the language/community managed to throw away the position they had achieved. At first I was equally shocked to see the Saudi Sovereign Wealth fund investing in the language, but then realized it was just 300k from the EU and everything made sense.

It's still my favorite language for getting things done so I wouldn't be upset with a comeback for the language, but I certainly don't expect it at this point.

ForHackernews 2 days ago

Did the Scala 3 changeover blunt its momentum, you think? Or just Python won out in a sort of worse-is-better way?

  • hocuspocus 2 days ago

    Mostly the latter. Scala 3 is almost completely irrelevant to the big data space so far. Databricks took six years before upgrading their proprietary Spark runtime to Scala 2.13. Flink dropped the Scala API before even moving to 2.13. I don't know if Scio will seriously attempt the move to Scala 3. All of them suffer from Twitter libraries being abandoned, which isn't insurmountable, but an annoyance still.

    And I don't think it matters anymore. I predict that the JVM will eventually be out of the equation. We're already seeing query engines being replaced by proprietary or open source equivalents in C++ or Rust. Large scale distribution is less of a selling point with modern cloud computing. Do you really need 100 executors when you can get a bare metal instance with 192, 256 or 384 cores?

    People want a dataframe API in Python because that's what the the ML/DS/AI crowd knows. Queries and processing will be done in C++ or Rust, with little or even zero need for a distributed runtime. The JVM and Scala solve a problem that simply won't exist anymore.