Comment by uxcolumbo
Comment by uxcolumbo 4 days ago
Are there any cheaper alternatives to Databricks, EC2, DynamoDB, S3 solution? Where cost is more predictable and controlled?
What's a good roll your own solution? DB storage doesn't need to be dynamic like with DynamoDB. At max 1TB - maybe double in the future.
Could this be done on a mid size VPS (32GB RAM) hosting Apache Spark etc - or better to have a couple?
P.S. total beginner in this space, hence the (naive) question.
Depends on how you define cheaper - you could set up Apache Iceberg, Spark, MLFlow, AirFlow, JupyterLab, etc and create an abomination that sort of looks like Databricks if you squint, but then you have to deal with set up, maintenance, support, etc.
Computationally speaking - again depends on what your company does - Collect a lot of data? You need a lot of storage.
Train ML Models, you will need GPUs - and you need to think about how to utilise those GPUs.
Or...you could pay databricks, log in and start working.
I worked at a company who tried to roll their own, and they wasted about a year to do it, and it was flaky as hell and fell apart. Self hosting makes sense if you have the people to manage it, but the vast majority of medium sized companies will have engineers who think they can manage this, try it, fail and move on to another company.