Comment by clickety_clack

Comment by clickety_clack 14 hours ago

16 replies

My default is basically YAGNI. You should use as few services as possible, and only add something new when there’s issues. If everything is possible in Postgres, great! If not, at least I’ll know exactly what I need from the New Thing.

Fripplebubby 14 hours ago

The post is a clear example of when YAGNI backfires, because you think YAGNI but then, you actually do need it. I had this experience, the author had this experience, you might as well - the things you think you AGN are actually pretty basic expectations and not luxuries: being able to write vectors real-time without having to run other processes out of band to keep the recall from degrading over time, being able to write a query that uses normal SQL filter predicates and similarity in one go for retrieval. These things matter and you won't notice that they actually don't work at scale until later on!

  • simonw 12 hours ago

    That's not YAGNI backfiring.

    The point of YAGNI is that you shouldn't over-engineer up front until you've proven that you need the added complexity.

    If you need vector search against 100,000 vectors and you already have PostgreSQL then pgvector is a great YAGNI solution.

    10 million vectors that are changing constantly? Do a bit more research into alternative solutions.

    But don't go integrating a separate vector database for 100,000 vectors on the assumption that you'll need it later.

    • Fripplebubby 10 hours ago

      I think the tricky thing here is that the specific things I referred to (real time writes and pushing SQL predicates into your similarity search) work fine at small scale in such a way that you might not actually notice that they're going to stop working at scale. When you have 100,000 vectors, you can write these SQL predicates (return the 5 top hits where category = x and feature = y) and they'll work fine up until one day it doesn't work fine anymore because the vector space has gotten large. So, I suppose it is fair to say this isn't YAGNI backfiring, this is me not recognizing the shape of the problem to come and not recognizing that I do, in fact, need it (to me that feels a lot like YAGNI backfiring, because I didn't think I needed it, but suddenly I do)

      • morshu9001 9 hours ago

        If the consequence of being wrong about the scalability is that you just have to migrate later instead of sooner, that's a win for YAGNI. It's only a loss if hitting this limit later causes service disruption or makes the migration way harder than if you'd done it sooner.

      • hobofan 9 hours ago

        > When you have 100,000 vectors [...] and they'll work fine

        So 95% of use-cases.

  • throwway120385 11 hours ago

    Many of the concerns in the article could be addressed by standing up a separate PG database that's used exclusively for vector ops and then not using it for your relational data. Then your vector use cases get served from your vector DB and your relational use cases get served from your relational DB. Separating concerns like that doesn't solve the underlying concern but it limits the blast radius so you can operate in a degraded state instead of falling over completely.

    • SoftTalker 9 hours ago

      I've always tried to separate transactional databases from those supporting analytical queries if there's going to be any question that there might be contention. The latter often don't need to be real-time or even near-time.

  • [removed] 9 hours ago
    [deleted]
esafak 14 hours ago

Databases are hard to swap out when you realize you need a different one.

  • morshu9001 9 hours ago

    That's true when you're talking about a generalized rdbms, but if this is an isolated set of tables for embeddings or something and you don't entangle it with everything else, it can be fine. See also, using Postgres as a KV store.