Comment by chatmasta

Comment by chatmasta 6 hours ago

7 replies

I would like to see a “DataFusion for Vector databases,” i.e. an embeddable library that Does One Thing Well – fast embedding generation, index builds, retrieval, etc. – so that different systems can glue it into their engines without reinventing the core vector capabilities every time. Call it a generic “vector engine” (or maybe “embedding engine” to avoid confusion with “vectorized query engine.”)

Currently, every new solution is either baked into an existing database (Elastic, pgvector, Mongo, etc) or an entirely separate system (Milvus, now Vectroid, etc.)

There is a clear argument in favor of the pgvector approach, since it simply brings new capabilities to 30 years of battle-tested database tech. That’s more compelling than something like Milvus that has to re-invent “the rest of the database.” And Milvus is also a second system that needs to be kept in sync with the source database.

But pgvector is still _just for Postgres_. It’s nice that it’s an extension, but in the same way Milvus has to reinvent the database, pgvector needs to reinvent the vector engine. I can’t load pgvector into DuckDB as an extension.

Is there any effort to make a pure, Unix-style, batteries not included, “vector engine?” A library with best-in-class index building, retrieval, storage… that can be glued into a Postgres extension just as easily as it can be glued into a DuckDB extension?

talipozturk 5 hours ago

I think we have so many of those nice open source libraries but the problem is not the library or the algorithm (hsnw or ivf derivatives).. the problem is figuring out the right distributed architecture to balance cost, accuracy (recall) and speed (latency). I believe no single library will give you all that. For instance if you don't separate writes (indexing) from reads (queries) and scale them separately then your indexing will either suck or your indexing will destroy your read latency. You won't be able to scale as easily either. I believe that is why AWS created Aurora and Google Cloud created AlloyDB to scale relational databases (mysql/postgresql) by separating the reads/writes, implementing a scalable storage backend and by offloading a lot of shared works (replication, compaction, indexing) to cluster of machines.

  • chatmasta 5 hours ago

    Yeah, I feel like these libraries are all one level lower than what I’m asking for. We need something that makes more assumptions (e.g. “I’m running as a component of some kind of database”) but… makes less decisions? Is more flexible? Idk. This is the hard part.

    DataFusion nailed this balance between an embedded query engine and a standalone database system. It brings just the right amount of batteries that it’s not a super generic thing that does nothing useful out of the box, but it doesn’t bring so many that it needs to compete with full database systems.

    I believe the maintainers refer to it as “the IR of databases” and I’ve always liked that analogy. That’s what I’d like to see for vector engines.

    Maybe what we need as a pre-requisite is the equivalent of arrow/parquet ecosystem for vectors. DataFusion really leverages those standards for interoperability and performance. This also goes a long way toward the architectural decisions you reference — Arrow and Parquet are a solid, “good enough” choice for in-memory and storage formats that are efficient and flexible and well-supported. Is there something similar for vector storage?

  • whakim 4 hours ago

    I couldn't agree with this more. I don't think the majority of problems with vector search at scale are vector search problems (although filtering + ANN is definitely interesting), they're search-problems-at-scale problems.

maxxen 4 hours ago

Soo… usearch? Its literally one header file (of what use to be strict c++11). Funnily enough that is what is used in the official duckdb-vss extension.

Disclaimer: I wrote duckdb-vss