Comment by laxmansharma
Comment by laxmansharma 3 days ago
Really like the direction: hot data in memory for dashboards, cold data in Parquet so you can use normal lake/OLAP tools and avoid lock-in. The streaming rollups + open formats story makes sense for cost and flexibility.
A few focused questions to understand how it behaves in the real world:
Core design & reliability
What protects recent/ hot in-memory data if a node dies? Is there a write-ahead log (on disk or external log like Kafka) or replication, or do we accept some loss between snapshots?
How does sharding and failover work? If a shard is down, can reads fan out to replicas, and how are writes handled?
When memory gets tight, what’s the backpressure plan—slow senders, drop data, or some other smarter approach?
How do you handle late or out-of-order samples after a rollup/export—can you backfill/compact Parquet to fix history? Are there any plans to do this?
Queries
Will there be any plans for data models and different metric types for the hot in memory store like gauge, counter etc.?
Performance & sizing
The sub-ms reads are great, is there a Linux version for the performance reports so it's easier to compare with other products?
Along with the throughput/ latency details I found on Github, are you able to share the memory/ CPU overhead/ GC details etc. for the benchmarks?
What is the rough recommended sizing for RAM/ CPU for different ingestion inputs in terms of bytes per sample or traffic estimation.
Lake/Parquet details
Considering most people use some other solution like prometheus etc. at this point, will there be an easier migration strategy by Okapi ?
Will Okapi be able to serve a single query across hot (memory) + cold (Parquet) seamlessly, or should older ranges be pushed entirely to the lake side and analyzed through OLAP systems ?
Ops & security
Snapshots can slow ingest—are those pauses tunable and bounded? Any metrics/alerts for export lag, memory pressure, or cardinality spikes?
A couple of end-to-end examples (for queries) and a Helm chart/Terraform module would make trials much easier.
Are there any additional monitoring and observability implemented or have plans for Okapi itself?
Overall: promising approach with a strong cost/flexibility angle. If you share Linux+concurrency benchmarks, ingest compatibility, and lake table format plans (Iceberg/Delta), I think a lot of folks here will try it out.
I realize some of the questions I raised may not be fully addressed yet given how early the project is. My goal isn’t to nitpick but to understand the direction the Okapi core team is planning to take and how they’re thinking about these areas over time. Really appreciate the work so far and looking forward to seeing how it evolves.