Show HN: Zeekstd – Rust Implementation of the ZSTD Seekable Format

(github.com)

86 points by rorosen 17 hours ago

12 comments

Hello,

I would like to share a Rust implementation of the Zstandard seekable format I've been working on.

Regular zstd compressed files consist of a single frame, meaning you have to start decompression at the beginning. The seekable format splits compressed data into a series of independent frames, each compressed individually, so that decompression of a section in the middle of an archive only requires zstd to decompress at most a frame's worth of extra data, instead of the entire archive.

I started working with the seekable format because I wanted to resume downloads of big zstd compressed files that are decompressed and written to disk on the fly. At first I created and used bindings to the C functions that are available upstream[1], however, I stumbled over the first segfault rather quickly (it's now fixed) and found out that the functions only allow basic things. After looking closer at the upstream implementation, I noticed that is uses functions of the core API that are now deprecated and it doesn't allow access to low-level (de)compression contexts. To me it looks like a PoC/demo implementation that isn't maintained the same way as the zstd core API, probably that's also the reason it's in the contrib directory.

My use-case seemed to require a complete rewrite of the seekable format, so I decided to implement it from scratch in Rust using bindings to the advanced zstd compression API, available from zstd 1.4.0.

The result is a single dependency library crate[2], and a CLI crate[3] for the seekable format that feels similar to the regular zstd tool.

Any feedback is highly appreciated!

[1]: https://github.com/facebook/zstd/tree/dev/contrib/seekable_f... [2]: https://crates.io/crates/zeekstd [3]: https://github.com/rorosen/zeekstd/tree/main/cli

rwmj an hour ago

Seekable formats also allow random reads which lets you do trickery like booting qemu VMs from remotely hosted, compressed files (over HTTPS). We do this already for xz: https://libguestfs.org/nbdkit-xz-filter.1.html https://rwmj.wordpress.com/2018/11/23/nbdkit-xz-curl/

Has zstd actually standardized the seekable version? Last I checked (which was quite a while ago) it had not been declared a standard, so I was reluctant to write a filter for nbdkit, even though it's very much a requested feature.

simeonmiteff 2 hours ago

This is very cool. Nice work! At my day job, I have been using a Go library[1] to build tools that require seekable zstd, but felt a bit uncomfortable with the lack of broader support for the format.

Why zeek, BTW? Is it a play on "zstd" and "seek"? My employer is also the custodian of the zeek project (https://zeek.org), so I was confused for a second.

[1] https://github.com/SaveTheRbtz/zstd-seekable-format-go

  • rorosen an hour ago

    Thanks! I was also surprised that there are very few tools to work with the seekable format. I could imagine that at least some people have a use-case for it.

    Yes, the name is a combination of zstd and seek. Funnily enough, I wanted to name it just zeek first before I knew that it already exists, so I switched to zeekstd. You're not the first person asking me if there is any relation to zeek and I understand how that is misleading. In hindsight the name is a little unfortunate.

    • etyp 21 minutes ago

      Zeek is well known in "security" spaces, but not as much in "developer" spaces. It did get me a bit excited to see Zeek here until I realized it was unrelated, though :)

ncruces 34 minutes ago

How's tool support these days to create compress a file with seekable zstd?

Given existing libraries, it should be really simple to create an SQLite VFS for my Go driver that reads (not writes) compressed databases transparently, but tool support was kinda lacking.

Will the zstd CLI ever support it? https://github.com/facebook/zstd/issues/2121

stu2010 2 hours ago

This is cool, I'd say that the most common tool in this space is bgzip[1]. Have you thought about training a dictionary on the first few chunks of each file and embedding the dictionary in a skippable frame at the start? Likely makes less difference if your chunk size is 2MB, but at smaller chunk sizes that could have significant benefit.

[1] https://www.htslib.org/doc/bgzip.html

  • jeroenhd an hour ago

    Looking at the spec (https://github.com/facebook/zstd/blob/dev/contrib/seekable_f...), I don't see any mention of custom dictionaries like you describe.

    The spec does mention:

    > While only Checksum_Flag currently exists, there are 7 other bits in this field that can be used for future changes to the format, for example the addition of inline dictionaries.

    so I don't think seekable zstd supports these dictionaries just yet.

    With multiple inline dictionaries, one could detect when new chunks compress badly with the previous dictionary and train new ones on the fly. Could be useful for compressing formats with headers and mixed data (i.e. game files, which can contain a mix of text + audio + video, or just regular old .tar files I suppose).

tyilo an hour ago

I already use zstd_seekable (https://docs.rs/zstd-seekable/) in a project. Could you compare the API's of this crate and yours?

  • tyilo an hour ago

    Correct me if I'm wrong, but it doesn't seem like you provide the equivalent of Seekable::decompress in zstd_seekable which decompresses at a specific offset, without having to calculate which frame(s) to decompress.

    This is basically the only function I use from zstd_seekable, so it would be nice to have that in zeekstd as well.

[removed] 2 hours ago
[deleted]
77pt77 37 minutes ago

BTW, something similar can be done with zlib/gzip.