range-streams

Streaming partial file content via range requests in Python.

While trying to profile the entire conda package repository (for a tool impscan), I encountered a particular need for only specific ranges within files. I soon uncovered HTTP range requests, which were compatible with the servers I was requesting data from.

Zip archives have a 'contents list' of sorts, but it's at the end of the binary file. If you know which files you're looking for, you can 'look them up' in this 'directory' within the ZIP archive, and then use range requests to just inspect that file (and perhaps only a bit of it).

This would clearly save a lot of downloading, and therefore (I hoped) speed up the process of scanning the many large ZIP archives I was profiling.

Next I came across an approach I'd never seen before, class to handle a binary HTTP stream as a file-like object.

Problem: this wouldn't work for the aforementioned ZIP archives, as a streaming HTTP response must be loaded linearly, so skipping to the 'directory' of archive contents at the end requires you to download all the preceding bytes, making it not much better than downloading the file in its entirety.

Solution: I extended the class to structure range-based requests as objects in a range-based data structure

The structure I used was a RangeDict from the python-ranges package

Under the hood: a linked list

I later added a second RangeDict to handle 'windows' on a stream, for other file formats where (unlike ZIPs with their extremely non-linear-appropriate format) it was actually optimal to use a single request.

Due to the possibility for a newly added range on a stream to overlap with an existing one, I left the option editable, but default to simply 'trimming' any pre-existing range so that it no longer overlaps. A trimmed range cannot be read to its untrimmed end, and if a range after trimming is empty then it gets deleted.

This felt harder to explain in my own notes than it was to code, sketching it out helped:

Codecs supported at the time of writing are ZIP (regular and conda), PNG, and zstd (Facebook's compression format). A tarball codec turned out moot: solid compression is incompatible with range streaming (the entire tarball is compressed, so it's not possible to extract a single file like it is for zip archives, where files are compressed individually).

I went on to use the PngStream codec in wikitransp, which motivated the addition of async support and a fetcher class that can be constructed with a list of URLs to then fetch asynchronously. This greatly sped up the process of scraping PNG files en masse, with control over exactly what was read from the stream to minimise these requests' overhead.

Notes on packaging

This library was a pleasure to write with types but later became a pain to document with them, due to incompatibility between Sphinx's sphinx-autodoc-typehint extension and the httpx library (which was doing all of the HTTP range requests).

I did manage to get it to build, and now I know the trick it's not difficult to cater to Sphinx, but I dislike 'fighting with my tools' like this.

Besides that, type checking caught many many bugs before runtime, and any additional overhead of having to declare the types involved in each function was (I feel) recouped in greater ability to intuit what the code is doing based on the type annotations, particularly when coming back at a later time. Much like tests, they help to 're-load your short term memory' with what's going on in a particular zone of the program.

Speaking of tests, this package was also a pleasure to test. Initially I got to 100% test coverage, which gave me a big motivation that didn't fade much after it slipped to 86%. I chose to let it sink rather than pepper the code [overly] with test skip comments, as certain things simply do not reasonably need to be tested when relying on other packages that should do what they say on the tin.

Async tests seem to wreak havoc in the CI pipeline (but not on my machine), which is now prone to never finishing [thus timing out], which I don't think is a good thing, and may mean I run up my allotted CI time each month on GitHub Actions.

The docs are very thorough, and I hope it'll find the users who need it in the future.