Fix DecompressMemMapReader.read returning b'' before EOF (#120)#135

mvanhorn

Summary

Closes #120

DecompressMemMapReader.read(N) could return b'' before _br == _s when N was small enough that the underlying zlib.decompress(indata, N) consumed input bytes without producing output (e.g. while ingesting the zlib header / dictionary frames). Callers using the standard while chunk := stream.read(N) idiom terminated at the first empty chunk -- before reaching the actual end of the uncompressed object. Reproduced with chunk sizes 1, 4, 16 against a 13 KB stream of compressible data.

Why the original guard existed

The previous if dcompdat and ... guard at the recursion site was added so that compressed_bytes_read() could drive read() in a scrub-loop that intentionally manipulates _br = 0 while the inner zlib object is already past EOF. Without that guard, the scrub loop would recurse forever because _br < _s stays true.

Fix

Convert the recursive "refill to size" branch into an iterative loop around the existing decompress block. The loop terminates when:

len(dcompdat) >= size (caller's request satisfied), OR
_br >= _s (uncompressed object fully read), OR
Inner zlib produced an empty chunk AND (zlib has hit EOF, OR the input window has run off self._m, OR no compressed bytes were consumed on this turn).

Condition 3 preserves the compressed_bytes_read() scrub safety: once the inner zip is at EOF, the loop breaks instead of looping forever. It also bounds the truncated-stream case (zip.eof never becomes true) by the input-exhaustion guard, and bounds the crafted "many empty deflate blocks" attack -- previously this could blow Python's recursion depth at ~1500+ stored-block headers because each recursion only consumed a handful of input bytes; the iterative form walks the stream forward without growing the stack.

Tests

New: test_decompress_reader_chunked_read_does_not_terminate_early reads a 13 KB highly-compressible stream with chunk sizes 1, 4, 16, 64 and asserts the full payload is returned and _br == _s.
Existing: gitdb/test/ (24 tests) all pass, including test_decompress_reader_special_case and test_pack -- both exercise the compressed_bytes_read scrub path.

$ pytest gitdb/test ======================== 24 passed, 1 skipped in 7.10s ========================

Manual smoke

import zlib
from gitdb import DecompressMemMapReader

data = b"hello world! " * 1000
zdata = zlib.compress(data)
for chunk_size in (1, 4, 16, 64, 4096):
    r = DecompressMemMapReader(zdata, close_on_deletion=False, size=len(data))
    out = b""
    while chunk := r.read(chunk_size):
        out += chunk
    assert out == data, chunk_size

(Pre-fix this fails for chunk_size in (1, 4, 16).)

Truncated input is handled the same as before: read() returns whatever was decoded so far, then b'', instead of looping or raising. The crafted recursion-depth attack (~5000 empty deflate blocks ahead of one valid block) now decodes correctly instead of raising RecursionError.

Byron

Thanks a lot.

This is interesting because nobody should be using this pure Python object database. But I guess for those who are, having this fixed is valuable.

Let's see if Copilot finds anything, but if not I think this should be merged, even though I only did an armchair review and admittedly don't understand a thing, trusting the test-suite fully here.
The test of course makes sense, and I agree that it should be working.

Copilot

Pull request overview

Fixes DecompressMemMapReader.read(N) returning b'' before true EOF for small N, which could prematurely terminate standard chunked read loops (issue #120). The change replaces a recursive refill strategy with an iterative decompression loop that continues as long as zlib is making forward progress, while still preserving the compressed_bytes_read() “scrub” behavior and avoiding recursion-depth failures on pathological inputs.

Changes:

Refactor DecompressMemMapReader.read() from recursion to an iterative loop that tolerates zlib consuming input without producing output.
Add a regression test ensuring chunked reads (1/4/16/64 bytes) return the full payload and reach _br == _s.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description

gitdb/stream.py	Reworks decompression refill logic to avoid empty reads before EOF and removes recursion to prevent stack overflows on adversarial streams.
gitdb/test/test_stream.py	Adds a regression test covering chunked reads that previously could terminate early.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Byron requested a review from Copilot May 8, 2026 23:58

Copilot started reviewing on behalf of Byron May 8, 2026 23:59 View session

Copilot AI reviewed May 9, 2026

View reviewed changes

Comment thread gitdb/stream.py Outdated Show resolved Hide resolved

Comment thread gitdb/stream.py Outdated Show resolved Hide resolved

Comment thread gitdb/test/test_stream.py Outdated Show resolved Hide resolved

Apply suggestions from code review …

ed41d2c

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix DecompressMemMapReader.read returning b'' before EOF (#120)#135