← 返回首页
Fix UnicodeDecodeError when reading packed-refs with non-UTF8 characters by MirrorDNA-Reflection-Protocol · Pull Request #2091 · gitpython-developers/GitPython · GitHub
Skip to content

Navigation Menu

Toggle navigation
Sign in
Appearance settings
Search or jump to...

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Saved searches

Use saved searches to filter your results more quickly

Appearance settings
Resetting focus

Fix UnicodeDecodeError when reading packed-refs with non-UTF8 characters#2091

Draft
MirrorDNA-Reflection-Protocol wants to merge 3 commits into
gitpython-developers:mainfrom
MirrorDNA-Reflection-Protocol:fix-packed-refs-encoding
Draft

Fix UnicodeDecodeError when reading packed-refs with non-UTF8 characters#2091
MirrorDNA-Reflection-Protocol wants to merge 3 commits into
gitpython-developers:mainfrom
MirrorDNA-Reflection-Protocol:fix-packed-refs-encoding

Conversation

Copy link
Copy Markdown

Summary

Fixes #2064

The packed-refs file can contain ref names that are not valid UTF-8 (e.g., Latin-1 encoded tag names created by older Git versions or systems with different locale settings). Previously, GitPython would fail with UnicodeDecodeError when reading such files.

Reproduction

As described in #2064:

git clone https://github.com/ACRA/acra cd acra python -c 'import git; print(git.Repo(".").tags)'

Before fix:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 6216: invalid continuation byte

After fix: Successfully reads all 101 tags.

Changes

  • Add errors='surrogateescape' to the open() call in _iter_packed_refs()
  • This allows reading files with arbitrary byte sequences while preserving valid UTF-8 as text
  • Add test that verifies non-UTF8 packed-refs can be read successfully

Technical Details

The surrogateescape error handler is Python's standard approach for handling potentially non-UTF8 data in filesystem operations. It:

  • Passes through valid UTF-8 unchanged
  • Converts invalid byte sequences to Unicode surrogate characters (\uDC80-\uDCFF)
  • Preserves the original bytes in a reversible way (can be re-encoded back to original bytes)

This is the same approach used by Python's os.fsdecode() and is recommended for filesystem operations where encoding may be unknown or mixed.

Fixes gitpython-developers#2064 The packed-refs file can contain ref names that are not valid UTF-8 (e.g., Latin-1 encoded tag names created by older Git versions or non-UTF8 systems). Previously, opening the file with encoding='UTF-8' would raise UnicodeDecodeError. Changes: - Add errors='surrogateescape' to the open() call in _iter_packed_refs() - This allows reading files with arbitrary byte sequences while still treating valid UTF-8 as text - Add test that verifies non-UTF8 packed-refs can be read successfully The 'surrogateescape' error handler is the standard Python approach for handling potentially non-UTF8 data in filesystem operations, as it preserves the original bytes in a reversible way.
Byron requested a review from Copilot December 7, 2025 15:52
Byron marked this pull request as draft December 7, 2025 15:53
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason Spam Abuse Off Topic Outdated Duplicate Resolved Low Quality Hide comment

Pull request overview

This PR fixes a UnicodeDecodeError that occurred when GitPython attempted to read packed-refs files containing ref names encoded with non-UTF-8 character encodings (e.g., Latin-1 encoded tag names from older Git versions). The fix uses Python's surrogateescape error handler, which is the standard approach for handling filesystem operations with potentially mixed or unknown encodings.

Key changes:

  • Adds errors='surrogateescape' parameter to file reading in _iter_packed_refs() method
  • Adds comprehensive test that reproduces and verifies the fix for the Unicode decoding issue

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
git/refs/symbolic.py Adds errors='surrogateescape' to the packed-refs file reader to handle non-UTF8 encoded ref names gracefully
test/test_refs.py Adds test case that creates a packed-refs file with Latin-1 encoded ref name and verifies it can be read without errors

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

Encoding issue with tags in packed-refs file

2 participants

Footer

© 2026 GitHub, Inc.