Sorry, something went wrong.
There was a problem hiding this comment.
Thanks so much for this wonderful PR! It's very well thought out and a delight. Particularly the fuzzing/README.md is a great introduction to how it all works. >[!TIP] is also new to me, and something I hope to be using more often in future.
About Fuzzing-FixturesOverall, the extra-files are just about 2MB in total, and I wonder how stable they will be. If stable, I could imagine including them in this repo.
If there is any concerns about this, gitpython-developers could fork the repo and make you an admin on it.
I created a new archive from the branch of this PR with python -m build --sdist --wheel and didn't see any trace of fuzzing in the produced archive.
It seems good.
Regarding the OSS fuzz PRI didn't review it for correctness, but approved it in case it helps getting it merge.
Regarding my reviewI just glanced at the shell scripts, and basically ignored the google-provided fuzz targets.
All that's left for me is to figure out where to put the ~2MB of corpus files.
And, thanks again for picking this up :)!
Sorry, something went wrong.
|
Thanks, I appreciate the kind words 🙂 And yes, >[!TIP] is a handy feature for sure. There are a few others as well. About Fuzzing-FixturesOverall, the extra-files are just about 2MB in total, and I wonder how stable they will be. If stable, I could imagine including them in this repo. If there is any concerns about this, gitpython-developers could fork the repo and make you an admin on it. I think a dedicated repo under the gitpython-developers org would be the best option. No need to fork mine though, a new one with whatever license you prefer would be fine. I have no preference as far my own permissions though. Ideally, a seed corpus would be added for each new fuzz target. Obviously that means more files but it doesn't necessarily mean "10 fuzz targets = 10mb". The size & number of interesting inputs will change over time and depending on the complexity/number of code paths in the functionality under test, and it's hard to know what that equates to in terms of per target corpus size at this point. The two zips in my repo right now actually demonstrate that pretty nicely: both of them basically top out their respective coverage quickly but one zip is ~450kb while the other is ~1.5mb. I like how Bitcoin Core handles in https://github.com/bitcoin-core/qa-assets. From what I've seen, that seems like the cleanest and most flexible. About archives for publishing on PyPyI created a new archive from the branch of this PR with python -m build --sdist --wheel and didn't see any trace of fuzzing in the produced archive. Regarding the OSS fuzz PRI didn't review it for correctness, but approved it in case it helps getting it merge. Regarding my reviewI just glanced at the shell scripts, and basically ignored the google-provided fuzz targets. Awesome thanks! |
Sorry, something went wrong.
Sorry, something went wrong.
|
Thanks @Byron! I've updated the URL to unblock this PR and I'll follow up with a PR to cleanup the new qa-assets repo following the pattern of the Bitcoin one. |
Sorry, something went wrong.
There was a problem hiding this comment.
At least one and possibly two changes will be needed to comply with the terms of the Apache License. That license imposes requirements for distributing the licensed work, including, in subsection 4(a):
You must give any other recipients of the Work or Derivative Works a copy of this License
This is to say that merely preserving the license headers (even with other acknowledgement) is not sufficient, and the license text itself needs to be included somewhere in the GitPython repository.
It seems to me that the most hassle-free way to achieve this is to copy the license file from the repository the Apache 2.0 licensed Python scripts originally came from (so long as there is nothing unusual about it) into the fuzzing/ directory, and name it LICENSE-APACHE.
The reason to put in the fuzzing/ directory rather than in the fuzzing/fuzz-targets/ subdirectory that contains those scripts, is that new files not covered by that license may end up in there later. So putting it there wouldn't really make it more specific. There is also a benefit to having it in a directory that also contains a README.md that covers, among other code, the code it pertains to--especially if the details about what it covers are also moved from the main readme into that one, as I suggest in a review comment below. But it could go in the fuzzing/fuzz-targets/ subdirectory if you prefer.
It shouldn't go at the top level of the repository, because that could lead to confusion, and also because (at least without further changes) it would end up in built sdist and wheel packages. Currently nothing from the fuzzing/ directory is included in such packages; I have verified this (as has Byron).
The slightly trickier part is that the Apache License also requires that changes be noted. Per 4(b):
You must cause any modified files to carry prominent notices stating that You changed the files
I believe that does apply here. The subtle aspect of this is that this is often not followed, and I think is not expected to be followed, when proposing changes for inclusion in an existing project that the files came from. For example, pull requests on a project that uses the Apache License often do not, and I think are not expected to, add any self-attribution, if the contributors decide they don't need to be attributed individually.
So I think anything that has formerly been published under this license in the OSS-Fuzz repository or other such places, if it included your modifications there, should not require that anything be added to put it here.
But it looks like your changes to the Apache licensed files were not ever included there. If this is where they are first published, then I think a note has to be added. If you want it to be a copyright notice naming you, I think that is fine. If not, then it could just say "This file has been modified by" followed by your name, or "GitPython contributors," or "contributors to GitPython," or something like that.
Please definitely do say if you think I am mistaken about any of that, or even if you are not sure but are worried I may be mistaken.
Besides the above, I think there are no other blockers for merging this. I've posted a number of comments below, some recommending other changes.
Sorry, something went wrong.
| shutil.rmtree(git_dir) | ||
|
|
||
| os.mkdir(git_dir) | ||
| with open(head_file, "w") as f: |
There was a problem hiding this comment.
Unless there is a specific reason to do otherwise, I recommend specifying an encoding explicitly, for example, encoding="utf-8". Otherwise it is unclear if the reason it is omitted is just because it is expected not to make a practical difference across different platforms, or because a difference is actually desired. (I am guessing the reason is the former, because I don't think the fuzz tests run on Windows, which is where these issues usually come into play.) If omitting it is intentional, then it might be worthwhile to include a comment to state the reason.
Sorry, something went wrong.
| python3 -m pip install . | ||
|
|
||
| # Directory to look in for dictionaries, options files, and seed corpa: | ||
| SEED_DATA_DIR="$SRC/seed_data" |
There was a problem hiding this comment.
I think SRC comes from outside this script and is set in such a way that it is very unlikely to start with -. I have refrained from mentioning places where -- should be added before arguments that have $SRC as a prefix if the value of SRC can begin, even accidentally, with -. Most commands accept --, but not all, and I have no objection to omitting it on the occasion it is truly known not to be needed. I'll assume such comments may not be wanted, unless you request them. The same goes for container-environment-boostrap.sh.
Sorry, something went wrong.
There was a problem hiding this comment.
I think SRC comes from outside this script and is set in such a way that it is very unlikely to start with -.
You are correct. I explained the OSS-Fuzz provided env vars a bit more in this comment: #1901 (comment). These variables are certainly a case where -- is truly known not to be needed, because if the omission of it results in these scripts breaking then most (more likely, all) of the other OSS-Fuzz projects would break as well.
That said, I wasn't particularly happy with having the "$SRC/seed_data" directory specified in two separate scripts, where one implicitly depends on the other creating it. If nothing else, that feels typo prone.
Initially I considered having a third script that just defined the shared variable and sourceing it in each, but that felt like overkill.
Ultimately, I think the best thing to do would be to remove the need for that directory all together. When I get around to the changes I described in a reply to you here: #1903 (comment), I should be able to do all of the dictionary & corpus zip creation inside container-environment-bootstrap.sh (i.e., during the Docker image build step) and put everything in $OUT/ directly from there. That would allow the removal of the "$SRC/seed_data" directory completely, and also make the fuzz harness build step more efficient because it wouldn't need to produce the fixture artifacts every time.
I'm open to alternative suggestions though 🙂
I'll leave this comment unresolved for now for my own reference
Sorry, something went wrong.
|
Thanks for the thorough review @EliahKagan! It's very helpful, and I'm particularly appreciative of the thoughtful explanations and suggestions you shared. I'll push some changes shortly to address your feedback, prioritizing the updates related to the license, so we can get this merged and unblock the downstream OSS-Fuzz PR. As for anything else that you explicitly noted as non-blocking: I'll either address them in a follow-up, or reply to in the |
Sorry, something went wrong.
|
@EliahKagan, commit 945a767 addresses the blocking concerns regarding the licensing and documentation of Apache licensed files. Please see the diff of that commit for specifics. Sorry about the force-push; I wanted to make sure the OSS-Fuzz commit at the time I copied the Apache license file from was captured in the commit message.1 Regarding: It seems to me that the most hassle-free way to achieve this is to copy the license file from the repository the Apache 2.0 licensed Python scripts originally came from (so long as there is nothing unusual about it) into the fuzzing/ directory, and name it LICENSE-APACHE. I did exactly that. I also checked the diff of the contents of the file included here against https://www.apache.org/licenses/LICENSE-2.0.txt just to be sure; it looks good. Diff Results--- fuzzing/LICENSE-APACHE
+++ fuzzing/LICENSE-APACHE-from-website-src
@@ -1,3 +1,4 @@
+
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
@@ -178,7 +179,7 @@
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
- boilerplate notice, with the fields enclosed by brackets "{}"
+ boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
@@ -186,7 +187,7 @@
same "printed page" as the copyright notice for easier
identification within third-party archives.
- Copyright {yyyy} {name of copyright owner}
+ Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
@@ -198,4 +199,4 @@
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
- limitations under the License.
+ limitations under the License.
\ No newline at end of file
CC @Byron Footnotes |
Sorry, something went wrong.
There was a problem hiding this comment.
Thanks for sorting this out!
It looks good to me, but I will wait for @EliahKagan to double-check.
Sorry, something went wrong.
There was a problem hiding this comment.
Looks great, thanks!
Sorry, something went wrong.
|
FYI, I'm hesitant to push more commits here, including via the suggestion comments, because I don't want to keep making the review approvals stale. I'll reply to the remaining as necessary of course, but I'm going to start a new branch for them unless someone suggests otherwise. Lmk if either of you would prefer them on this PR directly. |
Sorry, something went wrong.
|
FYI, I'm hesitant to push more commits here, including via the suggestion comments, because I don't want to keep making the review approvals stale. I'll reply to the remaining as necessary of course, but I'm going to start a new branch for them unless someone suggests otherwise. Lmk if either of you would prefer them on this PR directly. I agree. I think this PR is in good shape to be merged, and further refinements can be made in one or more subsequent PRs. This is the case even for future adjustments, if any, that would affect how the licensing situation is described. Everything that was blocking has been fully resolved. |
Sorry, something went wrong.
|
FWIW @EliahKagan I tackled most of the simpler changes and put up a Draft PR: #1903 |
Sorry, something went wrong.
There was a problem hiding this comment.
Apologies, I read the last two comments too late.
However, what follows is still something I consider feasible, nothing is going to be stale. Instead, I prefer to have all the existing comments in one place and resolve conversations.
Never mind :D - see #1903 .
Thanks a lot for keeping working on the refinements. I thought I'd wait until the suggestions are committed (they can be batched), and a few more things are cleaned up.
Regarding the hardcoded /tmp path that then expects a /tmp/.git, I agree that this under no circumstances should be running in an unisolated environment. Since the documentation clearly states that docker is needed, and if it's unlikely to impossible to run outside (or to get to that line), then I think it's OK to leave it.
It seems to be common practice in the OSS-fuzz project as well, and portability is not their concern.
Thanks for bearing with us, this PR is already fantastic, and so is working with you on this!
Sorry, something went wrong.
|
Thanks both! I updated the branch for #1903 and marked it ready for review. I agree the state of fuzz_tree.py is not ideal and appreciate the feedback you each left about it. I've given it some thought as well but I hadn't considered some of the points raised. I still plan to share in a reply on the unresolved thread here, I just haven't had a chance to yet. I should be able to in a few hours. |
Sorry, something went wrong.
|
Regarding the hardcoded /tmp path that then expects a /tmp/.git, I agree that this under no circumstances should be running in an unisolated environment. Since the documentation clearly states that docker is needed, and if it's unlikely to impossible to run outside (or to get to that line), then I think it's OK to leave it. To be clear, Docker is not needed, so this can be the real, unisolated /tmp, and the documentation presents that as a reasonable thing to do under some circumstances. It was with awareness of this that I recommended this be merged even before that is fixed. However, I realize I omitted a piece of my reasoning: I think it's easier to fix this after both this PR and google/oss-fuzz#11803 are merged (the latter is not merged yet but looks like it will be merged soon), because that will make it easier to test the fix both outside and inside Docker. (The Docker case is, I believe, more important, since that's how the tests are automatically run on the OSS-Fuzz infrastructure, and running individual tests outside of Docker is, if I understand correctly, mostly of interest when working on the fuzz tests themselves.) For the same reason, I think it is justified for #1903 not to include a fix for this, since the changes in #1903 otherwise could be merged at any time and do not need to wait on google/oss-fuzz#11803. |
Sorry, something went wrong.
|
Ok, sorry for the delay. Below are my thoughts regarding fuzz_tree.py, but also, a little extra context on the OSS-Fuzz execution environment because it may be helpful for thinking about #1901 (comment). About the OSS-Fuzz EnvironmentYou both touched on it, but for the sake of shared understanding, the scripts and tests look like they do in part because of the OSS-Fuzz container environment implementation details. The key points to understand are:
Build Stage Execution Environmentbuild.sh script environment
Everything else is read-only. Fuzzer Execution Stage EnvironmentFile system Regarding the Fuzz TestsWhen I initially started on this, my primary objective was really only about fixing the build. I did try to clean them up a bit in the process and apply some best practices, but those changes were mostly incidental and I intentionally avoided any significant refactors. After we decided to bring them here, my focus shifted to lowering the barrier to entry for contributors because of the nature of the project. In particular, I tried to distill the useful information from various place that I haven't seen documented in one place yet. One such tip is the direct execution section, but in hindsight I overlooked the environment assumptions in fuzz_tree.py. On fuzz_tree.py SpecificallyThis test is less than ideal for several reasons.
In addition to what has been mentioned, modifying global state and I/O operations (e.g., writing to the temp dir) are problematic for fuzzing in general. Further, @EliahKagan's remark in #1901 (comment) about "implicating other things in /tmp" made me realize this may already be causing issues inside the OSS-Fuzz execution environment. A non-obvious implementation detail specific to Python projects in OSS-Fuzz that the compile_python_fuzzer function in the build.sh script uses Pyinstaller to bundle the fuzz harness and all related dependencies into a single file archive that gets transferred to the fuzzer execution stage, where it is unpacked into a subdirectory of /tmp for execution. Which, I believe means, the test execution is actually using the test executable as a fixture 🥴 I have yet to confirm this, but I wouldn't be surprised if that is the cause of the issue I described as a possible " circular dependency " in google/oss-fuzz#11763
With an average execs per second of ~4 after the seed corpus loads on a short local run, it is far, far, below what is hoped for. In contrast, I'm seeing about 1432 exec/s on fuzz_config.py in a similar local run. fuzz_config.py creates a config fake in-memory but never writes it to disk. I haven't dug in yet, but I think this is partly related to point 1 above and partly related to some of the implementation details of the legacy code in GitPython. This is less of a concern/priority than the /tmp dir issue though because even if it's slow, it's still at least testing the API. Thoughts on Next StepsI'm fully in agreement that the /tmp issue and documentation related to it must be fixed before other updates are considered. I think it can be broken down into smaller problems in the following prioritized order:
Possible quick fix options:
I think the most robust solution would be some combination of both points above, but I don't know enough to comment on the potential issues that we may face goinf the TemporaryDirectory route inside OSS-Fuzz. The Dockerfile solution on the other hand is something I've already been toying with locally. I also looked at options like MemoryFS from https://github.com/PyFilesystem/pyfilesystem2 because an in-memory solution would be ideal, but I think the git executable invocation may cause issues with that approach. What do you both think? |
Sorry, something went wrong.
|
Thanks so much for the follow-up - I feel well-informed now! As OSS-fuzz strongly relies on docker, maybe it would be easiest to 'go with the flow' and provide a dockerfile as part of the fuzzing directory. That way, the docs can refer exclusively to that and thanks to docker, everything, including dependencies, would work just as if they would run on the OSS-fuzz infrastructure. As second step, one could see if improving the tempdir usage would fix certain issues that you have been seeing, including to make using OSS fuzz safe for running it directly. gitoxide can do that without issues, by the way, and maybe being able to do that helps with the local development workflow. With that said, I also think that… I'd really want you to do more with gitoxide 😁, and even though Rust OSS fuzzing is currently broken across the ecosystem, maybe there are some parsers that still don't have enough fuzzerage (fuzzing coverage). Maybe it would be more fun, too, as 'the system' won't be in your way anymore. But I digress 😅. |
Sorry, something went wrong.
|
As OSS-fuzz strongly relies on docker, maybe it would be easiest to 'go with the flow' and provide a dockerfile as part of the fuzzing directory. That way, the docs can refer exclusively to that and thanks to docker, everything, including dependencies, would work just as if they would run on the OSS-fuzz infrastructure. I've added a simple Dockerfile and I instructions for using it and a warning about not using Docker in #1904. Let me know what you think! As second step, one could see if improving the tempdir usage would fix certain issues that you have been seeing, including to make using OSS fuzz safe for running it directly. I'm planning to take a deeper at this. Even without the /tmp concerns, there is something that is blocking Fuzz Introspector from working properly, which would be nice to resolve. With that said, I also think that… I'd really want you to do more with gitoxide 😁 I'm definitely interested in taking you up on that! I've been interested in learning more about Rust for a while and I've found working on fizzers a great way to get familiar with new projects and languages, so you can likely expect to see me pop up in that project sooner or later 😁 In other news, the downstream OSS-Fuzz PR was merged and the Gitpython builds are succeeding! https://introspector.oss-fuzz.com/project-profile?project=gitpython The Coverage build is still reporting failures, but that is expected until ClusterFuzz generates enough corpora to run the analysis which I expect will happen in the next day or two. Exciting to see! |
Sorry, something went wrong.
As discussed in #1889, this PR introduces the changes necessary to migrate the OSS-Fuzz fuzz tests and configuration scripts from the OSS-Fuzz repository into GitPython's repo.
Summary of Changes
Most of the changes proposed here are the same as described in the discussion thread linked above. Beyond that, there are a few minor differences and potentially non-obvious details that I want to call out here:
Directory Structure
Initially, I proposed a single, flat "fuzzing/" directory at the top level, but when I moved the files, I felt that a few sub-directories would help with organization.
The fuzzing/READEME.md file introduced in this PR has a section describing them and some additional documentation.
Updates to Fuzz Targets
The fuzz test files in this PR include the changes that were originally proposed in google/oss-fuzz#11763.
Seed Data
Part of efficiency improvements I proposed in google/oss-fuzz#11763 included the addition of dictionary files (described in the new README) and seed corpus zip files.
I stored these in a repo that I maintain: https://github.com/DaveLak/oss-fuzz-inputs/tree/main/gitpython, because OSS-Fuzz has understandable concerns about seed data bloating the size of their repo.
In this PR:
Next Steps
@Byron & @EliahKagan, please:
Thanks! I'll address feedback any feedback as quickly as possible.