parent directory.. | ||||
View all files | ||||
Python extraction happens in two phases:
The rule for pack_zip('python-extractor') in build defines what files are included in a distribution and in the CodeQL CLI. After building the CodeQL CLI locally, the files are in target/intree/codeql/python/tools.
This project uses
You can install both tools with pipx, like so
Once you've installed poetry, you can do this:
To install multiple python versions locally, we recommend you use pyenv
(don't try to use tox run-parallel, our tests are not set up for this to work 😅)
Currently we distribute our code in an obfuscated way, by including the code in the subfolders in a zip file that is imported at run-time (by the python files in the top level of this directory).
The one exception is the data directory (used for stubs) which is included directly in the tools folder.
The zip creation is managed by make_zips.py, and currently we make one zipfile for Python 2 (which is byte compiled), and one for Python 3 (which has source files, but they are stripped of comments and docstrings).
We expect to be able to run our tools (setup phase) with either Python 2 or Python 3, and after determining which version to analyze the code as, we run the extractor with that version. So we must support:
For extraction with the CodeQL CLI locally (codeql database create --language python)
The representation of the code in the figure below has in some cases been altered slightly, but is accurate as of 2020-03-20.
DetailsThe representation of the code in the figure below has in some cases been altered slightly, but is accurate as of 2020-03-20.
DetailsThe entrypoint of the actual Python extractor is python_tracer.py.
The usual way to invoke the extractor is to pass a directory of Python files to the launcher. The extractor extracts code from those files and their dependencies, producing TRAP files, and copies the source code to a source archive. Alternatively, for highly distributed systems, it is possible to pass a single file to the per extractor invocation; invoking it many times. The extractor recognizes Python source code files and Thrift IDL files. Other types of file can be added to the database, by passing the --filter option to the extractor, but they'll be stored as text blobs.
The extractor expects the CODEQL_EXTRACTOR_PYTHON_TRAP_DIR and CODEQL_EXTRACTOR_PYTHON_SOURCE_ARCHIVE_DIR environment variables to be set (which determine, respectively, where it puts TRAP files and the source archive). However, the location of the TRAP folder and source archive can be specified on the command-line instead.
The extractor outputs the following information as TRAP files:
Once started, the extractor consists of three sets of communicating processes.
The front-end -> worker message queue has quite limited capacity (2 per process) to ensure rapid shutdown when interrupted. The capacity of the worker -> front-end message queue must be at least twice that size to prevent deadlock, and is in fact much larger to prevent workers being blocked on the queue.
Experiments suggest that the extractor scales almost linearly to at least 20 processes (on linux).
The component that walks the file system is known as the "traverser" and is designed to be pluggable. Its interface is simply an iterable of file descriptions. See semmle/traverser.py.
An important consequence of local extraction is that, except for the file path information, the contents of the TRAP file are functionally determined by:
Caching of TRAP files can reduce the time to extract a large project with few changes by an order of magnitude.
Each extractor process runs a loop which extracts files or modules from the queue, one at a time. Each file or module description is passed, in turn, to one of the extractor objects which will either extract it or reject it for the next extractor object to try. Currently the default extractors are:
The Python extractor is the most interesting of the processes mentioned above. The Python extractor takes a path to a Python file. It emits TRAP to the specified folder and a UTF-8 encoded version of the source to the source archive. It consists of the following passes:
Most Python template languages work by either translating the template into Python or by fairly closely mimicking the behavior of Python. This means that we can extract template files by converting them to the same AST used internally by the Python extractor and then passing that AST to the backend of the Python extractor to determine imports, and generate TRAP files including control-flow information.