codeql/python/extractor/tsg-python at codeql-cli/v2.25.2 · github/codeql · GitHub

NameName

Last commit message

Last commit date

parent directory ..
src		src
tsp		tsp
.gitignore		.gitignore
BUILD.bazel		BUILD.bazel
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
README.md		README.md
python.tsg		python.tsg
tsg_to_dot.py		tsg_to_dot.py
View all files

README.md

tsg-python

Run tree-sitter-graph queries against Python source files.

How to build

Run cargo build --release. The resulting binary can be found in the target/release directory.

How to invoke

tsg-python tsg-file.tsg python-file.py

Output is emitted on stdout.

If you're impatient, you can also build and run using cargo run followed by the arguments given above.

How to use

To use tsg-python, you must have an appropriate .tsg file containing the directions for how to construct a Python AST from the output of tree-sitter-python.

A quick primer on tree-sitter-graph syntax

A file consists of a sequence of stanzas. Each stanza consists of a query (using the tree-sitter query syntax) and a sequence of nodes and edges to define for each query match in the source file. Queries will (almost always) include captures like @foo, which means any occurrence of @foo in the corresponding stanza will refer to a particular syntax node in the bit that the query matches.

Stanzas are executed in order, and a stanza is only run when all possible matches have been exhausted for all preceding stanzas. (Since the syntax tree that is matched against never changes, execution never jumps back to an earlier stanza.)

Inside stanzas, scoped variables have the form @foo.bar where @foo is a capture in the associated query, and bar is an identifier. This should be thought of as a variable that is "attached" to the tree-sitter node that @foo refers to. If @baz is another reference to the same node as @foo (perhaps even in a different stanza), then @baz.bar will be a reference to the same scoped variable. This permits information to be linked across different stanzas.

Assigning a value to a scoped variable is done using the syntax let @foo.bar = some-expr (let for immutable variables, var for mutable variables, which may be mutated using set). Note that scoped variables only exist during the execution of the stack graph, and are not immediately part of the output graph.

To actually produce output, we must specify some nodes or edges and possibly attributes thereof.

To produce a node, we declare node @foo.bar (which is equivalent to let @foo.bar = (node), the right hand side being a function that creates a new node). In the output, nodes are simply integers.

To assign an attribute to a node, we write attr (@foo.bar) identifier = expr, for some suitable choice of identifier and expr. In the output, attributes are given alongside nodes in a key: value notation.

For edges and their attributes, the syntax is similar:

edge @foo.bar -> @baz.quux

and

attr (@foo.bar -> @baz.quux) identifier = expr.

Note that it is an error to declare the same node, edge, (or attribute of either of these) twice.

The general scheme:

For fields that point to some literal value

<some capture involving @nd> { attr (@nd.node) field_name = some_value }

For fields that point directly to an AST node:

<some capture involving @parent and @child> { attr (@parent.node) field_name = @child.node }

For fields that point to lists of AST nodes:

<some capture involving @parent and @child> { edge @parent.node -> @child.node attr (@parent.node -> @child.node) field_name = <index of @child in the resulting list> }

Scoped variables of the form @foo.node are used to tie the AST together, and so it's important that this is set for nodes that map directly onto tree-sitter-python nodes. Thus, for instance for binary operators, the stanza could look as follows:

(binary_operator left: (_) @left right: (_) @right ) @bin { attr (@bin.node) left = @left.node attr (@bin.node) right = @right.node }

Note in particular the @left.node and @right.node references. In order for the above stanza to work, these scoped variables must exist and point to suitable graph nodes.

In practice, the setting up of all of these scoped variables (and creation of output graph nodes) will happen at the very top of the .tsg file, to ensure that these scoped variables are defined for the remainder of the file.

To ease the creation of these variables, we have the ast-node convenience function. For binary operators, it would take the following form:

(binary_operator) @bin { let @bin.node = (ast-node @bin "BinOp") }

Here, the two arguments are respectively

a tree-sitter node (which is used to set the location of @bin.node), and
a string (which is used to set the "kind" of @bin.node)

In effect, the call

let @bin.node = (ast-node @bin "BinOp")

is exactly equivalent to the more verbose

node @bin.node ; or equivalently `let @bin.node = (node)` attr (@bin.node) _location = (location @bin) attr (@bin.node) _kind = "BinOp"

As the above suggests, attributes that start with an underscore are interpreted in a special way when reconstructing the AST.

Special attributes

The _kind attribute (mandatory)

Should be set to a string consisting of the name of the corresponding Python AST class. This information will be used to build the AST, and so it is an error if this is left out.

Generally, this (and _location) will be set using the ast-node function.

The _skip_to attribute (optional)

This is used to indicate that the present graph node should not be turned into an AST node, but that the graph node contained in this attribute should be used instead. That graph node may also contain a _skip_to field, in which case the entire chain is followed until a node is encountered that does not have a _skip_to field. (Please ensure that there are no cycles of _skip_to pointers.)

Example:

In tree-sitter-python, assignment statements are a form of expression_statement, and this node type also encompasses things like expressions (e.g. 2+2) appearing at the level of statements. In the internal Python AST, we need to separate the assignment from such expressions. The assignment should be present as an Assign node, but 2+2 should be wrapped in an Expr node. To solve this, we create an Expr for each expression_statement, and then explicitly skip this node in the AST if it contains an assignment. This is implemented as follows:

(expression_statement (assignment) @inner) @outer { attr (@outer.node) _skip_to = @inner.node }

The _location attribute (optional)

This attribute is used to indicate the location of the corresponding AST node. As with _kind it should be set using the ast-node function.

The _location_start and _location_end attributes (optional)

These attributes are used to indicate the start or end of the location of the AST node. They can be used for nodes where _location has already been set, in which case they override the relevant part of that location. For an example of this see the worked example on if statements below.

The _start_line, _start_column, _end_line, and _end_column attributes (optional)

These can be used to set the start or end position of an AST node with even greater detail than the preceding attributes. As with the _location_start and _location_end attributes, these will override the values of the corresponding part of the location.

In general, these attributes should be used sparingly, as they are quite verbose.

Built-in functions

(source-text tree-sitter-node) (built-in)

This function returns the source text of the tree-sitter node it receives as an argument.

Example:

Extracting the operator from a binary expression:

(binary_operator operator: _ @op ) @bin { attr (@bin.node) op = (source-text @op) }

(ast-node tree-sitter-node string) (tsg-python only)

Creates a new graph node with the given _kind and sets the _location attribute to the location of the given tree-sitter node.

(child-index tree-sitter-node) (built-in)

Returns the index of the given tree-sitter node in its parent.

(location tree-sitter-node) (tsg-python only)

Returns the location of the given tree-sitter node as a list containing four integers corresponding to the start row and column, followed by the end row and column.

(location-start tree-sitter-node) and (location-end tree-sitter-node) (tsg-python only)

Returns the start or end position (row followed by column) of the given tree-sitter node (as a list containing two integers).

start-row, start-column, end-row, and end-column (built-in)

(All of these take a tree-sitter-node as an argument.)

Returns an integer corresponding to the appropriate part of the location of the given tree-sitter node.

A worked example: if statements

The way the current parser handles if statements means we cannot do a straight mapping from the tree-sitter grammar to the AST. In particular, a block of code such as

if x: do_x
elif y: do_y
elif z: do_z
else: do_else

is unrolled into the following form by the current parser:

if x: do_x
else:
    if y: do_y
    else:
        if z: do_z
        else: do_else

This means we have to synthesise nodes for the inner if statements.

However, this should be straightforward -- we simply have to make sure that elif_clauses also produce the appropriate kind of node, and that everything is linked up correctly.

For references, here are the productions for if_statement, else_clause and elif_clause in tree-sitter-python

    if_statement: $ => seq(
      'if',
      field('condition', $.expression),
      ':',
      field('consequence', $._suite),
      repeat(field('alternative', $.elif_clause)),
      optional(field('alternative', $.else_clause))
    ),

    elif_clause: $ => seq(
      'elif',
      field('condition', $.expression),
      ':',
      field('consequence', $._suite)
    ),

    else_clause: $ => seq(
      'else',
      ':',
      field('body', $._suite)
    ),

First, we'll set up all of the relevant nodes with corresponding nodes in the AST:

(if_statement) @tree_sitter_node { let @tree_sitter_node.node = (ast-node @tree_sitter_node "If") }

This ensures that we can reference the .node scoped variable on the above nodes.

(We named the capture @tree_sitter_node above to make it more clear, but in general something like @if would be more appropriate.)

In particular, since we want elifs to be turned into nested ifs, it makes sense to apply the If kind to elif_clauses as well:

(elif_clause) @elif { let @elif.node = (ast-node @elif "If") }

Whenever we refer to a node, we must ensure that it has first been defined, however there is no need to do this separately for each node.

Next, for both ifs and elifs, we want to record the test and the body. The test we do as follows:

[ (if_statement condition: (_) @test) @if (elif_clause condition: (_) @test) @if ] { attr (@if.node) test = @test.node }

For body, in the Python AST this is simply a list of nodes, whereas for the tree-sitter parse tree, it will contain a block node. Because there is no Python AST equivalent for block, we skip over this node when linking the if-statement to its body:

[ (if_statement consequence: (block (_) @stmt)) @parent (elif_clause consequence: (block (_) @stmt)) @parent ] { edge @parent.node -> @stmt.node attr (@parent.node -> @stmt.node) body = (child-index @stmt) }

The above shows how we handle fields containing lists of items: we add an edge from the parent node to each child node, and put an attribute on that edge. The name of the attribute will be the name of the field, and the value will be the index of this node among the children of its tree-sitter parent.

Now we can begin unwinding the nesting. First of all, the first elif should be the orelse of the initial if_statement:

(if_statement consequence: (_) . (elif_clause) @elif ) @if { edge @if.node -> @elif.node attr (@if.node -> @elif.node) orelse = 0 }

(The . acts as an anchor, forcing its two neighbours to be adjancent in the tree. So in this case, we get the first elif after the body of the if)

Next, whenever we have two adjacent elifs, we want the orelse of the first one to be the second one:

( (elif_clause) @elif1 . (elif_clause) @elif2 ) { edge @elif1.node -> @elif2.node attr (@elif1.node -> @elif2.node) orelse = 0 }

Finally, the else branch of the outermost if should be the orelse of the last elif:

(if_statement (elif_clause) @elif . alternative: (else_clause body: (block (_) @orelse)) ) { edge @elif.node -> @orelse.node attr (@elif.node -> @orelse.node) orelse = (child-index @orelse) }

The above gives us the correct tree structure, but we're still missing a few bits (such as locations). To capture location information we use the following stanza:

[ (if_statement condition: (_) ":" @colon) @if (elif_clause condition: (_) ":" @colon) @if ] { attr (@if.node) _location_end = (location-end @colon) }

Because tree-sitter-python disagrees with the Python AST about the location of the If node, we have to adjust it. We do this by setting the _location_end attribute to the end of the : token. (Note that the start of this location was set when we called ast-node above. As we don't have to change this part of the location, we simply leave it as is.)

Synthesizing nodes

In many cases it will be sufficient to hook up AST nodes to the corresponding tree-sitter nodes, but occasionally we want the tree structure to be different. One example of this would be the class statement. For instance, a class declaration such as

class Foo(int, object, metaclass=type):
    x = 5

has a tree-sitter-python parse tree that looks like this:

module [0, 0] - [2, 0] class_definition [0, 0] - [1, 9] name: identifier [0, 6] - [0, 9] superclasses: argument_list [0, 9] - [0, 38] identifier [0, 10] - [0, 13] identifier [0, 15] - [0, 21] keyword_argument [0, 23] - [0, 37] name: identifier [0, 23] - [0, 32] value: identifier [0, 33] - [0, 37] body: block [1, 4] - [1, 9] expression_statement [1, 4] - [1, 9] assignment [1, 4] - [1, 9] left: identifier [1, 4] - [1, 5] right: integer [1, 8] - [1, 9]

but the Python AST looks like this:

Module: [1, 0] - [3, 0] body: [ Assign: [1, 0] - [1, 39] targets: [ Name: [1, 6] - [1, 9] variable: Variable('Foo', None) ctx: Store ] value: ClassExpr: [1, 0] - [1, 39] name: 'Foo' bases: [ Name: [1, 10] - [1, 13] variable: Variable('int', None) ctx: Load Name: [1, 15] - [1, 21] variable: Variable('object', None) ctx: Load ] keywords: [ keyword: [1, 23] - [1, 37] arg: 'metaclass' value: Name: [1, 33] - [1, 37] variable: Variable('type', None) ctx: Load ] inner_scope: Class: [1, 0] - [1, 39] name: 'Foo' body: [ Assign: [2, 4] - [2, 9] targets: [ Name: [2, 4] - [2, 5] variable: Variable('x', None) ctx: Store ] value: Num: [2, 8] - [2, 9] n: 5 text: '5' ] ]

In particular, we unroll the class statement into an explicit assignment (which is the top node for this statement in the AST) of a synthetic ClassExpr, which in turn contains a Class node (which holds things like the body of the class). This requires too many nodes to simply reuse what's given to us by tree-sitter-python, and so we must synthesize additional nodes.

First of all, let us set up the outer node to be an Assign node:

(class_definition) @class { let @class.node = (ast-node @class "Assign") }

Next, we can do most of the work in a single stanza:

(class_definition name: (identifier) @name ":" @colon ) @class { ; To make it clearer that the outer node is an assignment, we create an alias for it. let @class.assign = @class.node ; Synthesized nodes: the left-hand side of the assignment, the class_expr node, and the class ; node. let @class.assign_lhs = (ast-node @name "Name") let @class.class_expr = (ast-node @class "ClassExpr") let @class.inner_scope = (ast-node @class "Class") edge @class.assign -> @class.assign_lhs attr (@class.assign -> @class.assign_lhs) targets = 0 attr (@class.assign) value = @class.class_expr attr (@class.assign) _location_end = (location-end @colon) let class_name = (source-text @name) ; The left-hand side of the assignment, a `Name`. attr (@class.assign_lhs) variable = class_name attr (@class.assign_lhs) ctx = "store" ; The right hand side of the assignment, a `ClassExpr`. attr (@class.class_expr) name = class_name attr (@class.class_expr) inner_scope = @class.inner_scope ; `bases` will be set elsewhere ; `keywords` will be set elsewhere attr (@class.class_expr) _location_end = (location-end @colon) ; The inner scope of the class_expr, a `Class`. attr (@class.inner_scope) name = class_name ; body will be set in a separate stanza. attr (@class.inner_scope) _location_end = (location-end @colon) }

Let's go over these lines bit by bit. First, we create an alias for the outermost node (which will become an assignment node) in order to make it clearer that it's an assignment. Next, we create new nodes for the inner synthesized nodes. Note that we can't assign these to @class.node as that already points to the node that will become the assignment node. Instead, we create new scoped variables (with suitable names), and assign them nodes (with appropriate kinds and locations using ast-node).

; To make it clearer that the outer node is an assignment, we create an alias for it. let @class.assign = @class.node ; Synthesized nodes: the left-hand side of the assignment, the class_expr node, and the class ; node. let @class.assign_lhs = (ast-node @name "Name") let @class.class_expr = (ast-node @class "ClassExpr") let @class.inner_scope = (ast-node @class "Class")

Next, we set up the outer assignment:

edge @class.assign -> @class.assign_lhs attr (@class.assign -> @class.assign_lhs) targets = 0 attr (@class.assign) value = @class.class_expr attr (@class.assign) _location_end = (location-end @colon)

The remaining nodes all contain a field that refers to the name of the class, so put this in a local variable for convenience:

let class_name = (source-text @name)

We set up the left hand side of the assignment:

; The left-hand side of the assignment, a `Name`. attr (@class.assign_lhs) variable = class_name attr (@class.assign_lhs) ctx = "store"

The ClassExpr:

; The right hand side of the assignment, a `ClassExpr`. attr (@class.class_expr) name = class_name attr (@class.class_expr) inner_scope = @class.inner_scope ; `bases` will be set elsewhere ; `keywords` will be set elsewhere attr (@class.class_expr) _location_end = (location-end @colon)

The Class:

; The inner scope of the class_expr, a `Class`. attr (@class.inner_scope) name = class_name ; body will be set elsewhere attr (@class.inner_scope) _location_end = (location-end @colon)

The remaining stanzas take care of setting up the fields that contain lists of nodes, and these follow the same scheme as before.

; Class.body (class_definition body: (block (_) @stmt) ) @class { edge @class.inner_scope -> @stmt.node attr (@class.inner_scope -> @stmt.node) body = (child-index @stmt) } ; Class.bases (class_definition superclasses: (argument_list (identifier) @arg) ) @class { edge @class.class_expr -> @arg.node attr (@class.class_expr -> @arg.node) bases = (child-index @arg) attr (@arg.node) ctx = "load" } ; Class.keywords (class_definition superclasses: (argument_list (keyword_argument) @arg) ) @class { edge @class.class_expr -> @arg.node attr (@class.class_expr -> @arg.node) keywords = (child-index @arg) }

Terms
Privacy
Security
Status
Community
Docs
Contact
Manage cookies
Do not share my personal information

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

README.md

tsg-python

How to build

How to invoke

How to use

A quick primer on tree-sitter-graph syntax

The general scheme:

Special attributes

The _kind attribute (mandatory)

The _skip_to attribute (optional)

The _location attribute (optional)

The _location_start and _location_end attributes (optional)

The _start_line, _start_column, _end_line, and _end_column attributes (optional)

Built-in functions

(source-text tree-sitter-node) (built-in)

(ast-node tree-sitter-node string) (tsg-python only)

(child-index tree-sitter-node) (built-in)

(location tree-sitter-node) (tsg-python only)

(location-start tree-sitter-node) and (location-end tree-sitter-node) (tsg-python only)

start-row, start-column, end-row, and end-column (built-in)

A worked example: if statements

Synthesizing nodes

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilesExpand file tree

Breadcrumbs

tsg-python

Directory actions

More options

Directory actions

More options

Latest commit

History

Breadcrumbs

tsg-python

Folders and files

parent directory

README.md

tsg-python

How to build

How to invoke

How to use

A quick primer on tree-sitter-graph syntax

The general scheme:

Special attributes

The _kind attribute (mandatory)

The _skip_to attribute (optional)

The _location attribute (optional)

The _location_start and _location_end attributes (optional)

The _start_line, _start_column, _end_line, and _end_column attributes (optional)

Built-in functions

(source-text tree-sitter-node) (built-in)

(ast-node tree-sitter-node string) (tsg-python only)

(child-index tree-sitter-node) (built-in)

(location tree-sitter-node) (tsg-python only)

(location-start tree-sitter-node) and (location-end tree-sitter-node) (tsg-python only)

start-row, start-column, end-row, and end-column (built-in)

A worked example: if statements

Synthesizing nodes

Footer

Footer navigation

Files