parent directory.. | ||||
View all files | ||||
Run tree-sitter-graph queries against Python source files.
Run cargo build --release. The resulting binary can be found in the target/release directory.
tsg-python tsg-file.tsg python-file.py
Output is emitted on stdout.
If you're impatient, you can also build and run using cargo run followed by the arguments given above.
To use tsg-python, you must have an appropriate .tsg file containing the directions for how to construct a Python AST from the output of tree-sitter-python.
A file consists of a sequence of stanzas. Each stanza consists of a query (using the tree-sitter query syntax) and a sequence of nodes and edges to define for each query match in the source file. Queries will (almost always) include captures like @foo, which means any occurrence of @foo in the corresponding stanza will refer to a particular syntax node in the bit that the query matches.
Stanzas are executed in order, and a stanza is only run when all possible matches have been exhausted for all preceding stanzas. (Since the syntax tree that is matched against never changes, execution never jumps back to an earlier stanza.)
Inside stanzas, scoped variables have the form @foo.bar where @foo is a capture in the associated query, and bar is an identifier. This should be thought of as a variable that is "attached" to the tree-sitter node that @foo refers to. If @baz is another reference to the same node as @foo (perhaps even in a different stanza), then @baz.bar will be a reference to the same scoped variable. This permits information to be linked across different stanzas.
Assigning a value to a scoped variable is done using the syntax let @foo.bar = some-expr (let for immutable variables, var for mutable variables, which may be mutated using set). Note that scoped variables only exist during the execution of the stack graph, and are not immediately part of the output graph.
To actually produce output, we must specify some nodes or edges and possibly attributes thereof.
To produce a node, we declare node @foo.bar (which is equivalent to let @foo.bar = (node), the right hand side being a function that creates a new node). In the output, nodes are simply integers.
To assign an attribute to a node, we write attr (@foo.bar) identifier = expr, for some suitable choice of identifier and expr. In the output, attributes are given alongside nodes in a key: value notation.
For edges and their attributes, the syntax is similar:
edge @foo.bar -> @baz.quux
and
attr (@foo.bar -> @baz.quux) identifier = expr.
Note that it is an error to declare the same node, edge, (or attribute of either of these) twice.
For fields that point to some literal value
For fields that point directly to an AST node:
For fields that point to lists of AST nodes:
Scoped variables of the form @foo.node are used to tie the AST together, and so it's important that this is set for nodes that map directly onto tree-sitter-python nodes. Thus, for instance for binary operators, the stanza could look as follows:
Note in particular the @left.node and @right.node references. In order for the above stanza to work, these scoped variables must exist and point to suitable graph nodes.
In practice, the setting up of all of these scoped variables (and creation of output graph nodes) will happen at the very top of the .tsg file, to ensure that these scoped variables are defined for the remainder of the file.
To ease the creation of these variables, we have the ast-node convenience function. For binary operators, it would take the following form:
Here, the two arguments are respectively
In effect, the call
is exactly equivalent to the more verbose
As the above suggests, attributes that start with an underscore are interpreted in a special way when reconstructing the AST.
Should be set to a string consisting of the name of the corresponding Python AST class. This information will be used to build the AST, and so it is an error if this is left out.
Generally, this (and _location) will be set using the ast-node function.
This is used to indicate that the present graph node should not be turned into an AST node, but that the graph node contained in this attribute should be used instead. That graph node may also contain a _skip_to field, in which case the entire chain is followed until a node is encountered that does not have a _skip_to field. (Please ensure that there are no cycles of _skip_to pointers.)
Example:
In tree-sitter-python, assignment statements are a form of expression_statement, and this node type also encompasses things like expressions (e.g. 2+2) appearing at the level of statements. In the internal Python AST, we need to separate the assignment from such expressions. The assignment should be present as an Assign node, but 2+2 should be wrapped in an Expr node. To solve this, we create an Expr for each expression_statement, and then explicitly skip this node in the AST if it contains an assignment. This is implemented as follows:
This attribute is used to indicate the location of the corresponding AST node. As with _kind it should be set using the ast-node function.
These attributes are used to indicate the start or end of the location of the AST node. They can be used for nodes where _location has already been set, in which case they override the relevant part of that location. For an example of this see the worked example on if statements below.
These can be used to set the start or end position of an AST node with even greater detail than the preceding attributes. As with the _location_start and _location_end attributes, these will override the values of the corresponding part of the location.
In general, these attributes should be used sparingly, as they are quite verbose.
This function returns the source text of the tree-sitter node it receives as an argument.
Example:
Extracting the operator from a binary expression:
Creates a new graph node with the given _kind and sets the _location attribute to the location of the given tree-sitter node.
Returns the index of the given tree-sitter node in its parent.
Returns the location of the given tree-sitter node as a list containing four integers corresponding to the start row and column, followed by the end row and column.
Returns the start or end position (row followed by column) of the given tree-sitter node (as a list containing two integers).
(All of these take a tree-sitter-node as an argument.)
Returns an integer corresponding to the appropriate part of the location of the given tree-sitter node.
The way the current parser handles if statements means we cannot do a straight mapping from the tree-sitter grammar to the AST. In particular, a block of code such as
is unrolled into the following form by the current parser:
This means we have to synthesise nodes for the inner if statements.
However, this should be straightforward -- we simply have to make sure that elif_clauses also produce the appropriate kind of node, and that everything is linked up correctly.
For references, here are the productions for if_statement, else_clause and elif_clause in tree-sitter-python
First, we'll set up all of the relevant nodes with corresponding nodes in the AST:
This ensures that we can reference the .node scoped variable on the above nodes.
(We named the capture @tree_sitter_node above to make it more clear, but in general something like @if would be more appropriate.)
In particular, since we want elifs to be turned into nested ifs, it makes sense to apply the If kind to elif_clauses as well:
Whenever we refer to a node, we must ensure that it has first been defined, however there is no need to do this separately for each node.
Next, for both ifs and elifs, we want to record the test and the body. The test we do as follows:
For body, in the Python AST this is simply a list of nodes, whereas for the tree-sitter parse tree, it will contain a block node. Because there is no Python AST equivalent for block, we skip over this node when linking the if-statement to its body:
The above shows how we handle fields containing lists of items: we add an edge from the parent node to each child node, and put an attribute on that edge. The name of the attribute will be the name of the field, and the value will be the index of this node among the children of its tree-sitter parent.
Now we can begin unwinding the nesting. First of all, the first elif should be the orelse of the initial if_statement:
(The . acts as an anchor, forcing its two neighbours to be adjancent in the tree. So in this case, we get the first elif after the body of the if)
Next, whenever we have two adjacent elifs, we want the orelse of the first one to be the second one:
Finally, the else branch of the outermost if should be the orelse of the last elif:
The above gives us the correct tree structure, but we're still missing a few bits (such as locations). To capture location information we use the following stanza:
Because tree-sitter-python disagrees with the Python AST about the location of the If node, we have to adjust it. We do this by setting the _location_end attribute to the end of the : token. (Note that the start of this location was set when we called ast-node above. As we don't have to change this part of the location, we simply leave it as is.)
In many cases it will be sufficient to hook up AST nodes to the corresponding tree-sitter nodes, but occasionally we want the tree structure to be different. One example of this would be the class statement. For instance, a class declaration such as
has a tree-sitter-python parse tree that looks like this:
but the Python AST looks like this:
In particular, we unroll the class statement into an explicit assignment (which is the top node for this statement in the AST) of a synthetic ClassExpr, which in turn contains a Class node (which holds things like the body of the class). This requires too many nodes to simply reuse what's given to us by tree-sitter-python, and so we must synthesize additional nodes.
First of all, let us set up the outer node to be an Assign node:
Next, we can do most of the work in a single stanza:
Let's go over these lines bit by bit. First, we create an alias for the outermost node (which will become an assignment node) in order to make it clearer that it's an assignment. Next, we create new nodes for the inner synthesized nodes. Note that we can't assign these to @class.node as that already points to the node that will become the assignment node. Instead, we create new scoped variables (with suitable names), and assign them nodes (with appropriate kinds and locations using ast-node).
Next, we set up the outer assignment:
The remaining nodes all contain a field that refers to the name of the class, so put this in a local variable for convenience:
We set up the left hand side of the assignment:
The ClassExpr:
The Class:
The remaining stanzas take care of setting up the fields that contain lists of nodes, and these follow the same scheme as before.