Customizing library models for C and C++ — CodeQL CodeQL docs

CodeQL resources

Customizing library models for C and C++¶

You can model the methods and callables that control data flow in any framework or library. This is especially useful for custom frameworks or niche libraries, that are not supported by the standard CodeQL libraries.

Beta Notice - Unstable API

Library customization using data extensions is currently in beta and subject to change.

Breaking changes to this format may occur while in beta.

About this article¶

This article contains reference material about how to define custom models for sources, sinks, and flow summaries for C and C++ dependencies in data extension files.

About data extensions¶

You can customize analysis by defining models (summaries, sinks, and sources) of your code’s C and C++ dependencies in data extension files. Each model defines the behavior of one or more elements of your library or framework, such as callables. When you run dataflow analysis, these models expand the potential sources and sinks tracked by dataflow analysis and improve the precision of results.

Many of the security queries search for paths from a source of untrusted input to a sink that represents a vulnerability. This is known as taint tracking. Each source is a starting point for dataflow analysis to track tainted data and each sink is an end point.

Taint tracking queries also need to know how data can flow through elements that are not included in the source code. These are modeled as summaries. A summary model enables queries to synthesize the flow behavior through elements in dependency code that is not stored in your repository.

Syntax used to define an element in an extension file¶

Each model of an element is defined using a data extension where each tuple constitutes a model. A data extension file to extend the standard CPP queries included with CodeQL is a YAML file with the form:

extensions:
  - addsTo:
      pack: codeql/cpp-all
      extensible: <name of extensible predicate>
    data:
      - <tuple1>
      - <tuple2>
      - ...

Each YAML file may contain one or more top-level extensions.

addsTo defines the CodeQL pack name and extensible predicate that the extension is injected into.
data defines one or more rows of tuples that are injected as values into the extensible predicate. The number of columns and their types must match the definition of the extensible predicate.

Data extensions use union semantics, which means that the tuples of all extensions for a single extensible predicate are combined, duplicates are removed, and all of the remaining tuples are queryable by referencing the extensible predicate.

Publish data extension files in a CodeQL model pack to share¶

You can group one or more data extension files into a CodeQL model pack and publish it to the GitHub Container Registry. This makes it easy for anyone to download the model pack and use it to extend their analysis. For more information, see Creating a CodeQL model pack and Publishing and using CodeQL packs in the CodeQL CLI documentation.

Extensible predicates used to create custom models in C and C++¶

The CodeQL library for CPP analysis exposes the following extensible predicates:

sourceModel(namespace, type, subtypes, name, signature, ext, output, kind, provenance). This is used to model sources of potentially tainted data. The kind of the sources defined using this predicate determine which threat model they are associated with. Different threat models can be used to customize the sources used in an analysis. For more information, see “Threat models.”
sinkModel(namespace, type, subtypes, name, signature, ext, input, kind, provenance). This is used to model sinks where tainted data may be used in a way that makes the code vulnerable.
summaryModel(namespace, type, subtypes, name, signature, ext, input, output, kind, provenance). This is used to model flow through elements.
barrierModel(namespace, type, subtypes, name, signature, ext, output, kind, provenance). This is used to model barriers, which are elements that stop the flow of taint.
barrierGuardModel(namespace, type, subtypes, name, signature, ext, input, acceptingValue, kind, provenance). This is used to model barrier guards, which are elements that can stop the flow of taint depending on a conditional check.

The extensible predicates are populated using the models defined in data extension files.

Example of custom model definitions¶

The examples in this section are taken from the standard CodeQL CPP query pack published by GitHub. They demonstrate how to add tuples to extend extensible predicates that are used by the standard queries.

Example: Taint source from the boost::asio namespace¶

This example shows how the CPP query pack models the return value from the read_until function as a remote source.

boost::asio::read_until(socket, recv_buffer, '\0', error);

We need to add a tuple to the sourceModel(namespace, type, subtypes, name, signature, ext, output, kind, provenance) extensible predicate by updating a data extension file.

extensions:
  - addsTo:
      pack: codeql/cpp-all
      extensible: sourceModel
    data:
      - ["boost::asio", "", False, "read_until", "", "", "Argument[*1]", "remote", "manual"]

The first five values identify the callable (in this case a free function) to be modeled as a source.

The first value "boost::asio" is the namespace name.
The second value "" is the name of the type (class) that contains the method. Because we’re modeling a free function, the type is left blank.
The third value False is a flag that indicates whether or not the model also applies to all overrides of the method. For a free function, this should be False.
The fourth value "read_until" is the function name.
The fifth value is the function input type signature, which can be used to narrow down between functions that have the same name. In this case, we want the model to include all functions in boost::asio called read_until.

The sixth value should be left empty and is out of scope for this documentation. The remaining values are used to define the output specification, the kind, and the provenance (origin) of the source.

The seventh value "Argument[*1]" is the output specification, which means in this case that the sink is the first indirection (or pointed-to value, *) of the second argument (Argument[1]) passed to the function.
The eighth value "remote" is the kind of the source. The source kind is used to define the threat model where the source is in scope. remote applies to many of the security related queries as it means a remote source of untrusted data. For more information, see “Threat models.”
The ninth value "manual" is the provenance of the source, which is used to identify the origin of the source model.

Example: Taint sink in the boost::asio namespace¶

This example shows how the CPP query pack models the second argument of the boost::asio::write function as a remote flow sink. A remote flow sink is where data is transmitted to other machines across a network, which is used for example by the “Cleartext transmission of sensitive information” (cpp/cleartext-transmission) query.

boost::asio::write(socket, send_buffer, error);

We need to add a tuple to the sinkModel(namespace, type, subtypes, name, signature, ext, input, kind, provenance) extensible predicate by updating a data extension file.

extensions:
  - addsTo:
      pack: codeql/cpp-all
      extensible: sinkModel
    data:
      - ["boost::asio", "", False, "write", "", "", "Argument[*1]", "remote-sink", "manual"]

The first five values identify the callable (in this case a free function) to be modeled as a sink.

The first value "boost::asio" is the namespace name.
The second value "" is the name of the type (class) that contains the method. Because we’re modeling a free function, the type is left blank.
The third value False is a flag that indicates whether or not the model also applies to all overrides of the method. For a free function, this should be False.
The fourth value "write" is the function name.
The fifth value is the function input type signature, which can be used to narrow down between functions that have the same name. In this case, we want the model to include all functions in boost::asio called write.

The sixth value should be left empty and is out of scope for this documentation. The remaining values are used to define the output specification, the kind, and the provenance (origin) of the sink.

The seventh value "Argument[*1]" is the output specification, which means in this case that the sink is the first indirection (or pointed-to value, *) of the second argument (Argument[1]) passed to the function.
The eighth value "remote-sink" is the kind of the sink. The sink kind is used to define the queries where the sink is in scope.
The ninth value "manual" is the provenance of the sink, which is used to identify the origin of the sink model.

Example: Add flow through the boost::asio::buffer method¶

This example shows how the CPP query pack models flow through a function for a simple case.

boost::asio::write(socket, boost::asio::buffer(send_str), error);

We need to add tuples to the summaryModel(namespace, type, subtypes, name, signature, ext, input, output, kind, provenance) extensible predicate by updating a data extension file:

extensions:
  - addsTo:
      pack: codeql/cpp-all
      extensible: summaryModel
    data:
      - ["boost::asio", "", False, "buffer", "", "", "Argument[*0]", "ReturnValue", "taint", "manual"]

The first five values identify the callable (in this case free function) to be modeled as a summary.

The first value "boost::asio" is the namespace name.
The second value "" is the name of the type (class) that contains the method. Because we’re modeling a free function, the type is left blank.
The third value False is a flag that indicates whether or not the model also applies to all overrides of the method. For a free function, this should be False.
The fourth value "buffer" is the function name.
The fifth value is the function input type signature, which can be used to narrow down between functions that have the same name. In this case, we want the model to include all functions in boost::asio called buffer.

The sixth value should be left empty and is out of scope for this documentation. The remaining values are used to define the input and output specifications, the kind, and the provenance (origin) of the summary.

The seventh value is the input specification (where data flows from). Argument[*0] specifies the first indirection (or pointed-to value, *) of the first argument (Argument[0]) passed to the function.
The eighth value "ReturnValue" is the output specification (where data flows to), in this case the return value.
The ninth value "taint" is the kind of the flow. taint means that taint is propagated through the call.
The tenth value "manual" is the provenance of the summary, which is used to identify the origin of the summary model.

Example: Taint barrier using the mysql_real_escape_string function¶

This example shows how the CPP query pack models the mysql_real_escape_string function as a barrier for SQL injection. This function escapes special characters in a string for use in an SQL statement, which prevents SQL injection attacks.

char *query = "SELECT * FROM users WHERE name = '%s'";
char *name = get_untrusted_input();
char *escaped_name = new char[2 * strlen(name) + 1];
mysql_real_escape_string(mysql, escaped_name, name, strlen(name)); // The escaped_name is safe for SQL injection.
sprintf(query_buffer, query, escaped_name);

We need to add a tuple to the barrierModel(namespace, type, subtypes, name, signature, ext, output, kind, provenance) extensible predicate by updating a data extension file.

extensions:
  - addsTo:
      pack: codeql/cpp-all
      extensible: barrierModel
    data:
      - ["", "", False, "mysql_real_escape_string", "", "", "Argument[*1]", "sql-injection", "manual"]

The first five values identify the callable (in this case a free function) to be modeled as a barrier.

The first value "" is the namespace name.
The second value "" is the name of the type (class) that contains the method. Because we’re modeling a free function, the type is left blank.
The third value False is a flag that indicates whether or not the model also applies to all overrides of the method. For a free function, this should be False.
The fourth value "mysql_real_escape_string" is the function name.
The fifth value is the function input type signature, which can be used to narrow down between functions that have the same name.

The sixth value should be left empty and is out of scope for this documentation. The remaining values are used to define the output specification, the kind, and the provenance (origin) of the barrier.

The seventh value "Argument[*1]" is the output specification, which means in this case that the barrier is the first indirection (or pointed-to value, *) of the second argument (Argument[1]) passed to the function.
The eighth value "sql-injection" is the kind of the barrier. The barrier kind is used to define the queries where the barrier is in scope.
The ninth value "manual" is the provenance of the barrier, which is used to identify the origin of the barrier model.

Example: Add a barrier guard¶

This example shows how to model a barrier guard that stops the flow of taint when a conditional check is performed on data. A barrier guard model is used when a function returns a boolean that indicates whether the data is safe to use. Consider a function called is_safe which returns true when the data is considered safe.

if (is_safe(user_input)) { // The check guards the use, so the input is safe.
    mysql_query(user_input); // This is safe.
}

We need to add a tuple to the barrierGuardModel(namespace, type, subtypes, name, signature, ext, input, acceptingValue, kind, provenance) extensible predicate by updating a data extension file.

extensions:
  - addsTo:
      pack: codeql/cpp-all
      extensible: barrierGuardModel
    data:
      - ["", "", False, "is_safe", "", "", "Argument[*0]", "true", "sql-injection", "manual"]

The first five values identify the callable (in this case a free function) to be modeled as a barrier guard.

The first value "" is the namespace name.
The second value "" is the name of the type (class) that contains the method. Because we’re modeling a free function, the type is left blank.
The third value False is a flag that indicates whether or not the model guard also applies to all overrides of the method. For a free function, this should be False.
The fourth value "is_safe" is the function name.
The fifth value is the function input type signature, which can be used to narrow down between functions that have the same name.

The sixth value should be left empty and is out of scope for this documentation. The remaining values are used to define the input specification, the accepting-value, the kind, and the provenance (origin) of the barrier guard.

The seventh value Argument[*0] is the input specification (the value being validated). In this case, the first indirection (or pointed-to value, *) of the first argument (Argument[0]) passed to the function.
The eighth value true is the accepting value of the barrier guard. This is the value that the conditional check must return for the barrier to apply.
The ninth value sql-injection is the kind of the barrier guard. The barrier guard kind is used to define the queries where the barrier guard is in scope.
The tenth value manual is the provenance of the barrier guard, which is used to identify the origin of the barrier guard.

Threat models¶

Note

Threat models are currently in beta and subject to change. During the beta, threat models are supported only by Java, C#, Python and JavaScript/TypeScript analysis.

A threat model is a named class of dataflow sources that can be enabled or disabled independently. Threat models allow you to control the set of dataflow sources that you want to consider unsafe. For example, one codebase may only consider remote HTTP requests to be tainted, whereas another may also consider data from local files to be unsafe. You can use threat models to ensure that the relevant taint sources are used in a CodeQL analysis.

The kind property of the sourceModel determines which threat model a source is associated with. There are two main categories:

remote which represents requests and responses from the network.
local which represents data from local files (file), command-line arguments (commandargs), database reads (database), environment variables(environment), standard input (stdin) and Windows registry values (“windows-registry”). Currently, Windows registry values are used by C# only.

Note that subcategories can be turned included or excluded separately, so you can specify local without database, or just commandargs and environment without the rest of local.

The less commonly used categories are:

android which represents reads from external files in Android (android-external-storage-dir) and parameter of an entry-point method declared in a ContentProvider class (contentprovider). Currently only used by Java/Kotlin.
database-access-result which represents a database access. Currently only used by JavaScript.
file-write which represents opening a file in write mode. Currently only used in C#.
reverse-dns which represents reverse DNS lookups. Currently only used in Java.
view-component-input which represents inputs to a React, Vue, or Angular component (also known as “props”). Currently only used by JavaScript/TypeScript.

When running a CodeQL analysis, the remote threat model is included by default. You can optionally include other threat models as appropriate when using the CodeQL CLI and in GitHub code scanning. For more information, see Analyzing your code with CodeQL queries and Customizing your advanced setup for code scanning.

© GitHub, Inc.
Terms
Privacy