Blue-teaming For Exiv2: Adding Custom CodeQL Queries To Code Scanning

>>> Shared from Original Post The GitHub Blog

This blog is the third in a four-part series about how I am helping to harden the security of the Exiv2 project. This post is about how the Exiv2 team tightened our security by enabling GitHub’s code scanning feature, and in particular how we added some custom queries that are tailored to the Exiv2 code base.

As I mentioned at the start of this series, I often find that the easiest way to learn is by example. I hope that these examples from Exiv2 will help you to improve the security of your own software projects.

Enabling code scanning with CodeQL

Code scanning automatically scans your GitHub repository for vulnerabilities and errors. It’s packaged as a GitHub Actions workflow and is very easy to enable for your repository. The configuration is stored in a YAML file in your .github subdirectory. In its default configuration, code scanning runs a suite of security checks implemented in CodeQL, which is a query language that lets you query code as though it were data. You can also use CodeQL to develop your own custom checks. The CodeQL queries used by code scanning are open source, so there are plenty of examples to learn from if you want to develop your own query.

By default, code scanning only runs a curated suite of the most accurate CodeQL queries, to avoid overwhelming developers with large numbers of alerts. The CodeQL repository contains many more queries that are not enabled by default. It’s quite likely that there are queries that are not enabled that would find additional useful results on your project. You can add more queries by editing your code scanning configuration. We have done that in Exiv2 by adding this line to our YAML file:

config-file: .github/codeql/codeql-config.yml

The linked YAML file, codeql-config.yml, specifies the queries that we want to run.

Adding custom CodeQL queries

In the Exiv2 project, we have added some custom CodeQL queries, which will automatically detect variants of bugs that have caused security problems in the past. In this blog post, I am going to take a deep dive into the most sophisticated query that we have added so far: a query to detect unsafe uses of std::vector::operator[]. Although I have chosen the most complicated query as my main example, I don’t want to give you the impression that CodeQL is always this hard. For example, one of the other custom queries is a very simple query for detecting signed shifts. Yet I think unsafe_vector_access.ql is interesting, primarily because it uses a few advanced CodeQL features, and secondly because it will allow me to highlight some of the trade-offs that are possible when you’re designing a query for a single codebase, rather than trying to design a general purpose query.

Safer std::vector indexing with CodeQL

One of the Exiv2 bugs that I discussed in my previous blog post was #1706, which was caused by an out-of-bounds index in std::vector::operator[]. The simplest solution to that type of bug is to use the at() method rather than operator[], because at() checks that the index is within bounds and throws an exception if it isn’t. It would be very simple to write a CodeQL query that finds every single use of operator[] and suggests that you replace it with at(). But most of those results would be false positives, and implementing all of the suggestions would cause a large amount of unnecessary code churn. Here’s a typical example, from actions.cpp:

for (num = 0; num < pvList.size(); ++num) {
    writePreviewFile(pvMgr.getPreviewImage(pvList[num]), static_cast<int>(num + 1));
}

The simplistic query would flag pvList[num], but it is clearly perfectly safe due to the loop condition.

My goal, for the rest of this blog post, is to refine the simplistic query so that it stops complaining about code that is clearly safe. I have a big advantage compared to somebody who is trying to develop a general purpose query for safe std::vector indexing, which is that my query only needs to be good enough for the Exiv2 codebase. Firstly, my query only needs to be able to handle coding idioms that are common in Exiv2. Secondly, if I want to, I can change Exiv2’s code to make it easier to analyze. As a simple example, these two conditions are equivalent:

if (x.empty()) { … }
if (x.size() == 0) { … }

If one of those two idioms is rare in the codebase, then it might be quicker to change those few occurrences, rather than adding more logic to the query so that it can handle both. I have the luxury that I can take the path of least resistance.

Note: for presentation purposes, the version of the query that I describe in this blog post is a slightly simplified version of the real thing. I have only omitted minor details. The most important parts of the query are the same.

Installing CodeQL

I recommend using the CodeQL for Visual Studio Code extension to develop new queries. You will also need a database for the codebase that you are analyzing. I have created a database for version 0.27.4 of Exiv2, which you can download here. If you would like to build your own database, then see the appendix for instructions.

The simplistic query

This is the most simplistic version of the query:

/**
 * @name Unsafe vector access
 * @description std::vector::operator[] does not do any runtime
 *              bounds-checking, so it is safer to use std::vector::at()
 * @kind problem
 * @problem.severity warning
 * @id cpp/unsafe-vector-access
 * @tags security
 *       external/cwe/cwe-125
 */

import cpp

// A call to `operator[]`.
class ArrayIndexCall extends FunctionCall {
  ClassTemplateInstantiation ti;
  TemplateClass tc;

  ArrayIndexCall() {
    this.getTarget().getName() = "operator[]" and
    ti = this.getQualifier().getType().getUnspecifiedType() and
    tc = ti.getTemplate() and
    tc.getSimpleName() != "map"
  }

  ClassTemplateInstantiation getClassTemplateInstantiation() { result = ti }
}

from ArrayIndexCall call
select call, "Unsafe use of operator[]. Use the at() method instead."

The class ArrayIndexCall is used to find all uses of operator[]. It’s a subclass of FunctionCall, because we are interested in overloaded operators, rather than the built-in array indexing operator. It is more general than just std::vector, also matching similar types such as std::string. Note that std::map has been explicitly excluded from the results, because it does not crash if the key does exist.

This version of the query has 402 results (on Exiv2 version 0.27.4), most of which are false positives. In the final version of the query, several clauses have been added to eliminate most of those false positives:

from ArrayIndexCall call
where
  // Ignore results in the xmpsdk directory.
  not call.getLocation().getFile().getRelativePath().matches("xmpsdk/%") and
  // Ignore accesses like this: `vsnprintf(&buffer[0], buffer.size(), format, args)`
  // That's pointer arithmetic, not a deref, so it's usually a false positive.
  not exists(AddressOfExpr addrExpr | addrExpr.getOperand() = call) and
  not indexK_with_check(_, call) and
  not indexI_with_check(_, call) and
  not index_last_with_check(_, call)
select call, "Unsafe use of operator[]. Use the at() method instead."

I’ll explain those clauses in the next few sections.

Excluding the xmpsdk directory

Let’s start with the simplest (and crudest) clause that I added to reduce the number of results:

// Ignore results in the xmpsdk directory.
not call.getLocation().getFile().getRelativePath().matches("xmpsdk/%")

This clause says that I’m not interested in any results in the xmpsdk subdirectory. That may seem a little surprising, but there are reasons for this. The xmpsdk directory is an unfortunate situation in the Exiv2 codebase. It’s a copy of the Adobe XMP SDK, which is used to parse snippets of XML that are sometimes embedded in image metadata. It was added to the Exiv2 codebase many years ago, when the SDK was only published as a tarball. More recently, the SDK has been published on GitHub, so it may now be possible to replace the xmpsdk subdirectory with a git submodule, but that work has not started yet. Meanwhile, we try to make as few changes to xmpsdk source code as possible, since it’s third-party code. 🙈

Excluding AddressOfExpr

The next clause uses a heuristic to eliminate false positives like this one:

c = vsnprintf(&buffer[0], buffer.size(), format, args);

The address is taken, which means that the element isn’t immediately dereferenced. So even if the index is out-of-bounds, it doesn’t necessarily mean that it will lead to an out-of-bounds read. Analyzing what happens with the address next is beyond the scope of the query that I’m trying to write here, so it’s easier to just ignore those kinds of results.

The clause eliminates results where the parent of the operator[] is an AddressOfExpr:

// Ignore accesses like this: `vsnprintf(&buffer[0], buffer.size(), format, args)`
// That's pointer arithmetic, not a deref, so it's usually a false positive.
not exists(AddressOfExpr addrExpr | addrExpr.getOperand() = call) and

Constant valued index

The next clause is much more principled than the previous two. It handles examples like this:

if(numbers.size()>=10)
{
    //year                                                                                                                                                                       
    long l=(numbers[0]-48)*10+(numbers[1]-48);
    if(l<70)
    {
        l+=2000;
    }
    else
    {
        l+=1900;
    };
    os << l << ":";
    // month, day, hour, minutes                                                                                                                                                 
    os << numbers[2] << numbers[3] << ":" << numbers[4] << numbers[5] << " " << numbers[6] << numbers[7] << ":" << numbers[8] << numbers[9];
}

An access like numbers[9] is safe, because it is guarded by a bounds check:

if(numbers.size()>=10)

These false positives are eliminated by this clause in the query:

not indexK_with_check(_, call) and

Which uses this predicate:

// Array accesses like these are safe:
// `if (!x.empty()) { ... x[0] ... }`
// `if (x.size() > 2) { ... x[2] ... }`
predicate indexK_with_check(GuardCondition guard, ArrayIndexCall call) {
  exists(Expr arrayExpr, BasicBlock block, int i, int minsize, boolean branch |
    minimum_size_cond(guard, arrayExpr, minsize, branch) and
    globalValueNumber(arrayExpr) = globalValueNumber(call.getQualifier()) and
    guard.controls(block, branch) and
    block.contains(call) and
    i = call.getArgument(0).getValue().toInt() and
    0 <= i and
    i < minsize
  )
}

That predicate, in turn, uses a predicate named minimum_size_cond, which I will not discuss in detail here, but you can see in the Exiv2 repo. minimum_size_cond finds conditions like x.size()>2 or !x.empty(), which imply a lower bound on the size of the vector.

The most important clause in indexK_with_check is this one:

guard.controls(block, branch) and

It uses the Guards library to check that the minimum size condition “controls” the block of code containing the indexing expression, which means that we know that condition is true when the indexing happens.

Also important in this predicate is the use of the GlobalValueNumbering library to check that the vector in the condition is the same as the vector being indexed. It is used to prevent the query from accidentally classifying code like this as safe:

if (x.size() > 2) { ... y[2] ... }

Index with bounds check

The next clause handles code like this, which is probably the most common safe indexing idiom:

for (size_t i = 0; i < stringValue.length(); ++i) {
    if (stringValue[i] == 'T') stringValue[i] = ' ';
    if (stringValue[i] == '-') stringValue[i] = ':';
}

This is the clause:

not indexI_with_check(_, call) and

That uses this predicate, which is quite similar to indexK_with_check from the previous section:

// Array accesses like this are safe:
// `if (i < x.size()) { ... x[i] ... }`
predicate indexI_with_check(GuardCondition guard, ArrayIndexCall call) {
  exists(Expr idx, SizeCall sizeCall, BasicBlock block, boolean branch |
    relOpWithSwapAndNegate(guard, idx, sizeCall, Lesser(), Strict(), branch) and
    globalValueNumber(sizeCall.getQualifier()) = globalValueNumber(call.getQualifier()) and
    globalValueNumber(idx) = globalValueNumber(call.getArgument(0)) and
    guard.controls(block, branch) and
    block.contains(call)
  )
}

The main difference is that the index is expected to be a symbolic expression, like i, so the GlobalValueNumbering library is used a second time to check that the index expression matches the expression in the condition.

Indexing the last element

The final clause handles an indexing idiom that occurs quite frequently in Exiv2, like this example:

while (   p.length() > 1
       && (p[p.length()-1] == '\\' || p[p.length()-1] == '/')) {
    p = p.substr(0, p.length()-1);
}

It is safe to index the last element of the vector, provided that we know that the vector has at least one element.

This is the clause:

not index_last_with_check(_, call)

And this is the predicate:

// Array accesses like this are safe:
// `if (!x.empty()) { ... x[x.size() - 1] ... }`
predicate index_last_with_check(GuardCondition guard, ArrayIndexCall call) {
  exists(Expr arrayExpr, SubExpr minusExpr, SizeCall sizeCall, BasicBlock block, boolean branch |
    minimum_size_cond(guard, arrayExpr, _, branch) and
    globalValueNumber(arrayExpr) = globalValueNumber(call.getQualifier()) and
    guard.controls(block, branch) and
    block.contains(call) and
    minusExpr = call.getArgument(0) and
    minusExpr.getRightOperand().getValue().toInt() = 1 and
    DataFlow::localExprFlow(sizeCall, minusExpr.getLeftOperand()) and
    globalValueNumber(sizeCall.getQualifier()) = globalValueNumber(call.getQualifier())
  )
}

It has some extra logic to match the x.size()-1 expression, but it’s otherwise quite similar to the previous predicates. The only new technique is the use of DataFlow::localExprFlow, which I added to handle cases like this, where the size is stored in a local variable:

// pop trailing ':' on a namespace
if ( bNS && !out.empty() ) {
    std::size_t length = out.length();
    if ( out[length-1] == ':' ) out = out.substr(0,length-1);
}

Conclusion

Code scanning helps to prevent bugs from being introduced into your codebase. It is easy to enable for your repository, and it runs as an automated check on pull requests so that problems are caught before the pull request is merged. By default, code scanning runs a relatively small suite of the most accurate CodeQL queries, so you can get more value out of it by adding more queries. In the Exiv2 project, we have added some custom queries that will automatically detect variants of bugs that have caused security problems in the past. In this blog post, I discussed a fairly sophisticated example: a query to detect unsafe uses of std::vector::operator[]. The benefit of writing custom queries is that they can be tailored to recognize coding idioms that are common in your specific codebase, which means that you can make the rule quite strict without also producing a lot of annoying false positives. Custom queries don’t always need to be as complicated as the example in this blog post though. For example, I recently added a very simple query to detect signed shifts, which can cause undefined behavior, like this bug.

Follow GitHub Security Lab on Twitter for the latest in security research.


Appendix: building a CodeQL database for Exiv2

Here are complete instructions for downloading the CodeQL CLI and creating a CodeQL database for Exiv2:

# Get the CodeQL CLI
export CODEQL_VERSION=v2.6.1
export EXIV2_VERSION=v0.27.4
mkdir codeql-home
cd codeql-home
curl -L -O https://github.com/github/codeql-cli-binaries/releases/download/$CODEQL_VERSION/codeql-linux64.zip
unzip codeql-linux64.zip 
# Get the CodeQL queries
git clone https://github.com/github/codeql.git codeql-queries
cd ..
# Build a database for Exiv2
git clone https://github.com/Exiv2/exiv2.git
cd exiv2/
git checkout $EXIV2_VERSION
../codeql-home/codeql/codeql database create exiv2_$EXIV2_VERSION --language=cpp
zip -r exiv2_$EXIV2_VERSION.zip exiv2_$EXIV2_VERSION/

Those instructions are for Linux but should be easy to adapt for MacOS or Windows. You may also need to update the GitHub CLI version number: the latest release is available here. But don’t change the Exiv2 version number. Version 0.27.4 works the best for the query presented in this blog post, because most of the query results have been fixed in more recent versions.

>>> Read the Full Story at The GitHub Blog