Three Rules Of Bug Fixing For Better OSS Security

>>> Shared from Original Post The GitHub Blog

This blog is the second in a four-part series about how I am helping to harden the security of the Exiv2 project. Today, I want to talk about the three things that I believe you should always do when you are fixing a bug, especially when the bug is a security vulnerability:

  1. Add a regression test
  2. Fix the bug
  3. Find variants and fix them too

As an example, this pull request, which fixes GHSA-pvjp-m4f6-q984, has three commits corresponding to those three steps.

Step 2 is obvious, but why are steps 1 and 3 important, and why should you do them in that order?

Regression testing

The purpose of regression testing is to make sure that you never make the same mistake twice. When you fix a bug, you should add a regression test, which is designed to fail if the same bug is ever reintroduced. I don’t think regression testing is controversial, but it often feels like a waste of time to write a test for a bug that you already fixed. Realistically, how likely is it that somebody will ever reintroduce the exact same bug? It’s a tempting step to skip if you’re in a hurry. These are the reasons why I think it’s important:

  • Proof that the bug is fixed. I like to add the regression test as the first commit, before the commit that fixes the bug, so that I can easily confirm that the test fails without the bug fix and succeeds with it.
  • Code coverage. Bug fixes often add a new if-condition to the code. The regression test ensures that the new code has coverage. If a future developer is tempted to remove the if-condition because they don’t understand what it’s for, then the test will fail.
  • Improving the fuzzing corpus. Fuzzers work best with a corpus of sample inputs. In the case of Exiv2, the files in the test/data subdirectory are also useful as a corpus. Every time we add a new regression test, we are improving the quality of the corpus.

Finding variants and fixing them

Not everybody agrees with this approach. In previous software development jobs, I have worked with people who believe that you should only fix what’s verifiably broken and nothing else, especially when you’re fixing the bug in a released version of the software rather than on the development branch. They’re scared that by changing code that you don’t need to, you might accidentally introduce a new bug. But I think that’s the wrong risk to worry about. The reality is that if you find one bug, there are almost always others nearby. We have so much evidence of that from the miserable history of computer security, that it should no longer be in dispute. However, it’s clear that many software vendors persist in following the minimal fix strategy; in the worst case only fixing the symptoms of the bug rather than the root cause to the great frustration of security researchers. With this in mind, I am a firm believer that every bug is an opportunity to toss in a few extra defensive fixes.

For the rest of this blog post, I want to show you a couple of examples from Exiv2 of defensive fixes that I have added. In both examples, I’ll also show you how I used a simple CodeQL query to find variants in other parts of the codebase.

Example 1: Divide by zero (GHSA-pvjp-m4f6-q984, CVE-2021-34335)

This vulnerability is a crash caused by an integer divide by zero in minoltamn_int.cpp:

long    focalLength = getKeyLong  ("Exif.Photo.FocalLength"      ,metadata);
long    focalL35mm  = getKeyLong  ("Exif.Photo.FocalLengthIn35mmFilm",metadata);
long    focalRatio  = (focalL35mm*100)/focalLength;  <===== Divide by zero

On line 2174, focalLength is read from the image metadata, which means that it’s an attacker-controlled value. If the value is zero then it causes a crash on line 2176.

As you can see in the pull request, I fixed an almost identical bug on line 2183. But I also expanded the search to look for integer divisions that could potentially cause a crash due to INT_MIN/-1, which is another less well known gotcha of integer division.

This is the CodeQL query that I used:

/**
 * @kind: problem
 */
import cpp
import semmle.code.cpp.rangeanalysis.SimpleRangeAnalysis

from DivExpr div, Expr rhs
where
  rhs = div.getRightOperand() and
  div.getType() instanceof IntegralType and
  not lowerBound(rhs) > 0
select rhs, "Possible integer divide by zero. Type: " + rhs.getType().toString()

The query is very simple. It looks for an integer division operator whose right hand operand isn’t obviously greater than zero. (Exiv2 is unlikely to ever divide by a negative number on purpose, so any division operator with a possibly negative right-hand operand is suspect. Dividing by a negative number might enable an attacker to trigger the INT_MIN/-1 crash that I mentioned earlier.) I used the SimpleRangeAnalysis library to implement the greater-than-zero condition. Range analysis computes upper and lower bounds for expressions in the program. For example, it might deduce that the index variable of a simple for-loop is in the range 0..9. However, as its name suggests, it is only able to deduce accurate ranges in relatively simple cases1, so it often defaults to the range INT_MIN..INT_MAX (with values depending on the type of the expression). As a result, the query produces quite a few false positive results. You can see the results of the query on a recent version of Exiv2 here. There are 24 results, all of which I believe are false positives. (I already fixed everything that looked like a genuine bug.) Here’s an example of one of the false positives in value.hpp:

ok_ = (value_.at(n).second > 0 && value_.at(n).first < LARGE_INT);
if (!ok_) return 0;
return value_.at(n).first / value_.at(n).second;

It’s clear that this division is safe, but the logic is a bit too complicated for SimpleRangeAnalysis to analyze.

There’s a trade-off between spending more time improving the accuracy of the CodeQL versus just manually auditing the results. In this particular case, I didn’t think it was worth spending any more time on the query, since integer divisions are relatively rare in the Exiv2 codebase, so they’re easy to audit manually. Some queries are worth spending more time on though—as I will discuss in the next blog post in this series.

Example 2: Use vector::at() rather than operator[] (#1706)

std::vector provides two indexing methods: at() and operator[]. The former throws an exception if the index is out-of-bounds, but the latter does not. Therefore, it is safer to use at() unless you are very sure that the index is in bounds.

The crash (#1706) was caused by an out-of-bounds access in value.hpp:

template
long ValueType::toLong(long n) const
{
    ok_ = true;
    return static_cast(value_[n]);
}

toLong() is a utility function that is used in many places. Any one of those call-sites might accidentally call toLong() with an out-of-bounds index, so it would be a substantial auditing task to check them all. Instead, I decided that it would be much safer to use at() here.

As you can see in the pull request, I replaced quite a few instances of operator[] with at(), mostly in the same header file. This is the CodeQL query that I used to search for variants:

/**
 * @kind: problem
 */
import cpp
import semmle.code.cpp.rangeanalysis.SimpleRangeAnalysis

from FunctionCall call, ClassTemplateInstantiation t, TemplateClass c
where
  call.getQualifier().(VariableAccess).getTarget().getName() = "value_" and
  call.getTarget().getName() = "operator[]" and
  t = call.getQualifier().getType().getUnspecifiedType() and
  c = t.getTemplate() and
  c.getSimpleName() != "map"
select call, "Unsafe use of " + c.getSimpleName() + "::operator[]."

This query is very simplistic because it restricts the results to vectors named value_. If that clause is commented out, then the query has 356 results, which is too many.

Unfortunately, my simplistic query was not good enough to catch all the variants, because another crash caused by operator[] was discovered a few weeks later (GHSA-v5g7-46xf-h728). So I decided to spend some time improving the accuracy of this query, as I will discuss in more detail in my next blog post.

Other CodeQL examples

On several other occasions, I used simplistic CodeQL queries to quickly find variants. Here’s a short list in case you are interested in looking at other CodeQL examples:

  • Non-determinism due to uninitialized local variable: #1737
  • SIGSEGV due to dereferencing an end iterator: #1758
  • Infinite loop due to integer overflow in loop counter: #1766

Conclusion

Security bugs are rarely unique. Usually, there are similar bugs elsewhere in the code. That’s why I believe that every security bug is an opportunity to toss in a few extra defensive fixes. If you’re already going through the hassle of doing a new release to fix a security problem, then why not take the opportunity to do some extra security hardening?

In this blog post, I showed you a couple of examples of security bugs that I fixed in Exiv2. In both cases, I did three things:

  1. Add a regression test
  2. Fix the bug
  3. Fix variants of the bug

CodeQL is a very useful tool for finding variants, as demonstrated through both examples in this blog post. However, the CodeQL queries that I showed in this blog post were a bit simplistic, which led to a higher percentage of false positives (which is fine when you’re searching for other bugs to fix, but not fine if you’re trying to turn it into a formal rule). In the next blog post, I will show how I took one of the unpolished queries from this blog post and improved it so that it could be added to Exiv2’s continuous integration test suite—showing the true power of CodeQL.

Follow GitHub Security Lab on Twitter for the latest in security research.


1 I was the original author of SimpleRangeAnalysis a few years ago, so I am responsible for its limitations. 😳

>>> Read the Full Story at The GitHub Blog