Learn to use CodeQL, a query language that helps find bugs in source code. Find 9 remote code execution vulnerabilities in the open-source project Das U-Boot, and join the growing community of security researchers using CodeQL.
Tags: none
Quickly learn CodeQL, an expressive language for code analysis, which helps you explore source code to find bugs and vulnerabilities. During this beginner-level course, you will learn to write queries in CodeQL and find critical security vulnerabilities that were identified in Das U-Boot, a popular open-source project.
Upon completion of the course, you'll be able to:
- Understand the basic syntax of CodeQL queries
- Use the standard CodeQL libraries to write queries and explore code written in C/C++
- Use predicates and classes, the building blocks of CodeQL queries, to make your queries more expressive and reusable
- Use the CodeQL data flow and taint tracking libraries to write queries that find real security vulnerabilities
You will walk in the steps of our security researchers, and create:
- Several CodeQL queries that look for interesting patterns in C/C++ code.
- A CodeQL security query that finds 9 critical security vulnerabilities in the Das U-Boot codebase from 2019 (before it was patched!) and can be reused to audit other open-source projects of your choice.
- Some knowledge of the C language and standard library.
- A basic knowledge of secure coding practices is useful to understand the context of this course, and all the consequences of the bugs we'll find, but is not mandatory to learn CodeQL.
- This is a beginner course. No prior knowledge of CodeQL is required.
- Security researchers
- Developers
We created this course to help you quickly learn CodeQL, our query language and engine for code analysis. The goal is to find several remote code execution (RCE) vulnerabilities in the open-source software known as U-Boot, using CodeQL and its libraries for analyzing C/C++ code. To find the real vulnerabilities, you'll need to write a sequence of queries, making them more precise at each step of the course.
The goal is to find a set of 9 remote-code-execution vulnerabilities in the U-Boot boot loader. These vulnerabilities were originally discovered by GitHub Security Lab researchers and have since been fixed. An attacker with positioning on the local network, or control of a malicious NFS server, could potentially achieve remote code execution on the U-Boot powered device. This was possible because the code read data from the network (that could be attacker-controlled) and passed it to the length parameter of a call to the memcpy
function. When such a length parameter is not properly validated before use, it may lead to exploitable memory corruption vulnerabilities.
U-Boot contains hundreds of calls to both memcpy
and libc
functions that read data from the network. You can often recognize network data being acted upon through use of the ntohs
(network to host short) and ntohl
(network to host long) functions or macros. These swap the byte ordering for integer values that are received in network ordering to the host's native byte ordering (which is architecture dependent).
In this course, you will use CodeQL to find such calls. Many of those calls may actually be safe, so throughout this course you will refine your query to reduce the number of false positives, and finally track down the unsafe calls to memcpy
that are influenced by remote input.
Upon completion of the course, you will have created a CodeQL query that is able to find variants of this common vulnerability pattern.
Bookmark these useful documentation links:
If you get stuck during this course and need some help, the best place to ask for help is on the GitHub Security Lab Slack. Request an invitation from the Security Lab Get Involved page and ask in the channel #codeql-writing
. There are also sample solutions in the course repository, but please try to solve the tasks on your own first!
Hope this is exciting! Please close this issue now, then wait for the next set of instructions to appear in a comment below.
We will use the CodeQL extension for Visual Studio Code. You will take advantage of IDE features like auto-complete, contextual help and jump-to-definition.
Don't worry, you'll do this setup only once, and you'll be able to use it for future CodeQL development.
Follow the instructions below.
- This course is using GitHub actions. Please make sure that GitHub actions is enabled on this repository: Enable actions
- Install the Visual Studio Code IDE.
- Go to the CodeQL starter workspace repository, and follow the instructions in that repository's README. When you are done, you should have the CodeQL extension installed and the
vscode-codeql-starter
workspace open in Visual Studio Code. - Download and unzip this U-Boot CodeQL database, which corresponds to revision
d0d07ba
. - Import the database into Visual Studio Code (see documentation). This is the database that we'll be running queries on for the duration of this course.
- Clone this course repository on your local machine.
- Add the course repository to your Visual Studio Code starter workspace, by navigating to
File -> Add Folder to Workspace...
. If you open this folder you should see several files in there, includingqlpack.yml
andREADME.md
.
When you're done setting up, please close this issue, then wait for the next set of instructions to appear in a comment below.
Let's continue to the next step.
You will now run a simple CodeQL query, to understand its basic concepts and get familiar with your IDE.
-
Edit the file
3_function_definitions.ql
with the following contents:import cpp from Function f where f.getName() = "strlen" select f, "a function named strlen"
Don't copy / paste this code, but instead type it slowly. You will see the CodeQL auto-complete suggestions in your IDE as you type.
- After typing
from
and the first letters ofFunction
, the IDE will propose a list of available classes from the CodeQL library for C/C++. This is a good way to discover what classes are available to represent standard patterns in the source code. - After typing
where f.
the IDE will propose a list of available predicates that you can call on the variablef
. - Type the first letters of
getName()
to narrow down the list. - Move your cursor to a predicate name in the list to see its documentation. This is a good way to discover what predicates are available and what they mean.
- After typing
-
Run this query: Right-click on the query editor, then click CodeQL: Run Query.
-
Inspect the results appearing in the results panel. Click on the result hyperlinks to navigate to the corresponding locations in the U-Boot code. Do you understand what this query does? You probably guessed it! This query finds all functions with the name
strlen
.
Now it's time to submit your query. You will have 2 choices to do that, and we'll explain both of them in the comments below. Once you have chosen your method, submit your answer!
Read carefully: you will need to follow the same steps to submit your answers to later steps. You can always come back to this issue later to check the submission instructions.
The first method to submit your query is via a Pull Request. Using a Pull request has several advantages:
- If you are following this course with a mentor, or with co-learners, you may want them to interact with your PR via review, additional commits, etc.
- Creating a PR is good practice when contributing to shared code.
- You will be able to track the execution of the query checker directly in the PR.
However this workflow is bit more involved than just directly committing to main for the purposes of this course.
To submit this query via Pull Request, you can follow the following workflow:
- First, refresh your main branch, commit your changes to a new branch, and push them:
git checkout main git pull git checkout -b step-3 git add . git commit -a -m "First Query" git push -u origin step-3
- Then open a pull request.
- Wait for the course to check your query. It will display a status on your pull request!
- Once the check is completed, refresh your browser to get the next set of instructions.
- If the status is green, merge your PR and follow these instructions.
This method is simpler. You won't have to juggle between branches, rebase onto main, or create Pull Requests. However, merging directly to main is not a good practice when you are contributing to a shared code base, so if you choose this method, please don't take this bad habit home with you!
To submit this query via a direct commit to main, you can follow this workflow:
-
Commit your updated query file to your course repo:
git add . git commit -m "Any message here - why not step 3" git push origin main
-
Wait for your work to be checked, and for the results to appear as a comment below. The checks shouldn't take more than 5 minutes.
If the checks are successful, the course will close this issue and create a comment pointing you to the next step. If the checks are unsuccessful, the course will comment on your latest commit with more information, so that you can fix your query and try again.
To track the execution of the query checker, you can follow along in the Actions panel if you like.
Let's continue to the next step.
Ooops! The query you submitted in {{commit}} didn't find the right results. Please take a look at the comment and try again.
To submit a new iteration of your query, you just have to push a new commit to the same branch (main
or the PR branch).
Now let's analyze what you have written. A CodeQL query has the following basic structure:
import /* ... path to some CodeQL libraries ... */
from /* ... variable declarations ... */
where /* ... logical formulas that say something about the variables ... */
select /* ... expressions to output ... */
The from
/where
/select
part is the query clause: it describes what we are trying to find in the source code.
Let's look closer at the query we wrote in the previous step.
Show the query
import cpp
from Function f
where f.getName() = "strlen"
select f, "a function named strlen"
At the top of the query is import cpp
. This is an import statement. It brings into scope the standard CodeQL library that models C/C++ code, allowing us to use its features in our query. We'll use this library in every query, and in later steps we'll also use some more specialized libraries.
In the from
section, there is a declaration Function f
. Here we declare a variable named f
which has the type Function
. Function
is a class declared in the standard library (you can jump to the definition using F12). A class represents a collection of values, in this case the collection of all C/C++ functions in the source code.
Now look at the expression f.getName()
in the where
section. Here we call the predicate getName
on the variable f
of type Function
. Predicates are the building blocks of a query: they express logical properties that we want to hold. Some predicates return results (like getName
), and some predicates do not (they just assert that a property must be true).
So far your query finds all functions with the name strlen
. It does this by asserting that the result of f.getName()
is equal to the string "strlen"
.
- Edit the file
4_memcpy_definitions.ql
- Copy the query you wrote in step 3 into this file, and modify the
where
clause so that the query finds all definitions of functions namedmemcpy
instead. - Run your query on the U-Boot codebase to verify the results.
- Submit your solution as explained previously.
Congratulations, looks like the query you introduced in {{commit}} finds the correct results!
If you created a pull request, merge it.
Let's continue to the next step.
Ooops! The query you submitted in {{commit}} didn't find the right results. Please take a look at the comment and try again.
To submit a new iteration of your query, you just have to push a new commit to the same branch (main
or the PR branch).
We want to identify integer values that are supplied from network data. A good way to spot those is to look for use of network ordering conversion macros such as ntohl
, ntohll
, and ntohs
.
In the from
section of the query, you declare some variables, and state the types of those variables. The type tells us what the possible values are for the variable.
In the previous query you were querying for values in the class Function
to find functions in the source code. We have to query a different type to find macros in the source code instead. Can you guess its name?
NOTE: These Network ordering conversion utilities can be macros or functions depending on the platform. In this course, we are looking at a Linux database, where they are macros.
- Edit the file
5_macro_definitions.ql
- Write a query that finds the definitions of the macros named
ntohs
,ntohl
orntohll
. Use the auto-completion in the Visual Studio Code extension to guide you:- Wait a moment after typing
from
to get a list of available classes in the CodeQL standard library for C/C++. Which class in this list represents macros? Create a variable with this class as its type. - In the
where
section, type<your_variable_name>
followed by a dot.
, and wait a moment to get the list of predicates available for a value in the variable's type. Hover over each predicate to see the inline documentation. - Which predicate will give us the name of a macro?
- Use the
or
keyword to combine multiple conditions where you want at least one condition to be met. Here we are interested in three possible macro names.
- Wait a moment after typing
- To write a more compact query that searches for all three macros at once, instead of using three cases combined by
or
you have 2 choices:- Use a regular expression. Check out the predicate
string::regexpMatch
in the built-in predicates for string. CodeQL uses thejava.util.Pattern
regexp conventions. - Use a set literal expression to equate
<your_variable_name>
to a list of choices<your_variable_name> in ["bar", "baz", "quux"]
- Use a regular expression. Check out the predicate
- Once you're happy with the results, submit your solution.
Congratulations, looks like the query you introduced in {{commit}} finds the correct results!
If you created a pull request, merge it.
Let's continue to the next step.
Ooops! The query you submitted in {{commit}} didn't find the right results. Please take a look at the comment and try again.
To submit a new iteration of your query, you just have to push a new commit to the same branch (main
or the PR branch).
In step 4, you wrote a query that finds the definitions of functions named memcpy
in the codebase. Now, we want to find all the calls to memcpy
in the codebase.
One way to do this is to declare two variables: one to represent functions, and one to represent function calls. Then you will have to create a relationship between these variables in the where
section, so that they are restricted to only functions that are named memcpy
, and calls to exactly those functions.
- Edit the file
6_memcpy_calls.ql
- Use the auto-completion feature to find the class that represents function calls, and declare a variable that belongs to this class.
- Use auto-completion again on your function call variable to guess the predicate that tells us the target function that is being called.
- Combine this with your logic from step 4 to make sure the target function is named
memcpy
. - Once you're happy with the results, submit your solution.
Tip: You can have a look at the following C++ example. Note that your query will be simpler as you won't need to consider the declaringType
.
Finds calls to
std::map<...>::find()
import cpp from FunctionCall call, Function fcn where call.getTarget() = fcn and fcn.getDeclaringType().getSimpleName() = "map" and fcn.getDeclaringType().getNamespace().getName() = "std" and fcn.hasName("find") select call
Note: Once you have good results, you can try to make your query more compact by omitting the intermediate Function
variable. The 2 queries below are equivalent:
from Class1 c1, Class2 c2
where
c1.getClass2() = c2 and
c2.getProp() = "something"
select c1
from Class1 c1
where c1.getClass2().getProp() = "something"
select c1
Congratulations, looks like the query you introduced in {{commit}} finds the correct results!
If you created a pull request, merge it.
Let's continue to the next step.
Ooops! The query you submitted in {{commit}} didn't find the right results. Please take a look at the comment and try again.
To submit a new iteration of your query, you just have to push a new commit to the same branch (main
or the PR branch).
In step 5, you wrote a query that finds the definitions of macros named ntohs
, ntohl
and ntohll
in the codebase. Now, we want to find all the invocations of these macros in the codebase.
This will be similar to what you did in step 6, where you created variables for functions and function calls, and restricted them to look for a particular function and its calls.
Note: A macro invocation is a place in the source code that calls a particular macro. This is comparable to how a function call is a place in the source code that calls a particular function.
This query will look like the previous one, but with macros instead of functions.
- Edit the file
7_macro_invocations.ql
- Use the auto-completion to find the class that represents macro invocations, and declare a variable that belongs to this class.
- Use auto-completion again on your macro invocation variable, to find the predicate that tells us the target macro being invoked.
- Combine this with your logic from step 5 to make sure the target is one of the
ntoh*
macros. - As in the previous step, you can make your query more compact by omitting superfluous variable declarations.
- Once you're happy with the results, submit your solution.
Congratulations, looks like the query you introduced in {{commit}} finds the correct results!
If you created a pull request, merge it.
Let's continue to the next step.
Ooops! The query you submitted in {{commit}} didn't find the right results. Please take a look at the comment and try again.
To submit a new iteration of your query, you just have to push a new commit to the same branch (main
or the PR branch).
In the previous step, you found invocations of the macros we are interested in. Modify your query to find the top-level expressions these macro invocations expand to.
Note: An expression is a source code element that can have a value at runtime. Invoking a macro can bring various source code elements into scope, including expressions.
As before, if you don't know how a piece of source code is represented in the library, you can use the auto-completion and contextual help to discover the classes and predicates you need.
- Edit the file
8_macro_expressions.ql
with the previous query - Use the
getExpr()
predicate in theselect
section, to return the wanted expressions. - Once you're happy with the results, submit your solution.
Congratulations, looks like the query you introduced in {{commit}} finds the correct results!
If you created a pull request, merge it.
Let's continue to the next step.
Ooops! The query you submitted in {{commit}} didn't find the right results. Please take a look at the comment and try again.
To submit a new iteration of your query, you just have to push a new commit to the same branch (main
or the PR branch).
In this step we will learn how to write our own CodeQL classes. This will help us make the logic of our query more readable, easier to reuse, and easier to refine.
We'd like to find the same results as in the previous step, i.e. the top level expressions that correspond to the ntohl
, ntohs
and ntohll
macro invocations. It would be useful if we could refer to all such expressions directly, just like we can use MacroInvocation
from the standard library to refer to all macro invocations.
We will define a class to describe exactly this set of expressions, and use it in the last step of this course.
The Expr
class is the set of all expressions, and we are interested in a more specific set of expressions, so the class we write will be a subclass of Expr
.
So far, we have declared variables in the from
section of a query clause. Sometimes we need temporary variables in other parts of the query, and don't want to expose them in the query clause. The exists
keyword helps us do this. It is a quantifier: it introduces temporary variables and checks if they satisfy a particular condition.
To understand how exists
works, visit the documentation.
Then look at this example from the “Find the thief” CodeQL tutorial:
from Person t
where exists(string c | t.getHairColor() = c)
select t
This query selects all persons with a hair color that is a string
. So we'll get all persons that are not bald, since we are able to find a c
that defines their hair color. We don't really need c
in the query except to know that it exists.
-
We recommend that you first read the documentation on CodeQL classes.
-
Edit the file
9_class_network_byteswap.ql
with the template below:import cpp class NetworkByteSwap extends Expr { NetworkByteSwap () { // TODO: replace <class> and <var> exists(<class> <var> | // TODO: <condition> ) } } from NetworkByteSwap n select n, "Network byte swap"
-
This class
extends Expr
, which means it is a subclass ofExpr
, and it begins by taking all values fromExpr
. Now you need to restrict it to only the expressions we are interested in, which satisfy the condition of step 8.- You can do this by editing the characteristic predicate
NetworkByteSwap() { ... }
. The template includes theexists
quantifier, which will help. - Declare a temporary variable in the
exists
that refers to a macro invocation. - Constrain this macro invocation in the condition section of the
exists
. Use the same logic from thewhere
section of your query in step 8. - How is the macro invocation related to the expression? Use the same logic from the
select
section of your query in step 8. You can refer to the macro invocation using the name of the variable you created, and you can refer to the expression using thethis
variable.
- You can do this by editing the characteristic predicate
-
Once you're happy with the results, submit your solution.
Congratulations, looks like the query you introduced in {{commit}} finds the correct results!
If you created a pull request, merge it.
Let's continue to the next step.
Ooops! The query you submitted in {{commit}} didn't find the right results. Please take a look at the comment and try again.
To submit a new iteration of your query, you just have to push a new commit to the same branch (main
or the PR branch).
Great! You made it to the final step!
In step 9 we found expressions in the source code that are likely to have integers supplied from remote input, because they are being processed with invocations of ntoh
, ntohll
, or ntohs
. These can be considered sources of remote input.
In step 6 we found calls to memcpy
. These calls can be unsafe when their length arguments are controlled by a remote user. Their length arguments can be considered sinks: they should not receive user-controlled values without further validation.
Combining these pieces of information,
we know that code is vulnerable if tainted data flows from a network integer source to a sink in the length argument of a memcpy
call.
However, how do we know whether data from a particular source might reach a particular sink? This is known as data flow or taint tracking analysis. Given the number of results (hundreds of memcpy
calls and a large number of macro invocations), it would be quite a lot of work to triage all these cases manually.
To make our triaging job easier, we will have CodeQL do this analysis for us.
You will now write a query to track the flow of tainted data from network-controlled integers to the memcpy
length argument. As a result you will find 9 real vulnerabilities!
To achieve this, we’ll use the CodeQL taint tracking library. This library allows you to describe sources and sinks, and its predicate hasFlowPath
holds true when tainted data from a given source flows to a sink.
- Edit the file
10_taint_tracking.ql
with the template below. Note the annotationpath-problem
and the pattern used in theselect
section. This pattern allows CodeQL to interpret these results as a "path" through the code, and display the path in your IDE. - Copy and paste your definition of the
NetworkByteSwap
class from step 9. - Write the
isSource
predicate. This should recognize an expression in an invocation ofntohl
,ntohs
orntohll
.- You already described these expressions in the
NetworkByteSwap
class from step 9. Here we need to check that the source corresponds to a value that belongs to this class. - To check if a value belongs to CodeQL class, use the
<value> instanceof <myclass>
construct. - Note that the
source
variable is of typeDataFlow::Node
, while yourNetworkByteSwap
class is a subclass ofExpr
, so we cannot just writesource instanceof NetworkByteSwap
. (Try this and the compiler will give you an error.) Use auto-completion onsource
to discover the predicate that lets us view it as anExpr
.
- You already described these expressions in the
- Write the
isSink
predicate: The sink should be the size argument of calls tomemcpy
.- Use auto-completion to find the predicate that returns the
n
th argument of a function call. - Use the predicate you discovered when writing
isSource
to view thesink
as anExpr
.
- Use auto-completion to find the predicate that returns the
- Run your query. Note that the first run will take a little longer than the previous queries, since data flow analysis is more complex.
Submit your query when you're happy with the results.
Tip: For a complete example, read this article.
/**
* @kind path-problem
*/
import cpp
import semmle.code.cpp.dataflow.TaintTracking
import DataFlow::PathGraph
class NetworkByteSwap extends Expr {
// TODO: copy from previous step
}
class Config extends TaintTracking::Configuration {
Config() { this = "NetworkToMemFuncLength" }
override predicate isSource(DataFlow::Node source) {
// TODO
}
override predicate isSink(DataFlow::Node sink) {
// TODO
}
}
from Config cfg, DataFlow::PathNode source, DataFlow::PathNode sink
where cfg.hasFlowPath(source, sink)
select sink, source, sink, "Network byte swap flows to memcpy"
Congratulations, looks like the query you introduced in {{commit}} finds the correct results!
If you created a pull request, merge it.
Let's continue to the next step.
Ooops! The query you submitted in {{commit}} didn't find the right results. Please take a look at the comment and try again.
To submit a new iteration of your query, you just have to push a new commit to the same branch (main
or the PR branch).
Congratulations, you have finished the course! You can merge your last outstanding Pull Request if you have one. Don't hesitate to give us feedback; find us at https://securitylab.github.com/get-involved. And recommend this course to your friends if it was useful!