position, tool, category, severity, redundancy_level, category_frequency, tool_fp_rate, neighbors, positive?
foo.c:47, cppcheck, buffer overflow, critical, 1, 10, 0.3, ?, true
* neighbors would be a feature to catch other warnings around the same warning
* since we are collecting regular expressions about the warning messages to
label the warnings, we can cluster them in specific categories with these
regexes (buffer, div0, pointer, etc)
It is also possible that we would benefit of a binary feature for each of the static analyzers, where it is true for the presence of the same bug in the analyzer and false otherwise
- There are regular expressions to identify GOOD and BAD functions
- There are makefiles for some CWEs, where a binary is built on for files without Windows dependencies
- Some test cases are bad only test cases. They should not be used if you want
to determine the number of false positives generated by a tool I do believe
they may be useful for this experiment). These cases are listed in appendix D
of Juliet User guide under
juliet/doc
- Accidental flaws (i.e. non-intentional bugs in Juliet) may exist, and they should be ignored.
To run this experiment, you need the following software installed
- python >= 3.6
- RPM
- firehose
- ctags
- RPM packages for
- cppcheck
- flawfinder
- frama-c
- scan-build (clang-analyzer)
Just run make
to download and prepare the test suite and start running the
analyzers.
The results will be stored under the reports
directory.
- some entries in the functions scope list end with ':'. It seems they belong to C++ testcases, this needs further inverstigation
- there will be duplicates for class names when trying to determine functions scopes, in these cases, the largest ranges should be considered (hoping we are considering the whole class)
- for confirmation on the latest script, do check s01/CWE690_NULL_Deref_From_Return__int64_t_realloc_83_bad.cpp file scope
- note that for the cpp cases, we can just check the bad|good string in the file names
- You can check that there are no repeated file names in the set of test cases used for this experiment with the folowing command:
- Added new feature with number of warnings in a file, reaching 69% precision
- when I add the file name (we no not want it) it goes up to 71%
test `cat c_testcases.list cpp_testcases.list | sed 's/.*\/\([^/]*\.c[p]*\).*/\1/' | sort -u | wc -l` == `cat c_testcases.list cpp_testcases.list | wc -l` && echo 'There are no repeated file names in the used set of the test suite'
Static analysis reports were collected by runnings 4 static analysis tools in Juliet 1.2:
- Frama-C
- flawfinder
- cppcheck
- scan-build (clang analyzer)
The parameters used to run each script can be seen in the run_analyses.sh
script.
We ran the analyzers in a subset of Juliet, removing the testcases calling
types or functions that were specific to Windows systems. A complete list
of the files analyzed is generated with the bootstrap.sh
script.
It is worth mentioning that for Frama-C, we also had to ignore the C++ test cases, since this tool can only analyze C programs.
After generating the reports, we need to pre-process them before we are able to use them to train our model. We need to
- Label warnings as true/false positives
- Remove warnings not related to the CWE being tested generated for each test case (this is needed because accidental flaws may exist)
- Collect potential features for the training set
To aid this task, we first convert all the reports to a common report syntax (firehose)
We want to use the regular expressions provided by Juliet documentation to match the bad functions. We also need to map the warning messages from each tool with the CWE in question. after this:
- Warnings related to the CWE, in bad functions, will be considered true positives
- Warnings related to the CWE, in good functions, will be considered false positives
- All the other warnings will be ignored and will not be used in our trainnings set
TODO: GENERATE TABLE WITH MAPPINGS WARNING-MESSAGE=>CWE FOR EACH TOOL
TODO: GENERATE TABLES OR CHARTS WITH DATA CONTAINING NUMBERS OF WARNINGS GENERATED BY EACH TOOL FOR EACH TEST CASE. HOW MANY WERE USED? HOW MANY WERE IGNORED? (FOR EACH CASE)
cwe tool total_warnings warnings_inside_good_or_bad_functions warnings_inside_AND_related_to_cwe
total
To label a warning, first we need to find in which function it belongs, the
get_functions_info.sh
script outputs a file with all function and class locations.
As per the Juliet 1.2 documentation:
- Warnings in a function with the word "bad" in its name are TRUE POSITIVES
- Warnings in a class with the word "bad" in the FILE NAME are TRUE POSITIVES
- Warnings in a function with the word "good" in its name are FALSE POSITIVES
- Warnings in a class with the word "good" in the FILE NAME are FALSE POSITIVES
The warnings must match the CWE flaw category to fit in any of the above classifications, i.e., if a warning is triggered in a function with the word bad in its name for a division by zero test case, and the warning message says a null pointer derreference was found, the warning must be ignored and not included in our trainning set. This was done manually, by verifying each different message string in each warning triggered against each different CWE. It is important that the warnings do match the CWE precisely, so we have our trainning set labaled correctly (less is more). This is VERY important, since there will often exist a similar case with a false positive in the test case which we will label as false positive. If we accept strings with related flaws for a test case, where this related flaw may show up in the fix for that CWE, we will assign wrong labels to some warnings, hence, it is better to not include those warnings at all. When in doubt, we would not consider a warning category for the test cases.
The file used as a base for the manual inspections in this repository is
raw_cwe_versus_warning_msg.txt
. Note that although the
firehose_report_parser.py
file does output this raw list, the one in the
repository was sorted and repeated entries were already removed. One can do
that with cat raw_cwe_versus_warning_msg.txt | sort -u
.
TODO: SORT THE raw file automatically (in the make file?)
Note that for the cpp cases, we can just check the bad|good string in the file names (not just ending in good.cpp or bad.cpp, since there may exist suffixes or preffixes, like goodG2B.cpp)
with this data, we should start generating a CSV file, with information about the tool, file, line, label
For each of the desired features, a new entry is added to the CSV file generated in the step above.