Add prime path coverage to gcc/gcov
This patch adds prime path coverage to gcc/gcov. First, a quick
introduction to path coverage, before I explain a bit on the pieces of
the patch.
PRIME PATHS
Path coverage is recording the paths taken through the program. Here is
a simple example:
if (cond1) BB 1
then1 () BB 2
else
else1 () BB 3
if (cond2) BB 4
then2 () BB 5
else
else2 () BB 6
_ BB 7
To cover all paths you must run {then1 then2}, {then1 else2}, {else1
then1}, {else1 else2}. This is in contrast with line/statement coverage
where it is sufficient to execute then2, and it does not matter if it
was reached through then1 or else1.
1 2 4 5 7
1 2 4 6 7
1 3 4 5 7
1 3 4 6 7
This gets more complicated with loops, because 0, 1, 2, ..., N
iterations are all different paths. There are different ways of
addressing this, a promising one being prime paths. A prime path is a
maximal simple path (a path with no repeated vertices) or simple cycle
(no repeated vertices except for the first/last) Prime paths strike a
decent balance between number of tests, path growth, and loop coverage,
requiring loops to be both taken and skipped. Of course, the number of
paths still grows very fast with program complexity - for example, this
program has 14 prime paths:
while (a)
{
if (b)
return;
while (c--)
a++;
}
--
ALGORITHM
Since the numbers of paths grows so fast, we need a good algorithm. The
naive approach of generating all paths and discarding redundancies (see
reference_prime_paths in the diff) simply doesn't complete for even
pretty simple functions with a few ten thousand paths (granted, the
implementation is also poor, but only serves as a reference). Fazli &
Afsharchi in their paper "Time and Space-Efficient Compositional Method
for Prime and Test Paths Generation" describe a neat algorithm which
drastically improves on for most programs, and brings complexity down to
something managable. This patch implements that algorithm with a few
minor tweaks.
The algorithm first finds the strongly connected components (SCC) of the
graph and creates a new graph where the vertices are the SCCs of the
CFG. Within these vertices different paths are found - regular prime
paths, paths that start in the SCCs entries, and paths that end in the
SCCs exits. These per-SCC paths are combined with paths through the CFG
which greatly reduces of paths needed to be evaluated just to be thrown
away.
Using this algorithm we can find the prime paths for somewhat
complicated functions in a reasonable time. Please note that some
programs don't benefit from this at all. We need to find the prime paths
within a SCC, so if a single SCC is very large the function degenerates
to the naive implementation. This can probably be much improved on, but
is an exercise for later.
--
OVERALL ARCHITECTURE
Like the other coverages in gcc, this operates on the CFG in the
profiling phase, just after branch and condition coverage, in phases:
1. All prime paths are generated, counted, and enumerated from the CFG
2. The paths are evaluted and counter instructions and accumulators are
emitted
3. gcov reads the CFG and computes the prime paths (same as step 1)
4. gcov prints a report
Simply writing out all the paths in the .gcno file is not really viable,
the files would be too big. Additionally, there are limits to the
practicality of measuring (and reporting) on millions of paths, so for
most programs where coverage is feasible, computing paths should be
plenty fast. As a result, path coverage really only adds 1 bit to the
counter, rounded up to nearest 64 ("bucket"), so 64 paths takes up 8
bytes, 65 paths take up 16 bytes.
Recording paths is really just massaging large bitsets. Per function,
ceil(paths/64 or 32) buckets (gcov_type) are allocated. Paths are
sorted, so the first path maps to the lowest bit, the second path to the
second lowest bit, and so on. On taking an edge and entering a basic
block, a few bitmasks are applied to unset the bits corresponding to the
paths outside the block and set the bits of the paths that start in that
block. Finally, the right buckets are masked and written to the global
accumulators for the paths that end in the block. Full coverage is
achieved when all bits are set.
gcc does not really inform gcov of abnormal paths, so paths with
abnormal edges are ignored. This probably possible, but requires some
changes to the graph gcc writes to the .gcno file.
--
IMPLEMENTATION
In order to remove non-prime paths (subpaths) we use a suffix tree.
Fazli & Afsharchi do not discuss how duplicates or subpaths are removed,
and using the suffix works really well -- insertion time is a function
of the length of the (longest) paths, not the number of paths. The paths
are usually quite short, but there are many of them. The same
prime_paths function is used both in gcc and in gcov.
As for speed, I would say that it is acceptable. Path coverage is a
problem that is exponential in its very nature, so if you enable this
feature you can reasonably expect it to take a while. To combat the
effects of path explosion there is a limit at which point gcc will give
up and refuse to instrument a function, set with -fpath-coverage-limit.
Since estimating the number of prime paths is pretty much is counting
them, gcc maintains a pessimistic running count which slightly
overestimates the number of paths found so far. When that count exceeds
the limit, the function aborts and gcc prints a warning. This is a
blunt instrument meant to not get stuck on the occasional large
function, not fine-grained control over when to skip instrumentation.
My main benchmark has been tree-2.1.3/tree.c which generates approx 2M
paths across the 20 functions or so in it. Most functions have less than
1500 paths, and 2 around a million each. Finding the paths takes
3.5-4s, but the instrumentation phase takes approx. 2.5 minutes and
generates a 32M binary. Not bad for a 1429 line source file.
There are some selftests which deconstruct the algorithm, so it can be
easily referenced with the Fazli & Afsharchi. I hope that including them
both help to catch regression, clarify the assumptions, and help
understanding the algorithm by breaking up the phases.
DEMO
This is the denser line-aware (grep-friendlier) output. Every missing
path is summarized as the lines you need to run in what order, annotated
with the true/false/throw decision.
$ gcc -fpath-coverage --coverage bs.c -c -o bs
$ gcov -e --prime-paths-lines bs.o
bs.gcda:cannot open data file, assuming not executed
-: 0:Source:bs.c
-: 0:Graph:bs.gcno
-: 0:Data:-
-: 0:Runs:0
paths covered 0 of 17
path 0 not covered: lines 6 6(true) 11(true) 12
path 1 not covered: lines 6 6(true) 11(false) 13(true) 14
path 2 not covered: lines 6 6(true) 11(false) 13(false) 16
path 3 not covered: lines 6 6(false) 18
path 4 not covered: lines 11(true) 12 6(true) 11
path 5 not covered: lines 11(true) 12 6(false) 18
path 6 not covered: lines 11(false) 13(true) 14 6(true) 11
path 7 not covered: lines 11(false) 13(true) 14 6(false) 18
path 8 not covered: lines 12 6(true) 11(true) 12
path 9 not covered: lines 12 6(true) 11(false) 13(true) 14
path 10 not covered: lines 12 6(true) 11(false) 13(false) 16
path 11 not covered: lines 13(true) 14 6(true) 11(true) 12
path 12 not covered: lines 13(true) 14 6(true) 11(false) 13
path 13 not covered: lines 14 6(true) 11(false) 13(true) 14
path 14 not covered: lines 14 6(true) 11(false) 13(false) 16
path 15 not covered: lines 6(true) 11(true) 12 6
path 16 not covered: lines 6(true) 11(false) 13(true) 14 6
#####: 1:int binary_search(int a[], int len, int from, int to, int key)
-: 2:{
#####: 3: int low = from;
#####: 4: int high = to - 1;
-: 5:
#####: 6: while (low <= high)
-: 7: {
#####: 8: int mid = (low + high) >> 1;
#####: 9: long midVal = a[mid];
-: 10:
#####: 11: if (midVal < key)
#####: 12: low = mid + 1;
#####: 13: else if (midVal > key)
#####: 14: high = mid - 1;
-: 15: else
#####: 16: return mid; // key found
-: 17: }
#####: 18: return -1;
-: 19:}
Then there's the human-oriented source mode. Because it is so verbose I
have limited the demo to 2 paths. In this mode gcov will print the
sequence of *lines* through the program and in what order to cover the
path, including what basic block the line is a part of. Like its denser
sibling, this also prints the true/false/throw decision, if there is
one.
$ gcov -t --prime-paths-source bs.o
bs.gcda:cannot open data file, assuming not executed
-: 0:Source:bs.c
-: 0:Graph:bs.gcno
-: 0:Data:-
-: 0:Runs:0
paths covered 0 of 17
path 0:
BB 2: 1:int binary_search(int a[], int len, int from, int to, int key)
BB 2: 3: int low = from;
BB 2: 4: int high = to - 1;
BB 2: 6: while (low <= high)
BB 8: (true) 6: while (low <= high)
BB 3: 8: int mid = (low + high) >> 1;
BB 3: 9: long midVal = a[mid];
BB 3: (true) 11: if (midVal < key)
BB 4: 12: low = mid + 1;
path 1:
BB 2: 1:int binary_search(int a[], int len, int from, int to, int key)
BB 2: 3: int low = from;
BB 2: 4: int high = to - 1;
BB 2: 6: while (low <= high)
BB 8: (true) 6: while (low <= high)
BB 3: 8: int mid = (low + high) >> 1;
BB 3: 9: long midVal = a[mid];
BB 3: (false) 11: if (midVal < key)
BB 5: (true) 13: else if (midVal > key)
BB 6: 14: high = mid - 1;
The listing is also aware of inlining:
hello.c:
#include <stdio.h>
#include "hello.h"
int notmain(const char *entity)
{
return hello (entity);
}
#include <stdio.h>
inline __attribute__((always_inline))
int hello (const char *s)
{
if (s)
printf ("hello, %s!\n", s);
else
printf ("hello, world!\n");
return 0;
}
$ gcov -t --prime-paths-source hello
paths covered 0 of 2
path 0:
BB 2: (true) 4:int notmain(const char *entity)
== inlined from hello.h ==
BB 2: (true) 6: if (s)
BB 3: 7: printf ("hello, %s!\n", s);
BB 5: 10: return 0;
-------------------------
BB 7: 6: return hello (entity);
BB 8: 6: return hello (entity);
path 1:
BB 2: (false) 4:int notmain(const char *entity)
== inlined from hello.h ==
BB 2: (false) 6: if (s)
BB 4: 9: printf ("hello, world!\n");
BB 5: 10: return 0;
-------------------------
BB 7: 6: return hello (entity);
BB 8: 6: return hello (entity);
--prime-paths-{lines,source} take an optional argument type, which can
be 'covered', 'uncovered', or 'both', which defaults to 'uncovered'.
The flag controls if the covered or uncovered paths are printed, and
while uncovered is generally the most useful one, it is sometimes nice
to be able to see only the covered paths.
And finally, JSON (abbreviated). It is quite sparse and very nested, but
is mostly a JSON version of the source listing. It has to be this nested
in order to consistently capture multiple locations. It is always
includes the file name per location for consistency, even though this is
very much redundant in almost all cases. This format is in no way set in
stone, and without targeting it with other tooling I am not sure if it
does the job well.
"gcc_version": "15.0.0
20240704 (experimental)",
"current_working_directory": "dir",
"data_file": "hello.o",
"files": [
{
"file": "hello.c",
"functions": [
{
"name": "notmain",
"demangled_name": "notmain",
"start_line": 4,
"start_column": 5,
"end_line": 7,
"end_column": 1,
"blocks": 7,
"blocks_executed": 0,
"execution_count": 0,
"total_prime_paths": 2,
"covered_prime_paths": 0,
"prime_path_coverage": [
{
"id": 0,
"sequence": [
{
"block_id": 2,
"locations": [
{
"file": "hello.c",
"line_numbers": [
4
]
},
{
"file": "hello.h",
"line_numbers": [
6
]
}
],
"edge_kind": "fallthru"
},
...
gcc/ChangeLog:
* Makefile.in (OBJS): Add prime-paths.o, path-coverage.o.
(GTFILES): Add prime-paths.cc, path-coverage.cc
(GCOV_OBJS): Add graphds.o, prime-paths.o, bitmap.o
* builtins.cc (expand_builtin_fork_or_exec): Check
path_coverage_flag.
* collect2.cc (main): Add -fno-path-coverage to OBSTACK.
* common.opt: Add new options -fpath-coverage,
-fpath-coverage-limit, -Wcoverage-too-many-paths
* doc/gcov.texi: Add --prime-paths, --prime-paths-lines,
--prime-paths-source documentation.
* doc/invoke.texi: Add -fpath-coverage, -fpath-coverage-limit,
-Wcoverage-too-many-paths documentation.
* gcc.cc: Link gcov on -fpath-coverage.
* gcov-counter.def (GCOV_COUNTER_PATHS): New.
* gcov-io.h (GCOV_TAG_PATHS): New.
(GCOV_TAG_PATHS_LENGTH): New.
(GCOV_TAG_PATHS_NUM): New.
* gcov.cc (class path_info): New.
(struct coverage_info): Add paths, paths_covered.
(find_prime_paths): New.
(add_path_counts): New.
(find_arc): New.
(print_usage): Add -e, --prime-paths, --prime-paths-lines,
--prime-paths-source.
(process_args): Likewise.
(json_set_prime_path_coverage): New.
(output_json_intermediate_file): Call
json_set_prime_path_coverage.
(process_all_functions): Call find_prime_paths.
(generate_results): Call add_path_counts.
(read_graph_file): Read path counters.
(read_count_file): Likewise.
(function_summary): Print path counts.
(file_summary): Likewise.
(print_source_line): New.
(print_prime_path_lines): New.
(print_inlined_separator): New.
(print_prime_path_source): New.
(output_path_coverage): New.
(output_lines): Print path coverage.
* ipa-inline.cc (can_early_inline_edge_p): Check
path_coverage_flag.
* passes.cc (finish_optimization_passes): Likewise.
* profile.cc (branch_prob): Likewise.
* selftest-run-tests.cc (selftest::run_tests): Run path coverage
tests.
* selftest.h (path_coverage_cc_tests): New declaration.
* tree-profile.cc (tree_profiling): Check path_coverage_flag.
(pass_ipa_tree_profile::gate): Likewise.
* path-coverage.cc: New file.
* prime-paths.cc: New file.
gcc/testsuite/ChangeLog:
* lib/gcov.exp: Add prime paths test function.
* g++.dg/gcov/gcov-22.C: New test.
* g++.dg/gcov/gcov-23-1.h: New test.
* g++.dg/gcov/gcov-23-2.h: New test.
* g++.dg/gcov/gcov-23.C: New test.
* gcc.misc-tests/gcov-29.c: New test.
* gcc.misc-tests/gcov-30.c: New test.
* gcc.misc-tests/gcov-31.c: New test.
* gcc.misc-tests/gcov-32.c: New test.
* gcc.misc-tests/gcov-33.c: New test.
* gcc.misc-tests/gcov-34.c: New test.