So, peak dinfo memory decreases by 21Mb, and final by 36Mb.
The memory decrease is obtained by:
* using a dedup pool to store the machine dependent part (cfsi_m)
of the cfsi information as this information is highly duplicated.
For x86 and arm64, the duplication factor of cfsi machine dependent
part is very high (up to a factor 60).
For arm64, it is more like a factor 3.
A 'variable size' (1, 2 or 4 bytes) is automatically used to identify
the cfsi_m, if there is less than or more than 255/64K different cfsi_m.
* not storing explicitely the length of a range for which a cfsi_m
is to be used: in a large majority of the cases, ranges are
consecutive, and so the end of a range is just one byte before
the start of the next range.
So, we do not store the length of the ranges.
If there is a hole between 2 ranges, the hole is stored explicitely
as a range in which we have no cfsi_m information.
On x86 and amd64, we have quite some holes (something like one hole
every 7 cfsi). On arm64, we have very few holes (less than one hole
every 50 cfsi).
Even with the nr of holes on x86/amd64, it is more memory efficient
to store the holes rather than to store the length of each cfsi.
* Merging consecutive ranges that have the same cfsi_m info:
Many cfsi are "mergeable": there is no hole between 2 cfsi, and their
machine dependent part is identical
(I guess the unwind info needed by valgrind is subset of the full
unwind info, and so, the cfsi entries are not merged by the compiler,
but can be merged for simple unwind). Depending on the platform
(x86, amd64, arm64) and of the library/object file, we can have a
significant nr of mergeable entries.
The patch is not very small, but a lot is mechanical changes.
The patch has been compiled and tested on x86/amd64/ppc32/ppc64
(but ppc does not use cfsi so that just verifies it compiles).
It has been compiled on arm64, and "tested" by launching valgrind on
one executable.
It has not been compiled on s390 and mips.
With some luck, maybe it will compile on these platforms.
And if that uses the whole provision of luck for 2014, it might even work
on these platforms :).
If it does not compile, the fix should be straightforward.
Runtime problems might be more tricky (but arm64 "worked out of the box"
once x86/amd64 were ok).
This has also be tested in an outer/inner setup, to verify no memory leak/bugs.
Shadow registers wronly shown by gdb on avx machine
For an unclear reason, the orig_rax register and its shadows are described in the
xml file using a register number.
This register number is correct on non avx machine, but is wrong on
avx machine, as these have more registers, which means that orig_raxs1 and s2
should have different numbers.
As no reason was found to have a register number explicitely give, remove this
regnnr from the xml file, and let gdb calculate it.
Fix a bug in the "numbering" dedup pool: as indicated in
pub_tool_deduppoolalloc.h, for "numbering" pool, there is no guarantee
that the address of an element is stable if a new element is inserted.
But m_deduppoolalloc.c was itself not taking this 'no guarantee' into account.
So, when the addresses of the elements are changed due to reallocation
of the only pool, apply an offset to the element addresses stored in
the dedup hash table.
Implement VG_(arena_realloc_shrink) similar to realloc, but can
only decrease the size of a block, does not change the address,
does not need to alloc another block and copy the memory,
and (if big enough) makes the excess memory available for other
allocations.
VG_(arena_realloc_shrink) is then used for debuginfo storage.c
(replacing an allocation + copy).
Also use it in the dedup pool, to recuperate the unused
memory of the last pool.
This also allows to re-increase the string pool size to the original
3.9.0 value of 64Kb. All this slightly decrease the peak and in use
memory of dinfo.
VG_(arena_realloc_shrink) will also be used to implement (in another patch)
a dedup pool which "numbers" the allocated elements.
Activate --read-inline-info=yes for the outer/inner setup regtest run
as this makes the inner stacktraces easier to understand
and also it exercises the inline unwinding somewhat already,
waiting for a (possible) activation by default of --read-inline-info
2 execontexts in an hash table chain are not necessarily the same size.
So, ensure that when size differs, we do not start to compare them,
as this could otherwise cause a read buffer overrun
Fix a regression in supp matching with obj: entries
Suppression matching logic was changed to understand inlined function calls.
A regression was introduced while doing this. This regression could
cause false positive supp matches or false negative supp matches, when
obj: lines are used.
This patch fixes the regression, and adds 2 tests (one that was failing
with false positive, one that was failing with false negative).
The fix is relatively small (3 places where there was an "off or excess by one").
However, a lot more tracing was added in the supp matching logic, as this
logic is quite complex (for performance reasons mostly).
We might need more tests to properly cover supp matching logic.
So, giving -d -d -d -d produces a trace showing how a stacktrace was expanded
by the input completer and which suppression (if any) it matched.
Below is an example of trace. It shows a begin/end marker. The end marker
indicates if a supp matched. Then it shows the stack trace, and the state
of the lazy "input completer" used for the matching.
In the below, the trace shows that there are 3 IPs in the stacktrace
(n_ips 3) : Two are not shown (below main), and one IP corresponds
to main calling 4 inlined functions (so we have only one IP for 5 entries
in the stacktrace).
The state of the input completer shows that 2 IPs were expanded, resulting
in 6 expanded fun: or obj: lines.
The offset shows that ips0 corresponds to the entries [0,4] in ip2fo->funoffset
or ip2fo->objoffset.
This tracing should make it more clear what was used to match a stacktrace
with the suppression entries.
--10314-- errormgr matching begin
--10314-- errormgr matching end suppression main_a_b_c_d ./memcheck/tests/inlinfosupp.supp:2 matched:
==10314== at 0x8048667: fun_d (inlinfo.c:7)
==10314== by 0x8048667: fun_c (inlinfo.c:15)
==10314== by 0x8048667: fun_b (inlinfo.c:21)
==10314== by 0x8048667: fun_a (inlinfo.c:27)
==10314== by 0x8048667: main (inlinfo.c:66)
n_ips 3 n_ips_expanded 2 resulting in n_expanded 6
ips 0 0x088048667 offset [0,4] fun:fun_d obj:/home/philippe/valgrind/objcompl/memcheck/tests/inlinfo
fun:fun_c obj:/home/philippe/valgrind/objcompl/memcheck/tests/inlinfo
fun:fun_b obj:/home/philippe/valgrind/objcompl/memcheck/tests/inlinfo
fun:fun_a obj:/home/philippe/valgrind/objcompl/memcheck/tests/inlinfo
fun:main obj:/home/philippe/valgrind/objcompl/memcheck/tests/inlinfo
ips 1 0x0822abb5 offset [5,5] fun:(below main) obj:<not expanded>
Complete tracing (including individual pattern matching) can be activated
by recompiling m_errormgr.c after changing
#define DEBUG_ERRORMGR 0
to
#define DEBUG_ERRORMGR 1
This detailed tracing will be shown between the begin/end marker.
Florian Krohm [Tue, 24 Jun 2014 15:33:53 +0000 (15:33 +0000)]
According to Julian initialising the previously uninitialised variable
won't affect the test adversely -- so let's do this and get rid of
the special compilation again. Also guard against future compiler smartness
tricking the compiler into believing the variable is actually used.
So the loop won't get optimised away.
Julian Seward [Mon, 23 Jun 2014 16:02:04 +0000 (16:02 +0000)]
* Move the new 10.9 syscalls into their own section.
* Add a POST_MEM_WRITE for kernelrpc_mach_vm_map_trap
* fix a compiler complaint caused by lack of a cast
Florian Krohm [Sat, 21 Jun 2014 20:25:30 +0000 (20:25 +0000)]
ms_print ought to create temporary files in a proper directory as
specified with --with-tmpdir at configuration time or with TMPDIR
at runtime. Doing so fixes the symptom reported in BZ #332765.
Also fix an incorrect error message.
Find the name of the inlined function through a DW_AT_specification
The name is not necessarily found in the abstract origin, it can be
in a referred to specification.
If both a name and a DW_AT_specification is found in the abstract origin,
the name will have priority over the name of the specification.
(unclear if that can happen)
This optimisation divides by 2.5 the time (user+sys) needed to read
the inlined info of a big executable.
On a slow pentium, reading the inline info now takes 5.5 seconds.
The optimisation consists in having per dw3 abbreviation a structure
allowing to skip efficiently the non interesting DIEs (i.e. the DIEs
the parse_inl_DIE is not interested in).
Mostly, the idea is to avoid calling the image abstraction, and replace
this by just advancing the cursor (i.e. addition rather than a bunch
of function calls to read the data).
Julian Seward [Fri, 20 Jun 2014 13:38:04 +0000 (13:38 +0000)]
Mac OS X 10.9 improvements. Bug 326724 comment 27 patch name
"0005-darwin-try-to-improve-support-for-mach_msg-on-extern.patch"
(Frederic Germain, frederic.germain@gmail.com)
Julian Seward [Fri, 20 Jun 2014 13:29:31 +0000 (13:29 +0000)]
Mac OS X 10.9 improvements. Bug 326724 comment 27 patch name
"0004-wqthread_hijack-fix-magic_delta-on-darwin-10.9.patch"
(Frederic Germain, frederic.germain@gmail.com)
Julian Seward [Fri, 20 Jun 2014 13:22:57 +0000 (13:22 +0000)]
Mac OS X 10.9 improvements. Bug 326724 comment 27 patch name
"0003-darwin-remove-warnings-in-logs-related-to-Char-HChar.patch"
(Frederic Germain, frederic.germain@gmail.com)
Julian Seward [Fri, 20 Jun 2014 13:13:57 +0000 (13:13 +0000)]
Mac OS X 10.9 improvements. Bug 326724 comment 27 patch name
"0002-thread_state_from_vex-adding-support-for-x86_THREAD_.patch"
(Frederic Germain, frederic.germain@gmail.com)
Julian Seward [Fri, 20 Jun 2014 12:35:00 +0000 (12:35 +0000)]
Mac OS X 10.9 improvements. Bug 326724 comment 27 patch name
"0001-adding-support-for-loads-of-new-syscall-in-darwin-10.patch"
(Frederic Germain, frederic.germain@gmail.com)
Julian Seward [Tue, 17 Jun 2014 20:37:08 +0000 (20:37 +0000)]
When printing "REDIR:" lines at -v, print not only the name of the
function being redirected but also the soname of the object that it is
in. This makes it a bit easier to diagnose redirection problems.
restructure dwarf3 DIE tracing
* add a trace_DIE function
* use it to trace a bad DIE
and to trace all DIEs that are (maybe) read
(due to the "avoid read twice" optimisation, the tracing was not
so easy to read anymore => add an explicit trace_DIE call at the beginning
of read_DIE)
optimisation : avoid double reading of a DIE when the DIE will be parsed
by a DIE parser
Instead of pre-reading the DIE, first let the parser(s) possibly
parse the DIE. Read (to skip) the DIE data if no parser has parsed it.
OTherwise, just jump to the end of the DIE as established by the parser
that has read the DIE.
This slightly improves the reading of inlined info.
Add helgrind intercepts to have helgrind understanding Ada tasks terination rules
A recent gnatpro version is needed for this to work.
Thanks to these intercepts, some false positive errors are avoided,
and helgrind properly recuperates some internal memory associated
to the terminated task.
Add a comment to document a possible optimisation (avoid double reading
of DIEs when one or more parsers will read them also)
+ add the name of the parser in the barf output.
This patch implements the support needed for stacktraces
showing inlined function calls.
See 278972 valgrind stacktraces and suppression do not handle inlined function call debuginfo
Reading the inlined dwarf call info is activated using the new clo
--read-inline-info=yes
Default is currently no but an objective is to optimise the performance
and memory in order to possibly set it on by default.
(see below discussion about performances).
Basically, the patch provides the following pieces:
1. Implement a new dwarf3 reader that reads the inlined call info
2. Some performance improvements done for this new parser, and
on some common code between the new parser and the var info parser.
3. Use the parsed inlined info to produce stacktrace showing inlined calls
4. Use the parsed inlined info in the suppression matching and suppression generation
5. and of course, some reg tests
1. new dwarf3 reader:
---------------------
Two options were possible: add the reading of the inlined info
in the current var info dwarf reader, or add a 2nd reader.
The 2nd approach was preferred, for the following reasons:
The var info reader is slow, memory hungry and quite complex.
Having a separate parsing phase for the inlined information
is simpler/faster when just reading the inlined info.
Possibly, a single parser would be faster when using both
--read-var-info=yes and --read-inline-info=yes.
However, var-info being extremely memory/cpu hungry, it is unlikely
to be used often, and having a separate parsing for inlined info
does in any case make not much difference.
(--read-var-info=yes is also now less interesting thanks to commit
r13991, which provides a fast and low memory "reasonable" location
for an address).
The inlined info parser reads the dwarf info to make calls
to priv_storage.h ML_(addInlInfo).
2. performance optimisations
----------------------------
* the abbrev cache has been improved in revision r14035.
* The new parser skips the non interesting DIEs
(the var-info parser has no logic to skip uninteresting DIEs).
* Some other minor perf optimisation here and there.
In total now, on a big executable, 15 seconds CPU are needed to
create the inlined info (on my slow x86 pentium).
With regards to memory, the dinfo arena:
with inlined info: 172281856/121085952 max/curr mmap'd
without : 157892608/106721280 max/curr mmap'd,
So, basically, inlined information costs about 15Mb of memory for
my big executable (compared to first version of the patch, this is
already using less memory, thanks to the strpool deduppoolalloc.
The needed memory can probably be decreased somewhat more.
3. produce better stack traces
------------------------------
VG_(describe_IP) has a new argument InlIPCursor *iipc which allows
to describe inlined function calls by doing repetitive calls
to describe_IP. See pub_tool_debuginfo.h for a description.
4. suppression generation and matching
--------------------------------------
* suppression generation now also uses an InlIPCursor *iipc
to generate a line for each inlined fn call.
* suppression matching: to allow suppression matching to
match one IP to several function calls in a suppression entry,
the 'inputCompleter' object (that allows to lazily generate
function or object names for a stacktrace when matching
an error with a suppression) has been generalised a little bit
more to also lazily generate the input sequence.
VG_(generic_match) has been updated so as to be more generic
with respect to the input completer : when providing an
input completer, VG_(generic_match) does not need anymore
to produce/compute any input itself : this is all delegated
to the input completer.
5. various regtests
-------------------
to test stack traces with inlined calls, and suppressions
of (some of) these errors using inlined fn calls matching.
Work still to do:
-----------------
* improve parsing performance
* improve the memory overhead.
* handling the directory name for files of the inlined function calls is not yet done.
(probably implies to refactor some code)
* see if m_errormgr.c *offsets arrays cannot be managed via xarray
Improve performance of dwarf3 reader using a hashtable of parsed abbreviations
For each DIE, the dwarf3 reader must know which data elements to read.
These elements are described by an abbreviation.
Re-reading these abbreviations for each DIE is costly as
the location of the needed abbreviation is found by scanning the full
abbv section, which is very costly.
(A small cache of 32 abbv offsets in the abbv section somewhat decreases
the cost, but reading the abbvs is still a hot spot, in particular for
big debug informations).
This patch:
* adds an hash table of parsed abbreviations
* all abbreviations for a CU are read in one single scan of the abbv
section, when the CU header is read
So, with the patch, the di image is not accessed anymore for reading the abbvs
after the CU header parsing.
On a big executable, --read-var-info=yes user cpu changes from
trunk: 320 seconds
to
abbv cache: 270 seconds
This further improves on a previous (not committed) abbv cache that
was just caching up to 513 entries in the abbv pos cache and populating
the cache with an initial scan. The user cpu for this version was 285 seconds.
NB: this is some work in anticipation of a following patch that
will add reading dwarf3 inlined information, with the hope to make
this reading fast enough to activate it by default.
Note: on the examples I looked at, all abbreviations were numbered starting
from 1, with no holes. If that would always be the case, then one could use
an xarray of parsed abbreviations rather than an hash table. However,
I found nothing in the dwarf standard that guarantees that abbreviations
are numbered from 1. So, the hash table.
Do not destroy the strpool if NULL
It is possible that a debug info contains no string (and so strpool
is never allocated).
A protection to avoid accessing strpool was already necessary
in ML_(canonicaliseTables) :
if (di->strpool)
VG_(freezeDedupPA) (di->strpool);
So, if a similar debug info is released, we need the same protection
to avoid accessing a NULL strpool.
Detect by Julian on arm64, but not (at least easily) reproduced on amd64.