Add helgrind intercepts to have helgrind understanding Ada tasks terination rules
A recent gnatpro version is needed for this to work.
Thanks to these intercepts, some false positive errors are avoided,
and helgrind properly recuperates some internal memory associated
to the terminated task.
Add a comment to document a possible optimisation (avoid double reading
of DIEs when one or more parsers will read them also)
+ add the name of the parser in the barf output.
This patch implements the support needed for stacktraces
showing inlined function calls.
See 278972 valgrind stacktraces and suppression do not handle inlined function call debuginfo
Reading the inlined dwarf call info is activated using the new clo
--read-inline-info=yes
Default is currently no but an objective is to optimise the performance
and memory in order to possibly set it on by default.
(see below discussion about performances).
Basically, the patch provides the following pieces:
1. Implement a new dwarf3 reader that reads the inlined call info
2. Some performance improvements done for this new parser, and
on some common code between the new parser and the var info parser.
3. Use the parsed inlined info to produce stacktrace showing inlined calls
4. Use the parsed inlined info in the suppression matching and suppression generation
5. and of course, some reg tests
1. new dwarf3 reader:
---------------------
Two options were possible: add the reading of the inlined info
in the current var info dwarf reader, or add a 2nd reader.
The 2nd approach was preferred, for the following reasons:
The var info reader is slow, memory hungry and quite complex.
Having a separate parsing phase for the inlined information
is simpler/faster when just reading the inlined info.
Possibly, a single parser would be faster when using both
--read-var-info=yes and --read-inline-info=yes.
However, var-info being extremely memory/cpu hungry, it is unlikely
to be used often, and having a separate parsing for inlined info
does in any case make not much difference.
(--read-var-info=yes is also now less interesting thanks to commit
r13991, which provides a fast and low memory "reasonable" location
for an address).
The inlined info parser reads the dwarf info to make calls
to priv_storage.h ML_(addInlInfo).
2. performance optimisations
----------------------------
* the abbrev cache has been improved in revision r14035.
* The new parser skips the non interesting DIEs
(the var-info parser has no logic to skip uninteresting DIEs).
* Some other minor perf optimisation here and there.
In total now, on a big executable, 15 seconds CPU are needed to
create the inlined info (on my slow x86 pentium).
With regards to memory, the dinfo arena:
with inlined info: 172281856/121085952 max/curr mmap'd
without : 157892608/106721280 max/curr mmap'd,
So, basically, inlined information costs about 15Mb of memory for
my big executable (compared to first version of the patch, this is
already using less memory, thanks to the strpool deduppoolalloc.
The needed memory can probably be decreased somewhat more.
3. produce better stack traces
------------------------------
VG_(describe_IP) has a new argument InlIPCursor *iipc which allows
to describe inlined function calls by doing repetitive calls
to describe_IP. See pub_tool_debuginfo.h for a description.
4. suppression generation and matching
--------------------------------------
* suppression generation now also uses an InlIPCursor *iipc
to generate a line for each inlined fn call.
* suppression matching: to allow suppression matching to
match one IP to several function calls in a suppression entry,
the 'inputCompleter' object (that allows to lazily generate
function or object names for a stacktrace when matching
an error with a suppression) has been generalised a little bit
more to also lazily generate the input sequence.
VG_(generic_match) has been updated so as to be more generic
with respect to the input completer : when providing an
input completer, VG_(generic_match) does not need anymore
to produce/compute any input itself : this is all delegated
to the input completer.
5. various regtests
-------------------
to test stack traces with inlined calls, and suppressions
of (some of) these errors using inlined fn calls matching.
Work still to do:
-----------------
* improve parsing performance
* improve the memory overhead.
* handling the directory name for files of the inlined function calls is not yet done.
(probably implies to refactor some code)
* see if m_errormgr.c *offsets arrays cannot be managed via xarray
Improve performance of dwarf3 reader using a hashtable of parsed abbreviations
For each DIE, the dwarf3 reader must know which data elements to read.
These elements are described by an abbreviation.
Re-reading these abbreviations for each DIE is costly as
the location of the needed abbreviation is found by scanning the full
abbv section, which is very costly.
(A small cache of 32 abbv offsets in the abbv section somewhat decreases
the cost, but reading the abbvs is still a hot spot, in particular for
big debug informations).
This patch:
* adds an hash table of parsed abbreviations
* all abbreviations for a CU are read in one single scan of the abbv
section, when the CU header is read
So, with the patch, the di image is not accessed anymore for reading the abbvs
after the CU header parsing.
On a big executable, --read-var-info=yes user cpu changes from
trunk: 320 seconds
to
abbv cache: 270 seconds
This further improves on a previous (not committed) abbv cache that
was just caching up to 513 entries in the abbv pos cache and populating
the cache with an initial scan. The user cpu for this version was 285 seconds.
NB: this is some work in anticipation of a following patch that
will add reading dwarf3 inlined information, with the hope to make
this reading fast enough to activate it by default.
Note: on the examples I looked at, all abbreviations were numbered starting
from 1, with no holes. If that would always be the case, then one could use
an xarray of parsed abbreviations rather than an hash table. However,
I found nothing in the dwarf standard that guarantees that abbreviations
are numbered from 1. So, the hash table.
Do not destroy the strpool if NULL
It is possible that a debug info contains no string (and so strpool
is never allocated).
A protection to avoid accessing strpool was already necessary
in ML_(canonicaliseTables) :
if (di->strpool)
VG_(freezeDedupPA) (di->strpool);
So, if a similar debug info is released, we need the same protection
to avoid accessing a NULL strpool.
Detect by Julian on arm64, but not (at least easily) reproduced on amd64.
This patch adds a 'de-duplicating memory pool allocator':
include/pub_tool_deduppoolalloc.h
coregrind/pub_core_deduppoolalloc.h
coregrind/m_deduppoolalloc.c
and uses it (currently only) for the strings in m_debuginfo/storage.c
The idea is that such ddup pool allocator will also be used for other
highly duplicated information (e.g. the DiCFSI information), where
significant gains can also be achieved.
The dedup pool for strings also decreases significantly the memory
needed by the read inline information (patch still to be committed,
see bug 278972).
(so 3Mb less mmap-ed once debug info is read, 1Mb less mmap-ed in peak,
6Mb less allocated once debug info is read).
This is all gained due to the string which changes from:
trunk: 17,434,704 in 266: di.storage.addStr.1
to
ddup: 10,966,608 in 750: di.storage.addStr.1
(6.5Mb less memory used by strings)
The gain in mmap-ed memory is smaller due to fragmentation.
Probably one could decrease the fragmentation by using bigger
size for the dedup pool, but then we would lose memory on the last
allocated pool (and for small libraries, we often do not use much
of a big pool block).
Solution might be to increase the pool size but have a "shrink_block"
operation. To be looked at in the future.
In terms of performance, startup of a big executable (on an old pentium)
is not influenced significantly (something like 0.1 seconds on 15 seconds
startup for a big executable, on a slow pentium).
The dedup pool uses a hash table. The hash function used currently
is the VG_(adler32) check sum. It is reported (and visible also here)
that this checksum is not a very good hash function (many collisions).
To have statistics about collisions, use --stats -v -v -v
As an example of the collisions, on the strings in debug info of memcheck tool on x86,
one obtain:
--4789-- dedupPA:di.storage.addStr.1 9983 allocs (8174 uniq) 11 pools (4820 bytes free in last pool)
--4789-- nr occurences of chains of len N, N-plicated keys, N-plicated elts
--4789-- N: 0 : nr chain 6975, nr keys 0, nr elts 0
--4789-- N: 1 : nr chain 3670, nr keys 6410, nr elts 8174
--4789-- N: 2 : nr chain 1070, nr keys 226, nr elts 0
--4789-- N: 3 : nr chain 304, nr keys 100, nr elts 0
--4789-- N: 4 : nr chain 104, nr keys 84, nr elts 0
--4789-- N: 5 : nr chain 72, nr keys 42, nr elts 0
--4789-- N: 6 : nr chain 44, nr keys 34, nr elts 0
--4789-- N: 7 : nr chain 18, nr keys 13, nr elts 0
--4789-- N: 8 : nr chain 17, nr keys 8, nr elts 0
--4789-- N: 9 : nr chain 4, nr keys 6, nr elts 0
--4789-- N:10 : nr chain 9, nr keys 4, nr elts 0
--4789-- N:11 : nr chain 1, nr keys 0, nr elts 0
--4789-- N:13 : nr chain 1, nr keys 1, nr elts 0
--4789-- total nr of unique chains: 12289, keys 6928, elts 8174
which shows that on 8174 different strings, we have only 6410 strings which have
a unique hash value. As other examples, N:13 line shows we have 13 strings
mapping to the same key. N:14 line shows we have 4 groups of 10 strings mapping to the
same key, etc.
So, adler32 is definitely a bad hash function.
Trials have been done with another hash function, giving a much lower
collision rate. So, a better (but still fast) hash function would probably
be beneficial. To be looked at ...
Improve address description for address in the stack.
--read-var-info=yes is very memory and cpu intensive.
This patch ensures that even witout --read-var-info=yes that
the frame where the address point is reported in the address
description.
Julian Seward [Wed, 21 May 2014 14:43:11 +0000 (14:43 +0000)]
Add test cases for PCMPxSTRx cases 0x0E, 0x34, 0x14, and reformat some
of the associated switch statements. Pertains to #326469, #327639,
#328878 respectively.
on ppc64, pthread_create_WRK is not (always) produced in the stacktrace
showing where a thread was created.
This makes many tests fail => use sed to delete pthread_create_WRK
from the stacktrace to let tests succeed on ppc64.
With this change, on ppc64 gcc110 (fedora 18), helgrind failures
goes from 28 tests failing to 4, with following reasons:
helgrind/tests/pth_cond_destroy_busy (stderr)
(6 errors instead of 3 in the summary line ???)
helgrind/tests/tc06_two_races_xml (stderr)
similar change needed in filter_xml to del pthread_create_WRK
helgrind/tests/tc18_semabuse (stderr)
- with error code 22 (EINVAL: Invalid argument)
+ with error code 38 (ENOSYS: Function not implemented)
helgrind/tests/tc20_verifywrap (stderr)
- with error code 22 (EINVAL: Invalid argument)
+ with error code 38 (ENOSYS: Function not implemented)
More details about the stacktrace not containing pthread_create_WRK:
--------------------------------------------------------------------
Here is the stacktrace obtained by GDB+vgdb:
(gdb) bt
#0 0x0000008074f7ac5c in .__clone () from /lib64/libc.so.6
#1 0x000000807517b1ec in do_clone (pd=0x4c6f200, attr=0x8075189c90 <default_attr>, stackaddr=<optimized out>, stopped=<optimized out>,
fct=@0x80751a01e0: 0x807517c500 <start_thread>, clone_flags=4001536) at ../nptl/sysdeps/pthread/createthread.c:74
#2 0x000000000403ed0c in pthread_create_WRK (thread=<error reading variable: value has been optimized out>,
attr=<error reading variable: value has been optimized out>, start=<error reading variable: value has been optimized out>,
arg=0xfff00ee18) at hg_intercepts.c:269
#3 0x000000000403ef1c in _vgw00000ZZ_libpthreadZdsoZd0_pthreadZucreateZAZa (thread=<optimized out>, attr=<optimized out>,
start=<optimized out>, arg=<optimized out>) at hg_intercepts.c:300
#4 0x000000003806f1d8 in ?? ()
#5 0x0000008074e9fb94 in generic_start_main (main=@0x100200d8: 0x100013a0 <main>, argc=<optimized out>, ubp_av=0xfff00f2d8,
auxvec=0xfff00f408, init=<optimized out>, rtld_fini=<optimized out>, stack_end=<optimized out>, fini=<optimized out>)
at ../csu/libc-start.c:225
#6 0x0000008074e9fd90 in __libc_start_main (argc=<optimized out>, ubp_av=<optimized out>, ubp_ev=<optimized out>,
auxvec=<optimized out>, rtld_fini=<optimized out>, stinfo=<optimized out>, stack_on_entry=<optimized out>)
at ../sysdeps/unix/sysv/linux/powerpc/libc-start.c:91
#7 0x0000000000000000 in ?? ()
(gdb)
and here is the stacktrace produced by Valgrind unwinder:
Thread 1: status = VgTs_Runnable
==41687== at 0x8074F7AC5C: clone (in /usr/lib64/libc-2.16.so)
==41687== by 0x807517B1EB: do_clone.constprop.3 (createthread.c:74)
==41687== by 0x403EF1B: pthread_create@* (hg_intercepts.c:300)
==41687== by 0x10001597: main (tc19_shadowmem.c:172)
valgrind stack top usage: 15328 of 1048576
When the 2nd clone break is encountered (in the child thread), here is
the GDB stacktraces:
Thread 2 (Thread 6028):
#0 0x0000008074f75fb0 in .madvise () from /lib64/libc.so.6
#1 0x000000807517c700 in start_thread (arg=0x4c6f200) at pthread_create.c:402
#2 0x0000008074f7acf0 in .__clone () from /lib64/libc.so.6
Thread 1 (Thread 41687):
#0 pthread_create_WRK (thread=0xfff00ee10, attr=0x0, start=@0x100200e8: 0x10001dd0 <steer>, arg=0xfff00ee18) at hg_intercepts.c:248
#1 0x000000000403ef1c in _vgw00000ZZ_libpthreadZdsoZd0_pthreadZucreateZAZa (thread=<optimized out>, attr=<optimized out>,
start=<optimized out>, arg=<optimized out>) at hg_intercepts.c:300
#2 0x000000003806f1d8 in ?? ()
#3 0x0000008074e9fb94 in generic_start_main (main=@0x100200d8: 0x100013a0 <main>, argc=<optimized out>, ubp_av=0xfff00f2d8,
auxvec=0xfff00f408, init=<optimized out>, rtld_fini=<optimized out>, stack_end=<optimized out>, fini=<optimized out>)
at ../csu/libc-start.c:225
#4 0x0000008074e9fd90 in __libc_start_main (argc=<optimized out>, ubp_av=<optimized out>, ubp_ev=<optimized out>,
auxvec=<optimized out>, rtld_fini=<optimized out>, stinfo=<optimized out>, stack_on_entry=<optimized out>)
at ../sysdeps/unix/sysv/linux/powerpc/libc-start.c:91
#5 0x0000000000000000 in ?? ()
(gdb)
Here are the valgrind stacktraces:
Thread 1: status = VgTs_Runnable
==41687== at 0x403EBE0: pthread_create_WRK (hg_intercepts.c:248)
==41687== by 0x403EF1B: pthread_create@* (hg_intercepts.c:300)
==41687== by 0x8074E9FB93: generic_start_main.isra.0 (libc-start.c:225)
==41687== by 0x8074E9FD8F: (below main) (libc-start.c:91)
valgrind stack top usage: 15328 of 1048576
Thread 2: status = VgTs_WaitSys
==41687== at 0x8074F75FB0: madvise (in /usr/lib64/libc-2.16.so)
==41687== by 0x807517C6FF: start_thread (pthread_create.c:402)
valgrind stack top usage: 10320 of 1048576
And then after a few more next/breaks:
Thread 1: status = VgTs_Runnable
==41687== at 0x8074F7AC5C: clone (in /usr/lib64/libc-2.16.so)
==41687== by 0x807517B1EB: do_clone.constprop.3 (createthread.c:74)
==41687== by 0x403EF1B: pthread_create@* (hg_intercepts.c:300)
==41687== by 0x100015BB: main (tc19_shadowmem.c:173)
valgrind stack top usage: 15328 of 1048576
Thread 2: status = VgTs_WaitSys
==41687== at 0x8074F75FB0: madvise (in /usr/lib64/libc-2.16.so)
==41687== by 0x807517C6FF: start_thread (pthread_create.c:402)
valgrind stack top usage: 10320 of 1048576
So, pthread_create_WRK is not in the stacktrace anymore.
This only works in non-bi arch mode. If ever aarch64+arm
are compiled bi-arch, then some more work is needed to have
a 64 bits vgdb able to ptrace invoke a 32 bits valgrind.
Note also that PTRACE_GETREGSET is defined on other platforms
(e.g. ppc64 fedora 18 defines it), but it is not used on
these platforms, as again, PTRACE_GETREGSET implies some
work for bi-arch to work properly.
So, on all platforms except arm64, we use PTRACE_GETREGS
or PTRACE_PEEKUSER.
Bart Van Assche [Sat, 17 May 2014 10:44:00 +0000 (10:44 +0000)]
drd/tests/atomic_var: Revert r13876.
r13876 was a workaround for false ordering introduced by platform-specific
(Solaris) code. The conclusion of an off-list discussion was that this has
to be solved in the drd tool itself and not by modifying test programs. Hence
this revert.
On old kernel, poll syscall being ptraced (vgdb+ptrace) is not necessarily
properly restarted. Instead, it can fail with EINTR, even if no signal was
effectively received.
Handle such case by retrying the poll syscall when the poll syscall
is failing due to EINTR
Mark Wielaard [Fri, 16 May 2014 22:28:48 +0000 (22:28 +0000)]
Revert configure support for adding -Werror=format-security.
This reverts part of valgrind svn r13962. There was a typo in the configure
check that caused failures when -Werror=format-security wasn't supported
and the flag interfered badly with -Wno-format-zero-length. So remove
for now and only add back when properly tested on various (older) setups.
Julian Seward [Thu, 15 May 2014 13:50:47 +0000 (13:50 +0000)]
Make the PLAT_ identification work properly for mingw-win64. Problem was
that mingw64 also defines __MINGW32__, which led to the 32-bit definitions
being used in the 64-bit case. n-i-bz. (Bernhard.Loos@ruecker.de)
Factorises the address code description and printing
of memcheck and helgrind in a common module:
pub_tool_addrinfo.h pub_core_addrinfo.h m_addrinfo.c
At the same time, the factorised code is made usable by other
tools also (and is used by the gdbserver command 'v.info location'
which replaces the helgrind 'describe addr' introduced 1 week ago
and which is now callable by all tools).
The new address description code can describe more addresses
(e.g. for memcheck, if the block is not on the free list anymore,
but is in an arena free list, this will also be described).
Similarly, helgrind address description can now describe more addresses
when --read-var-info=no is given (e.g. global symbols are
described, or addresses on the stack are described as
being on the stack, freed blocks in the arena free list are
described, ...).
See e.g. the change in helgrind/tests/annotate_rwlock.stderr.exp
or locked_vs_unlocked2.stderr.exp
The patch touches many files, but is basically a lot of improvements
in helgrind output files.
The code changes are mostly refactorisation of existing code.
Julian Seward [Tue, 13 May 2014 16:15:56 +0000 (16:15 +0000)]
Followup to r13958: add reg-trash lists to inline assembly in
TESTINSTPCMISALIGNED TESTINSTPCMISALIGNED_DWORDOUT
TESTINSTPCMISALIGNED_2OUT and nice up the the indentation a bit.
In case gdbsrv poll syscall fails, produces more information
gdbsrv poll syscall seems to very infrequently (1 on 100000 vgdb invocations§)
to be EINTR-upted.
So, when poll syscall fails, output stacktrace + sigmask status
to capture more info about the problem.
This is a follow-up/extension of r13748, which showed poll was interrupted
but it is not clear why. In particular, all async signals are supposed
to be masked at the time vgdb has forced an invocation
Julian Seward [Fri, 9 May 2014 16:13:21 +0000 (16:13 +0000)]
3_9_BUGSTATUS.txt: looked at all bugs in the file. Moved fixed ones
to NEWS (if not already there). Put the rest of them into a set of
categories depending on which part of the code base is affected, which
divides them up into -- IMO -- much more managable groups.
Mark Wielaard [Fri, 9 May 2014 13:34:13 +0000 (13:34 +0000)]
Out of tree build. Partial fix for Bug 333628.
Patch by Gilles Chanteperdrix <gilles.chanteperdrix@xenomai.org>.
Partial fix. make && make check now works with builddir != srcdir.
But make regtest doesn't yet.