Improve performance of dwarf3 reader using a hashtable of parsed abbreviations
For each DIE, the dwarf3 reader must know which data elements to read.
These elements are described by an abbreviation.
Re-reading these abbreviations for each DIE is costly as
the location of the needed abbreviation is found by scanning the full
abbv section, which is very costly.
(A small cache of 32 abbv offsets in the abbv section somewhat decreases
the cost, but reading the abbvs is still a hot spot, in particular for
big debug informations).
This patch:
* adds an hash table of parsed abbreviations
* all abbreviations for a CU are read in one single scan of the abbv
section, when the CU header is read
So, with the patch, the di image is not accessed anymore for reading the abbvs
after the CU header parsing.
On a big executable, --read-var-info=yes user cpu changes from
trunk: 320 seconds
to
abbv cache: 270 seconds
This further improves on a previous (not committed) abbv cache that
was just caching up to 513 entries in the abbv pos cache and populating
the cache with an initial scan. The user cpu for this version was 285 seconds.
NB: this is some work in anticipation of a following patch that
will add reading dwarf3 inlined information, with the hope to make
this reading fast enough to activate it by default.
Note: on the examples I looked at, all abbreviations were numbered starting
from 1, with no holes. If that would always be the case, then one could use
an xarray of parsed abbreviations rather than an hash table. However,
I found nothing in the dwarf standard that guarantees that abbreviations
are numbered from 1. So, the hash table.
Do not destroy the strpool if NULL
It is possible that a debug info contains no string (and so strpool
is never allocated).
A protection to avoid accessing strpool was already necessary
in ML_(canonicaliseTables) :
if (di->strpool)
VG_(freezeDedupPA) (di->strpool);
So, if a similar debug info is released, we need the same protection
to avoid accessing a NULL strpool.
Detect by Julian on arm64, but not (at least easily) reproduced on amd64.
This patch adds a 'de-duplicating memory pool allocator':
include/pub_tool_deduppoolalloc.h
coregrind/pub_core_deduppoolalloc.h
coregrind/m_deduppoolalloc.c
and uses it (currently only) for the strings in m_debuginfo/storage.c
The idea is that such ddup pool allocator will also be used for other
highly duplicated information (e.g. the DiCFSI information), where
significant gains can also be achieved.
The dedup pool for strings also decreases significantly the memory
needed by the read inline information (patch still to be committed,
see bug 278972).
(so 3Mb less mmap-ed once debug info is read, 1Mb less mmap-ed in peak,
6Mb less allocated once debug info is read).
This is all gained due to the string which changes from:
trunk: 17,434,704 in 266: di.storage.addStr.1
to
ddup: 10,966,608 in 750: di.storage.addStr.1
(6.5Mb less memory used by strings)
The gain in mmap-ed memory is smaller due to fragmentation.
Probably one could decrease the fragmentation by using bigger
size for the dedup pool, but then we would lose memory on the last
allocated pool (and for small libraries, we often do not use much
of a big pool block).
Solution might be to increase the pool size but have a "shrink_block"
operation. To be looked at in the future.
In terms of performance, startup of a big executable (on an old pentium)
is not influenced significantly (something like 0.1 seconds on 15 seconds
startup for a big executable, on a slow pentium).
The dedup pool uses a hash table. The hash function used currently
is the VG_(adler32) check sum. It is reported (and visible also here)
that this checksum is not a very good hash function (many collisions).
To have statistics about collisions, use --stats -v -v -v
As an example of the collisions, on the strings in debug info of memcheck tool on x86,
one obtain:
--4789-- dedupPA:di.storage.addStr.1 9983 allocs (8174 uniq) 11 pools (4820 bytes free in last pool)
--4789-- nr occurences of chains of len N, N-plicated keys, N-plicated elts
--4789-- N: 0 : nr chain 6975, nr keys 0, nr elts 0
--4789-- N: 1 : nr chain 3670, nr keys 6410, nr elts 8174
--4789-- N: 2 : nr chain 1070, nr keys 226, nr elts 0
--4789-- N: 3 : nr chain 304, nr keys 100, nr elts 0
--4789-- N: 4 : nr chain 104, nr keys 84, nr elts 0
--4789-- N: 5 : nr chain 72, nr keys 42, nr elts 0
--4789-- N: 6 : nr chain 44, nr keys 34, nr elts 0
--4789-- N: 7 : nr chain 18, nr keys 13, nr elts 0
--4789-- N: 8 : nr chain 17, nr keys 8, nr elts 0
--4789-- N: 9 : nr chain 4, nr keys 6, nr elts 0
--4789-- N:10 : nr chain 9, nr keys 4, nr elts 0
--4789-- N:11 : nr chain 1, nr keys 0, nr elts 0
--4789-- N:13 : nr chain 1, nr keys 1, nr elts 0
--4789-- total nr of unique chains: 12289, keys 6928, elts 8174
which shows that on 8174 different strings, we have only 6410 strings which have
a unique hash value. As other examples, N:13 line shows we have 13 strings
mapping to the same key. N:14 line shows we have 4 groups of 10 strings mapping to the
same key, etc.
So, adler32 is definitely a bad hash function.
Trials have been done with another hash function, giving a much lower
collision rate. So, a better (but still fast) hash function would probably
be beneficial. To be looked at ...
Improve address description for address in the stack.
--read-var-info=yes is very memory and cpu intensive.
This patch ensures that even witout --read-var-info=yes that
the frame where the address point is reported in the address
description.
Julian Seward [Wed, 21 May 2014 14:43:11 +0000 (14:43 +0000)]
Add test cases for PCMPxSTRx cases 0x0E, 0x34, 0x14, and reformat some
of the associated switch statements. Pertains to #326469, #327639,
#328878 respectively.
on ppc64, pthread_create_WRK is not (always) produced in the stacktrace
showing where a thread was created.
This makes many tests fail => use sed to delete pthread_create_WRK
from the stacktrace to let tests succeed on ppc64.
With this change, on ppc64 gcc110 (fedora 18), helgrind failures
goes from 28 tests failing to 4, with following reasons:
helgrind/tests/pth_cond_destroy_busy (stderr)
(6 errors instead of 3 in the summary line ???)
helgrind/tests/tc06_two_races_xml (stderr)
similar change needed in filter_xml to del pthread_create_WRK
helgrind/tests/tc18_semabuse (stderr)
- with error code 22 (EINVAL: Invalid argument)
+ with error code 38 (ENOSYS: Function not implemented)
helgrind/tests/tc20_verifywrap (stderr)
- with error code 22 (EINVAL: Invalid argument)
+ with error code 38 (ENOSYS: Function not implemented)
More details about the stacktrace not containing pthread_create_WRK:
--------------------------------------------------------------------
Here is the stacktrace obtained by GDB+vgdb:
(gdb) bt
#0 0x0000008074f7ac5c in .__clone () from /lib64/libc.so.6
#1 0x000000807517b1ec in do_clone (pd=0x4c6f200, attr=0x8075189c90 <default_attr>, stackaddr=<optimized out>, stopped=<optimized out>,
fct=@0x80751a01e0: 0x807517c500 <start_thread>, clone_flags=4001536) at ../nptl/sysdeps/pthread/createthread.c:74
#2 0x000000000403ed0c in pthread_create_WRK (thread=<error reading variable: value has been optimized out>,
attr=<error reading variable: value has been optimized out>, start=<error reading variable: value has been optimized out>,
arg=0xfff00ee18) at hg_intercepts.c:269
#3 0x000000000403ef1c in _vgw00000ZZ_libpthreadZdsoZd0_pthreadZucreateZAZa (thread=<optimized out>, attr=<optimized out>,
start=<optimized out>, arg=<optimized out>) at hg_intercepts.c:300
#4 0x000000003806f1d8 in ?? ()
#5 0x0000008074e9fb94 in generic_start_main (main=@0x100200d8: 0x100013a0 <main>, argc=<optimized out>, ubp_av=0xfff00f2d8,
auxvec=0xfff00f408, init=<optimized out>, rtld_fini=<optimized out>, stack_end=<optimized out>, fini=<optimized out>)
at ../csu/libc-start.c:225
#6 0x0000008074e9fd90 in __libc_start_main (argc=<optimized out>, ubp_av=<optimized out>, ubp_ev=<optimized out>,
auxvec=<optimized out>, rtld_fini=<optimized out>, stinfo=<optimized out>, stack_on_entry=<optimized out>)
at ../sysdeps/unix/sysv/linux/powerpc/libc-start.c:91
#7 0x0000000000000000 in ?? ()
(gdb)
and here is the stacktrace produced by Valgrind unwinder:
Thread 1: status = VgTs_Runnable
==41687== at 0x8074F7AC5C: clone (in /usr/lib64/libc-2.16.so)
==41687== by 0x807517B1EB: do_clone.constprop.3 (createthread.c:74)
==41687== by 0x403EF1B: pthread_create@* (hg_intercepts.c:300)
==41687== by 0x10001597: main (tc19_shadowmem.c:172)
valgrind stack top usage: 15328 of 1048576
When the 2nd clone break is encountered (in the child thread), here is
the GDB stacktraces:
Thread 2 (Thread 6028):
#0 0x0000008074f75fb0 in .madvise () from /lib64/libc.so.6
#1 0x000000807517c700 in start_thread (arg=0x4c6f200) at pthread_create.c:402
#2 0x0000008074f7acf0 in .__clone () from /lib64/libc.so.6
Thread 1 (Thread 41687):
#0 pthread_create_WRK (thread=0xfff00ee10, attr=0x0, start=@0x100200e8: 0x10001dd0 <steer>, arg=0xfff00ee18) at hg_intercepts.c:248
#1 0x000000000403ef1c in _vgw00000ZZ_libpthreadZdsoZd0_pthreadZucreateZAZa (thread=<optimized out>, attr=<optimized out>,
start=<optimized out>, arg=<optimized out>) at hg_intercepts.c:300
#2 0x000000003806f1d8 in ?? ()
#3 0x0000008074e9fb94 in generic_start_main (main=@0x100200d8: 0x100013a0 <main>, argc=<optimized out>, ubp_av=0xfff00f2d8,
auxvec=0xfff00f408, init=<optimized out>, rtld_fini=<optimized out>, stack_end=<optimized out>, fini=<optimized out>)
at ../csu/libc-start.c:225
#4 0x0000008074e9fd90 in __libc_start_main (argc=<optimized out>, ubp_av=<optimized out>, ubp_ev=<optimized out>,
auxvec=<optimized out>, rtld_fini=<optimized out>, stinfo=<optimized out>, stack_on_entry=<optimized out>)
at ../sysdeps/unix/sysv/linux/powerpc/libc-start.c:91
#5 0x0000000000000000 in ?? ()
(gdb)
Here are the valgrind stacktraces:
Thread 1: status = VgTs_Runnable
==41687== at 0x403EBE0: pthread_create_WRK (hg_intercepts.c:248)
==41687== by 0x403EF1B: pthread_create@* (hg_intercepts.c:300)
==41687== by 0x8074E9FB93: generic_start_main.isra.0 (libc-start.c:225)
==41687== by 0x8074E9FD8F: (below main) (libc-start.c:91)
valgrind stack top usage: 15328 of 1048576
Thread 2: status = VgTs_WaitSys
==41687== at 0x8074F75FB0: madvise (in /usr/lib64/libc-2.16.so)
==41687== by 0x807517C6FF: start_thread (pthread_create.c:402)
valgrind stack top usage: 10320 of 1048576
And then after a few more next/breaks:
Thread 1: status = VgTs_Runnable
==41687== at 0x8074F7AC5C: clone (in /usr/lib64/libc-2.16.so)
==41687== by 0x807517B1EB: do_clone.constprop.3 (createthread.c:74)
==41687== by 0x403EF1B: pthread_create@* (hg_intercepts.c:300)
==41687== by 0x100015BB: main (tc19_shadowmem.c:173)
valgrind stack top usage: 15328 of 1048576
Thread 2: status = VgTs_WaitSys
==41687== at 0x8074F75FB0: madvise (in /usr/lib64/libc-2.16.so)
==41687== by 0x807517C6FF: start_thread (pthread_create.c:402)
valgrind stack top usage: 10320 of 1048576
So, pthread_create_WRK is not in the stacktrace anymore.
This only works in non-bi arch mode. If ever aarch64+arm
are compiled bi-arch, then some more work is needed to have
a 64 bits vgdb able to ptrace invoke a 32 bits valgrind.
Note also that PTRACE_GETREGSET is defined on other platforms
(e.g. ppc64 fedora 18 defines it), but it is not used on
these platforms, as again, PTRACE_GETREGSET implies some
work for bi-arch to work properly.
So, on all platforms except arm64, we use PTRACE_GETREGS
or PTRACE_PEEKUSER.
Bart Van Assche [Sat, 17 May 2014 10:44:00 +0000 (10:44 +0000)]
drd/tests/atomic_var: Revert r13876.
r13876 was a workaround for false ordering introduced by platform-specific
(Solaris) code. The conclusion of an off-list discussion was that this has
to be solved in the drd tool itself and not by modifying test programs. Hence
this revert.
On old kernel, poll syscall being ptraced (vgdb+ptrace) is not necessarily
properly restarted. Instead, it can fail with EINTR, even if no signal was
effectively received.
Handle such case by retrying the poll syscall when the poll syscall
is failing due to EINTR
Mark Wielaard [Fri, 16 May 2014 22:28:48 +0000 (22:28 +0000)]
Revert configure support for adding -Werror=format-security.
This reverts part of valgrind svn r13962. There was a typo in the configure
check that caused failures when -Werror=format-security wasn't supported
and the flag interfered badly with -Wno-format-zero-length. So remove
for now and only add back when properly tested on various (older) setups.
Julian Seward [Thu, 15 May 2014 13:50:47 +0000 (13:50 +0000)]
Make the PLAT_ identification work properly for mingw-win64. Problem was
that mingw64 also defines __MINGW32__, which led to the 32-bit definitions
being used in the 64-bit case. n-i-bz. (Bernhard.Loos@ruecker.de)
Factorises the address code description and printing
of memcheck and helgrind in a common module:
pub_tool_addrinfo.h pub_core_addrinfo.h m_addrinfo.c
At the same time, the factorised code is made usable by other
tools also (and is used by the gdbserver command 'v.info location'
which replaces the helgrind 'describe addr' introduced 1 week ago
and which is now callable by all tools).
The new address description code can describe more addresses
(e.g. for memcheck, if the block is not on the free list anymore,
but is in an arena free list, this will also be described).
Similarly, helgrind address description can now describe more addresses
when --read-var-info=no is given (e.g. global symbols are
described, or addresses on the stack are described as
being on the stack, freed blocks in the arena free list are
described, ...).
See e.g. the change in helgrind/tests/annotate_rwlock.stderr.exp
or locked_vs_unlocked2.stderr.exp
The patch touches many files, but is basically a lot of improvements
in helgrind output files.
The code changes are mostly refactorisation of existing code.
Julian Seward [Tue, 13 May 2014 16:15:56 +0000 (16:15 +0000)]
Followup to r13958: add reg-trash lists to inline assembly in
TESTINSTPCMISALIGNED TESTINSTPCMISALIGNED_DWORDOUT
TESTINSTPCMISALIGNED_2OUT and nice up the the indentation a bit.
In case gdbsrv poll syscall fails, produces more information
gdbsrv poll syscall seems to very infrequently (1 on 100000 vgdb invocations§)
to be EINTR-upted.
So, when poll syscall fails, output stacktrace + sigmask status
to capture more info about the problem.
This is a follow-up/extension of r13748, which showed poll was interrupted
but it is not clear why. In particular, all async signals are supposed
to be masked at the time vgdb has forced an invocation
Julian Seward [Fri, 9 May 2014 16:13:21 +0000 (16:13 +0000)]
3_9_BUGSTATUS.txt: looked at all bugs in the file. Moved fixed ones
to NEWS (if not already there). Put the rest of them into a set of
categories depending on which part of the code base is affected, which
divides them up into -- IMO -- much more managable groups.
Mark Wielaard [Fri, 9 May 2014 13:34:13 +0000 (13:34 +0000)]
Out of tree build. Partial fix for Bug 333628.
Patch by Gilles Chanteperdrix <gilles.chanteperdrix@xenomai.org>.
Partial fix. make && make check now works with builddir != srcdir.
But make regtest doesn't yet.
Keep the reading side of the FIFO opened after a vgdb disconnect
Otherwise, if the valgrind process has just forked before
the vgdb connect, and the child has not yet closed the FIFO it inheritates
from its parent, and a 2nd vgdb is started before the parent has
re-opened the reading side FIFO, the 2nd vgdb can connect to the child,
which will not handle the wakeup character/connection request, and
vgdb will have a write error and/or block.
Julian Seward [Thu, 8 May 2014 15:14:48 +0000 (15:14 +0000)]
When constructing a signal frame, tell the tool that the registers
that carry the arguments (signo, siginfo, mcontext) for the handler,
have been written. In particular this makes Memcheck think
(correctly) they are defined and so removes a bunch of false positives
that can happen in the signal handler, should the registers have been
marked undefined before the signal.
The same fix needs to be applied to various other of these
sigframe-*.c files.
On a big application linking with gtk, using the compilation options
-ffunction-sections -fdata-sections and the linker option
-Wl,--gc-sections, --read-var-info=yes gives the following:
host stacktrace:
==18521== at 0x38057C54: show_sched_status_wrk (m_libcassert.c:308)
==18521== by 0x38057F50: report_and_quit (m_libcassert.c:367)
==18521== by 0x38058151: vgPlain_assert_fail (m_libcassert.c:432)
==18521== by 0x3813F084: vgModuleLocal_evaluate_GX (d3basics.c:973)
==18521== by 0x38098300: data_address_is_in_var (debuginfo.c:2769)
==18521== by 0x38099E26: vgPlain_get_data_description (debuginfo.c:3298)
...
The problem is that -Wl,--gc-sections eliminates the unused functions
but keeps some debug info for the functions or their compilation units.
The dwarf entry has low and high pc, but both are equal to 0.
The dwarf reader of Valgrind is confused by this, as the varstack becomes
empty, while it should not. This then causes local (eliminated) variables
to be put in the global scope, leading afterwards to evaluation errors
when describing any other variables.
The fix is to also push something on the varstack when
a CU that has low and high pc given but with 0 value.
This is similar to the varstack_push done for a CU that has
no low pc, no high pc and no range.
Despite considerable effort to make a small reproducer, the problem
could only be produced with a big executable.
After the fix, everything was working properly.
The wrong behaviour for dwarf entries produce the following trace:
<2><2ff291a>: Abbrev Number: 23 (DW_TAG_formal_parameter)
DW_AT_name : AET
DW_AT_decl_file : 1
DW_AT_decl_line : 243
DW_AT_type : <2ff2811>
DW_AT_location : 18288554
Recording this variable, with 1 PC range(s)
....
<2ff291a> addVar: level 0: AET :: EdgeTableEntry*
Loc=GX(final){[0x0,0x8]=50,[0x9,0x1d]=53,[0x1e,0x26]=51,[0x27,0x29]=53,[0x2a,0x2f]=51,[0x44,0x4a]=53,[0x4d,0x5e]=51,[0x5f,0x62]=53}
FrB=none
declared at: gdkpolyreg-generic.c:243
ACQUIRE for range(s) [0x0,0xffffffff]
The AET is a formal parameter of a function, but is wrongly added
at level 0, with a PC range covering the full space. It has a Loc GX
which uses non biased program counters (e.g. 0x0,0x8).
This dwarf entry will require a FrB (and registers when evaluating)
but no such things are available (or given) when evaluating a variable
in the global scope.
The fix is to handle compilation units with lo and hi pc == 0x0
similarly to a CU that has no lo and hi pc.
With this fix, valgrind --read-var-info=yes could properly
handle a big application with plenty of eliminated functions.
This is because that dwarf entry defines a struct with no size.
This happens when the struct has a VLA array in the middle
of a struct. This is a C gcc extension, and is a standard
feature of Ada.
The proper solution would be to have the size calculated at runtime,
using the gnat extensions or dwarf entries (to be generated by
the compiler).
The patch fixes this problem by defining the size of such structure
as 1 byte.
Another approach tried was to put the max possible size.
This had the disadvantage that any address on the stack was seen
as belonging to this variable.
This allows the description to work for the 1st byte of the variable
but cannot properly describe the 2nd and following bytes :
(gdb) p &n
$9 = (namet *) 0xbefbc070
(gdb) mo c d 0xbefbc070
Address 0xBEFBC070 len 1 not defined:
Uninitialised value at 0xBEFBC070
==1396== Location 0xbefbc070 is 0 bytes inside n.before_name,
==1396== declared at crec.c:10, in frame #0 of thread 1
(gdb) mo c d 0xbefbc071
Address 0xBEFBC071 len 1 not defined:
Uninitialised value at 0xBEFBC071
==1396== Address 0xbefbc071 is on thread 1's stack
(gdb)
A possible refinement would be to use a huge size but have the
logic of variable description understanding this and describing
all between this var and hte next var on the stack as being
in the VLA variable.
In the meantime, the size 1 avoids --read-var-info=yes to fail.
Also, the 'goto bad_DIE' have been replaced by a macro
goto_bad_DIE that ensures the line nr at which the bad DIE has
been detected is reported in the error msg.
This makes it easier to understand what is the problem.