Carl Love [Thu, 14 May 2015 21:52:59 +0000 (21:52 +0000)]
Patch 4 in a revised series of cleanup patches from Will Schmidt
Add a suppression to handle a "Jump to the invalid address..." message
that gets generated on power. This is a variation of the existing
suppressions.
While here, I also updated the "prog:" line in the vgtest file to reference
the supp_unknown executable, versus the badjump executable. They share the
same source code, so I think this is effectively cosmetic.
Add the lwpid in the scheduler status information
E.g. we now have:
Thread 1: status = VgTs_Runnable (lwpid 15782)
==15782== at 0x8048EB5: main (sleepers.c:188)
client stack range: [0xBE836000 0xBE839FFF] client SP: 0xBE838F80
valgrind stack top usage: 10264 of 1048576
Thread 2: status = VgTs_WaitSys (lwpid 15828)
==15782== at 0x2E9451: ??? (syscall-template.S:82)
==15782== by 0x8048AD3: sleeper_or_burner (sleepers.c:84)
==15782== by 0x39B924: start_thread (pthread_create.c:297)
==15782== by 0x2F107D: clone (clone.S:130)
client stack range: [0x442F000 0x4E2EFFF] client SP: 0x4E2E338
valgrind stack top usage: 2288 of 1048576
This allows to attach with GDB to the good lwpid in case
you want to examine the valgrind state rather than the guest state.
(it is needed to attach to the specific lwpid as valgrind is not
linked with lib pthread, so GDB cannot discover the threads
of the process).
Implement 'qXfer:exec-file:read' packet in Valgrind gdbserver.
Thanks to this packet, with recent GDB (>= 7.9.50.20150514-cvs), the
command 'target remote' will automatically load the executable file of
the process running under Valgrind. This means you do not need to
specify the executable file yourself, GDB will discover it itself.
See GDB documentation about 'qXfer:exec-file:read' packet for more
info.
Carl Love [Wed, 13 May 2015 21:46:47 +0000 (21:46 +0000)]
Patch 2 in a revised series of cleanup patches from Will Schmidt
Add a deep-D test .exp values for ppc64.
Depending on the system and the systems endianness, there are variances
in the library reference, and to the specific line number in the library.
I was able to add and modify existing filters to cover most of the variations,
but did need to add a .exp to cover the additional call stack entry as seen
on power.
This change allows the ppc64 targets to pass the massif/deep-D test.
Carl Love [Wed, 13 May 2015 21:10:12 +0000 (21:10 +0000)]
Patch 1 in a revised series of cleanup patches from Will Schmidt
Update the massif/big-alloc test for ppc64*.
In comparison to the existing .exp files, the time,total,extra-heap
values generated on ppc64* vary from the other architectures.
This .exp allows the ppc64 targets to pass the test.
* avoid indirection via function pointers to call SVal__rcinc and SVal__rcdec
* declare these functions inlined
* transform 2 asserts on hot path in conditionally compiled checks
on CHECK_ZSM
This slightly optimises some perf tests with helgrind
Improves the way arena statistics are shown
The mmap'd max/curr and max/curr nr of bytes will be shown e.g. as
11,440,408/ 4,508,968
instead of 11440656/ 4509200
So, using more space, but more readable (in particular when the
nr exceeds the width, and so are not aligned anymore)
This patch decreases the memory used by the helgrind SecMap,
by implementing a Garbage Collection for the SecMap.
The basic change is that freed memory is marked as noaccess
(while before, it kept the previous marking, on the basis that
non buggy applications are not accessing freed memory in any case).
Keeping the previous marking avoids the CPU/memory changes needed
to mark noaccess.
However, marking freed memory noaccess and GC the secmap reduces
the memory on big apps.
For example, a firefox test needs 220Mb less (on about 2.06 Gb).
Similar reduction for libreoffice batch (260 MB less on 1.09 Gb).
On such applications, the performance with the patch is similar to the trunk.
There is a performance decrease for applications that are doing
a lot of malloc/free repetitively: e.g. on some perf tests, an increase
in cpu of up to 15% has been observed.
Several performance optimisations can be done afterwards to not loose
too much performance. The decrease of memory is expected to produce
in any case significant benefit in memory constrained environments
(e.g. android phones).
So, after discussion with Julian, it was decided to commit as-is
and (re-)gain (part of) performance in follow-up commits.
Add some cfi directives in the code doing syscall (by Valgrind).
This allows to attach to Valgrind when VAlgrind is blocked in a syscall
and have GDB producing a stacktrace, rather than being unable
to unwind.
I.e. instead of having:
(gdb) bt
#0 0x380460f2 in do_syscall_WRK ()
(gdb)
with the directives, we obtain:
(gdb) bt
#0 vgPlain_mk_SysRes_x86_linux (val=1) at m_syscall.c:65
#1 vgPlain_do_syscall (sysno=168, a1=944907996, a2=1, a3=4294967295, a4=0, a5=0, a6=0, a7=0, a8=0) at m_syscall.c:791
#2 0x38031986 in vgPlain_poll (fds=0x385226dc <remote_desc_pollfdread_activity>, nfds=1, timeout=-1) at m_libcfile.c:535
#3 0x3807479f in vgPlain_poll_no_eintr (fds=0x385226dc <remote_desc_pollfdread_activity>, nfds=1, timeout=-1)
at m_gdbserver/remote-utils.c:86
#4 0x380752f0 in readchar (single=4096) at m_gdbserver/remote-utils.c:938
#5 0x38075ae3 in getpkt (buf=0x61f35020 "") at m_gdbserver/remote-utils.c:997
#6 0x38076fcb in server_main () at m_gdbserver/server.c:1048
#7 0x38072af2 in call_gdbserver (tid=1, reason=init_reason) at m_gdbserver/m_gdbserver.c:721
#8 0x380735ba in vgPlain_gdbserver (tid=1) at m_gdbserver/m_gdbserver.c:788
#9 0x3802c6ef in do_actions_on_error (allow_db_attach=<optimized out>, err=<optimized out>) at m_errormgr.c:532
#10 pp_Error (err=0x61f580e0, allow_db_attach=1 '\001', xml=1 '\001') at m_errormgr.c:644
#11 0x3802cc34 in vgPlain_maybe_record_error (tid=1643479264, ekind=8, a=2271560481, s=0x0, extra=0x62937f1c)
at m_errormgr.c:851
#12 0x38028821 in vgMemCheck_record_free_error (tid=1, a=2271560481) at mc_errors.c:836
#13 0x38007b65 in vgMemCheck_free (tid=1, p=0x87654321) at mc_malloc_wrappers.c:496
#14 0x3807e261 in do_client_request (tid=1) at m_scheduler/scheduler.c:1840
#15 vgPlain_scheduler (tid=1) at m_scheduler/scheduler.c:1406
#16 0x3808b6b2 in thread_wrapper (tidW=<optimized out>) at m_syswrap/syswrap-linux.c:102
#17 run_a_thread_NORETURN (tidW=1) at m_syswrap/syswrap-linux.c:155
#18 0x00000000 in ?? ()
(gdb)
Carl Love [Wed, 6 May 2015 21:11:35 +0000 (21:11 +0000)]
Patch 8 in a series of cleanup patches from Will Schmidt
Add a helper script to determine if the platform is ppc64le.
This is specifically used to help exclude the 32-bit tests from being
run on a ppc64LE (ABIV2) platform. The 32-bit targets, specifically ppc32/*
is not built on LE.
Carl Love [Wed, 6 May 2015 19:44:14 +0000 (19:44 +0000)]
Patch 2 in a series of cleanup patches from Will Schmidt
Adjust the badjump2 test for ppc64le/ABIV2. Under the ABIV2 there
is no function descriptor, so the fn[] setup does not apply.
This fixes the badjump2 test failure as seen on ppc64le.
* Out of memory message was using 'bytes have already been allocated.'
while this nr is in fact the total anonymously mmap-ed.
Change the message so as to reflect the shown number.
* Show also the total anonymous mmaped in non OOM memory statistics
This patch reduces the memory needed for a VtsTE by 25% (one word)
on 32 bits platforms. No memory reduction on 64 bits platforms,
due to alignment.
The patch also shows the vts stats when showing the helgrind stats.
The perf/memrw.c perf test gets also some new additional features
allowing e.g. to control the size of the read or written blocks.
This patch adds a function that allows to directly properly size an xarray
when the size is known in advance.
3 places identified where this function can be used trivially.
The result is a reduction of 'realloc' operations in core
arena, and a small reduction in ttaux arena
(it is the nr of operations that decreases, the memory usage itself
stays the same (ignoring some 'rounding' effects).
E.g. for perf/bigcode 0, we change from
core 1085742/ 216745904 totalloc-blocks/bytes, 1085733 searches
ttaux 5348/ 6732560 totalloc-blocks/bytes, 5326 searches
to
core 712666/ 190998592 totalloc-blocks/bytes, 712657 searches
ttaux 5319/ 6731808 totalloc-blocks/bytes, 5296 searches
For bz2, we switch from
core 50285/ 32383664 totalloc-blocks/bytes, 50256 searches
ttaux 670/ 245160 totalloc-blocks/bytes, 669 searches
to
core 32564/ 29971984 totalloc-blocks/bytes, 32535 searches
ttaux 605/ 243280 totalloc-blocks/bytes, 604 searches
Performance wise, on amd64, this improves memcheck performance
on perf tests by 0.0, 0.1 or 0.2 seconds depending on the test.
Rename write variable to avoid a warning:
memrw.c:37: warning: declaration of ‘write’ shadows a global declaration
/usr/include/unistd.h:333: warning: shadowed declaration is here
DW_CFA_def_cfa_expression: don't push the CFA on the stack before
evaluation starts. For DW_CFA_val_expression and DW_CFA_expression
doing so is correct, but not for DW_CFA_def_cfa_expression.
Back out most of r15145 which reports bug fixes for various altivec insns.
Either those bugs have been fixed looong time ago, or the reporter ran
on a host without altivec capabilities, or those insns were actually
e500 insns which are not supported at all at this point.
Follow up on VEX r3144 and remove VexGuestTILEGXStateAlignment.
Also fix the alignment check which should be mod 16 not mod 8.
Well, actually, it should be mod LibVEX_GUEST_STATE_ALIGN but
that is another patch.
Fix BZ #342683. Based on patch by Ivo Raisr.
What this does is to make sure that the initial client data segment
is marked as unaddressable. This is consistent with the behaviour of
brk when the data segment is shrunk. The "freed" memory is marked
as unaddressable.
Special tweaks were needed for s390 which was returning early from
the funtion to avoid sloppy register definedness initialisation.
Replace adler32 by sdbm_hash in m_deduppoolalloc.c
adler32 is not very good as a hash function.
sdbm_hash gives more different keys that adler32,
and in a large majority of the cases, shorter chains.
Fix an assertion in the address space manager. BZ #345887.
The VG_(extend_stack) call needs to be properly guarded because the
passed-in address is not necessarily part of an extensible stack
segment. And an extensible stack segment is the only thing that
function should have to deal with.
Previously, the function VG_(am_addr_is_in_extensible_client_stack)
was introduced to guard VG_(extend_stack) but it was not added in all
places it should have been.
Also, extending the client stack during signal delivery (in sigframe-common.c)
was simply calling VG_(extend_stack) hoping it would do the right thing.
But that was not always the case. The new testcase
none/tests/linux/pthread-stack.c exercises this (3.10.1 errors out on it).
Renamed ML_(sf_extend_stack) to ML_(sf_maybe_extend_stack) and add
proper guard logic for VG_(extend_stack).
Testcases none/tests/{amd64|x86}-linux/bug345887.c by Ivo Raisr.
Carl Love [Wed, 22 Apr 2015 21:17:48 +0000 (21:17 +0000)]
There is an ABI change in how the PPC64 gcc compiler handles 128 bit arguments
are aligned with GCC 5.0. The compiler generates a "note" about this starting
with GCC 4.9. To avoid generating the "note", the passing of the arguments
were changed to a pointer to make it pass by reference rather then pass by
value.
Carl Love [Wed, 22 Apr 2015 16:17:06 +0000 (16:17 +0000)]
Add support for the TEXASRU register. This register contains information on
transactional memory instruction summary information. This register contains
the upper 32-bits of the transaction information. Note, the valgrind
implementation of transactional memory instructions is limited. Currently, the
contents of the TEXASRU register will always return 0. The lower 64-bits of
the trasnaction information in the TEXASR register will contain the failure
information as setup by Valgrind.
The vex commit 3143 contains the changes needed to support the TEXASRU
register on PPC64.
The support requires changing the value of MAX_REG_WRITE_SIZE in
memcheck/mc_main.c from 1696 to 1712. The change is made in this
valgrind commit.
Add some stats to helgrind stats:
* nr of client malloc-ed blocks
* how many OldRef helgrind has, and the distribution
of these OldRef according to the nr of accs they have
Do RCEC_GC when approaching the max nr of RCEC, not when reaching it.
Otherwise, long running applications still see the max nr of RCEC
slowly growing, which increases the memory usage and
makes the (fixed) contextTab hash table slower to search.
Without this margin, the max could increase as the GC code
is not called at exactly the moment we reach the previous max,
but rather when a thread has run a bunch of basic blocks.
increase function size even more (see r15095). On s390 this testcase
might use a relative load (e.g. via load address relative long(larl)
for the address) into the literal pool for some constants. 1280 seems
to be enough that the r/o data is copied along the function.
Carl Love [Mon, 20 Apr 2015 23:38:33 +0000 (23:38 +0000)]
Add support for the lbarx, lharx, stbcx and sthcs instructions.
One of the expect files was missing. Also found that there
was a bug in the stq, stqcx, lq and lqarx instructions for LE.
The VEX commit for the instruction fix was 3138.
This commit updates the expect files for the corrected instructions
and adds the missing expect files.
The bugzilla for the orginal issue of the missing instructions
is 346324.
This patch changes the policy that does the GC of OldRef and RCEC
conflict cache size.
The current policy is:
A 'more or less' LRU policy is implemented by giving
to each OldRef a generation nr in which it was last touched.
A new generation is created every 50000 new access.
GC is done when the nr of OldRef reaches --conflict-cache-size.
The GC consists in removing enough generations to free
half of the entries.
After GC of OldRef, the RCEC (Ref Counted Exe Contexts)
not referenced anymore are GC-ed.
The new policy is:
An exact LRU policy is implemented using a doubly linked list
of OldRef.
When reaching --conflict-cache-size, the LRU entry is re-used.
The not referenced RCEC are GC-ed when less than 75% of the RCEC
are referenced, and the nr of RCEC is 'big' (at least half the
size of the contextTab, and at least the max nr of RCEC reached
previously).
(note: tried to directly recover a unref'ed RCEC when recovering
the LRU oldref, but that gives a lot of re-creation of RCEC).
new policy has the following advantages/disadvantages:
1. It is faster (at least for big applications)
On a firefox startup/exit, we gain about 1m30 second on 11m.
Similar 5..10% speed up encountered on other big applications
or on the new perf/memrw test.
The speed increase depends on the amount of memory
touched by the application. For applications with a
working set fitting in conflict-cache-size, the new policy
might be marginally slower than previous policy on platforms
having a small cache : the current policy only sets a generation
nr when an address is re-accessed, while the new policy
has to unchain and rechain the OldRef access in the LRU
doubly linked list.
2. It uses less memory (at least for big applications)
Firefox startup/exit "core" arena max use decreases from
1175MB mmap-ed/1060MB alloc-ed
to
994MB mmap-ed/913MB alloc-ed
The decrease in memory is the result of having a lot less RCEC:
The current policy let the nr of RCEC grow till the conflict
cache size is GC-ed.
The new policy limits the nr of RCEC to 133% of the RCEC
really referenced. So, we end up with a max nr of RCEC
a lot smaller with the new policy : max RCEC 191000
versus 1317000, for a total nr of discard RCEC operations
almost the same: 33M versus 32M.
Also, the current policy allocates a big temporary array
to do the GC of OldRef.
With the new policy, size of an OldRef increases because
we need 2 pointers for the LRU doubly linked list, and
we need the accessed address.
In total, the OldRef increase is limited to one Word,
as we do not need anymore the gen, and the 'magic'
for sanity check was removed (the check somewhat
becomes less needed, because an OldRef is never freed
anymore. Also, we do a new cross-check between
the ga in the OldRef and the sparseWA key).
For applications using small memory and having
a small nr of different stack traces accessing memory,
the new policy causes an increase in memory (one Word
per OldRef).
3. Functionally, the new policy gives better past information:
once the steady state is reached (i.e. the conflict cache
is full), the new policy has always --conflict-cache-size
entries of past information.
The current policy has a nr of past information varying
between --conflict-cache-size/2 and --conflict-cache-size
(so in average, 75% of conflict-cache-size).
4. The new code is a little bit smaller/simpler:
The generation based GC is replaced by a simpler LRU policy.
So, in summary, this patch should allow big applications
to use less cpu/memory, while having very little
or no impact on memory/cpu of small applications.
Note that the OldRef data structure LRU policy
is not really explicitely tested by a regtest.
Not easy at first sight to make such a test portable
between platforms/OS/compilers/....
For ppc64, use the endianess of the running program, rather
than an harcoded endness.
(this is because ppc64 supports 2 endness, decided at runtime)
For mips, use BE if running on a non mips system, otherwise
use the endness of the running program
(this is because mips supports 2 endness, but decided at compile time).
fix 346307 fuse filesystem syscall deadlocks
Mark 2 additional syscalls as 'mayblock' when fuse-compatible hint
is given.
Patch from aozgovde@ralota.com