Carl Love [Tue, 19 May 2015 16:08:05 +0000 (16:08 +0000)]
Fix for the HWCAP2 aux vector.
The support assumed that if HWCAP2 is present that the system also supports
ISA2.07. That assumption is not correct as we have found a few systems (OS)
where the HWCAP2 entry is present but the ISA2.07 bit is not set. This patch
fixes the assertion test to specifically check the ISA2.07 support bit setting
in the HWCAP2 and vex_archinfo->hwcaps variable. The setting for the
ISA2.07 support must be the same in both variables if the HWCAP2 entry exists.
This patch reduces the memory needed for the linesF.
Currently, each SecMap has an array of linesF, referenced by the linesZ
of the secmap that needs a lineF, via an index stored in dict[1].
When the array is full, its size is doubled.
The linesF array of a secmap is freed when the SecMap is GC-ed.
The above strategy has the following consequences:
A. in average, 25% of the LinesF are unused.
B. if a SecMap has 'temporarily' a need for linesF, but afterwards,
these linesF are converted to normal lineZ representation, the linesF
will not be recuperated unless the SecMap is GC-ed (i.e. fully marked
no access).
The patch replaces the linesF array private per SecMap
by a pool allocator of LinesF shared between all SecMap.
A lineZ that needs a lineF will directly point to its lineF (using a pointer
stored in dict[1]), instead of having in dict[1] the index in the SecMap
linesF array.
When a lineZ needs a lineF, it is allocated from the pool allocator.
When a lineZ does not need anymore a lineF, it is returned back to the
pool allocator.
On a firefox startup, the above strategy reduces the memory for linesF
by about 42Mb. It seems that the more firefox is used (e.g. to visit
a few websites), the bigger the memory gain.
After opening the home page of valgrind, wikipedia and google, the memory
gain is about 94Mb:
trunk:
linesF: 392,181 allocd ( 203,934,120 bytes occupied) ( 173,279 used)
patch:
linesF: 212,966 allocd ( 109,038,592 bytes occupied) ( 170,252 used)
There is also less alloc/free operations in core arena with the patch:
trunk:
core : 810,680,320/ 802,291,712 max/curr mmap'd, 17/19 unsplit/split sb unmmap'd, 759,441,224/ 703,191,896 max/curr, 40631760/16376828248 totalloc-blocks/bytes, 188015696 searches 8 rzB
patch:
core : 701,628,416/ 690,753,536 max/curr mmap'd, 12/29 unsplit/split sb unmmap'd, 643,041,944/ 577,793,712 max/curr, 32050040/14056017712 totalloc-blocks/bytes, 174097728 searches 8 rzB
In terms of performance, no CPU impact detected on Firefox startup.
Note we have no representative reproducible (and preferrably small)
perf test that uses extensively linesF. Firefox is a good heavy lineF
user but is far to be reproducible, and is very far to be small.
Theoretically, in terms of CPU performance, the patch might have some
small benefits here and there for read operations, as the lineF pointer
is directly retrieved from the lineZ, rather than retrieved via an indirection
in the linesF array.
For write operations, the patch might need a little bit more CPU,
as we replace an
assignment to lineF inUse boolean to False (and then probably back to True
when the cacheline is written back)
by
a call to pool allocator VG_(freeEltPA) (and then probably a call to
VG_(allocEltPA) when the cacheline is written back).
These PA functions are small, so cost should be ok.
We might however still maintain in clear_LineF_of_Z the last cleared lineF
and re-use it in alloc_LineF_for_Z. Not sure how many calls to the PA functions
would be avoided by this '1 elt cache' (and the needed 'if elt == NULL'
check in both clear_LineF_of_Z and alloc_LineF_for_Z.
This possible optimisationwill be looked at later.
Carl Love [Fri, 15 May 2015 20:09:05 +0000 (20:09 +0000)]
Patch 5 in a revised series of cleanup patches from Will Schmidt
Add a .exp for the pth_cond_destroy_busy for PPC64 big endian.
This is specifically to cover the last line of output as
seen on ppc64BE, which is "ERROR SUMMARY: X errors from 3 contexts",
where X is 6, versus 3 as seen on other architectures.
The additional errors show up on BE during the "Thread #1: pthread_cond
_destroy: destruction of condition variable being waited upon."
Signed-off-by: Will Schmidt <will_schmidt@vnet.ibm.com>
This patch fixes Vagrind bugzilla 347686
Carl Love [Fri, 15 May 2015 16:50:06 +0000 (16:50 +0000)]
Patch 6 in a revised series of cleanup patches from Will Schmidt
Fix multipleinheritance heuristic for ppc64LE (leak_cpp_interior test).
Adjust the PPC64 #ifdiffery to indicate that ppc64BE uses a thunk table,
but ppc64LE (in particular, the ELF ABIV2) does not. In this case, thunk
table == function descriptors.
Signed-off-by: Will Schmidt <will_schmidt@vnet.ibm.com>
--
This patch replaces the previously posted "[6/7] add leak_cpp_interior
test .exp results ....."
This patch (re-)gains performance in helgrind, following revision 15207, that
reduced memory use doing SecMap GC, but was slowing down some workloads
(typically, workloads doing a lot of malloc/free).
A significant part of the slowdown came from the clear of the filter,
that was not optimised for big ranges : the filter was working byte
per byte till an 8 alignment. Then working per 8 bytes at a time.
With the patch, the filter clear is done the following way:
* all the bytes till 8 alignement are done together
* then 8 bytes at a time till filter_line alignment (32 bytes)
* then 32 bytes at a time.
Moreover, as the filter cache is small (1024 lines of 32 bytes),
clearing filter for ranges bigger than 32Kb was uselessly checking
several times the same entry. This is now avoided by using a range
check rather than a tag equality check.
As the new filter clear is significanly more complex than the previous simple
algorithm, the old algorithm is kept and used to check the new algorithm
when CHECK_ZSM is defined as 1.
The patch also contains a few micro optimisations and
disables
// VG_(track_die_mem_stack) ( evh__die_mem );
as this had no effect and was somewhat costly.
With this patch, we have almost reached for all perf tests the same
performance as we had before revision 15207. Some tests are still
slightly slower than before the SecMap GC (max 2% difference).
Some tests are now significantly faster (e.g. sarp).
For almost all tests, we are now faster than valgrind 3.10.1.
Details below.
Regtested on x86/amd64/ppc64 (and regtested with all compile time
checks set).
I have also regtested with libreoffice and firefox.
(with firefox, also with CHECK_ZSM set to 1).
Details about performance:
hgtrace = this patch
trunk_untouched = trunk
base_secmap = trunk before secmap GC
valgrind 3.10.1 included for comparison
Measured on core i5 2.53GHz
Carl Love [Thu, 14 May 2015 21:52:59 +0000 (21:52 +0000)]
Patch 4 in a revised series of cleanup patches from Will Schmidt
Add a suppression to handle a "Jump to the invalid address..." message
that gets generated on power. This is a variation of the existing
suppressions.
While here, I also updated the "prog:" line in the vgtest file to reference
the supp_unknown executable, versus the badjump executable. They share the
same source code, so I think this is effectively cosmetic.
Add the lwpid in the scheduler status information
E.g. we now have:
Thread 1: status = VgTs_Runnable (lwpid 15782)
==15782== at 0x8048EB5: main (sleepers.c:188)
client stack range: [0xBE836000 0xBE839FFF] client SP: 0xBE838F80
valgrind stack top usage: 10264 of 1048576
Thread 2: status = VgTs_WaitSys (lwpid 15828)
==15782== at 0x2E9451: ??? (syscall-template.S:82)
==15782== by 0x8048AD3: sleeper_or_burner (sleepers.c:84)
==15782== by 0x39B924: start_thread (pthread_create.c:297)
==15782== by 0x2F107D: clone (clone.S:130)
client stack range: [0x442F000 0x4E2EFFF] client SP: 0x4E2E338
valgrind stack top usage: 2288 of 1048576
This allows to attach with GDB to the good lwpid in case
you want to examine the valgrind state rather than the guest state.
(it is needed to attach to the specific lwpid as valgrind is not
linked with lib pthread, so GDB cannot discover the threads
of the process).
Implement 'qXfer:exec-file:read' packet in Valgrind gdbserver.
Thanks to this packet, with recent GDB (>= 7.9.50.20150514-cvs), the
command 'target remote' will automatically load the executable file of
the process running under Valgrind. This means you do not need to
specify the executable file yourself, GDB will discover it itself.
See GDB documentation about 'qXfer:exec-file:read' packet for more
info.
Carl Love [Wed, 13 May 2015 21:46:47 +0000 (21:46 +0000)]
Patch 2 in a revised series of cleanup patches from Will Schmidt
Add a deep-D test .exp values for ppc64.
Depending on the system and the systems endianness, there are variances
in the library reference, and to the specific line number in the library.
I was able to add and modify existing filters to cover most of the variations,
but did need to add a .exp to cover the additional call stack entry as seen
on power.
This change allows the ppc64 targets to pass the massif/deep-D test.
Carl Love [Wed, 13 May 2015 21:10:12 +0000 (21:10 +0000)]
Patch 1 in a revised series of cleanup patches from Will Schmidt
Update the massif/big-alloc test for ppc64*.
In comparison to the existing .exp files, the time,total,extra-heap
values generated on ppc64* vary from the other architectures.
This .exp allows the ppc64 targets to pass the test.
* avoid indirection via function pointers to call SVal__rcinc and SVal__rcdec
* declare these functions inlined
* transform 2 asserts on hot path in conditionally compiled checks
on CHECK_ZSM
This slightly optimises some perf tests with helgrind
Improves the way arena statistics are shown
The mmap'd max/curr and max/curr nr of bytes will be shown e.g. as
11,440,408/ 4,508,968
instead of 11440656/ 4509200
So, using more space, but more readable (in particular when the
nr exceeds the width, and so are not aligned anymore)
This patch decreases the memory used by the helgrind SecMap,
by implementing a Garbage Collection for the SecMap.
The basic change is that freed memory is marked as noaccess
(while before, it kept the previous marking, on the basis that
non buggy applications are not accessing freed memory in any case).
Keeping the previous marking avoids the CPU/memory changes needed
to mark noaccess.
However, marking freed memory noaccess and GC the secmap reduces
the memory on big apps.
For example, a firefox test needs 220Mb less (on about 2.06 Gb).
Similar reduction for libreoffice batch (260 MB less on 1.09 Gb).
On such applications, the performance with the patch is similar to the trunk.
There is a performance decrease for applications that are doing
a lot of malloc/free repetitively: e.g. on some perf tests, an increase
in cpu of up to 15% has been observed.
Several performance optimisations can be done afterwards to not loose
too much performance. The decrease of memory is expected to produce
in any case significant benefit in memory constrained environments
(e.g. android phones).
So, after discussion with Julian, it was decided to commit as-is
and (re-)gain (part of) performance in follow-up commits.
Add some cfi directives in the code doing syscall (by Valgrind).
This allows to attach to Valgrind when VAlgrind is blocked in a syscall
and have GDB producing a stacktrace, rather than being unable
to unwind.
I.e. instead of having:
(gdb) bt
#0 0x380460f2 in do_syscall_WRK ()
(gdb)
with the directives, we obtain:
(gdb) bt
#0 vgPlain_mk_SysRes_x86_linux (val=1) at m_syscall.c:65
#1 vgPlain_do_syscall (sysno=168, a1=944907996, a2=1, a3=4294967295, a4=0, a5=0, a6=0, a7=0, a8=0) at m_syscall.c:791
#2 0x38031986 in vgPlain_poll (fds=0x385226dc <remote_desc_pollfdread_activity>, nfds=1, timeout=-1) at m_libcfile.c:535
#3 0x3807479f in vgPlain_poll_no_eintr (fds=0x385226dc <remote_desc_pollfdread_activity>, nfds=1, timeout=-1)
at m_gdbserver/remote-utils.c:86
#4 0x380752f0 in readchar (single=4096) at m_gdbserver/remote-utils.c:938
#5 0x38075ae3 in getpkt (buf=0x61f35020 "") at m_gdbserver/remote-utils.c:997
#6 0x38076fcb in server_main () at m_gdbserver/server.c:1048
#7 0x38072af2 in call_gdbserver (tid=1, reason=init_reason) at m_gdbserver/m_gdbserver.c:721
#8 0x380735ba in vgPlain_gdbserver (tid=1) at m_gdbserver/m_gdbserver.c:788
#9 0x3802c6ef in do_actions_on_error (allow_db_attach=<optimized out>, err=<optimized out>) at m_errormgr.c:532
#10 pp_Error (err=0x61f580e0, allow_db_attach=1 '\001', xml=1 '\001') at m_errormgr.c:644
#11 0x3802cc34 in vgPlain_maybe_record_error (tid=1643479264, ekind=8, a=2271560481, s=0x0, extra=0x62937f1c)
at m_errormgr.c:851
#12 0x38028821 in vgMemCheck_record_free_error (tid=1, a=2271560481) at mc_errors.c:836
#13 0x38007b65 in vgMemCheck_free (tid=1, p=0x87654321) at mc_malloc_wrappers.c:496
#14 0x3807e261 in do_client_request (tid=1) at m_scheduler/scheduler.c:1840
#15 vgPlain_scheduler (tid=1) at m_scheduler/scheduler.c:1406
#16 0x3808b6b2 in thread_wrapper (tidW=<optimized out>) at m_syswrap/syswrap-linux.c:102
#17 run_a_thread_NORETURN (tidW=1) at m_syswrap/syswrap-linux.c:155
#18 0x00000000 in ?? ()
(gdb)
Carl Love [Wed, 6 May 2015 21:11:35 +0000 (21:11 +0000)]
Patch 8 in a series of cleanup patches from Will Schmidt
Add a helper script to determine if the platform is ppc64le.
This is specifically used to help exclude the 32-bit tests from being
run on a ppc64LE (ABIV2) platform. The 32-bit targets, specifically ppc32/*
is not built on LE.
Carl Love [Wed, 6 May 2015 19:44:14 +0000 (19:44 +0000)]
Patch 2 in a series of cleanup patches from Will Schmidt
Adjust the badjump2 test for ppc64le/ABIV2. Under the ABIV2 there
is no function descriptor, so the fn[] setup does not apply.
This fixes the badjump2 test failure as seen on ppc64le.
* Out of memory message was using 'bytes have already been allocated.'
while this nr is in fact the total anonymously mmap-ed.
Change the message so as to reflect the shown number.
* Show also the total anonymous mmaped in non OOM memory statistics
This patch reduces the memory needed for a VtsTE by 25% (one word)
on 32 bits platforms. No memory reduction on 64 bits platforms,
due to alignment.
The patch also shows the vts stats when showing the helgrind stats.
The perf/memrw.c perf test gets also some new additional features
allowing e.g. to control the size of the read or written blocks.
This patch adds a function that allows to directly properly size an xarray
when the size is known in advance.
3 places identified where this function can be used trivially.
The result is a reduction of 'realloc' operations in core
arena, and a small reduction in ttaux arena
(it is the nr of operations that decreases, the memory usage itself
stays the same (ignoring some 'rounding' effects).
E.g. for perf/bigcode 0, we change from
core 1085742/ 216745904 totalloc-blocks/bytes, 1085733 searches
ttaux 5348/ 6732560 totalloc-blocks/bytes, 5326 searches
to
core 712666/ 190998592 totalloc-blocks/bytes, 712657 searches
ttaux 5319/ 6731808 totalloc-blocks/bytes, 5296 searches
For bz2, we switch from
core 50285/ 32383664 totalloc-blocks/bytes, 50256 searches
ttaux 670/ 245160 totalloc-blocks/bytes, 669 searches
to
core 32564/ 29971984 totalloc-blocks/bytes, 32535 searches
ttaux 605/ 243280 totalloc-blocks/bytes, 604 searches
Performance wise, on amd64, this improves memcheck performance
on perf tests by 0.0, 0.1 or 0.2 seconds depending on the test.
Rename write variable to avoid a warning:
memrw.c:37: warning: declaration of ‘write’ shadows a global declaration
/usr/include/unistd.h:333: warning: shadowed declaration is here
DW_CFA_def_cfa_expression: don't push the CFA on the stack before
evaluation starts. For DW_CFA_val_expression and DW_CFA_expression
doing so is correct, but not for DW_CFA_def_cfa_expression.
Back out most of r15145 which reports bug fixes for various altivec insns.
Either those bugs have been fixed looong time ago, or the reporter ran
on a host without altivec capabilities, or those insns were actually
e500 insns which are not supported at all at this point.
Follow up on VEX r3144 and remove VexGuestTILEGXStateAlignment.
Also fix the alignment check which should be mod 16 not mod 8.
Well, actually, it should be mod LibVEX_GUEST_STATE_ALIGN but
that is another patch.
Fix BZ #342683. Based on patch by Ivo Raisr.
What this does is to make sure that the initial client data segment
is marked as unaddressable. This is consistent with the behaviour of
brk when the data segment is shrunk. The "freed" memory is marked
as unaddressable.
Special tweaks were needed for s390 which was returning early from
the funtion to avoid sloppy register definedness initialisation.
Replace adler32 by sdbm_hash in m_deduppoolalloc.c
adler32 is not very good as a hash function.
sdbm_hash gives more different keys that adler32,
and in a large majority of the cases, shorter chains.
Fix an assertion in the address space manager. BZ #345887.
The VG_(extend_stack) call needs to be properly guarded because the
passed-in address is not necessarily part of an extensible stack
segment. And an extensible stack segment is the only thing that
function should have to deal with.
Previously, the function VG_(am_addr_is_in_extensible_client_stack)
was introduced to guard VG_(extend_stack) but it was not added in all
places it should have been.
Also, extending the client stack during signal delivery (in sigframe-common.c)
was simply calling VG_(extend_stack) hoping it would do the right thing.
But that was not always the case. The new testcase
none/tests/linux/pthread-stack.c exercises this (3.10.1 errors out on it).
Renamed ML_(sf_extend_stack) to ML_(sf_maybe_extend_stack) and add
proper guard logic for VG_(extend_stack).
Testcases none/tests/{amd64|x86}-linux/bug345887.c by Ivo Raisr.
Carl Love [Wed, 22 Apr 2015 21:17:48 +0000 (21:17 +0000)]
There is an ABI change in how the PPC64 gcc compiler handles 128 bit arguments
are aligned with GCC 5.0. The compiler generates a "note" about this starting
with GCC 4.9. To avoid generating the "note", the passing of the arguments
were changed to a pointer to make it pass by reference rather then pass by
value.
Carl Love [Wed, 22 Apr 2015 16:17:06 +0000 (16:17 +0000)]
Add support for the TEXASRU register. This register contains information on
transactional memory instruction summary information. This register contains
the upper 32-bits of the transaction information. Note, the valgrind
implementation of transactional memory instructions is limited. Currently, the
contents of the TEXASRU register will always return 0. The lower 64-bits of
the trasnaction information in the TEXASR register will contain the failure
information as setup by Valgrind.
The vex commit 3143 contains the changes needed to support the TEXASRU
register on PPC64.
The support requires changing the value of MAX_REG_WRITE_SIZE in
memcheck/mc_main.c from 1696 to 1712. The change is made in this
valgrind commit.