Change heap_max_size() to improve performance of arena_for_chunk().
Instead of a complex calculation, using a simple mask operation to get the
arena base pointer. HEAP_MAX_SIZE should be larger than the huge page size,
otherwise heaps will use not huge pages.
On AArch64 this removes 6 instructions from arena_for_chunk(), and
bench-malloc-thread improves by 1.1% - 1.8%.