BUG/MEDIUM: debug: only dump Lua state when panicking
For a long time, we've tried to show the Lua state and backtrace when
dumping threads so as to be able to figure is (and which) Lua code was
misbehaving, e.g. by performing expensive library calls. Since 3.1 with
commit
365ee28510 ("BUG/MINOR: hlua: prevent LJMP in hlua_traceback()"),
it appears that the approach is more fragile (though that fix addressed
a real issue about out-of-memory), and it's possible to occasionally
observe crashes or CPU loops with "show threads" while running Lua
heavily. While users of "show threads" are rare, the watchdog warnings,
which were also enabled on 3.1, also trigger these issues, which is
even more of a concern.
This patch goes the simple way to address this for now: since the purpose
of the Lua backtrace was to help locate Lua call places upon a panic,
let's only call the backtrace on panic but not in other situations. After
a panic we obviously don't care that the Lua stack might be corrupted
since it's never going to be resumed anyway. This may be relaxed in the
future if a solution is found to reliably produce harmless Lua backtraces.
The commit above was backported to all stable branches, so this patch
will be needed everywhere. However, TAINTED_PANIC only appeared in 2.8,
and given the rarety of this bug before 3.1, it's probably not needed
to make any extra effort to go beyond 2.8.
It's easy enough to test a version for being subject to this issue,
by running the following Lua code:
local function stress(txn)
for _, backend in pairs(core.backends) do
for _, server in pairs(backend.servers) do
local stats = server:get_stats()
end
end
end
core.register_fetches("stress", stress)
in the following config file:
global
stats socket /tmp/haproxy.stat level admin mode 666
tune.lua.bool-sample-conversion normal
lua-load-per-thread "stress.lua"
listen stress
bind :8001
mode http
timeout client 5s
timeout server 5s
timeout connect 5s
http-request return status 200 content-type text/plain lf-string %[lua.stress()]
server s1 127.0.0.1:8000
and stressing port 8001 with 100+ connections requesting / in loop, then
issuing "show threads" on the CLI using socat in loops as well. Normally
it instantly segfaults (sometimes during the first "show").