A trained ANN could become unreachable to workers even though
training succeeded: NEURAL_SPAM/NEURAL_HAM stopped firing while the
controller logged "ann ... is changed, our version = N, remote
version = M" forever.
Root cause is a version regression, not a missing zset registration.
The new version was seeded from the in-memory set.ann, and
fill_set_ann resets set.ann.version to 0 whenever a worker never
loaded an ANN (restart, or the selected profile's blob was missing).
A worker that trained from the _4 profile then saved version 1.
process_existing_ann selects the highest version among compatible
profiles, so the live version-1 blob was shadowed by the stale
version-4 zset entry whose key was empty. The profile zset has no
TTL, so the dead high-version tombstone was immortal and the
condition self-perpetuated (the _4 blob was never rewritten).
Three fixes:
1. Version monotonicity (lualib/plugins/neural.lua): seed the new
version from the profile actually trained from (the trained-from
key encodes it as the trailing _<n>), max'd with
training_profile/set.ann, so the new entry always outranks the
profile it supersedes.
2. Liveness-aware selection (src/plugins/lua/neural.lua,
neural_maybe_invalidate.lua): when the selected profile's blob is
missing, fall back to the next compatible profile with a live blob
instead of going dark, and emit a throttled warning (was a silent
debug line). The invalidate script also GCs profile entries that
have no blob and no training data and are older than a grace
window.
3. Lifetime coupling (neural_save_unlock.lua,
src/plugins/lua/neural.lua): give the profile zset a TTL refreshed
each check_anns cycle, and refresh the blob TTL on every reload,
so an actively used ANN never expires out from under its entry.
Adds 330_neural/005_stale_version.robot, which injects a
higher-version tombstone and asserts inference recovers.