Eric Wong [Thu, 26 Oct 2023 08:20:06 +0000 (08:20 +0000)]
git: cleanup un-associated coderepo processes
It's possible to have many coderepos with no inbox association
that never see git->cleanup. So instead of tying git->cleanup
to inboxes, ensure it gets armed when ->watch_async is called
(since it's only called in our -netd or -httpd servers).
Eric Wong [Wed, 25 Oct 2023 15:33:49 +0000 (15:33 +0000)]
cindex: fix large prunes
When comm(1) has a lot of data to output, we must ensure we
explicitly close FDs of processes in previous stages of the
pipeline to ensure comm(1) to terminates properly.
This is difficult to test automatically with small test repos...
Fixes: 17b06aa32aac (cindex: start using run_await to simplify code)
While uncommon, some git repos have hundreds of thousands of
refs and slurping that output into memory can bloat the heap.
Introduce a sha_all sub in PublicInbox::SHA to loop until EOF
and rely on autodie for checking sysread errors.
Eric Wong [Wed, 25 Oct 2023 00:29:49 +0000 (00:29 +0000)]
cindex: use sysread for generating fingerprint
We use sysseek for this file handle elsewhere (since it's passed
to `git rev-list --stdin' multiple times), and sysread ensures
we can use a larger read buffer than the tiny 8K BUFSIZ Perl +
glibc is contrained to.
This also ensures we autodie on sysread failures, since the
autodie import for `read' was missing and we don't call `read'
anywhere else in this file.
Eric Wong [Wed, 25 Oct 2023 00:29:46 +0000 (00:29 +0000)]
cindex: use run_await to read extensions.objectFormat
This saves us the trouble of seeking ourselves by using existing
run_await functionality. We'll also be more robust to ensure we
only handle the result if the `git config' process exited without
a signal.
Eric Wong [Wed, 25 Oct 2023 00:29:45 +0000 (00:29 +0000)]
cindex: start using run_await to simplify code
This saves us some awaitpid calls. We can also start passing
hashref redirect elements directly to pipe and open perlops,
saving us the trouble of naming some variables.
Eric Wong [Wed, 25 Oct 2023 00:29:44 +0000 (00:29 +0000)]
cindex: use timer for inits
We'll need to be in the event loop to use run_await in parallel,
so we can't start processes outside of it. This change isn't
ideal, but it likely keeps the rest of our (hotter) code simpler.
Eric Wong [Wed, 25 Oct 2023 00:29:43 +0000 (00:29 +0000)]
cindex: avoid awaitpid for popen
We can use popen_rd to pass command and callbacks to a
callback sub. This is another step which may allow us
to get rid of the wantarray forms of popen_rd/popen_wr
in the future.
Eric Wong [Wed, 25 Oct 2023 00:29:41 +0000 (00:29 +0000)]
qspawn: simplify internal argument passing
Now that psgi_return is gone, we can further simplify our
internals to support only psgi_qx and psgi_yield. Internal
argument passing is reduced and we keep the command env and
redirects in the Qspawn object for as long as it's alive.
I wanted to get rid of finalize() entirely, but it seems
trickier to do when having to support generic PSGI.
Eric Wong [Wed, 25 Oct 2023 00:29:39 +0000 (00:29 +0000)]
drop psgi_return, httpd/async and GetlineBody
Now that psgi_yield is used everywhere, the more complex
psgi_return and it's helper bits can be removed. We'll also fix
some outdated comments now that everything on psgi_return has
switched to psgi_yield. GetlineResponse replaces GetlineBody
and does a better job of isolating generic PSGI-only code.
Eric Wong [Wed, 25 Oct 2023 00:29:32 +0000 (00:29 +0000)]
qspawn: introduce new psgi_yield API
This is intended to replace psgi_return and HTTPD/Async
entirely, hopefully making our code less convoluted while
maintaining the ability to handle slow clients on
memory-constrained systems
This was made possible by the philosophy shift in commit 21a539a2df0c
(httpd/async: switch to buffering-as-fast-as-possible, 2019-06-28).
We'll still support generic PSGI via the `pull' model with a
GetlineResponse class which is similar to the old GetlineBody.
Eric Wong [Wed, 25 Oct 2023 00:29:30 +0000 (00:29 +0000)]
httpd/async: require IO arg
Callers that want to requeue can call PublicInbox::DS::requeue
directly and not go through the convoluted argument handling
via PublicInbox::HTTPD::Async->new.
Eric Wong [Wed, 25 Oct 2023 00:29:26 +0000 (00:29 +0000)]
psgi_qx: use a temporary file rather than pipe
A pipe requires more context switches, syscalls, and code to
deal with unpredictable pipe EOF vs waitpid ordering. So just
use the new spawn/aspawn features to automatically handle
slurping output into a string.
Eric Wong [Wed, 25 Oct 2023 00:29:25 +0000 (00:29 +0000)]
spawn: support synchronous run_qx
This is similar to `backtick` but supports all our existing spawn
functionality (chdir, env, rlimit, redirects, etc.). It also
supports SCALAR ref redirects like run_script in our test suite
for std{in,out,err}.
We can probably use :utf8 by default for these redirects, even.
Eric Wong [Wed, 25 Oct 2023 00:29:24 +0000 (00:29 +0000)]
limiter: split out from qspawn
It's slightly better organized this way, especially since
`publicinboxLimiter' has its own user-facing config section
and knobs. I may use it in LeiMirror and CodeSearchIdx for
process management.
Eric Wong [Thu, 19 Oct 2023 01:14:31 +0000 (01:14 +0000)]
lei: simplify startq/au_done wakeup notifications
We only need to write one byte at MUA start instead of a byte
for every LeiXSearch worker. Also, make sure it succeeds by
enabling autodie for syswrite.
When reading, we can rely on `:perlio' layer `read' semantics
to retry on EINTR to avoid looping and other error checking.
Eric Wong [Tue, 17 Oct 2023 23:38:05 +0000 (23:38 +0000)]
test_common: only hide TCP port in messages
v2:// lei outputs are on the filesystem, so putting $HOST:$PORT
is nonsensical. We'll also keep `127.0.0.1' or `[::1]' since
it's harmless and can point out obvious errors in system
configuration when testing with old Perls or libraries.
Eric Wong [Tue, 17 Oct 2023 23:37:58 +0000 (23:37 +0000)]
xt/git-http-backend: remove Net::HTTP usage
HTTP::Tiny is part of the Perl standard library since Perl 5.14
while Net::HTTP has never been (unlike Net::NNTP or Net::POP3).
For the test which forces server-side buffering, we'll just use
regular socket handle.
Eric Wong [Tue, 17 Oct 2023 23:37:54 +0000 (23:37 +0000)]
xap_helper: die more easily in both implementations
We don't need to tolerate bad requests since it's only handling
requests from the parent process. So simplify error management
and just die||exit if we get a bad request.
Eric Wong [Tue, 17 Oct 2023 23:37:52 +0000 (23:37 +0000)]
use read_all in more places to improve safety
`readline' ops may not detect errors on partial reads.
This saves us some code to reduce cognitive overhead for
readers. We'll also support reusing a destination buffers so it
can work more nicely with existing code.
Eric Wong [Tue, 17 Oct 2023 23:37:49 +0000 (23:37 +0000)]
git: introduce read_all function
This makes it easier to improve error checking, since the
`do { local $/; readline(FH) }' construct does not detect
errors (autodie does not cover `readline' or `<FH>').
I'm not sure exactly where this should be, but PublicInbox::Git
is used nearly everywhere in our code base and it's probably
not worth creating a new package for it.
Eric Wong [Tue, 17 Oct 2023 23:37:46 +0000 (23:37 +0000)]
lei_mirror: start converting to autodie
This code is too noisy and not critical for startup performance;
so autodie provides a nice noise reduction while improving error
reporting in most cases.
For places where failures are expected, the `CORE::' prefix
gives us an easy escape hatch to fall back to normal error
checking.
Eric Wong [Tue, 17 Oct 2023 10:11:06 +0000 (10:11 +0000)]
input_pipe: handle noncanonical TTY
lei could get a TTY in noncanonical mode for stdin, so rely on
VMIN+VTIME to get the desired non-blocking semantics we'd expect
from a pipe or socket. This ought to prevent read(2) (Perl sysread)
from returning zero when we really want to hit EAGAIN.
Eric Wong [Tue, 17 Oct 2023 10:11:05 +0000 (10:11 +0000)]
input_pipe: improve error handling
Ensure the callback is always guarded by `eval' to catch
exceptions and to force a ->close (EPOLL_CTL_DEL).
We also don't want to blindly set O_NONBLOCK on TTYs since their
O_NONBLOCK semantics aren't well-defined by POSIX. We can also
drop EPOLLET (edge-triggered) use to reduce the need to make
->requeue calls on our end.
Eric Wong [Tue, 17 Oct 2023 10:11:04 +0000 (10:11 +0000)]
lei: consolidate stdin slurp, fix warnings
We can share more code amongst stdin slurper (not streaming)
commands. This also fixes uninitialized variable warnings when
feeding an empty stdin to these commands.
Eric Wong [Sun, 15 Oct 2023 08:16:28 +0000 (08:16 +0000)]
learn: respect indexlevel for v1 inboxes
v2 never suffered from this bug, apparently, but -learn didn't
seem able to handle indexlevel=basic (nor respect `medium')
for v1 inboxes. I only noticed this bug because I converted
some ancient v1 inboxes to `basic' to save space.
Eric Wong [Fri, 13 Oct 2023 06:12:29 +0000 (06:12 +0000)]
xap_helper_cxx: allow sharing XDG_CACHE_HOME across ABIs
For users sharing home directories (or just XDG_CACHE_HOME)
across hosts of different architectures, we must use a compiler
and architecture-specific destination directory for storing the
binary result. Even on the same OS and architecture, different
C++ compilers may have different ABIs, so we must account for
that.
Eric Wong [Thu, 12 Oct 2023 00:21:00 +0000 (00:21 +0000)]
lei: quiet excessive write/seen messages
We don't want to end up dumping nr_seen/nr_write when progress
is disabled, nor do we want forked off `lei note-event' workers
dump them when DS->Reset is called on fork.
Eric Wong [Wed, 11 Oct 2023 07:20:57 +0000 (07:20 +0000)]
lei import|tag|rm: support --commit-delay=SECONDS
Delayed commits allows users to trade off immediate safety for
throughput and reduced storage wear when running multiple
discreet commands.
This feature is currently useful for providing a way to make
t/lei-store-fail.t reliable and for ensuring `lei blob' can
retrieve messages which have not yet been committed.
In the future, it'll also be useful for the FUSE layer to batch
git activity.
Eric Wong [Wed, 11 Oct 2023 07:20:55 +0000 (07:20 +0000)]
import: cat_blob is a no-op w/o live fast-import
cat_blob is a fallback for handling files which haven't made it
onto disk to be readable by `git cat-file'. Thus spawning a new
fast-import process to retrieve a blob is pointless, as cat_blob
is only used as a last resort when `git cat-file' fails.
Eric Wong [Wed, 11 Oct 2023 07:20:54 +0000 (07:20 +0000)]
import: switch to Unix stream socket for fast-import
We use fewer file descriptors and fewer lines of code this way.
I'm not aware of any place we rely on POSIX pipe semantics with
`git fast-import', and sockets have bigger buffers by default
in most cases (even if Linux allows larger pipe buffers).
Eric Wong [Wed, 11 Oct 2023 07:20:53 +0000 (07:20 +0000)]
treewide: consolidate "From " line removal
Aside from our prior import bugs (fixed in a0c07cba0e5d8b6a
(mda: drop leading "From " lines again, 2016-06-26)), we'll
always have to be dealing with mutt piping messages to us and
`git format-patch' output. So just share the regexp so we
can use it everywhere.
In may be desirable to allow importing messages with a leading
"From " line for FUSE, even.
Additionally, some instances of this regexp needlessly added
optional `\r?' (CR) checks ahead of the `\n' (LF) element; but
they're pointless anyways since [^\n]* is enough to exclude all
non-LF bytes.
Eric Wong [Wed, 11 Oct 2023 07:20:51 +0000 (07:20 +0000)]
msgtime: quiet warnings we can do nothing about
In retrospect, warning about bad times and dates is pointless
since there's nothing actionable about it. We'll also drop an
unnecessary capture in msg_received_at while we're at it and
favor using $eml since as the input variable name to match
current usage.
The note to install Date::Parse as a fallback remains since it
can be helpful in some cases (and is actionable by the user).
Eric Wong [Wed, 11 Oct 2023 07:20:50 +0000 (07:20 +0000)]
lei_xsearch: improve curl progress reporting
Instead of having tail(1) follow a file when we're in verbose
mode, unconditionally pipe stderr to a Perl 2-liner which tees
its output to a regular file with line buffering.
POSIX tee(1) isn't suitable for this task since it's required
to be completely unbuffered while we want line-buffering when
running parallel processes. Fortunately, Perl makes this easy.
This also means we no longer leave curl-err.XXXX files around
on premature shutdown if we're hit by a SIGKILL or similar and
can't exit normally.
We do need to stop and respawn the Perl process if we hit a curl
error, though, since we need to be certain the output is
flushed.
Eric Wong [Wed, 11 Oct 2023 07:20:49 +0000 (07:20 +0000)]
lei rediff: use ProcessIO for --drq support
This required fixing binmode support a few commits ago, along
with properly enabling autoflush in popen_wr instead of setting
it on the wrapper ProcessIO class.
Eric Wong [Tue, 10 Oct 2023 10:09:04 +0000 (10:09 +0000)]
ds: use a dummy poller during Reset
commit 1897c3be1ed644a05f96ed06cde4a9cc2ad0e5a4
(ds: Reset: replace Poller object early, 2023-10-04)
was not effective at eliminating the following message
at daemon shutdown:
Can't call method "FILENO" on an undefined value at
.../PublicInbox/Select.pm line 34 during global destruction.
This seems down to some tied objects having unpredictable
destruction order. So use a dummy class to ensure its ep_*
methods never call the tied `FILENO' method at all since
dropping the Poller object will release any resources it holds.
Eric Wong [Tue, 10 Oct 2023 10:07:56 +0000 (10:07 +0000)]
over*: avoid defined-or hash assignments with side-effects
These may've been causing strange errors[1] in t/imapd.t from
the -watch daemon, such as:
Cannot copy to HASH in scalar assignment ../PublicInbox/Over.pm
in the Over->dbh() sub. I've only noticed this failure on
FreeBSD 13.2 (Perl 5.32.1, DBD::SQLite 1.72 (bundled SQLite
3.39.4), DBI 1.643) so far, so it could also be something to do
with the versions used and/or memory layout differences
with libc or build toolchain.
Eric Wong [Tue, 10 Oct 2023 09:03:09 +0000 (09:03 +0000)]
t/nntp.t: attempt to track source of undefined vars
Occasionally, t/nntp.t spews undefined variable warnings under
`make check-run'. While the test doesn't fail, it's annoying
to see them and it could be a source of deeper problems.
Eric Wong [Mon, 9 Oct 2023 17:56:23 +0000 (17:56 +0000)]
www_coderepo: fix handling of non-UTF-8 git data
We can't assume git output is UTF-8, and we'll always have
legacy data in git coderepos. So attempt to display some
some garbled text rather than nothing at all if Perl croaks
on it.
sox commit c38987e8d20505621b8d872863afa7d233ed1096
(Added raw inverse-bit u-law and A-law support. Updated *.txt files., 2001-12-13)
is an example of a commit which caused problems for me.
Eric Wong [Sun, 8 Oct 2023 22:11:48 +0000 (22:11 +0000)]
introduce ProcessIONBF for multiplexed non-blocking IO
This is required for reliable epoll/kevent/poll/select
wakeup notifications, since we have no visibility into
the buffer states used internally by Perl.
We can safely use sysread here since we never use the :utf8
nor any :encoding Perl IO layers for readable pipes.
I suspect this fixes occasional failures from t/solver_git.t
when retrieving the WwwCoderepo summary.
Eric Wong [Sun, 8 Oct 2023 22:11:47 +0000 (22:11 +0000)]
process_io: fix binmode and use it in lei_xsearch
The `binmode' perlop can only take two scalars, so passing
`@_' blindly won't work since prototypes are checked. This
means we can get IO::Uncompress::Gunzip working properly
with ProcessIO and use it for curl.
We'll also just autodie (instead of warn) on FS errors when
dealing with curl stderr; since the process will likely be
in bigger trouble soon, anyways.
Eric Wong [Sun, 8 Oct 2023 20:19:40 +0000 (20:19 +0000)]
overidx: use croak/confess instead of die
Unlike `die', `croak' can be expanded to `confess' to give a
full backtrace. We'll use `confess' on transaction failures
since that occasionally causes sporadic t/imapd.t failures on
FreeBSD (IO::Kqueue is installed, so signals are deferred).