Eric Wong [Wed, 5 Mar 2025 07:18:35 +0000 (07:18 +0000)]
xap_helper.h: share ThreadFieldProcessor across DBs
Unlike notmuch, we don't actually use the QueryParser from
initialization and can stop storing it in the class. Instead,
we rely on the global `cur_srch' at runtime for both which has
both the Xapian::Databaase and Xapian::QueryParser objects. So
stop carrying a copy of the ThreadFieldProcessor object for
every DB connection we have since some public non-extindex-users
may have many DBs and we can probably save some memory this way.
Eric Wong [Wed, 5 Mar 2025 07:18:34 +0000 (07:18 +0000)]
search: avoid `git rev-parse' if using C++ xap_helper
The C++ xap_helper can work with the Xapian query parser API to
run run git_date_parse(), so there's no need to do janky string
substitutions as we do for the Perl bindings.
Eric Wong [Wed, 5 Mar 2025 07:18:33 +0000 (07:18 +0000)]
xap_helper: use libgit2 git_date_parse in C++ impl
Using proper Xapian RangeProcessor and FieldProcessor subclasses
via the Xapian API means our `d:', `dt:' and `rt:' prefixes can
use git `approxidate' interpretation rules inside quoted
subqueries with the `thread:' prefix.
The only incompatibility is the `rt:' field which no longer
takes seconds with <5 characters (e.g. `rt:0..' as was in
t/xap_helper.t), but that'd be broken anyways in real-world use
which is run through `git rev-parse --since='.
Eric Wong [Wed, 5 Mar 2025 07:18:32 +0000 (07:18 +0000)]
search: make `d:' search prefix consistently date-only
While `d:' previously treated YYYYMMDD and YYYY-MM-DD as
low-precision by assuming 00:00:00 for the HH:MM:SS portion, it
would not do so for dates passed to git-rev-parse (e.g.
"last.year") since the HH:MM:SS of the current time would be
used by git. So stop remapping `d:' in queries to `dt:' and
continue using `d:' in the YYYYMMDD column.
This does break things like `d:2.hours.ago..' by causing too
many (or too few) results to be returned due to lack of
precision, but I expect small time ranges of less than one
day to be of limited use.
`dt:' remains a higher-precision field for searching on both
date and time from the Date: header, while `d:' is now always
date-only.
Eric Wong [Wed, 5 Mar 2025 00:45:36 +0000 (00:45 +0000)]
listener: don't set listen backlog on inherited sockets
By using the listen(2) backlog as-is when inheriting (from
systemd or similar), we can give the sysadmin more control on
controlling overload on a per-listener basis. For systemd
users, this means the `Backlog=' parameter in systemd.socket(5)
can be respected and configured to give certain sockets a
smaller backlog (perhaps combined with with per-listener
`multi-accept' parameter on sockets with the standard (huge)
backlog).
For sockets we create, continue to use INT_MAX and let the
kernel clamp it to whatever system-wide limit there is
(e.g. `net.core.somaxconn' sysctl on Linux).
Eric Wong [Thu, 27 Feb 2025 00:08:18 +0000 (00:08 +0000)]
imap: support external xap_helper process
Moving Xapian requests to an external process prevents expensive
searches from affecting non-search IMAP commands or any other
non-search requests within a -netd process if IMAP is sharing a
process with other public-facing protocols.
Eric Wong [Thu, 27 Feb 2025 00:08:17 +0000 (00:08 +0000)]
daemon: slightly simplify xap_helper spawning
Avoid typing the fully-qualified global $XHC variable twice in
spawn_xh. We'll also delay loading of XhcMset until successful
spawn and improve the error message in case of error. There's
also no need to conditionally local-ize
$PublicInbox::Search::XHC, so do it unconditionally to reduce
branches.
Eric Wong [Thu, 27 Feb 2025 00:08:16 +0000 (00:08 +0000)]
search: remove xhc_start_maybe
We start xap_helper processes early in both public-facing
daemons to ensure maximum sharing. lei may be different
since lingering processes likely sit idle for long periods
and we don't want to tie up RAM or swap space for idle
processes.
Eric Wong [Sun, 23 Feb 2025 13:13:35 +0000 (13:13 +0000)]
xh_thread_fp: optimize OR query generation
For Xapian >= 1.4.10 users, using `|=' will reduce allocations.
`|=' still works for Xapian 1.3.2 .. 1.4.9 users, and I'm not
sure if Xapian 1.2.x is relevant anymore, especially for our
new C++-only features.
While operator overloading is often confusing and frustrating to
me when reading someone else's code, the optimization seems worth
it since (AFAIK) there's no other way to get the allocation
reduction.
cf. Olly in xapian-discuss <20250222043050.GA17282@survex.com>
Eric Wong [Sun, 23 Feb 2025 13:13:34 +0000 (13:13 +0000)]
xap_helper: avoid temporary std::set in thread fp
Instead of creating a temporary std::set to store THREADIDs,
just inject the Xapian::Query::OP_OR operations directly into
the query we return.
Unlike notmuch(1), we can do this without risking redundant
query ops from repeated THREADIDs since use since we use a
column instead of a term for THREADID. Column use allowed
us to use .set_collapse_key to deduplicate the intermediate
mset, already.
Eric Wong [Sun, 23 Feb 2025 13:13:33 +0000 (13:13 +0000)]
xap_helper: enable FLAG_PURE_NOT in external process
Since public-facing WWW uses an external process with
a non-blocking socket, clients will only see 503 errors
if the search is overloaded. While allowing pure NOT
queries is expensive, there are already many possible
ways of triggering expensive queries so it's probably not
a problem to allow one more.
Eric Wong [Thu, 20 Feb 2025 22:14:30 +0000 (22:14 +0000)]
search: index References: for thread:GHOST-MSGID
To search for messages in a thread with a ghost msgid,
we need to be able to search against msgids in the
References: header since (by definition) ghosts don't
show up as any Message-ID: we've indexed.
This should make our implementation of `thread:MSGID'
queries equivalent in capability to `thread:THREADID'
of notmuch.
Eric Wong [Thu, 20 Feb 2025 22:14:29 +0000 (22:14 +0000)]
xap_helper: support thread:{SUBQUERY} via C++
Stealing the idea and code from notmuch to perform subqueries
within search. One major internal difference from notmuch is we
store THREADID as numeric column in Xapian whereas notmuch
stores a boolean term. The use of a column lets us use
set_collapse_key to deduplicate results within Xapian itself.
The other difference from notmuch is we avoid exposing the
numeric THREADID since they're unstable and not reproducible in
mirrors, thus we also support `thread:MSGID' instead of
`thread:THREADID' in brace-less queries.
Eric Wong [Thu, 20 Feb 2025 22:14:28 +0000 (22:14 +0000)]
xap_helper: switch C++ implementation to AGPL-3
GPL-2 approxidate code won't work with the XS/SWIG version, so
it looks like we'll keep calling `git rev-parse' in both
versions for the time being. Meanwhile, it's more valuable to
be able to take GPL-3+ code from notmuch for thread:{} query
parsing.
Eric Wong [Wed, 19 Feb 2025 10:10:32 +0000 (10:10 +0000)]
searchidx: don't index Base-85 w/ CRLF endings
I encountered a false positive search result from a CRLF message
with a Base-85 patch in it. It turns out our Base-85 filtering
code didn't account for the possibility of "\r" showing up in
patch messages, so just ignore all trailing spaces (not just
horizontal spaces) in index_diff().
While we're at it, exclude horizontal whitespace and CR
consistently from Base-85-looking quoted text in
index_body_text(), too, since I'm sure there's messages with
CRCRLF in the wild, too...
Eric Wong [Wed, 19 Feb 2025 12:39:26 +0000 (12:39 +0000)]
cfgwr: fix non-libgit2 case
libgit2 isn't installed on all my test machines, and the
addition of libgit2 support was careless in that it failed to
test the non-libgit2 case thoroughly :x
Fixes: a8073f6c (lg2: use cfgwr_commit to write to configs using libgit2, 2025-02-15)
Eric Wong [Sat, 15 Feb 2025 11:10:11 +0000 (11:10 +0000)]
lg2: use cfgwr_commit to write to configs using libgit2
libgit2 allows us to avoid process spawning overhead when
writing to the config file. While the performance difference is
unlikely to matter in real-world use, it should improve
maintainability of tests by allowing us to write tests with
`public-inbox-init' with a smaller performance penalty (as
opposed to using `PublicInbox::IO::write_file' to setup
configs).
Eric Wong [Sat, 15 Feb 2025 11:10:10 +0000 (11:10 +0000)]
rename Gcf2 -> Lg2 for general libgit2 use
We'll be exploring libgit2 for writing (and possibly reading) config
files, so the `gcf2' (git-cat-file-2) name isn't appropriate
anymore for the package name. We'll still keep `gcf2_' for
subroutine names related to git-cat-file-like behavior.
For testing with TEST_RUN_MODE=0 (which spawns new processes
instead of reusing them), we can't rely on packages being loaded
by subprocesses influencing the main Perl process. Thus we must
load PublicInbox::Config explicitly to prevent failures with
this run mode.
Eric Wong [Sat, 15 Feb 2025 11:10:06 +0000 (11:10 +0000)]
import: clobber $? if global init.defaultBranch fails
We don't want $? influencing further error checking downstream,
such as thw _wq_done_wait call in public-inbox-clone when
git-config(1) invocations which set $?=0 are replaced with
in-process libgit2 config function calls.
Eric Wong [Sat, 15 Feb 2025 11:10:04 +0000 (11:10 +0000)]
favor run_wait() over CORE::system()
run_wait() can use vfork(2) which saves memory in large
processes, such as public-inbox-extindex when dealing with giant
messages. While vfork is unlikely to matter for real-world uses
of public-inbox-edit, PublicInbox::Spawn is a sunk cost treewide
our `make check-run' test target avoids spawning new Perl
processes for most things in script/*, so there can be a small
savings for testing.
Eric Wong [Fri, 14 Feb 2025 09:38:55 +0000 (09:38 +0000)]
imap_searchqp: avoid occasional P::RD hint spew
It turns out Inline::C::Parser::RecDescent increments $::RD_HINT
when Inline::C rebuilds are necessary, thus causing verbose log
spew on bogus IMAP search queries in our own use of
Parse::RecDescent. So, local-ize $::RD_HINT as we do with other
$::RD_* vars to avoid Inline::C influencing our IMAP query
parsing.
Triggering the increment of $::RD_HINT via
Inline::C::Parser::RecDescent may be achieved by forcing an
Inline::C rebuild, such as clearing the contents of
~/.cache/public-inbox/inline-c/* or
$ENV{PERL_INLINE_DIRECTORY}/* (but leaving one of those
top-level directories to trigger Inline::C rebuilds. This bug
mainly manifested in t/imap_searchqp.t under the `make
check-run' test target which reuses existing processes to speed
up tests, though it could manifest in real-world -imapd (and
-netd) usage if the process spawning the daemon built Inline::C
code.
Eric Wong [Tue, 11 Feb 2025 03:55:36 +0000 (03:55 +0000)]
daemon: make xap_helper socket non-blocking
A non-blocking socket for public-facing searches is ideal on
overloaded servers which receive more searches than it can
handle in a short time span. A non-blocking socket here
prevents a sendmsg(2) call from blocking the main event loop
and ensures non-search requests can still be processed even
if all xap_helper subprocesses are overwhelmed.
$search_xh_pid is hoisted out to TestCommon for use with the new
slow-xh-search test and renamed $find_xh_pid to avoid confusion
with Xapian search functionality. We'll also drop a redundant
call to find the xap_helper PID in psgi_v2.t while transitioning
to $find_xh_pid.
Eric Wong [Tue, 11 Feb 2025 03:55:34 +0000 (03:55 +0000)]
search: async_mset: always run callback on exceptions
To produce consistent error behavior, ensure we always run
the user-supplied callback on exceptions when using in-process
Xapian (as opposed to the external xap_helper process).
Eric Wong [Mon, 10 Feb 2025 21:09:34 +0000 (21:09 +0000)]
git_http_backend: input_prepare: die on unrecoverable errors
We shouldn't be guarding against errors at this level to return
500 errors, but rather we can rely on existing eval wrappers in
the stack (e.g. Plack::Util::run_app and Qspawn->parse_hdr_done)
to capture errors and return the HTTP 500 status.
Eric Wong [Mon, 10 Feb 2025 21:09:32 +0000 (21:09 +0000)]
msgmap: use v5.12
No unicode_string dependencies, here, so we can use v5.12 and
perhaps to speed up loading one day if everything drops `strict'
in favor of v5.12 (or higher).
Eric Wong [Mon, 10 Feb 2025 21:09:30 +0000 (21:09 +0000)]
multi_git: remove redundant ops
read_all already returns an array in array context so the
`split' op is unnecessary, and we don't need to use
`join("", ...)' on arrays that are destined for `print'
or `say'.
Eric Wong [Mon, 10 Feb 2025 21:09:29 +0000 (21:09 +0000)]
devel/sysdefs-list: explain why we don't autodie, here
Some distros split out core modules like autodie, and
sysdefs-list has higher portability requirements than the rest
of our code since it's used for uncommon platforms.
Eric Wong [Sat, 8 Feb 2025 03:26:38 +0000 (03:26 +0000)]
qspawn: drop redundant check against limiter->{max}
We only need to guard against excessive parallelism on entry
in one place during Qspawn->start. Redundant guards are
confusing and make bugs harder-to-spot.
Eric Wong [Sat, 8 Feb 2025 03:26:37 +0000 (03:26 +0000)]
viewvcs: -codeblob limiter w/ depth for solver
By adding the `publicinboxlimiter.<name>.depth' parameter for
all limiters, we can have queue overflow detection and reject
requests with HTTP 503 error codes for all limiter uses, not
just the custom queue in ViewVCS as before. Then we can just
use a normal limiter in ViewVCS and get rid of the old custom
queue.
Eric Wong [Sat, 8 Feb 2025 03:26:36 +0000 (03:26 +0000)]
qspawn: use limiter->new default
No need to hardcode `32' here since PublicInbox::Limiter already
uses that value already (and includes a helpful comment about
using the same value as git-daemon).
Eric Wong [Sat, 8 Feb 2025 03:26:35 +0000 (03:26 +0000)]
daemon: check connections before solving codeblobs
The /$MSGID/s/ codeblob reconstruction endpoint is extremely
expensive and one of the few which do a large amount of upfront
processing before being able to write to an HTTP client and
detect if they're still alive (unlike other expensive
endpoints).
For TCP connections, use the TCP_INFO via getsockopt(2) if
supported or attempt to use poll(2) + the FIONREAD ioctl(2) to
check the connection before possibly invoking
git-(apply|ls-files|update-index) dozens or even hundreds of
times to reconstruct a blob.
Eric Wong [Sat, 8 Feb 2025 03:26:34 +0000 (03:26 +0000)]
git_http_backend: fix 32 default connection limit
Using any configurable limiter actually defaults to `1', which
isn't intended for `-httpbackend'. So add a $max_default
parameter to ensure we can set a per-limiter default value for
default limiters such as `-httpbackend'.
Fixes: 0a083241 (git_http_backend: use default limiter from Qspawn, 2025-01-28)
Eric Wong [Sat, 8 Feb 2025 03:26:33 +0000 (03:26 +0000)]
syscall: `use bytes' throughout
All *nix syscalls operate on the byte-level with no knowledge of
encodings, so ensure all `substr' and `length' calculations use
the bytes(3perl) pragma.
Eric Wong [Tue, 28 Jan 2025 08:31:12 +0000 (08:31 +0000)]
www: configurable "-httpbackend" named limiter
The default limiter being hard-coded to 32 is excessive for
smaller systems. With dozens or hundreds of inboxes,
configuring all inboxes to point to a user-defined limiter
is tedious, as well. So give WWW admins the opportunity
to configure the limiter for all git-http-backend(1) processes
with one `publicinbox.-httpbackend.max' knob.
Eric Wong [Tue, 28 Jan 2025 08:31:11 +0000 (08:31 +0000)]
limiter: ignore unparseable rlimit
Instead of blindly accepting whatever bogus values the config
gave us, ignore it and use the `W:' prefix to be consistent
with the rest of our codebase.
Eric Wong [Fri, 24 Jan 2025 09:18:56 +0000 (09:18 +0000)]
viewvcs: handle exceptions in on_destroy cb
We need to account for File::Temp->newdir or autodie::open
croaking on us in case the system is out of space or memory.
These resource errors are already handled in our normal code
path. However, this on_destroy codepath can be triggered
when the solver request needs to be queued due to business.
Thus we explicitly call the {-wcb} to ensure the socket
isn't forgotten.
Eric Wong [Sat, 18 Jan 2025 01:26:15 +0000 (01:26 +0000)]
mda: use read_all for error handling
We already imported PublicInbox::IO for this script and read_all
provides more safety in case default PerlIO semantics change for
read-in-full behavior.
Eric Wong [Sat, 18 Jan 2025 01:26:13 +0000 (01:26 +0000)]
config: config_fh_parse: hardcode FS/RS
We no longer need the flexibility to handle non-NUL-delimited
output from `git config -l', so simplify callers and allow
a theoretically sufficiently-advanced Perl implementation
optimize more easily.
Followup-to: 21146412 (config: drop scalar ref support from internal API, 2023-09-24)
Eric Wong [Sat, 18 Jan 2025 01:26:11 +0000 (01:26 +0000)]
treewide: use autodie for seek+sysseek
The underlying lseek(2) syscall won't fail due to HW errors,
only due to usage errors (ESPIPE, EINVAL); so don't waste
code on error checking ourselves and just let autodie check
things during development.
Eric Wong [Sat, 18 Jan 2025 01:26:07 +0000 (01:26 +0000)]
init: move --skip-epoch handling to {-creat_opt}
Epoch skipping only makes sense at inbox creation, thus we'll
bundle inbox creation options together to reduce potentionally
confusing positional parameters.
Eric Wong [Sat, 18 Jan 2025 01:26:06 +0000 (01:26 +0000)]
init: move --skip-artnum handling to {-creat_opt}
It makes more sense to have inbox creation options bundled
together and reduces the amount of potentially confusing
positional parameters we must pass around.
Eric Wong [Sat, 18 Jan 2025 01:50:18 +0000 (01:50 +0000)]
doc: add an example of acceptable marketing
Presenting specific solutions to random comments is probably
acceptable since it's potentially helpful to some people;
but disclaimers must be present to lower expectations.
Eric Wong [Sun, 19 Jan 2025 04:40:58 +0000 (04:40 +0000)]
search_query: fix warnings on bogus `o=' query param
Users may specify non-numeric values for the `o' parameter
which won't be captured. Avoid uninitialized variable warnings
by assuming `0' when no numeric value is specified.
Eric Wong [Thu, 16 Jan 2025 03:28:51 +0000 (03:28 +0000)]
config: force absolute path when `-C /' (chdir) is used
For rare cases where we rely on `-c' to override temporary
options, we also chdir(2) to `/' to avoid hitting a users'
$worktree/.git/config. Thus, the config file must be given as
as an absolute path. While the majority of or code assumes
absolute paths are used for config files, it's possible we end
up on some strange setups where $ENV{HOME} is a relative path.
Eric Wong [Fri, 10 Jan 2025 23:20:59 +0000 (23:20 +0000)]
(ext)index: eliminate $sync->{ibx}
Perhaps the trickiest field to eliminate from $sync {ibx}
due to the way -extindex handles cross-posted messages. We
now eliminate the per-request $req->{ibx} for non-crossposted
messages and use local-ized $self->{ibx} directly.
We already had $self->{epoch_max}, and it's easy enough to
`local'-ize it to ensure it doesn't outlive its usefulness when
-extindex handles multiple inboxes.
Eric Wong [Fri, 10 Jan 2025 23:20:05 +0000 (23:20 +0000)]
v2writable: eliminate $sync->{art_end}
Instead of creating a long-lived {art_end} entry or even
local-izing it in $self, pass the `$art_end' value on stack to
achieve symmetry with the existing `$arg_beg' arg in calls to
index_xap_step().
Eric Wong [Fri, 10 Jan 2025 23:19:03 +0000 (23:19 +0000)]
(ext)index: eliminate most uses of `$sync->{ibx}'
Yet another step towards eliminating the $sync structure.
There's still a few remaining dependencies due to cross-inbox
linkage but the goal is to reduce refcount traffic for the
majority of Inbox object references. This field in $sync will
be completely eliminated in the next few commits.
Eric Wong [Fri, 10 Jan 2025 23:18:58 +0000 (23:18 +0000)]
(ext)index: move {max_size} and related bits to $self
`--max-size' is persistent for a process, so having it in $self
is more appropriate anyways than the shorter-lived $sync. The
{index_oid} function pointer is tied to max-size handling, so
we'll move that over to $self as well.
Eric Wong [Fri, 10 Jan 2025 23:18:57 +0000 (23:18 +0000)]
searchidx: rename {sidx} to {self}
While this is v1-only code, {self} is more consistent with the
rest of the code and the eventual goal is to eliminate $sync
entirely in favor of $self for temporary state.
Eric Wong [Fri, 10 Jan 2025 23:18:55 +0000 (23:18 +0000)]
extindex: move {id2pos} to $self
{id2pos} doesn't need a different hash table entry for every
request, instead it is tied to the inboxes associated with the
ExtSearchIdx ($self) object so we can just store it and give it
the same lifetime as the associated inbox references.
Eric Wong [Fri, 10 Jan 2025 23:18:16 +0000 (23:18 +0000)]
extindex: {boost_in_use} field to $self
There's no need to repeatedly propagate this field to every
cat-file blob request since this field is tied to the attached
config and inboxes associated with an inbox.
Eric Wong [Fri, 10 Jan 2025 23:18:13 +0000 (23:18 +0000)]
searchidx: prefix v1 code with `v1_'
Since there's no or v1-specific modules such as `V1Writable',
prefixing v1-specific subs seems like a prudent way to demarcate
subs used for dealing with the old v1 storage format.
We'll also take this opportunity to rearrange the order of subs
to take advantage of prototypes and compile-time checking with
`perl -c'.
Eric Wong [Fri, 10 Jan 2025 23:18:12 +0000 (23:18 +0000)]
smsg->populate: rename $sync to $cmt_info
We only use this hashref to propagate author and commit time
from the commit, so give it a more descriptive name rather than
overloading the to-be-eliminated `$sync' struct. There's no
need to explicitly vivify into a hashref, either.
Eric Wong [Wed, 8 Jan 2025 10:12:05 +0000 (10:12 +0000)]
ExtMsg, NewsWWW: account for publicinbox.nameIsUrl
As with WwwListing, these generate full URLs which go across
multiple inboxes, so support the new publicinbox.nameIsUrl
feature which allows users with many inboxes on the same domain
to avoid setting `publicinbox.*.url' for each one.
Alyssa Ross [Tue, 7 Jan 2025 17:46:23 +0000 (18:46 +0100)]
t/lei-sigpipe.t: use correct pipe size
According to fcntl(2), Linux will round up pipe sizes lower than a
page to a whole page. Additionally, the kernel may use a larger size
in general, "if that is convenient for the implementation". The
actual size of the pipe is returned from fcntl, so the test should use
that return value, rather than assuming the size requested was
accepted exactly. This fixes the test on my system with 16K pages.
Eric Wong [Thu, 2 Jan 2025 22:50:31 +0000 (22:50 +0000)]
extindex: use nproc_shards directly from IPC
There's no reason to use it from the V2Writable namespace
since it was moved to PublicInbox::IPC in commit 9225145d (ipc: move nproc_shards from v2writable, 2023-03-21)
Import it early via `use' so we can take advantage of prototype
checking while we're at it.