Eric Wong [Mon, 3 Nov 2025 02:50:10 +0000 (02:50 +0000)]
cindex: clarify yet-to-be-documented switches
We need to clarify --all, --include=, --only= switches were only
intended for inboxes, especially since the --exclude= switch
exists for excluding coderepo pathames. So suffix them
appropriately with `-inbox(es)'.
I'm not sure if these switches are even useful, so we'll leave
them undocumented for now.
Eric Wong [Mon, 3 Nov 2025 02:50:07 +0000 (02:50 +0000)]
cindex: use `--project-list=' default with empty string
Matching public-inbox-clone(1) behavior, using `--project-list='
with an empty argument to mean the same thing as
`--project-list=projects.list' allows users to reduce typing and
visual noise on their displays.
Eric Wong [Fri, 31 Oct 2025 20:35:06 +0000 (20:35 +0000)]
http: skip and improve warning on undef header values
In case we see undefined header values again, add some
additional information about the request to help track down the
problem. This ought to help admins (such as myself) who lack
space for more verbose access logs but keep syslog (stderr) logs
around longer.
Eric Wong [Fri, 31 Oct 2025 20:35:05 +0000 (20:35 +0000)]
www: reduce likelyhood of undefined HTTP header values
I'm not seeing how it's possible, but it appears HTTP.pm
occasionally warns on undefined HTTP headers in its responses
and is generating invalid HTTP headers.
Eric Wong [Fri, 31 Oct 2025 02:45:02 +0000 (02:45 +0000)]
xap_helper: drop unnecessary check on split2argv result
Our split2argv will already abort on excessive or zero
arguments, and it's impossible to for a `size_t' to evaluate to
a negative value, anyways. While we're at it, explain it's
safe to cast its return value to a signed int (typically
32-bit) since `req.argv' is a relatively small fixed value.
Eric Wong [Thu, 30 Oct 2025 04:26:58 +0000 (04:26 +0000)]
searchview: fix uninitialized var on bogus `o='
When somebody enters an out-of-bounds `o=' (offset) query parameter
for a query which otherwise returns some results, we should avoid
triggering uninitialized variable warnings since we were unable
to extract min/max relevance percentages. So just make up some
{min,max}_pct numbers for now if somebody tries that.
Aside from causing noise in stderr (often syslog), these were
otherwise harmless warning. AFAIK, this could only be triggered
by someone entering URL parameters manually to view HTML, and
not in any generated URLs.
Eric Wong [Mon, 27 Oct 2025 17:56:14 +0000 (17:56 +0000)]
searchidx: split shards at 100000 docs by default
Testing on a busy btrfs system with indexlevel=medium reveals
another ~15% speedup compared to the previous 450000 value since
shards are smaller and less prone to slowdown. The smaller
splits should also work better with indexlevel=full (the
default) since full indexing with positions takes up the bulk of
the space.
Eric Wong [Sun, 26 Oct 2025 20:59:00 +0000 (20:59 +0000)]
t/cindex: fix broken size checks
Xapian size checks aren't accurate unless the shards are compacted.
Unfortunately, I commented out the -compact calls while working on
--split-shards but forgot to re-add them :x So re-enable the
-compact calls and add an extra flag so we can skip size checks
when xapian-compact(1) is missing.
Fixes: 622e8c89 (*search: introduce open.lock for reader safety, 2025-10-08)
Eric Wong [Sun, 26 Oct 2025 00:16:44 +0000 (00:16 +0000)]
t/xap_helper: avoid confusing skip messages
Stop misleading users about not having C++ even if we're
explicitly testing C++ and Perl+(XS|SWIG) implementations
separately. The previous message was especially confusing
since it was one of the last diagnostics messages printed
and most likely to be seen by those running the test.
Eric Wong [Sun, 26 Oct 2025 00:16:43 +0000 (00:16 +0000)]
xh_cidx: include khashl.h to placate cppcheck(1)
cppcheck(1) will complain about things normal compilers don't.
While this #include is unnecessary to us since we only have a
single compilation unit and don't intend to reuse this header
elsewhere, it seems mostly harmless since GNU and clang
preprocessors should be able to optimize out redundant headers,
anyways, even w/o `#pragma once'.
Eric Wong [Sun, 26 Oct 2025 00:04:12 +0000 (00:04 +0000)]
rproxy: remove line context on backend disconnects
It's worthless, especially when using `die' instead of `croak'.
In any case, there's not much we can do if upstream decides to
shut down a partially written stream for whatever reason (e.g.
OOM) so stop giving line information about it. Adding request
information may be helpful at some point, but that's for another
time.
Eric Wong [Fri, 24 Oct 2025 17:34:09 +0000 (17:34 +0000)]
contrib/reject_bots: drop persistent connection requirement
Like many measures against aggressive scrapers bots which ignore
or insufficiently support robots.txt, requiring persistent
connections no longer seems effective. The
$env->{'pi-httpd.request_nr'} field remains for logging,
but will probably be removed, soon.
Eric Wong [Fri, 24 Oct 2025 19:28:34 +0000 (19:28 +0000)]
doc: tuning: add note about 64-bit OpenSSL speedup
Since switching HTTPS termination to 64-bit, I've noticed a
significant CPU usage reduction on public-inbox.org. IMAPS,
NNTPS, and POP3S remain 32-bit for the moment since that doesn't
go through varnish and I haven't gotten around to dealing with
a 64-bit Xapian (or SQLite) install.
Eric Wong [Thu, 23 Oct 2025 23:23:07 +0000 (23:23 +0000)]
treewide: don't change `use VERSION' in the same scope
Perl v5.40 deprecates and warns about mismatching `use VERSION'
statements until v5.44, at which point it'll become fatal and
unsupported. Multiple `package' statements in the same file are
considered the same scope by Perl, including string `eval' where
packages are declared within the string.
Furthermore, downgrading from v5.11+ to earlier versions is
already fatal in v5.42, at least. It's explained somewhat
unsatisfactorily in perl5360delta(1)...
It appears Perl takes another step towards the path of Python
and Ruby of introducing incompatibilities for minor reasons :<
Eric Wong [Wed, 22 Oct 2025 19:36:13 +0000 (19:36 +0000)]
t/pop3d: add diagnostics on APOP after STLS failure
Calling `APOP' after `STLS' to start TLS seems to fail
occasionally, but I haven't been able to figure out why.
Add some diagnostics to hopefully explain what's going
on...
Eric Wong [Thu, 23 Oct 2025 10:21:00 +0000 (10:21 +0000)]
psgi_rproxy: fix uploads with small output buffers
Attempting to call pass_res_hdr directly fails when the previous
DS->write to the upstream hit EAGAIN. Ensure correct ordering
by relying on DS->write to call the pass_res_hdr subroutine
after the previous DS->write is complete. In other words, we
must not try reading the upstream response until our request is
fully sent to the upstream.
I noticed this after installing HTTP::Parser::XS
(p5-HTTP-Parser-XS) on my FreeBSD machine since our PsgiRproxy
module depends the XS package for parsing HTTP/1.x responses.
Eric Wong [Tue, 21 Oct 2025 03:03:46 +0000 (03:03 +0000)]
extmsg: fix Message-ID search for non-xap_helper users
commit c3d0295bd was intended to avoid head-of-line blocking for
--xapian-helpers (-X) deployments. Unfortunately, that broke
searches for deployments not using --xapian-helpers by causing
the ExtMsg to get queued redundantly into the event loop. So
perform the `requeue' procedure exactly once and simplify
ExtMsg->event_step by eliminating the $sync parameter.
Eric Wong [Wed, 8 Oct 2025 21:24:22 +0000 (21:24 +0000)]
*index: --split-shards to speeds initial indexing
When indexing millions of messages, Xapian has a tendency to
slow down as each shard gets bigger. --split-shards allows
Xapian to work on temporary shards (more akin to "epochs")
and uses xapian-compact(1) to commit the finalized changes
for readers. The result is roughly 2x faster for millions
of messages.
The downside of this switch is temporary space use increases by
2-3x and incremental changes are not visible to readers until
all indexing is complete. It has no useful effect on
--reindex, but --reindex is typically faster than initial
indexing anyways since space is already allocated.
It still takes days to create a new extindex of lore, but
fewer days than before.
Another beneficial side effect of this switch is it also
tends to reduce the effect of fragmentation for --cow users
on btrfs.
Eric Wong [Wed, 8 Oct 2025 21:24:21 +0000 (21:24 +0000)]
*search: introduce open.lock for reader safety
public-inbox-compact and -xcpdb (reshard) both use a series of
rename(2) operations to replace Xapian shard directories quickly
to minimize downtime. While a single rename(2) is atomic,
chaining two or more atomic operations is not.
Readers now acquire a shared lock via LOCK_SH of flock(2)
if it exists, but tolerates ENOENT for backwards compatibility
with indices that haven't been written by the current version.
The open.lock only protects against parallel open(2) calls used
by readers while short rename(2) operations are taking place on
the writers. In other words, it allows parallel readers but
only a single writer process to do renames.
This open.lock will become more important with --split-shards
in the next commit.
Eric Wong [Wed, 8 Oct 2025 21:24:20 +0000 (21:24 +0000)]
codesearchidx: use {topdir} for consistency
There's no reason for the read-only CodeSearch and read-write
CodeSearchIdx to use different field names for referring to the
same thing. Consistently use {topdir} since it's inspired by
ExtSearch{,Idx} and also matches the configuration knob. This
will make sharing methods between CodeSearch and CodeSearchIdx
easier.
Eric Wong [Thu, 2 Oct 2025 19:29:39 +0000 (19:29 +0000)]
imapd|nntpd|pop3d: output IP + port on new connections
Dumping the IP address will help admins use the (standard)
output of these servers to track abusive IMAP and POP3 scanners
looking for private mail. NNTP doesn't seem affected at
the moment, but it's easier to keep the common code
across all three of these stateful protocols.
Eric Wong [Sat, 20 Sep 2025 00:04:50 +0000 (00:04 +0000)]
treewide: disable warning about NoCOW on btrfs
It's too noisy for tests and breaks some tests which check for
warnings when TMPDIR is btrfs. I didn't notice this earlier
since I had TMPDIR pointed at /dev/shm on my btrfs machine.
We'll probably just make CoW the default for btrfs because the
CoW overhead with --split-shards on initial index is "only" 100%
or so...
Eric Wong [Tue, 16 Sep 2025 21:42:58 +0000 (21:42 +0000)]
compact: support --cow/--no-cow switch
Users on btrfs may prefer to sacrifice performance for data
safety with RAID. Allow them to do so by supporting the --cow
switch. I missed this when I added --cow support to the other
commands.
Fixes: db671788 (support --cow switch to preserve CoW on btrfs, 2025-08-26)
Eric Wong [Mon, 15 Sep 2025 21:18:15 +0000 (21:18 +0000)]
view: display small and invalid time ranges
Instead of falling back to displaying the most recent messages
when there are too few (<200) or no messages in a time range,
show the small subset or return a 404 when there's no messages
at all. This ought to be less confusing for users who want to
focus on a small subset of messages within a given timeframe.
Eric Wong [Tue, 16 Sep 2025 10:25:05 +0000 (10:25 +0000)]
extindex: fix --reindex when blobs go missing
Blobs in v2 repos may be purged, ensure we don't try to parse
non-existent blobs for List-IDs to remove from Xapian when they
go missing. The invalid List-IDs will be removed in a later
stage of the reindex process.
Eric Wong [Sat, 13 Sep 2025 07:40:14 +0000 (07:40 +0000)]
searchidxshard: drop unused `echo' sub
We no longer need a separate `echo' command to check the
doneness of a previous ->ipc_do command since we've made
->ipc_do async and introduced ->ipc_wait_all in commit f0b9f90a (*index: propagate exceptions from shard processes, 2025-09-04)
Eric Wong [Sat, 13 Sep 2025 12:21:50 +0000 (12:21 +0000)]
searchidx: use warn for excessively long terms
Filename and linenumber context information from Carp::carp is
unlikely to be useful here on (likely) worthless data from bad
MUAs. carp adds needless noise for users tracking down actual
code bugs with PERL5OPT=-MCarp=verbose, which changes Carp::carp
into Carp::cluck.
Eric Wong [Fri, 12 Sep 2025 21:28:49 +0000 (21:28 +0000)]
t/httpd-corner: fix test on missing curl-config(1)
We were misusing our internal context-dependent `require_cmd',
which requires the caller to call `skip' explicitly when a
return value is desired. So add a new `skip' call for testers
without curl-config(1). Furthermore, the `SKIP' label was
incorrectly placed for Test::More::skip to use, so create an
explicit block for it.
Eric Wong [Fri, 12 Sep 2025 23:28:18 +0000 (23:28 +0000)]
extindex: fix --reindex
`public-inbox-extindex --reindex' deprioritizes itself for
public-inbox-extindex invocations without --reindex by shutting
down shard processes to let other processes acquire the lock and
process new messages, first.
Restarting shard processes during --reindex was causing new
Xapian shards to be written to v2 inboxes instead of the
extindex itself. This bug was introduced with the
simplifications to internal data structures to eliminate the
ad-hoc $sync structure.
The local-ized use of ExtSearchIdx->{ibx} tricked
PublicInbox::SearchIdxShard::new into using the standard v2 code
path. So make SearchIdxShard->new check the `$v2w' object for
the ability to call `eidx_sync' rather than the existence of the
{ibx} field.
I only noticed this bug while working on the --split-shards
feature for performance.
Fixes: 922b765d ((ext)index: move {max_size} and related bits to $self, 2025-01-10)
Eric Wong [Sun, 7 Sep 2025 23:41:44 +0000 (23:41 +0000)]
ipc: improve exception handling
When an exception triggers a teardown of the worker process
(ipc_worker_stop), we need to combine subsequent exceptions and
show the original one, first. In other words, we must not lose
the original exception if new exceptions are thrown during
teardown.
So rely on `wantarray' to grab caller contexts to allow
returning exceptions as a list rather than throwing them
immediately.
Eric Wong [Sun, 7 Sep 2025 23:41:43 +0000 (23:41 +0000)]
ipc: avoid context line in generated exception
The filename and line number of the "aborted" message is
needless noise and confusing when dealing with errors
which already triggered ipc_fail. So add a newline to
ensure Perl doesn't add context information if it needs
to `die' or `warn' on that message.
Eric Wong [Sat, 6 Sep 2025 19:45:06 +0000 (19:45 +0000)]
*index: increase default --commit-interval to 15s
The DBD::SQLite(3pm) sqlite_busy_timeout is 30s and we don't
currently have a way to override this. Thus 15s should be
adequate assuming we can keep SQLite commit times <10s.
Currently, the worst case commit times can exceed even 30s for
Xapian, but that doesn't affect over.sqlite3 which commits
fairly quickly.
Eric Wong [Tue, 2 Sep 2025 20:30:14 +0000 (20:30 +0000)]
reject_bots: avoid download prompts in Firefox
Apparently, some versions of Firefox will open a download prompt
when attempting to open the page without a Content-Type. So set
a Content-Type and keep those installations and users of Firefox
happy.
Eric Wong [Thu, 4 Sep 2025 19:22:27 +0000 (19:22 +0000)]
*index: propagate exceptions from shard processes
We'll introduce a new ->ipc_async internal API and use it for
->ipc_do calls where the return value is ignored. This new
API is modeled after our async API for accessing
`git cat-file --batch*' and remains compatible with synchronous
callers who want the return value of ->ipc_do.
Processes no longer spin and burn CPU after hitting ENOSPC or
dealing with database or FS corruption. Instead, they should
now die properly in most cases. However, hangs and crashes may
still possible since Xapian may abort(3) in some ENOSPC cases
and SQLite's may trigger SIGBUS via mmap(2) (if using WAL or
forcing mmap use).
Eric Wong [Thu, 4 Sep 2025 19:22:26 +0000 (19:22 +0000)]
ipc: add comment on pipe usage (vs socketpair)
Using pipes here isn't ideal for developer ergonomics, but the
increased buffer size for non-privileged processes on Linux is
still worth the extra performance with expensive indexing done
in each shard.
Eric Wong [Fri, 29 Aug 2025 20:38:54 +0000 (20:38 +0000)]
ipc: remove {-ipc_pid} field
The fork generation is already attached to the response pipe,
so there's no need to keep the PID around and do comparisons
against the getpid(2) result.
Eric Wong [Fri, 29 Aug 2025 19:04:01 +0000 (19:04 +0000)]
*index: support --commit=$SECONDS to adjust the commit interval
The default 5s interval is too low for initial indexing with
gian inboxes or extindices which aren't ready for public
consumption, yet. It's also too low for --wal users since
WAL mode of SQLite (by default) allows readers to proceed
without waiting on a writer.
I'm not sure what the defaults should be, so allow users to
change it, for now.
Eric Wong [Tue, 26 Aug 2025 19:50:52 +0000 (19:50 +0000)]
support --cow switch to preserve CoW on btrfs
We currently unconditionally disable CoW on btrfs to reduce
fragmentation. Unfortunately, disabling CoW may cause data
corruption on all btrfs RAID levels, so provide an option to
keep it enabled. In the future, CoW may become the default on
btrfs (matching the FS default) even if fragmentation is awful.
Eric Wong [Tue, 26 Aug 2025 19:50:49 +0000 (19:50 +0000)]
use Getopt::Long hashref for --no-fsync and --dangerous
We can eliminate the {-no_fsync} and {-dangerous} fields by
using the Getopt::Long hashref directly. This will make it
easier to support additional CLI options (e.g. --cow).
Our internal APIs now defaults to disabling fsync, however the
CLI tools still override that internal default to enable fsync.
Having our internals default to disabling fsync can slightly
improve test performance, since they're the main users of our
unstable internal API.
Eric Wong [Tue, 26 Aug 2025 19:50:48 +0000 (19:50 +0000)]
overidx: take Getopt::Long options hashref directly
Taking the Getopt::Long options hashref directly will allow us
to support future options more easily and avoid copying/mapping
fields (e.g. {-no_fsync}, {journal_mode}) across different
object types).
Eric Wong [Tue, 26 Aug 2025 19:50:47 +0000 (19:50 +0000)]
t/lei_store: ensure over.sqlite3 uses WAL
As lei is a single user store, it's always used WAL since it
should improve parallelism and I/O patterns. An additional test
here will ensure bugs don't slip in where we forget to enable
WAL.
Eric Wong [Tue, 26 Aug 2025 19:50:45 +0000 (19:50 +0000)]
searchidx: take Getopt::Long options hashref on create
Relying more on the hashref populated by Getopt::Long should
reduce cognitive overhead. Future commits will allow us to
eliminate the {-no_fsync} and {-dangerous} fields and allow
us to more easily support new switches for toggling CoW/No_COW,
and other options for SQLite and Xapian tuning.
Eric Wong [Tue, 26 Aug 2025 19:50:44 +0000 (19:50 +0000)]
t/ipc: improve test reliability
->wq_close works asynchronously, nowadays, so processes may
not be completely done writing warnings when the parent process
reads the file. We'll test some untested methods after test_die
to ensure the worker has had time to write the exception to the
warning log. Finally, we'll explain warning contents on failure
in case it happens again.
Eric Wong [Tue, 26 Aug 2025 19:50:43 +0000 (19:50 +0000)]
lei_saved_search: avoid //= on complex hashref assignment
Setting $self->{oidx} via `//=' may be unsafe due to
->lock_for_scope setting $self->{lock_fh}. While this is not
known to cause problems at the moment, it may be problematic in
future as we've had to deal with subtle bugs in similar code in
the past.
Eric Wong [Tue, 26 Aug 2025 19:50:42 +0000 (19:50 +0000)]
init: store Getopt::Long options in hashref
We will pass this options hashref freely across internal
backends, so using a hashref is more consistent with the rest
of our codebase and allows eliminating some local variables.
Eric Wong [Tue, 26 Aug 2025 19:50:41 +0000 (19:50 +0000)]
extindex|v2: defrag SQLite and Xapian DBs on btrfs
Doing periodic defrags ought to improve performance and perhaps
allow CoW to be usable with btrfs. The autodefrag mount option
of btrfs(5) doesn't seem recommended by btrfs developers since
it's too aggressive, defragments too often, and wears out devices.
Performing defrag on our end should allow users to tune a more
ideal defrag interval to maintain performance while avoiding
excessive device wear.
Eric Wong [Tue, 26 Aug 2025 19:50:40 +0000 (19:50 +0000)]
extindex: reduce IPC and Xapian updates on reindex
Instead of updating the document and re-adding eidx keys +
List-IDs repeatedly, we can do it at once. Doing so reduces
IPC traffic and ought to reduce FS traffic on the Xapian DB.
Eric Wong [Tue, 26 Aug 2025 19:50:39 +0000 (19:50 +0000)]
v2writable: checkpoint: clarify $dbh is for msgmap
We overload the checkpoint sub for -extindex, however only v2 has
msgmap.sqlite3 so we'll avoid confusing ourselves with with a
generic `$dbh' name in favor of `$mm_dbh'.
Eric Wong [Fri, 22 Aug 2025 20:40:25 +0000 (20:40 +0000)]
*index: ignore misordered References if In-Reply-To exists
Fix our indexers to favor In-Reply-To during the Message-ID
unique-fication phase. Some MUAs will generate an incorrect
References header ordering and put its direct In-Reply-To at the
head of the References list instead of at the tail.
Having the same Message-ID in both References and In-Reply-To
inadvertently caused the In-Reply-To value to get dropped by the
`uniq_mids' subroutine. To fix this, we reverse the values
before `uniq_mids' and reverse the result again since
PublicInbox::SearchThread (and Mail::Thread before) depends on
the In-Reply-To being last.
An expensive(*) --rethread --reindex will be required to fix
this on the existing data set. I've only tested this change
with a three message inbox consisting of the following
Message-IDs:
Eric Wong [Tue, 19 Aug 2025 00:33:40 +0000 (00:33 +0000)]
v2+extindex: show commit time and indexing rate in progress
With the `-v' switch, we'll display these rates to track total
indexing rate and commit speeds throughout the indexing phase.
These numbers will help us monitor for slowdowns throughout the
entirety of a large indexing job taking several days. This
change may help us decide whether or not to start implementing
autodefrag for btrfs and similar CoW FSes prone to performance
degradation from fragmentation.
Eric Wong [Tue, 19 Aug 2025 00:33:39 +0000 (00:33 +0000)]
extsearchidx: distinguish between binary and hex OIDs
Meaningful variables make it easier for hackers to understand
the code, since this is one of the few places where we freely
mix human-readable hex and binary representations of OIDs.
Eric Wong [Tue, 19 Aug 2025 00:33:37 +0000 (00:33 +0000)]
t/ipc.t: streamline dependency check
We should not be checking for Socket::Msghdr or Inline::C
presence since there is now a 3rd way to support SCM_RIGHTS
passing with the syscall, pack, and unpack perlops.
Eric Wong [Tue, 19 Aug 2025 00:33:36 +0000 (00:33 +0000)]
spawn: add note about dropping SCM_RIGHTS in Inline::C
Inline::C seems more difficult to support in a hypothetical
alternative implementation of Perl which maintains and
distributes implementation-specific ports of XS modules.
Eric Wong [Tue, 19 Aug 2025 00:33:34 +0000 (00:33 +0000)]
ipc: get rid of -ipc_ppid field in favor of fork_gen
We own all the `fork' calls in our codebase, so rely on the
existing $PublicInbox::OnDestroy::fork_gen instead of using `$$'
to call getpid(2) everywhere. getpid(2) is slower nowadays
since it's always a syscall on modern glibc whereas it was
cached (somewhat unsafely) in the past.
Eric Wong [Tue, 19 Aug 2025 00:33:33 +0000 (00:33 +0000)]
ipc: use PublicInbox::IO::attach_pid instead of awaitpid
Blessing and attaching PIDs to PublicInbox::IO objects will give
us access to my_readline and my_bufread subs to allow for
asynchronous callbacks for error handling (similar to
PublicInbox::Git->cat_async).
Eric Wong [Fri, 15 Aug 2025 13:41:36 +0000 (13:41 +0000)]
extindex: preserve indexlevel=basic on incremental update
-extindex needs to preserve indexlevel=basic when doing
incremental updates if the extindex was originally created with
indexlevel=basic. Otherwise blindly upgrading somebody to
indexlevel=full would waste disk space and likely result in
inconsistent indexing on the Xapian side.
Fixes: bf2360b31 (extindex: support `-L basic' to avoid most Xapian space, 2025-08-13)
Eric Wong [Fri, 15 Aug 2025 13:41:35 +0000 (13:41 +0000)]
init|*index|convert: --wal enables Write-Ahead Log in SQLite
SQLite WAL <https://sqlite.org/wal.html> ought to improve
concurrency for readers when writers are active. While users
always had the option of enabling it with the sqlite3(1) tool,
having a `--wal' switch may encourage its use.
It's already on by default for lei/store and mail_sync.sqlite3
where readers and writers are the same user, but enabling it by
default for public-facing daemons would cause permissions
problems so it remains optional.
The key downside of WAL is readers (-(netd|httpd|imapd|nntpd|pop3d)
require write access to the directories containing over.sqlite3
and msgmap.sqlite3). In the past, public-facing read-only
daemons were not expected to ever have write permissions to
inbox and extindex directories, but WAL requires writability
since SQLite||DBD::SQLite currently doesn't provide a way to
guarantee the persistence of -wal and -shm files, especially
when accessed by sqlite3(1) or 3rd-party scripts.
The main advantage of supporting WAL is so the writers won't
block readers, avoiding busy timeouts on the read side and
reducing the need for writers to commit transactions
periodically to keep readers happy (currently a ridiculously
short 5s). We can investigate higher commit intervals (or
eliminating them entirely) for users of WAL.
WAL should also reduce fragmentation in btrfs and similar CoW
filesystems by reducing fsync(2) calls, this remains to be
proven as measuring such changes takes a while and I don't
want to put more wear on my SSD.
Eric Wong [Wed, 13 Aug 2025 21:42:48 +0000 (21:42 +0000)]
extindex: support `-L basic' to avoid most Xapian space
For users uninterested in being able to search message contents,
the `basic' indexlevel is now available.
The /misc/ Xapian index for data on individual inboxes (e.g. names,
descriptions, email addresses) remains even with `basic'; but
individual messages are not indexed.
Eric Wong [Wed, 13 Aug 2025 21:42:46 +0000 (21:42 +0000)]
extindex: reduce IPC for cross-posted messages
When dealing with cross-posted messages, we can reduce IPC
traffic by only sending the List-ID header(s) across the pipe
instead of all headers of a message. List-ID headers only take
up a small portion of all message headers so the IPC traffic
reduction can be helpful in saving memory bandwidth to improve
performance.
This change will also make it easier to journal work for Xapian
and perform all work for cross posted messages in the same
transaction. This ought to reduce doing updates for a given
document across transactions and hopefully reduce storage device
wear while improving performance.
Eric Wong [Sat, 9 Aug 2025 01:02:11 +0000 (01:02 +0000)]
lei: input: warn on `L:' and `t:' use consistently
When using `--in-format/-F', we need to bail out on the /\A(?:L|kw):/
check as early as possible before checking for `$urlish_scheme:' vs.
`--in-format' conflicts. Otherwise, we'd get a confusing error
message such as: --in-format=mboxrd and `l:' conflict
Eric Wong [Thu, 7 Aug 2025 00:50:05 +0000 (00:50 +0000)]
nodatacow: warn about CoW being disabled on btrfs
While Xapian and SQLite performance is untenable with CoW on
btrfs, disabling CoW is not without caveats on btrfs. So warn
users about the possibility of data corruption.
Eric Wong [Thu, 7 Aug 2025 00:49:31 +0000 (00:49 +0000)]
*index: don't try to index boolean terms >245 bytes long
Xapian's flint, chert, and glass backends only support a maximum
term length of 245 bytes including the upper-case term prefix.
Thus we can't index things like insanely long Message-IDs in
References: or overly long pathnames or newsgroup names.
The work-in-progress honey backend still doesn't support
updates, yet, (only created from an existing glass DB), and I
doubt the limit will increase since excessive term lengths are
usually a mistake of mangled whitespace or a broken/spammy
client somewhere.
Eric Wong [Fri, 18 Jul 2025 01:39:42 +0000 (01:39 +0000)]
psgi_rproxy: standardize on 64K I/O size
Much of our bulk I/O in DS and IO.pm already uses 64K buffers
(matching the Linux default pipe size) instead of 16K buffers.
While larger I/O sizes can result in more work for the malloc
implementation, they can also reduce syscall and Perl function
call overhead to reduce CPU usage and improve throughput on long
fat networks (LFNs). Furthermore, standardizing on larger sizes
ought to reduce fragmentation since malloc can avoid splitting
existing buffers used for bulk I/O we do in other places.
Eric Wong [Sun, 13 Jul 2025 00:08:36 +0000 (00:08 +0000)]
contrib: PSGI RejectBots middleware
RejectBots rejects certain user-agents and forces meta-refresh
while requiring the use of persistent connections. While the
first two techniques are relatively well-known, the persistent
connection requirement requires the public-inbox-(netd|httpd) to
be directly connected with the remote client without something
like (haproxy|nginx) in front of it.
This middleware is used by public-inbox.org and the
yhbt.net/lore mirror.
Eric Wong [Sun, 13 Jul 2025 00:08:35 +0000 (00:08 +0000)]
http: introduce a reverse proxy PSGI app
While nginx is (and is likely to remain) a popular reverse
proxy, it is a commercial open core project nowadays which I
haven't used in production in over a decade (since I used a Ruby
reverse proxy, instead). Furthermore, nginx|haproxy|etc. is
another dependency sysadmins need to configure and maintain in
addition to our daemons (and varnish).
Since our Perl code already needs to deal with IMAP scraper
bots, it might as well deal with HTTP(S) ones as well :P
Notable advantages of PublicInbox::PsgiRproxy over nginx for
HTTPS termination include:
* Response buffering is lazy by default. That is, whereas (last
I checked) nginx either (a) buffers a response body in full
before sending it or (b) doesn't buffer at all. (a) is safe
for reducing backend contention but increases latency, whereas
(b) allows slow clients to bottleneck fast backends.
PublicInbox::PsgiRproxy defaults to writing to the client
immediately (unbuffered), but starts buffering when a client
can't keep up with the backend.
* Perl configuration allows easier development of custom modules
("Plack middleware" in our case) for inspection of requests in
Perl. It gives the PSGI env a way to detect reused connections
and the $env->{'psgix.io'} handle can be used for TCP/TLS
fingerprinting of suspected bots.
* Trailers following chunked requests are supported (but mapped to
headers for the backend). Response trailers are not yet
supported since browsers (even bloated "modern" ones) can't
seem to handle them.
This PSGI reverse proxy has been fronting https://public-inbox.org
and the https://yhbt.net/lore/ mirror for over a month, now.
Eric Wong [Sun, 13 Jul 2025 00:08:34 +0000 (00:08 +0000)]
http: expose reused connection counter in PSGI env
`$env->{pi-httpd.request_nr}' is now accessible via the PSGI
env for middlewares to use to identify reused connections.
This can be useful for middlewares to implement rules for
identifying and rejecting/allowing certain clients (e.g.
aggressive bots).
Eric Wong [Tue, 8 Jul 2025 03:54:07 +0000 (03:54 +0000)]
lei_mail_sync: fix size check for Maildir||MH files
`-s $fh' returns undef only when a file is non-existent and zero
when it's empty. Thus so we must use `||' to skip empty files.
Furthermore, `-s FILEHANDLE' is never undef on open handles.
Fixes: 5aab49f3 (lei: support reading MH for convert+import+index, 2023-12-29)