Eric Wong [Tue, 9 Jan 2024 11:39:23 +0000 (11:39 +0000)]
git: workaround occasional -watch error message
I'm not sure how this happens (perl 5.34.1 on FreeBSD 13.2)
but it appears the {sock} check can succeed and then go undef
and become unable to call ->owner_pid.
This happens when libgit2 is in use, so perhaps that's a factor.
In any case, the rest of the tests succeed.
Eric Wong [Mon, 1 Jan 2024 01:07:49 +0000 (01:07 +0000)]
view: always show strict|loose note w/ multi-roots
For thread skeletons with multiple roots, it makes sense to
note the strict|loose delineation even when the first message
matches the desired Message-ID.
Eric Wong [Mon, 1 Jan 2024 01:07:48 +0000 (01:07 +0000)]
over: re-sort Subject matches for WWW /T/ endpoint
When retrieving loose (Subject) matches for a thread, we wanted
the most recent matches in reverse chronological order.
However, when displaying the /T/ endpoint generating the thread
skeleton, we prefer ascending chronological order to match the
flow of the conversation.
Eric Wong [Fri, 29 Dec 2023 18:05:14 +0000 (18:05 +0000)]
lei: support reading MH for convert+import+index
The MH format is widely-supported and used by various MUAs such
as mutt and sylpheed, and a MH-like format is used by mlmmj for
archives, as well. Locking implementations for writes are
inconsistent, so this commit doesn't support writes, yet.
inotify|EVFILT_VNODE watches aren't supported, yet, but that'll
have to come since MH allows packing unused integers and
renaming files.
Eric Wong [Thu, 28 Dec 2023 04:23:00 +0000 (04:23 +0000)]
pure Perl inotify support
This is a step towards improving the out-of-the-box experience
in achieving notifications without XS, extra downloads, and .so
loading + runtime mmap overhead.
This also fixes loongarch support of all Linux syscalls due to
a bad regexp :x
All the reachable Linux architectures listed at
<https://portal.cfarm.net/machines/list/> should be supported.
At the moment, there appears to be no reachable sparc* Linux
machines available to cfarm users.
Fixes: b0e5093aa3572a86 (syscall: add support for riscv64, 2022-08-11)
Eric Wong [Sat, 16 Dec 2023 11:13:15 +0000 (11:13 +0000)]
lei index: support +L: labels
`lei index' should be capable of indexing the the same way
`lei import' does, but without the indexing. I only noticed
this omission while developing a new feature.
Eric Wong [Fri, 15 Dec 2023 20:22:46 +0000 (15:22 -0500)]
searchidx: quiet down old git patchid
CentOS 7.x ships with git 1.8.5, so unless a CentOS 7.x user
enables 3rd-party repos[1], they'll be stuck with a version
of git without `--stable' (though I'm becoming skeptical of
indexing patchids at all).
Eric Wong [Fri, 15 Dec 2023 20:22:45 +0000 (15:22 -0500)]
tests: quiet uninitialized warnings on CentOS 7.x
Test::More distributed with Perl 5.16.3 on CentOS 7.x expects
the `$how_many' argument for `skip' and warns when its
uninitialized, so quiet that warning down.
Eric Wong [Wed, 13 Dec 2023 00:50:19 +0000 (00:50 +0000)]
t/lei-import: relax EIO regexp
musl uses "I/O error" while glibc uses "Input/output error"
I wish something like strerrorname_np(3) were portable
and built into Perl so we could just match on /EIO/.
Eric Wong [Wed, 13 Dec 2023 00:50:18 +0000 (00:50 +0000)]
gzip_filter: use OO ->zflush dispatch
While it's not in a code path intended WwwCoderepo and RepoAtom,
those classes provide their own ->zflush, this can future-proof
our code against future subclasses at a minor performance cost.
Eric Wong [Wed, 13 Dec 2023 00:50:17 +0000 (00:50 +0000)]
www_coderepo: fix read buffering
Our read buffering only worked well with the stdout buffering on
glibc and *BSD libc, but not musl. When reading the stdout of
git(1), we are likely to get smaller buffers and require more
reads on musl-based systems (tested Alpine Linux 3.19.0).
Thus we must prevent ->translate from being called with an empty
argument list (denoting EOF). We'll also avoid some local
variable assignments while at it and favor the non-OO ->zflush
dispatch inside RepoAtom and WwwCoderepo subclasses.
Eric Wong [Wed, 13 Dec 2023 00:50:16 +0000 (00:50 +0000)]
t/convert-compact: allow S_ISGID bit
My user home directory on Alpine has S_ISGID set on it and every
subdirectory inherits it. This includes my work tree and the
t/data-gen/* subdirectories. So just ignore the presence (or
non-presence) of the S_ISGID bit on directories descended from
the cached t/data-gen/* directories.
Now, public-inbox-convert may want to preserve S_ISGID on the
newly-created v2 inbox, but that's a separate discussion.
Eric Wong [Wed, 13 Dec 2023 00:50:14 +0000 (00:50 +0000)]
install: updates for Alpine Linux and apk
Somewhat surprising that BSD::Resource hasn't been packaged for
Alpine, but otherwise pretty straightforward mapping with some
dependencies filled in manually.
Eric Wong [Wed, 13 Dec 2023 00:50:13 +0000 (00:50 +0000)]
xap_helper_cxx: support clang w/o `c++' executable
This makes the C++ build work on Alpine Linux (tested 3.19.0)
without having to install g++ to get the `c++' executable.
I've tested this change with and without g++ on Alpine so it'll
continue to work if a user decides to install g++.
This should continue to work if the Xapian package on Alpine is
changed to link against libc++ instead of libstdc++, since we
only add `-lstdc++' as a fallback. For reference, Xapian is
already linked against libc++ and not libstdc++ on FreeBSD 13.x
Eric Wong [Wed, 13 Dec 2023 00:50:11 +0000 (00:50 +0000)]
treewide: avoid strftime %k for portability
The musl strftime(3) implementation on AlpineLinux 3.19.0
doesn't support `%k' and `%k' isn't in POSIX, either. So we
fall back to using the `sprintf' perlop in the user-facing UI
since leading zeroes require needless overhead for my eyes and
brain to parse in the time.
Eric Wong [Wed, 13 Dec 2023 00:50:09 +0000 (00:50 +0000)]
tests: attempt compatibility w/ busybox lsof
BusyBox lsof(1) ignores the `-p PID' argument and shows
the open files for every process it knows about. BusyBox
lsof also lacks the `NODE' column of the non-BusyBox
implementation, so we'll rely on /proc/PID/fd/ in those
cases since the deleted file checks are Linux-only and
it's common to have procfs is mounted on /proc on Linux.
Eric Wong [Wed, 13 Dec 2023 00:50:08 +0000 (00:50 +0000)]
t/cindex*: skip --join when join(1) is missing
While join(1) is POSIX, busybox on Alpine 3.19.0 does not
provide its functionality. So just skip tests for now since
it's too much trouble to provide a workaround for an otherwise
common POSIX command.
Eric Wong [Wed, 13 Dec 2023 00:50:07 +0000 (00:50 +0000)]
tests: account for missing git-http-backend
Alpine Linux ships git-http-backend in the `git-daemon'
package separately from `git', so we must test for its
existence before attempting to test functionality which
depends on it.
Eric Wong [Sun, 10 Dec 2023 13:42:52 +0000 (13:42 +0000)]
imap: replace Mail::Address fallback with AddressPP
Our pure-Perl (PublicInbox::AddressPP) fallback is closer to the
preferred Email::Address::XS (EAX) behavior than Mail::Address
is for ->name support. EAX tends to be overkill with good spam
filtering, and using our own fallback means life is easier for
users with neither C/XS build tools nor a pre-built EAX package.
Eric Wong [Fri, 8 Dec 2023 03:54:38 +0000 (03:54 +0000)]
cindex: switch --join to use dfpost7 by default
Post-image blob OIDs are what solver already works with, and
longer OIDs may not be available in historical mail archives.
`patchid' turns out to be unsuitable since:
1) git's default diff algorithm has changed over time
2) users may use different diff options to improve readability
Of course, we could eventually run `lei rediff' during the index
phase to regenerate patchids, but that's out-of-scope for now
and likely to be too expensive.
Eric Wong [Fri, 8 Dec 2023 03:54:37 +0000 (03:54 +0000)]
xap_helper: support term length limit
This will allow us to use p2q-compatible specifications such as
"dfpost7" to only capture blob OIDs which are 7 characters in
length (the indexer will always index down to 7 characters)
Eric Wong [Fri, 8 Dec 2023 03:54:35 +0000 (03:54 +0000)]
xap_helper_cxx: drop chdir usage in build
While chdir simplifies path manipulation on our end, its use
falls over when PERL5LIB/@INC contains relative paths which need
to be made absolute. It's fewer lines of code to get eliminate
chdir usage than it is to keep using relative paths in most
places.
Eric Wong [Thu, 7 Dec 2023 23:32:14 +0000 (23:32 +0000)]
workaround --headers bug with spamc(1)
As of SpamAssassin 4.0.0, spamc(1) corrupts messages with NUL in
the body when the `--headers' switch is used. This increases
transport costs, but most spamc/spamd setups are via local
sockets, so it's unlikely to be significant.
Eric Wong [Wed, 6 Dec 2023 21:12:25 +0000 (21:12 +0000)]
cindex: avoid recursion on prune
There's no need to recurse and trigger deep recursion warnings
when we hit a coderepo with a known hash (SHA-1 vs SHA-256).
Noticed while pruning the 1200+ repos on a git.kernel.org
mirror.
Eric Wong [Wed, 6 Dec 2023 21:12:24 +0000 (21:12 +0000)]
t/cindex: fix test when worktree PWD is a symlink
Our code aims to respect $ENV{PWD} (and therefore symlinks) as
much as possible to ensure portability across devices when repos
and indices are on portable or shared storage. Thus we can't
rely on Cwd::abs_path and ought to favor File::Spec->rel2abs
whenever absolute paths are required.
I noticed this when working on a VM where my worktree is a
symlink to a more reliable device.
Eric Wong [Tue, 5 Dec 2023 09:46:23 +0000 (09:46 +0000)]
cindex: index full (40/64 char) hex blob OIDs
This future proofs the index against git auto-abbreviation
needing more characters as the repo grows. It'll be useful for
joining against inboxes using dfpre.
As with emails, we'll continue indexing abbreviated blob OIDs
down to 7 hex characters so a SHA-1 git repo will have all
abbreviations of the OID from 7-39 hex characters in addition
to the 40 character unabbreviated form.
Eric Wong [Fri, 1 Dec 2023 02:07:02 +0000 (02:07 +0000)]
t/xap_helper: make sendmsg errors more obvious
By ignoring SIGPIPE, we hit our own error path and emit an informative
error message instead of dying abruptly and requiring somebody to run
`echo $?' to see the child status from their shell.
Eric Wong [Thu, 30 Nov 2023 21:40:47 +0000 (21:40 +0000)]
codesearch: use retry_reopen for WWW
As with mail search, a cindex may be updated while WWW is
serving requests. Thus we must reopen the Xapian DB when
the revision we're using becomes stale.
Eric Wong [Thu, 30 Nov 2023 11:41:07 +0000 (11:41 +0000)]
inbox: shrink data structures for publicinbox.*.hide
We no longer vivify the intermediate $ibx->{-hide} hashref,
instead we use $ibx->{-hide_$KEY} directly. This avoids
an intermediate hashref and extra hash table lookups.
Eric Wong [Thu, 30 Nov 2023 11:41:06 +0000 (11:41 +0000)]
www_listing: support publicInbox.nameIsUrl
This is a convenient (and slightly memory-saving) alternative to
specifying a `publicinbox.*.url' entry for every single inbox
when using publicinbox.wwwListing.
Eric Wong [Thu, 30 Nov 2023 11:41:05 +0000 (11:41 +0000)]
git_async_cat: use git from "all" extindex if possible
For inboxes associated with an extindex (currently only the
special "all") one, we can share the git process across
all those inboxes unambiguously when retrieving full SHA-1
blobs.
The comment for my proposed patch is also out-of-date as that
git speedup has been a part of git since 2.33.
Eric Wong [Thu, 30 Nov 2023 11:41:04 +0000 (11:41 +0000)]
inbox: expire resources more aggressively
We no longer trigger git cleanups from the Inbox package since
`git cat-file' users have their own cleanup to support git
coderepos not associated with any inbox.
This change means we unconditionally expire SQLite and Xapian
FDs and some internal caches regardless of git activity. The
old logic was irrelevant to Gcf2 (libgit2) users anyways since
we couldn't determine whether or not an inbox was active based
on {inflight} git requests, and upcoming changes will make it
inaccurate for all extindex/cindex users as well.
Opening SQLite and Xapian DBs is fairly cheap; so it's a small
price to pay to reduce memory use and fragmentation.
Eric Wong [Thu, 30 Nov 2023 11:41:03 +0000 (11:41 +0000)]
cindex: speed up initial scan setup phase
This brings a no-op -cindex scan of a git.kernel.org mirror
down from 70s to 10s with a hot cache on a busy machine.
CPU-intensive SHA-256 fingerprinting of the `git show-ref'
result can be parallelized on shard workers. Future changes can
move more of the initial scan setup phase into shard workers for
more parallelism.
But most of the performance for skipping unchanged repos is
gained from delaying the commit time reading until we've seen
the fingerprint is out-of-date, since reading commit times
requires a large amount of I/O compared to only reading refs
for fingerprints.
Eric Wong [Thu, 30 Nov 2023 11:41:01 +0000 (11:41 +0000)]
cindex: skip getpid guard for most OnDestroy use
We no longer fork after cidx_init, so there's no need to spend
CPU cycles on the getpid() syscall, especially since it's no
longer cached on glibc while syscalls are also more expensive
these days due to CPU vulnerability mitigations.
Eric Wong [Thu, 30 Nov 2023 11:40:58 +0000 (11:40 +0000)]
cindex: keep batch pipe for pruning SHA-256 repos
This fixes the case where we're running both SHA-256 and SHA-1.
There's no tests for SHA-256, yet, but the bug is pretty obvious
upon reading the code.
Eric Wong [Thu, 30 Nov 2023 11:40:56 +0000 (11:40 +0000)]
config: reject newlines consistently in dir names
Explicitly drop support for "\n" in git coderepo pathnames as
we do other stuff. Gcf2 (our libgit2 helper) was always
broken with "\n" in pathnames, and I'm not sure if cgit config
files work with them, either. Dealing with newline characters
requires extra complexity that I'm not willing to deal with when
managing alternates files.
Eric Wong [Thu, 30 Nov 2023 11:40:54 +0000 (11:40 +0000)]
cindex: fix store_repo+repo_stored on no-op
It's possible to update the fingerprint for a given repo when we
have no commits to index on because they were already done for
another repo. Thus we'll always vivify $repo_ctx->{active}
before calling store_repo since $active may've been undef.
Eric Wong [Tue, 28 Nov 2023 17:36:59 +0000 (17:36 +0000)]
www: mail_diff: fix optional address obfuscation
We need to load the proper package and fully-qualify the sub
call since we shouldn't load Hval in lei. Some users use this
feature even if its broken, oh well :<
Eric Wong [Tue, 28 Nov 2023 14:56:26 +0000 (14:56 +0000)]
cindex: extra quit checks
We don't want to be accessing uninitialized variables on
process teardown since much of our control flow revolves
around DESTROY for dependency handling.
Eric Wong [Tue, 28 Nov 2023 14:56:25 +0000 (14:56 +0000)]
admin: resolve_git_dir respects symlinks
Absolute pathnames of git coderepos are stored in the cindex,
but we should favor paths relative to $ENV{PWD} since it
respects symlinks in the heirarchy.
Respecting symlinks makes it easier to migrate cindex to
new storage as old storage wears out and to relocate the
storage device onto another machine.
Eric Wong [Tue, 28 Nov 2023 14:56:23 +0000 (14:56 +0000)]
cindex: require `-g GIT_DIR' or `-r PROJECT_ROOT'
Accepting @ARGV without switches ends up being ambiguous with
optional parameters for --join and --show. Requiring users to
specify `--join=' or `--show=' is a bit awkward (as it with
-clone --objstore= and the like, but that is historical baggage
we need to carry at this point...)
Eric Wong [Tue, 28 Nov 2023 14:56:22 +0000 (14:56 +0000)]
git: speed up ->git_path for non-worktrees
Only worktrees need to use `git rev-parse --git-path', so avoid
the spawn overhead of a new process. With the SolverGit.pm
limit on coderepo scans disabled and scanning over 800 git repos
for git@vger matches, this reduces up xt/solver.t times by
roughly 25%.
Eric Wong [Tue, 28 Nov 2023 14:56:21 +0000 (14:56 +0000)]
www: load and use cindex join data
This is a major step in solving the problem of having to
manually associate hundreds/thousands of coderepos with
hundreds/thousands of public-inboxes to power solver
(and more).
Eric Wong [Tue, 28 Nov 2023 14:56:20 +0000 (14:56 +0000)]
hval: use File::Spec to make relative paths for href
File::Spec->abs2rel doesn't touch the filesystem at all when
given an absolute base arg ($env->{PATH_INFO}), so we can rely
on it to generate relative links that work with the `mount'
from Plack::Builder and also people running `wget -r' mirrors.
Eric Wong [Tue, 28 Nov 2023 14:56:19 +0000 (14:56 +0000)]
xap_helper: implement mset endpoint for WWW, IMAP, etc...
The C++ version will allow us to take full advantage of Xapian's
APIs for better queries, and the Perl bindings version can still
be advantageous in the future since we'll be able to support
timeouts effectively.
Eric Wong [Tue, 28 Nov 2023 14:56:18 +0000 (14:56 +0000)]
xap_helper.h: move cindex endpoints to separate file
It ought to help a bit with organization since xap_helper.h
is getting somewhat large and we'll need new endpoints to
support WWW, lei, and whatever else that needs to come.
Eric Wong [Tue, 28 Nov 2023 14:56:15 +0000 (14:56 +0000)]
t/cindex*: require SCM_RIGHTS for these tests
Code search will require SCM_RIGHTS, and Inline::C on BSDs
probably isn't too onerous a dependency for new features as
all the ones I've tested have it packaged.
Furthermore, requiring SCM_RIGHTS isn't far off since OpenBSD's
Perl is patched to route the `syscall' perlop through libc[1],
while NetBSD[2] and FreeBSD[3] actually do strive for backwards
compatibility. We'd just need to use the numbers and not rely
on syscall.ph shipped with Perl since the macro names themselves
are unstable.
Eric Wong [Tue, 28 Nov 2023 14:56:14 +0000 (14:56 +0000)]
test_common: create_*: detect changes all parameters
Data::Dumper+B::Deparse seems fast enough to generate cache keys
with, so this makes updating and developing tests easier (as
opposed to forcing the developer to change the identifier). The
main downside is we'll have to deal with cache expiration, but
"make clean" seems overly aggressive already (it keeps blowing
away the clones made by t/cindex-join.t :<)
Eric Wong [Mon, 27 Nov 2023 22:20:59 +0000 (22:20 +0000)]
disallow NUL characters in Message-ID and List-Id
While MTAs seem to stop '\0' from appearing in headers, users
fetching archives via git remain susceptible to having '\0' land
in archives. So we'll filter them out of Xapian and SQLite DBs
to avoid interopability problems with CLI tools since there's no
known messages in lore or any of my archives which feature them.
Avoiding '\0' will ensure all indexed Message-IDs and List-Ids
can be specified from the command-line (although some characters
will still require $(printf) contortions).
As with Message-ID, List-Id fields with /\n\t\r/ characters will
also be stripped for indexing. I will assume whatever went wrong
with the References: header in
<https://public-inbox.org/git/656C30A1EFC89F6B2082D9B6@localhost/raw>
could also happen to the List-Id header.
Eric Wong [Mon, 27 Nov 2023 10:23:48 +0000 (10:23 +0000)]
www: qs_html: fix escaping of `q' param
Our use of MID_ESC characters was only intended for the pathname
component of URIs and not appropriate for the query string
component. So use a different $unsafe parameter list for
uri_escape to make the result appropriate for query strings by
disallowing [\&\'\+=] characters. Most notably, this change
also allows us to accept `/' (slash) unescaped to make dfn: queries
nicer to look at.
Finally, we'll also add a ascii_html call on the URI-escaped
result as an extra safety measure even though it's not really
needed.
As far as I can tell, the code without this fix didn't result in
in an HTML injection since all our uses of uri_escape did escape
angle brackets.
Eric Wong [Mon, 27 Nov 2023 07:26:28 +0000 (07:26 +0000)]
t/nntpd-tls: avoid test failure on OpenBSD 7.3
The LibreSSL 3.7.2 on my OpenBSD 7.3 VM seems return 7 bytes of
junk data before EOF/ECONNRESET when a client attempts to write
plain-text to a TLS socket.
Eric Wong [Mon, 27 Nov 2023 04:05:47 +0000 (04:05 +0000)]
xap_helper.h: avoid some off_t vs size_t problems
We'll introduce a helper to cast off_t to size_t consistently
for mmap/munmap/calloc calls which require size_t. Also, an
extra check for multiplication overflow can be helpful just
in case we end up with a gigantic file roots file.
Eric Wong [Sun, 26 Nov 2023 20:07:45 +0000 (20:07 +0000)]
xap_helper: avoid strerror(3) inside signal handler
It's not async-signal-safe and the glibc implementation uses
malloc via asnprintf. Practically it's not a problem unless the
kernel OOMs and the write(2) fails to the self-pipe.
Eric Wong [Sun, 26 Nov 2023 21:08:01 +0000 (21:08 +0000)]
drop redundant calls to DS->Reset
Reset gets called on END{} anyways to workaround DBI lifetime
problems, so there's no need to call it near exit. We can't
replace calls to POSIX::_exit with `exit' to force END{} to
run just yet, as there are still some lingering destruction
ordering problems on newer DBI and or Perls.
Eric Wong [Sun, 26 Nov 2023 02:11:04 +0000 (02:11 +0000)]
git: improve coupling with {sock} and {inflight} fields
While the {inflight} array should be tied to the IO object even
more tightly, that's not an easy task with our current code. So
take some small steps by introducing a gcf_inflight helper to
validate the ownership of the process and to drain the inflight
array via the awaitpid callback.
This hopefully fix problems with t/lei-q-save.t (still) hanging
occasionally on v2 outputs since git->cleanup/->DESTROY was getting
called in v2 shard workers.
Eric Wong [Sat, 25 Nov 2023 20:54:35 +0000 (20:54 +0000)]
ds: long_step: eliminate redundant fileno call
We already stash the associated FD for reporting at startup and
don't need to call `fileno' again. Found via manual code
inspection while considering the effort to make async {forward}
from PublicInbox::HTTP more like the generic long_response API
and {long_cb} field used by IMAP/NNTP/POP3.
Eric Wong [Sat, 25 Nov 2023 20:54:34 +0000 (20:54 +0000)]
select+poll: have caller retry on EINTR
We can't assume signals are blocked when neither signalfd nor
EVFILT_SIGNAL are in use. So just return an empty result so
the caller can recalculate the timeout.
I found this bug while making xt/httpd-async-stream.t
use our event loop to reap processes but have abandoned
that effort for now since it didn't save any code.
Eric Wong [Sat, 25 Nov 2023 20:54:33 +0000 (20:54 +0000)]
http: fix pipelining during long async requests
We must not attempt to read request bodies from the HTTP client
while processing a long request since that drains pipelined
requests. The NNTP/IMAP/POP3 event_step callbacks follow the
same behavior when {long_cb} is present from ->long_response.
This bug has little real-world consequence since HTTP/1.1
pipelining is not widely-used, especially when behind varnish
or other reverse proxies.
I found this bug while randomly strace-ing an active -netd
process to see the kind of traffic it was seeing.
Eric Wong [Sat, 25 Nov 2023 01:52:25 +0000 (01:52 +0000)]
examples/unsubscribe.milter: limit scope of munging
We don't want the milter to munge List-Unsubscribe headers from
external (incoming) mlmmj lists, only lists hosted on the server
running unsubscribe.milter.
Adding support for an allow_domains file should've been enough,
but this further restricts the milter to only operating on Postfix
connections from localhost.
Eric Wong [Fri, 24 Nov 2023 09:53:46 +0000 (09:53 +0000)]
cindex: fix --join=reset and speed up incremental joins
`reset' means we want to ignore existing join data, while
the default (non-reset) means we perform an incremental
join while taking into account existing (fuzzy) join data.