Eric Wong [Thu, 30 Mar 2023 11:29:51 +0000 (11:29 +0000)]
www: support POST /$INBOX/$MSGID/?x=m&q=
This allows filtering the contents of any existing thread using
a search query. It uses the existing THREADID column in Xapian
so we can internally add a Xapian OP_FILTER to the results.
This new functionality is orthogonal to the existing `t=1'
parameter which gives mairix-style thread expansion. It doesn't
make sense to use `t=1' with this functionality, but it's not
disallowed, either.
The indentation change in Over->next_by_mid is to ensure
DBI->prepare_cached can share across both ->next_by_mid
and ->mid2tid.
I also noticed the existing regex for `POST /$INBOX/?x=m&q=' was
allowing extra characters. With an added \z, it's now as strict
was originally intended and AFAIK nothing was generating invalid
URLs for it
Eric Wong [Wed, 29 Mar 2023 20:32:59 +0000 (20:32 +0000)]
cindex: interleave prune with indexing
We need to ensure we don't block indexing for too long while
pruning, since pruning coderepos seems more frequent and
necessary than inbox repos due to the prevalence of force
pushes with branches like `seen' (formerly `pu') in git.git.
Implement this via ->event_step and requeue mechanisms of DS so
we periodically flush our work and let indexing resume.
I originally wanted to implement this as a dedicated group
of workers, but the XS Search::Xapian bug[1] workaround
to handle uncaught C++ exceptions was expensive and complex
compared to the evented mechanism.
Eric Wong [Tue, 28 Mar 2023 02:59:01 +0000 (02:59 +0000)]
cindex: simplify some internal data structures
We'll rely more on local-ized `our' globals rather than
hashref fields. The former is more resistant to typos
and can be checked at compile-time earlier via `perl -c'.
The {-internal} field is also renamed to {-cidx_internal}
in case to reduce confusion within a large code base.
Eric Wong [Tue, 28 Mar 2023 11:12:36 +0000 (11:12 +0000)]
inotify: wrap with informative error message
As encountered by Louis DeLosSantos, Linux inotify is capped by
a lesser-known limit than the standard RLIMIT_NOFILE (`ulimit -n`)
value. Give the user a hint about the fs.inotify.max_user_instances
sysctl knob on EMFILE, since EMFILE alone may mislead users into
thinking they've hit the (typically higher) RLIMIT_NOFILE limit.
Eric Wong [Sun, 26 Mar 2023 10:52:46 +0000 (10:52 +0000)]
watch: do not recreate signalfd on SIGHUP
The normal method by which PublicInbox::DS::event_loop sets up
signals once needs some coercing to work with -watch.
Otherwise, we'll end up wasting FDs every time somebody reloads
-watch via SIGHUP.
Eric Wong [Sun, 26 Mar 2023 09:35:43 +0000 (09:35 +0000)]
Merge branch 'cindex'
* cindex: (29 commits)
cindex: --prune checkpoints to avoid OOM
cindex: ignore SIGPIPE
cindex: respect existing permissions
cindex: squelch incompatible options
cindex: implement reindex
cindex: add support for --prune
cindex: filter out non-existent git directories
spawn: show failing directory for chdir failures
cindex: improve granularity of quit checks
cindex: attempt to give oldest commits lowest docids
cindex: truncate or drop body for over-sized commits
cindex: check for checkpoint before giant messages
cindex: implement --max-size=SIZE
sigfd: pass signal name rather than number to callback
cindex: handle graceful shutdown by default
cindex: drop `unchanged' progress message
cindex: show shard number in progress message
cindex: implement --exclude= like -clone
ds: @post_loop_do replaces SetPostLoopCallback
cindex: use DS and workqueues for parallelism
...
Eric Wong [Sat, 25 Mar 2023 11:11:05 +0000 (11:11 +0000)]
lei_store: avoid redundant work on no-op worker spawn
While ->wq_workers_start is idempotent, the pipe creation for
PublicInbox::LeiStoreErr was not and required several extra
syscalls and FD allocations. Check the correct field required
for SOCK_SEQPACKET workers rather than pipe-based workers.
Fixes: cbc2890cb89b81cb ("lei/store: use SOCK_SEQPACKET rather than pipe")
Eric Wong [Tue, 21 Mar 2023 23:07:42 +0000 (23:07 +0000)]
cindex: respect existing permissions
For internal ($GIT_DIR/public-inbox-cindex) Xapian DBs, we can
rely on core.sharedRepository. For external ones, we'll just
rely on existing permissions if the directory already exists.
Eric Wong [Tue, 21 Mar 2023 23:07:40 +0000 (23:07 +0000)]
cindex: implement reindex
This allows changing --indexlevel at the moment and will allow
us to fix some yet-to-be-discovered bugs or backwards-compatible
improvements in the future.
Eric Wong [Tue, 21 Mar 2023 23:07:39 +0000 (23:07 +0000)]
cindex: add support for --prune
This gets rid of both inaccessible commits AND repositories.
It will only unindex commits which are pruned in git, first,
so repos with auto GC disabled will need GC to prune them.
Eric Wong [Tue, 21 Mar 2023 23:07:37 +0000 (23:07 +0000)]
spawn: show failing directory for chdir failures
Our use of `git rev-parse --git-dir' depends on our (v)fork+exec
wrapper doing chdir, so the error message is required to avoid
user confusion. I'm still avoiding `git -C $DIR' for now since
ancient versions of git did not support it.
Eric Wong [Tue, 21 Mar 2023 23:07:35 +0000 (23:07 +0000)]
cindex: attempt to give oldest commits lowest docids
Monotonically increasing docids may help us avoid sorting output
for the web and CLI, since recent commits are generally the most
desired search results.
`git log --reverse' incurs no extra overhead in this case, since
`--stdin' will mean git buffers the commit list in memory before
attempting to emit anything.
Eric Wong [Tue, 21 Mar 2023 23:07:30 +0000 (23:07 +0000)]
cindex: handle graceful shutdown by default
While individual Xapian shards are consistent due to the use of
Xapian transactions, the data across shards still needs to be
in a consistent state for our search to work.
Eric Wong [Tue, 21 Mar 2023 23:07:26 +0000 (23:07 +0000)]
ds: @post_loop_do replaces SetPostLoopCallback
This allows us to avoid repeatedly using memory-intensive
anonymous subs in CodeSearchIdx where the callback is assigned
frequently. Anonymous subs are known to leak memory in old
Perls (e.g. 5.16.3 in enterprise distros) and still expensive in
newer Perls. So favor the (\&subroutine, @args) form which
allows us to eliminate anonymous subs going forward.
Only CodeSearchIdx takes advantage of the new API at the moment,
since it's the biggest repeat user of post-loop callback
changes.
Getting rid of the subroutine and relying on a global `our'
variable also has two advantages:
1) Perl warnings can detect typos at compile-time, whereas the
(now gone) method could only detect errors at run-time.
2) `our' variable assignment can be `local'-ized to a scope
Eric Wong [Tue, 21 Mar 2023 23:07:25 +0000 (23:07 +0000)]
cindex: use DS and workqueues for parallelism
This avoids forking new shard processes for each repo we scan,
but we can't avoid many excessive commits since we need to
ensure the `seen()' sub can avoid excessive work.
Eric Wong [Tue, 21 Mar 2023 23:07:23 +0000 (23:07 +0000)]
cindex: use read-only shards during prep phases
No need to open shards for read/write access when read-only
will do. Since we also control how a document gets sharded,
we'll also access the shard directly instead of letting Xapian
do the mappings.
--reindex didn't work properly before this change since it was
over-indexing. It is now broken in the opposite way in that it
doesn't do reindexing at all. --reindex will be implemented
properly in the future.
Eric Wong [Tue, 21 Mar 2023 23:07:22 +0000 (23:07 +0000)]
cindex: parallelize prep phases
Listing refs, fingerprinting and root scanning can all be
parallelized to reduce runtime on SMP systems.
We'll use DESTROY-based dependency management with
parallelizagion as in LeiMirror to handle ref listing and
fingerprinting before serializing Xapian DB access to check
against the existing fingerprint.
We'll also delay root listing until we get a fingerprint
mismatch to speed up no-op indexing.
Eric Wong [Tue, 21 Mar 2023 23:07:21 +0000 (23:07 +0000)]
codesearch: initial cut w/ -cindex tool
It seems relying on root commits is a reasonable way to
deduplicate and handle repositories with common history.
I initially wanted to shoehorn this into extindex, but decided a
separate Xapian index layout capable of being EITHER external to
handle many forks or internal (in $GIT_DIR/public-inbox-cindex)
for small projects is the right way to go.
Unlike most existing parts of public-inbox, this relies on
absolute paths of $GIT_DIR stored in the Xapian DB and does not
rely on the config file. We'll be relying on the config file to
map absolute paths to public URL paths for WWW.
Eric Wong [Sat, 25 Mar 2023 02:08:52 +0000 (02:08 +0000)]
ipc: retry sendmsg + recvmsg calls on EINTR
I'm not sure how this went undetected for so long, but EINTR
must be checked for when working with blocking sockets. EINTR
shouldn't happen for non-blocking sockets, though, but it's
easier to just use the new wrapper in most of those places.
I don't know what I was smoking when I left out EINTR checks :x
Eric Wong [Wed, 15 Mar 2023 21:47:56 +0000 (21:47 +0000)]
ds: reap_pids: remove redundant signal blocking
Blocking signals when reaping was done when the lei pager was
spawned by the daemon in b90e8d6e02. Shortly afterwards in 7b79c918a5, the client script took over spawning of the pager
and made b90e8d6e02 redundant.
cf. b90e8d6e02 (ds: block signals when reaping, 2021-01-10) 7b79c918a5 (lei: run pager in client script, 2021-01-10)
Eric Wong [Wed, 15 Feb 2023 22:20:23 +0000 (22:20 +0000)]
lei_mirror: use fetch.hideRefs to speed up connectivity check
`git fetch' runs an expensive connectivity check against all
refs, which is unnecessarily expensive for incremental fetches
on RAM-constrained systems.
This depends on the proposal to support `fetch.hideRefs' for `git fetch':
https://public-inbox.org/git/20230212090426.M558990@dcvr/
Eric Wong [Mon, 13 Mar 2023 12:00:23 +0000 (12:00 +0000)]
lei_mirror: handle UTF-8 from manifest.js.gz properly
This should ensure we display the "git config gitweb.owner
$OWNER" command invocation properly and also ensures we set the
description properly without triggering wide character warnings.
Also tested with a smallish iproute2 repo
(/pub/scm/linux/kernel/git/toke/iproute2.git) using my mirror:
Anyways, I'm fairly certain this change and its tests are
correct; but I still struggle to understand Perl's approach to
Unicode and it's interactions with various JSON implementations.
Fixes: 0830817c132cb105 ("lei_mirror: show non-ASCII owner properly w/ --verbose")
Eric Wong [Mon, 13 Mar 2023 12:00:22 +0000 (12:00 +0000)]
lei_mirror: do not fetch to read-only directories
As with public-inbox-fetch, we shouldn't waste time fetching
into read-only directories, since --epoch= will make unwanted
epoch directories read-only placeholders.
Eric Wong [Wed, 8 Mar 2023 11:02:58 +0000 (11:02 +0000)]
lei_mirror: unlink FETCH_HEAD when fetching forkgroups
Apparently, --no-write-fetch-head is broken in current git[1].
It also wasn't in older git, at all. So just unlink FETCH_HEAD
as we see it, but keep using --no-write-fetch-head to avoid the
syscall and I/O overhead when we can.
Eric Wong [Tue, 7 Mar 2023 08:47:15 +0000 (08:47 +0000)]
sha: fix compatibility with old OpenSSL + Net::SSLeay
In older OpenSSL, EVP_get_digestbyname() didn't work properly
without calling OpenSSL_add_all_digests(), first. However,
OpenSSL_add_all_digests() is deprecated by OpenSSL 1.1.0 in
favor of OPENSSL_init_crypto(). Of course, OpenSSL_init_crypto()
isn't available in OpenSSL 1.0.1k nor Net::SSLeay as of 1.93_02
(2023-02-22).
Thus, instead of relying on string lookups and conditional
subroutine calls, just call EVP_sha1() and EVP_sha256() which
work on both old and new systems.
Tested with Net::SSLeay 1.55 and OpenSSL 1.0.1k on on CentOS 7.x
Eric Wong [Mon, 27 Feb 2023 10:21:05 +0000 (10:21 +0000)]
doc: update clone+fetch with 2.0+ switches
Because old versions will exist for a long time and our latest
documentation is visible on the web, we must document when a
switch appears to avoid confusing users of old versions.
Eric Wong [Fri, 24 Feb 2023 16:59:10 +0000 (16:59 +0000)]
ds: write: do not assume final wbuf entry is tmpio
The final entry of {wbuf} may be a CODE ref and not a
tmpio ARRAY ref, so we must ensure it's an ARRAY before
attempting to use `->[INDEX]' to access it.
This fixes:
forward ->close error: Not an ARRAY reference at PublicInbox/DS.pm line 544.
systemd (247.3-7+deb11u1 on Debian 11.x) considers them "obsolete" and
emits the following to my syslog:
Standard output type syslog is obsolete, automatically updating to journal.
Please update your unit file, and consider removing the setting altogether.
So we'll remove it altogether, as I'm sticking with rsyslog for now.
File::Path already accounts for the existence of directories,
handles races from redundant mkdir(2), and croaks on
unrecoverable errors. So there's no point in doing any
of that on our end.
Furthermore, avoiding the overhead of loading File::Path doesn't
seem worth it to save 20-60ms given the overhead of loading
our other code. Instead, try to reduce optree overhead on
our code, instead, since File::Path gets used in a bunch of
places.
We'll also favor the newer make_path for multi-directory
invocations to avoid bloating our own optree to create an
arrayref, but mkpath is one fewer subroutine call within
File::Path itself, right now.
Eric Wong [Tue, 21 Feb 2023 12:17:44 +0000 (12:17 +0000)]
lei_mirror: support --remote-manifest=URL
Since PublicInbox::WWW already generates manifest.js.gz, I'm
using an alternate path with PublicInbox::WwwStatic to host the
manifest.js.gz for coderepos at an alternate location. The
following snippet lets me host
https://yhbt.net/lore/pub/manifest.js.gz for mirrored git
repositories, while https://yhbt.net/lore/manifest.js.gz
(no `pub') remains for inbox mirroring.
==> sample.psgi <==
use PublicInbox::WWW;
use PublicInbox::WwwStatic;
my $www = PublicInbox::WWW->new; # use default PI_CONFIG
my $st = PublicInbox::WwwStatic->new(docroot => '/path/to/code');
my $www_cb = sub {
my ($env) = @_;
if ($env->{PATH_INFO} eq '/pub/manifest.js.gz') {
local $env->{PATH_INFO} = '/manifest.js.gz';
my $res = $st->call($env);
return $res if $res->[0] != 404;
}
$www->call($env);
};
builder {
enable 'ReverseProxy';
enable 'Head';
mount '/lore' => $www_cb;
}
Eric Wong [Tue, 21 Feb 2023 11:17:58 +0000 (11:17 +0000)]
viewvcs: handle non-UTF-8 commit message
Back in the old days, git didn't store commit encodings
and allowed messages in various encodings to enter history.
Assuming such a commit is UTF-8 trips up s/// operations
on buffers read with the `:utf8' PerlIO layer. So clear
Perl's internal UTF-8 flag if we end up with something
which isn't valid UTF-8
Eric Wong [Mon, 20 Feb 2023 09:21:50 +0000 (09:21 +0000)]
searchidx: do not index quoted Base-85 patches
Base-85 binary patches were a source of false-positives in results
and we've filtered out in non-quoted text since July 2022.
Unfortunately, people were quoting binary patch contents
in replies (*sigh*) and triggering false positives in search
results. So we must filter out base-85-looking contents from
quoted text, too.
Followup-to: 8fda04081acde705 (search: do not index base-85 binary patches, 2022-06-20) Followup-to: 840785917bc74c8e (searchidx: skip "delta $N" sections for base-85, 2022-07-19)
Eric Wong [Mon, 20 Feb 2023 05:32:02 +0000 (05:32 +0000)]
multi_git: do not set include.path if already set
The epoch may already be read-only, and we don't need to cause
more I/O traffic and disk wear for no-op stuff. This fixes
idempotent use of public-inbox-clone to update multi-epoch
inboxes.
Eric Wong [Mon, 20 Feb 2023 08:19:43 +0000 (08:19 +0000)]
git_async_cat: don't mis-abort replaced process
When a git process gets replaced (e.g. due to new
epochs/alternates), we must be careful and not abort the wrong
one.
I suspect this fixes the problem exacerbated by --batch-command.
It was theoretically possible w/o --batch-command, but it seems
to have made it surface more readily.
This should fix "Failed to retrieve generated blob" errors from
PublicInbox/ViewVCS.pm appearing in syslog
Eric Wong [Sun, 19 Feb 2023 08:18:14 +0000 (08:18 +0000)]
search: translate d: to dt: in query
dt: is higher resolution and the YYYYMMDD column will be dropped
if there's ever another SCHEMA_VERSION update. While the
upcoming code repo index is independent of the mail schemas,
it'll use similar query prefixes and likely use d:/dt: for
Author Date of git commits.
Eric Wong [Fri, 17 Feb 2023 10:36:14 +0000 (10:36 +0000)]
search: move query transform + enquire setup out of retry loop
The Xapian query transformation and Enquire object setup aren't
subject to MVCC and retries, so move it outside the retry loop
to save some cycles in case we need to retry on a busy DB.
Uwe Kleine-König [Fri, 17 Feb 2023 11:08:50 +0000 (12:08 +0100)]
public-inbox.cgi(1): Mention AllowEncodedSlashes for Apache setups
When AllowEncodedSlashes is Off (the default setting), URLs containing
%2f are replied with a 404 error without calling the CGI. To (maybe)
prevent others debugging this issue add a hint with the solution.
Eric Wong [Tue, 14 Feb 2023 13:17:39 +0000 (13:17 +0000)]
www_coderepo: handle unborn/dead branches in summary
We need to account for `git log' showing nothing for invalid
branches and continue to render properly. We'll also quiet down
`git log' stderr to avoid cluttering stderr, too.
Eric Wong [Tue, 14 Feb 2023 02:42:32 +0000 (02:42 +0000)]
lei q: do not collapse threads with `-tt'
While having Xapian collapse threads is an easy way to reduce
the amount of deduplication work we need to do when writing
out threads; we can't rely on it when using `lei q -tt` since
that needs to flag all hits.