Eric Wong [Wed, 12 Apr 2023 00:12:58 +0000 (00:12 +0000)]
git: cat_async_step: reduce batch-command info checks
This improves readability for me. Instead of checking for `info '
requests of `--batch-command' in multiple places of every
common branch, do it once per-call and stash its result.
We'll also avoid storing `$bc' for now since the only other
check is in a cold path.
Eric Wong [Sun, 9 Apr 2023 22:30:13 +0000 (22:30 +0000)]
www_coderepo: use OnDestroy to render summary view
This lets us get rid of a /bin/sh process and allows us us to
rely on Qspawn to parallelize git commands.
Special treatment of the OnDestroy object is necessary to keep
its scope limited for MockHTTP. Neither the generic `plackup'
HTTP server and nor our -httpd/-netd needed this scope
limitation. As a result, summary() is now called inside an
anonymous sub to keep the memory overhead of the anonymous sub
itself as small as possible. Avoiding anonymous subs entirely
would be preferable for memory savings, but it's necessary for
PSGI.
Eric Wong [Sat, 8 Apr 2023 09:23:44 +0000 (09:23 +0000)]
v2writable: drop experimental DEBUG_DIFF support
I haven't used it in 5 years, and I doubt anybody else has,
either. In any case, we have both `lei mail-diff' and diff
support in the WWW UI, now, so more convenient options are
available.
Eric Wong [Fri, 7 Apr 2023 12:40:53 +0000 (12:40 +0000)]
switch git version comparisons to vstrings, too
There's too many require_git callsites in t/*.t to change,
but we can make the rest of the code more readable and reuse
PublicInbox::Git::version() in our test suite, too.
Eric Wong [Fri, 7 Apr 2023 12:40:52 +0000 (12:40 +0000)]
searchidx: use vstring to improve readability
Perl has native `vstring' encoding for vector (or version)
strings, make use of it instead of relying on difficult-to-read
hex versions and integer shifts.
Eric Wong [Fri, 7 Apr 2023 12:40:50 +0000 (12:40 +0000)]
umask: hoist out of InboxWritable
Since CodeSearchIdx doesn't deal with inboxes, it makes sense
to split it out from inbox-specific code and start moving
towards using OnDestroy to restore the umask at the end of
scope and reducing extra functions.
Eric Wong [Fri, 7 Apr 2023 12:40:48 +0000 (12:40 +0000)]
cindex: improve progress display
Instead of displaying the total number of changes across all
repos next to the repo path ("$GIT_DIR: $TOTAL commits"), we'll
only show the number of changes made in that repo.
We'll also note when a prune is complete on a shard, since
prunes may often be expensive no-ops.
Eric Wong [Thu, 6 Apr 2023 12:39:53 +0000 (12:39 +0000)]
watch: close inotify FD on ->quit
For simplicity, we quit and recreate an entire watch instance
on SIGHUP. However, inotify (and signalfd) FDs are tied to
the DS event loop and stay pinned to existence that way.
Thus we explicitly close the FD in Watch->quit to prevent
leakage on SIGHUP.
Eric Wong [Thu, 6 Apr 2023 12:39:52 +0000 (12:39 +0000)]
watch: use detect_indexlevel for unconfigured inboxes
I favor leaving the publicinbox.<name>.indexlevel parameter
out of config files to make it easier to alter and reduce
sources of truth. It worked well in most cases, but
public-inbox-watch also needs to detect the indexlevel.
Moving the sub to InboxWritable (from Admin) probably makes
sense since it's a per-inbox attribute and allows -watch
to reuse it.
Eric Wong [Wed, 5 Apr 2023 11:26:56 +0000 (11:26 +0000)]
cindex: enter event loop once per run
This avoids needing to alter the sigmask for systems without
signalfd or EVFILT_SIGNAL. This will also make it easier to
workaround FreeBSD (and likely *BSD) signal behavior in the
next commit.
Eric Wong [Wed, 5 Apr 2023 11:26:53 +0000 (11:26 +0000)]
cindex: do prune work while waiting for `git log -p'
`git log -p' can several seconds to generate its initial output.
SMP systems can be processing prunes during this delay, so let
DS do a one-shot notification for us while prune is running. On
Linux, we'll also use the biggest pipe possible so git can do
more CPU-intensive work to generate diffs while our Perl
processes are indexing and likely hitting I/O wait.
Eric Wong [Wed, 5 Apr 2023 11:26:52 +0000 (11:26 +0000)]
ipc: support awaitpid in WQ workers
Using signalfd is necessary to get reliable signal wakeups w/o
polling on fixed intervals. This change will make it possible
to use awaitpid in cidx shard workers so they can perform prune
work while waiting on the initial output of `git log -p'.
Eric Wong [Thu, 30 Mar 2023 11:29:51 +0000 (11:29 +0000)]
www: support POST /$INBOX/$MSGID/?x=m&q=
This allows filtering the contents of any existing thread using
a search query. It uses the existing THREADID column in Xapian
so we can internally add a Xapian OP_FILTER to the results.
This new functionality is orthogonal to the existing `t=1'
parameter which gives mairix-style thread expansion. It doesn't
make sense to use `t=1' with this functionality, but it's not
disallowed, either.
The indentation change in Over->next_by_mid is to ensure
DBI->prepare_cached can share across both ->next_by_mid
and ->mid2tid.
I also noticed the existing regex for `POST /$INBOX/?x=m&q=' was
allowing extra characters. With an added \z, it's now as strict
was originally intended and AFAIK nothing was generating invalid
URLs for it
Eric Wong [Wed, 29 Mar 2023 20:32:59 +0000 (20:32 +0000)]
cindex: interleave prune with indexing
We need to ensure we don't block indexing for too long while
pruning, since pruning coderepos seems more frequent and
necessary than inbox repos due to the prevalence of force
pushes with branches like `seen' (formerly `pu') in git.git.
Implement this via ->event_step and requeue mechanisms of DS so
we periodically flush our work and let indexing resume.
I originally wanted to implement this as a dedicated group
of workers, but the XS Search::Xapian bug[1] workaround
to handle uncaught C++ exceptions was expensive and complex
compared to the evented mechanism.
Eric Wong [Tue, 28 Mar 2023 02:59:01 +0000 (02:59 +0000)]
cindex: simplify some internal data structures
We'll rely more on local-ized `our' globals rather than
hashref fields. The former is more resistant to typos
and can be checked at compile-time earlier via `perl -c'.
The {-internal} field is also renamed to {-cidx_internal}
in case to reduce confusion within a large code base.
Eric Wong [Tue, 28 Mar 2023 11:12:36 +0000 (11:12 +0000)]
inotify: wrap with informative error message
As encountered by Louis DeLosSantos, Linux inotify is capped by
a lesser-known limit than the standard RLIMIT_NOFILE (`ulimit -n`)
value. Give the user a hint about the fs.inotify.max_user_instances
sysctl knob on EMFILE, since EMFILE alone may mislead users into
thinking they've hit the (typically higher) RLIMIT_NOFILE limit.
Eric Wong [Sun, 26 Mar 2023 10:52:46 +0000 (10:52 +0000)]
watch: do not recreate signalfd on SIGHUP
The normal method by which PublicInbox::DS::event_loop sets up
signals once needs some coercing to work with -watch.
Otherwise, we'll end up wasting FDs every time somebody reloads
-watch via SIGHUP.
Eric Wong [Sun, 26 Mar 2023 09:35:43 +0000 (09:35 +0000)]
Merge branch 'cindex'
* cindex: (29 commits)
cindex: --prune checkpoints to avoid OOM
cindex: ignore SIGPIPE
cindex: respect existing permissions
cindex: squelch incompatible options
cindex: implement reindex
cindex: add support for --prune
cindex: filter out non-existent git directories
spawn: show failing directory for chdir failures
cindex: improve granularity of quit checks
cindex: attempt to give oldest commits lowest docids
cindex: truncate or drop body for over-sized commits
cindex: check for checkpoint before giant messages
cindex: implement --max-size=SIZE
sigfd: pass signal name rather than number to callback
cindex: handle graceful shutdown by default
cindex: drop `unchanged' progress message
cindex: show shard number in progress message
cindex: implement --exclude= like -clone
ds: @post_loop_do replaces SetPostLoopCallback
cindex: use DS and workqueues for parallelism
...
Eric Wong [Sat, 25 Mar 2023 11:11:05 +0000 (11:11 +0000)]
lei_store: avoid redundant work on no-op worker spawn
While ->wq_workers_start is idempotent, the pipe creation for
PublicInbox::LeiStoreErr was not and required several extra
syscalls and FD allocations. Check the correct field required
for SOCK_SEQPACKET workers rather than pipe-based workers.
Fixes: cbc2890cb89b81cb ("lei/store: use SOCK_SEQPACKET rather than pipe")
Eric Wong [Tue, 21 Mar 2023 23:07:42 +0000 (23:07 +0000)]
cindex: respect existing permissions
For internal ($GIT_DIR/public-inbox-cindex) Xapian DBs, we can
rely on core.sharedRepository. For external ones, we'll just
rely on existing permissions if the directory already exists.
Eric Wong [Tue, 21 Mar 2023 23:07:40 +0000 (23:07 +0000)]
cindex: implement reindex
This allows changing --indexlevel at the moment and will allow
us to fix some yet-to-be-discovered bugs or backwards-compatible
improvements in the future.
Eric Wong [Tue, 21 Mar 2023 23:07:39 +0000 (23:07 +0000)]
cindex: add support for --prune
This gets rid of both inaccessible commits AND repositories.
It will only unindex commits which are pruned in git, first,
so repos with auto GC disabled will need GC to prune them.
Eric Wong [Tue, 21 Mar 2023 23:07:37 +0000 (23:07 +0000)]
spawn: show failing directory for chdir failures
Our use of `git rev-parse --git-dir' depends on our (v)fork+exec
wrapper doing chdir, so the error message is required to avoid
user confusion. I'm still avoiding `git -C $DIR' for now since
ancient versions of git did not support it.
Eric Wong [Tue, 21 Mar 2023 23:07:35 +0000 (23:07 +0000)]
cindex: attempt to give oldest commits lowest docids
Monotonically increasing docids may help us avoid sorting output
for the web and CLI, since recent commits are generally the most
desired search results.
`git log --reverse' incurs no extra overhead in this case, since
`--stdin' will mean git buffers the commit list in memory before
attempting to emit anything.
Eric Wong [Tue, 21 Mar 2023 23:07:30 +0000 (23:07 +0000)]
cindex: handle graceful shutdown by default
While individual Xapian shards are consistent due to the use of
Xapian transactions, the data across shards still needs to be
in a consistent state for our search to work.
Eric Wong [Tue, 21 Mar 2023 23:07:26 +0000 (23:07 +0000)]
ds: @post_loop_do replaces SetPostLoopCallback
This allows us to avoid repeatedly using memory-intensive
anonymous subs in CodeSearchIdx where the callback is assigned
frequently. Anonymous subs are known to leak memory in old
Perls (e.g. 5.16.3 in enterprise distros) and still expensive in
newer Perls. So favor the (\&subroutine, @args) form which
allows us to eliminate anonymous subs going forward.
Only CodeSearchIdx takes advantage of the new API at the moment,
since it's the biggest repeat user of post-loop callback
changes.
Getting rid of the subroutine and relying on a global `our'
variable also has two advantages:
1) Perl warnings can detect typos at compile-time, whereas the
(now gone) method could only detect errors at run-time.
2) `our' variable assignment can be `local'-ized to a scope
Eric Wong [Tue, 21 Mar 2023 23:07:25 +0000 (23:07 +0000)]
cindex: use DS and workqueues for parallelism
This avoids forking new shard processes for each repo we scan,
but we can't avoid many excessive commits since we need to
ensure the `seen()' sub can avoid excessive work.
Eric Wong [Tue, 21 Mar 2023 23:07:23 +0000 (23:07 +0000)]
cindex: use read-only shards during prep phases
No need to open shards for read/write access when read-only
will do. Since we also control how a document gets sharded,
we'll also access the shard directly instead of letting Xapian
do the mappings.
--reindex didn't work properly before this change since it was
over-indexing. It is now broken in the opposite way in that it
doesn't do reindexing at all. --reindex will be implemented
properly in the future.
Eric Wong [Tue, 21 Mar 2023 23:07:22 +0000 (23:07 +0000)]
cindex: parallelize prep phases
Listing refs, fingerprinting and root scanning can all be
parallelized to reduce runtime on SMP systems.
We'll use DESTROY-based dependency management with
parallelizagion as in LeiMirror to handle ref listing and
fingerprinting before serializing Xapian DB access to check
against the existing fingerprint.
We'll also delay root listing until we get a fingerprint
mismatch to speed up no-op indexing.
Eric Wong [Tue, 21 Mar 2023 23:07:21 +0000 (23:07 +0000)]
codesearch: initial cut w/ -cindex tool
It seems relying on root commits is a reasonable way to
deduplicate and handle repositories with common history.
I initially wanted to shoehorn this into extindex, but decided a
separate Xapian index layout capable of being EITHER external to
handle many forks or internal (in $GIT_DIR/public-inbox-cindex)
for small projects is the right way to go.
Unlike most existing parts of public-inbox, this relies on
absolute paths of $GIT_DIR stored in the Xapian DB and does not
rely on the config file. We'll be relying on the config file to
map absolute paths to public URL paths for WWW.
Eric Wong [Sat, 25 Mar 2023 02:08:52 +0000 (02:08 +0000)]
ipc: retry sendmsg + recvmsg calls on EINTR
I'm not sure how this went undetected for so long, but EINTR
must be checked for when working with blocking sockets. EINTR
shouldn't happen for non-blocking sockets, though, but it's
easier to just use the new wrapper in most of those places.
I don't know what I was smoking when I left out EINTR checks :x
Eric Wong [Wed, 15 Mar 2023 21:47:56 +0000 (21:47 +0000)]
ds: reap_pids: remove redundant signal blocking
Blocking signals when reaping was done when the lei pager was
spawned by the daemon in b90e8d6e02. Shortly afterwards in 7b79c918a5, the client script took over spawning of the pager
and made b90e8d6e02 redundant.
cf. b90e8d6e02 (ds: block signals when reaping, 2021-01-10) 7b79c918a5 (lei: run pager in client script, 2021-01-10)
Eric Wong [Wed, 15 Feb 2023 22:20:23 +0000 (22:20 +0000)]
lei_mirror: use fetch.hideRefs to speed up connectivity check
`git fetch' runs an expensive connectivity check against all
refs, which is unnecessarily expensive for incremental fetches
on RAM-constrained systems.
This depends on the proposal to support `fetch.hideRefs' for `git fetch':
https://public-inbox.org/git/20230212090426.M558990@dcvr/
Eric Wong [Mon, 13 Mar 2023 12:00:23 +0000 (12:00 +0000)]
lei_mirror: handle UTF-8 from manifest.js.gz properly
This should ensure we display the "git config gitweb.owner
$OWNER" command invocation properly and also ensures we set the
description properly without triggering wide character warnings.
Also tested with a smallish iproute2 repo
(/pub/scm/linux/kernel/git/toke/iproute2.git) using my mirror:
Anyways, I'm fairly certain this change and its tests are
correct; but I still struggle to understand Perl's approach to
Unicode and it's interactions with various JSON implementations.
Fixes: 0830817c132cb105 ("lei_mirror: show non-ASCII owner properly w/ --verbose")
Eric Wong [Mon, 13 Mar 2023 12:00:22 +0000 (12:00 +0000)]
lei_mirror: do not fetch to read-only directories
As with public-inbox-fetch, we shouldn't waste time fetching
into read-only directories, since --epoch= will make unwanted
epoch directories read-only placeholders.
Eric Wong [Wed, 8 Mar 2023 11:02:58 +0000 (11:02 +0000)]
lei_mirror: unlink FETCH_HEAD when fetching forkgroups
Apparently, --no-write-fetch-head is broken in current git[1].
It also wasn't in older git, at all. So just unlink FETCH_HEAD
as we see it, but keep using --no-write-fetch-head to avoid the
syscall and I/O overhead when we can.
Eric Wong [Tue, 7 Mar 2023 08:47:15 +0000 (08:47 +0000)]
sha: fix compatibility with old OpenSSL + Net::SSLeay
In older OpenSSL, EVP_get_digestbyname() didn't work properly
without calling OpenSSL_add_all_digests(), first. However,
OpenSSL_add_all_digests() is deprecated by OpenSSL 1.1.0 in
favor of OPENSSL_init_crypto(). Of course, OpenSSL_init_crypto()
isn't available in OpenSSL 1.0.1k nor Net::SSLeay as of 1.93_02
(2023-02-22).
Thus, instead of relying on string lookups and conditional
subroutine calls, just call EVP_sha1() and EVP_sha256() which
work on both old and new systems.
Tested with Net::SSLeay 1.55 and OpenSSL 1.0.1k on on CentOS 7.x