Eric Wong [Tue, 25 Apr 2023 11:02:55 +0000 (11:02 +0000)]
cindex: simplify store_repo
It's easier to just create a new Xapian::Document and
replace it rather than to load and edit it. I don't
know if there's any performance difference one way or
the other, but fewer branches helps with maintainability
and smaller optree size to lower memory use and startup
speed.
Eric Wong [Tue, 25 Apr 2023 11:02:54 +0000 (11:02 +0000)]
cindex: simplify tmpfile management for indexing
I considered making this a pipe, but we must avoid spawning
`git log --stdin --no-walk=unsorted' for the no-op case since that
still emits a commit if stdin is empty. So just get rid of an
unnecessary loop and do lseek(2) inside workers for parallelism
Eric Wong [Tue, 25 Apr 2023 11:02:53 +0000 (11:02 +0000)]
cindex: drop unneeded module use
I initially thought I'd use the PublicInbox::Eml module and rely
on --pretty=mboxrd; but eventually decided against it since
it wasn't saving any code.
Eric Wong [Tue, 25 Apr 2023 10:50:52 +0000 (10:50 +0000)]
content_digest_dbg: improve display of To:/Cc: diffs
To: and Cc: headers can be long and differences in long lines
are easier to view when broken apart. Just split by /,/ since
Data::Dumper will delimit with "," anyways.
Eric Wong [Tue, 25 Apr 2023 10:50:51 +0000 (10:50 +0000)]
mail_diff: show headers differences in WWW /$MSGID/d/ view
Some messages only differ in the To/Cc headers because some
MTAs seem to normalize them. I was getting confused when I
saw some /d/ endpoints with no visible differences
Eric Wong [Tue, 25 Apr 2023 10:50:50 +0000 (10:50 +0000)]
mail_diff: match ContentHash EOL and EOM behavior more closely
ContentHash currently doesn't convert CRCRLF to LF. Perhaps it
should, but for now, have diff behavior match the actual
comparison behavior used for dedupe and omit all trailing
whitespace for diff.
Eric Wong [Tue, 25 Apr 2023 10:50:49 +0000 (10:50 +0000)]
mid+contenthash: eliminate needless local variable captures
It's possible in theory that Perl could be smarter and free
memory a tad sooner this way. Regardless, fewer lines of code
is easier-to-navigate/read and can save optree size and reduce
parsing times.
Eric Wong [Sat, 22 Apr 2023 10:33:42 +0000 (10:33 +0000)]
cindex: rewrite prune (again) for speed
With my partial git.kernel.org mirror, this brings a full prune
down from ~75 minutes to under 5 minutes using git 2.19+. This
speedup even applies to users on slow storage (rotational HDD).
First off, xapian-delve(1) is nearly 10x faster for dumping
boolean terms by prefix than the equivalent Perl code with
Xapian bindings. This performance difference is critical since
we need to check over 5 million commits for pruning a partial
git.kernel.org mirror.
We can use sed(1) and sort(1) to massage delve output into
something suitable for the first comm(1) input.
For the second comm(1) input, the output of `git cat-file
--batch-check --batch-all-objects' against all indexed git repos
with awk(1) filtering provides the necessary output for
generating a list of indexed-but-no-longer accessible commits.
sed(1) and awk(1) are POSIX standard tools which can be roughly
2x faster than equivalent Perl for simple filters, while
sort(1) is designed to handle larger-than-memory datasets
efficiently (unlike the `sort' perlop).
With slow storage and git <2.19, the switch to --batch-all-objects
actually results in a performance regression since having git
perform sorting results in worse disk locality than the previous
sequential iteration by Xapian docid. git 2.19+ users with
`--unordered' support benefits from improved storage locality;
and speedups from storage locality dwarfs the extra overhead of
an extra external sort(1) invocation.
Even with consumer-grade SATA-II SSDs, the combo of --unordered
and sort(1) provides a noticeable speedup since SSD latency
remains a factor for --batch-all-objects.
git <2.19 users must upgrade git to get acceptable performance
on slow storage and giant indexes, but git 2.19 was released
nearly 5 years ago so it's probably a reasonable requirement for
performance.
The only remaining downside of this change for all users
the extra temporary disk space for sort(1) and comm(1);
but the speedup provided with git 2.19+ is well worth it.
Eric Wong [Thu, 20 Apr 2023 10:23:02 +0000 (10:23 +0000)]
lei_mail_sync: prepare to support SHA-256
I'm not sure how combining SHA-1 and SHA-256 in a single git
repo will work, eventually. But this is an obvious place to do
the right thing if we ever see a 64-byte hex string (unless git
adds support for another hash which uses 64-byte hex string
representations, which would break many assumptions elsewhere,
too...).
Eric Wong [Thu, 20 Apr 2023 00:53:30 +0000 (00:53 +0000)]
cindex: limit parallelism of extensions.objectFormat check
We can't safely spawn all `git config' processes of every
indexed git directory at once due to system resource limits
(RLIMIT_NPROC, RLIMIT_NOFILE). So queue them up and limit
parallelism that way.
Eric Wong [Wed, 19 Apr 2023 21:54:48 +0000 (21:54 +0000)]
cindex: support sha256 coderepos alongside sha1
This special support is only needed for --prune at the moment
since the indexing side works on a per-repo basis. There's no
automated tests, yet, but it seems to work well on my sha256
projects when sharing a cindex with sha1 projects.
Eric Wong [Tue, 18 Apr 2023 18:39:14 +0000 (18:39 +0000)]
www_coderepo: rescan cgit project-list for new coderepos
Coderepo changes are probably more common than inbox changes, so
it probably makes sense to rescan and look for new coderepos on
404s, especially since we serve mirrored manifest.js.gz as-is.
I noticed my git.kernel.org mirror was serving manifest.js.gz
pointing to irretrievable repositories. This should stop that.
We'll also drop the underscore ('_') and use `coderepo'
everywhere to be consistent with our documentation.
We may serve new inboxes in a similar way down the line, too;
but this change only affects coderepos for now since we can
guarantee the inbox manifest.js.gz never contains irretrievable
inboxes as it's dynamically generated.
Eric Wong [Wed, 12 Apr 2023 10:17:42 +0000 (10:17 +0000)]
listener: support multi-accept like nginx
While accepting a single connection at-a-time is likely best for
multi-worker and/or load-balanced deployments; accepting
multiple connections at once should be less bad on overloaded
single-worker systems.
We can't automatically pick the best value here since worker
counts are dynamic via SIGTTIN/SIGTTOU. Process managers
(e.g. systemd) can also spawn multiple instances sharing a
single listener with no knowledge sharing between listeners.
Eric Wong [Wed, 12 Apr 2023 06:19:10 +0000 (06:19 +0000)]
lei_mail_sync: cleanup stale/dangling fids if possible
I'm not sure how it happens or if/when it was fixed, but my
earliest lei installations have hit some
"E: fid=$fid for $oidhex unknown" messages on `lei import'
invocations.
This really should've enabled the foreign keys pragma to begin
with; but we'll probably start using that in the future. For
now, at least rely on a transaction to keep things consistent
in SQLite.
Eric Wong [Wed, 12 Apr 2023 00:13:02 +0000 (00:13 +0000)]
git: parallelize manifest_entry
This saves a few milliseconds per-epoch without incurring
any dependencies on the event loop. It can be parallelized
further, of course, but it may not be worth it for -extindex
users since it's already cached.
Eric Wong [Wed, 12 Apr 2023 00:12:58 +0000 (00:12 +0000)]
git: cat_async_step: reduce batch-command info checks
This improves readability for me. Instead of checking for `info '
requests of `--batch-command' in multiple places of every
common branch, do it once per-call and stash its result.
We'll also avoid storing `$bc' for now since the only other
check is in a cold path.
Eric Wong [Sun, 9 Apr 2023 22:30:13 +0000 (22:30 +0000)]
www_coderepo: use OnDestroy to render summary view
This lets us get rid of a /bin/sh process and allows us us to
rely on Qspawn to parallelize git commands.
Special treatment of the OnDestroy object is necessary to keep
its scope limited for MockHTTP. Neither the generic `plackup'
HTTP server and nor our -httpd/-netd needed this scope
limitation. As a result, summary() is now called inside an
anonymous sub to keep the memory overhead of the anonymous sub
itself as small as possible. Avoiding anonymous subs entirely
would be preferable for memory savings, but it's necessary for
PSGI.
Eric Wong [Sat, 8 Apr 2023 09:23:44 +0000 (09:23 +0000)]
v2writable: drop experimental DEBUG_DIFF support
I haven't used it in 5 years, and I doubt anybody else has,
either. In any case, we have both `lei mail-diff' and diff
support in the WWW UI, now, so more convenient options are
available.
Eric Wong [Fri, 7 Apr 2023 12:40:53 +0000 (12:40 +0000)]
switch git version comparisons to vstrings, too
There's too many require_git callsites in t/*.t to change,
but we can make the rest of the code more readable and reuse
PublicInbox::Git::version() in our test suite, too.
Eric Wong [Fri, 7 Apr 2023 12:40:52 +0000 (12:40 +0000)]
searchidx: use vstring to improve readability
Perl has native `vstring' encoding for vector (or version)
strings, make use of it instead of relying on difficult-to-read
hex versions and integer shifts.
Eric Wong [Fri, 7 Apr 2023 12:40:50 +0000 (12:40 +0000)]
umask: hoist out of InboxWritable
Since CodeSearchIdx doesn't deal with inboxes, it makes sense
to split it out from inbox-specific code and start moving
towards using OnDestroy to restore the umask at the end of
scope and reducing extra functions.
Eric Wong [Fri, 7 Apr 2023 12:40:48 +0000 (12:40 +0000)]
cindex: improve progress display
Instead of displaying the total number of changes across all
repos next to the repo path ("$GIT_DIR: $TOTAL commits"), we'll
only show the number of changes made in that repo.
We'll also note when a prune is complete on a shard, since
prunes may often be expensive no-ops.
Eric Wong [Thu, 6 Apr 2023 12:39:53 +0000 (12:39 +0000)]
watch: close inotify FD on ->quit
For simplicity, we quit and recreate an entire watch instance
on SIGHUP. However, inotify (and signalfd) FDs are tied to
the DS event loop and stay pinned to existence that way.
Thus we explicitly close the FD in Watch->quit to prevent
leakage on SIGHUP.
Eric Wong [Thu, 6 Apr 2023 12:39:52 +0000 (12:39 +0000)]
watch: use detect_indexlevel for unconfigured inboxes
I favor leaving the publicinbox.<name>.indexlevel parameter
out of config files to make it easier to alter and reduce
sources of truth. It worked well in most cases, but
public-inbox-watch also needs to detect the indexlevel.
Moving the sub to InboxWritable (from Admin) probably makes
sense since it's a per-inbox attribute and allows -watch
to reuse it.
Eric Wong [Wed, 5 Apr 2023 11:26:56 +0000 (11:26 +0000)]
cindex: enter event loop once per run
This avoids needing to alter the sigmask for systems without
signalfd or EVFILT_SIGNAL. This will also make it easier to
workaround FreeBSD (and likely *BSD) signal behavior in the
next commit.
Eric Wong [Wed, 5 Apr 2023 11:26:53 +0000 (11:26 +0000)]
cindex: do prune work while waiting for `git log -p'
`git log -p' can several seconds to generate its initial output.
SMP systems can be processing prunes during this delay, so let
DS do a one-shot notification for us while prune is running. On
Linux, we'll also use the biggest pipe possible so git can do
more CPU-intensive work to generate diffs while our Perl
processes are indexing and likely hitting I/O wait.
Eric Wong [Wed, 5 Apr 2023 11:26:52 +0000 (11:26 +0000)]
ipc: support awaitpid in WQ workers
Using signalfd is necessary to get reliable signal wakeups w/o
polling on fixed intervals. This change will make it possible
to use awaitpid in cidx shard workers so they can perform prune
work while waiting on the initial output of `git log -p'.
Eric Wong [Thu, 30 Mar 2023 11:29:51 +0000 (11:29 +0000)]
www: support POST /$INBOX/$MSGID/?x=m&q=
This allows filtering the contents of any existing thread using
a search query. It uses the existing THREADID column in Xapian
so we can internally add a Xapian OP_FILTER to the results.
This new functionality is orthogonal to the existing `t=1'
parameter which gives mairix-style thread expansion. It doesn't
make sense to use `t=1' with this functionality, but it's not
disallowed, either.
The indentation change in Over->next_by_mid is to ensure
DBI->prepare_cached can share across both ->next_by_mid
and ->mid2tid.
I also noticed the existing regex for `POST /$INBOX/?x=m&q=' was
allowing extra characters. With an added \z, it's now as strict
was originally intended and AFAIK nothing was generating invalid
URLs for it
Eric Wong [Wed, 29 Mar 2023 20:32:59 +0000 (20:32 +0000)]
cindex: interleave prune with indexing
We need to ensure we don't block indexing for too long while
pruning, since pruning coderepos seems more frequent and
necessary than inbox repos due to the prevalence of force
pushes with branches like `seen' (formerly `pu') in git.git.
Implement this via ->event_step and requeue mechanisms of DS so
we periodically flush our work and let indexing resume.
I originally wanted to implement this as a dedicated group
of workers, but the XS Search::Xapian bug[1] workaround
to handle uncaught C++ exceptions was expensive and complex
compared to the evented mechanism.
Eric Wong [Tue, 28 Mar 2023 02:59:01 +0000 (02:59 +0000)]
cindex: simplify some internal data structures
We'll rely more on local-ized `our' globals rather than
hashref fields. The former is more resistant to typos
and can be checked at compile-time earlier via `perl -c'.
The {-internal} field is also renamed to {-cidx_internal}
in case to reduce confusion within a large code base.
Eric Wong [Tue, 28 Mar 2023 11:12:36 +0000 (11:12 +0000)]
inotify: wrap with informative error message
As encountered by Louis DeLosSantos, Linux inotify is capped by
a lesser-known limit than the standard RLIMIT_NOFILE (`ulimit -n`)
value. Give the user a hint about the fs.inotify.max_user_instances
sysctl knob on EMFILE, since EMFILE alone may mislead users into
thinking they've hit the (typically higher) RLIMIT_NOFILE limit.
Eric Wong [Sun, 26 Mar 2023 10:52:46 +0000 (10:52 +0000)]
watch: do not recreate signalfd on SIGHUP
The normal method by which PublicInbox::DS::event_loop sets up
signals once needs some coercing to work with -watch.
Otherwise, we'll end up wasting FDs every time somebody reloads
-watch via SIGHUP.
Eric Wong [Sun, 26 Mar 2023 09:35:43 +0000 (09:35 +0000)]
Merge branch 'cindex'
* cindex: (29 commits)
cindex: --prune checkpoints to avoid OOM
cindex: ignore SIGPIPE
cindex: respect existing permissions
cindex: squelch incompatible options
cindex: implement reindex
cindex: add support for --prune
cindex: filter out non-existent git directories
spawn: show failing directory for chdir failures
cindex: improve granularity of quit checks
cindex: attempt to give oldest commits lowest docids
cindex: truncate or drop body for over-sized commits
cindex: check for checkpoint before giant messages
cindex: implement --max-size=SIZE
sigfd: pass signal name rather than number to callback
cindex: handle graceful shutdown by default
cindex: drop `unchanged' progress message
cindex: show shard number in progress message
cindex: implement --exclude= like -clone
ds: @post_loop_do replaces SetPostLoopCallback
cindex: use DS and workqueues for parallelism
...
Eric Wong [Sat, 25 Mar 2023 11:11:05 +0000 (11:11 +0000)]
lei_store: avoid redundant work on no-op worker spawn
While ->wq_workers_start is idempotent, the pipe creation for
PublicInbox::LeiStoreErr was not and required several extra
syscalls and FD allocations. Check the correct field required
for SOCK_SEQPACKET workers rather than pipe-based workers.
Fixes: cbc2890cb89b81cb ("lei/store: use SOCK_SEQPACKET rather than pipe")
Eric Wong [Tue, 21 Mar 2023 23:07:42 +0000 (23:07 +0000)]
cindex: respect existing permissions
For internal ($GIT_DIR/public-inbox-cindex) Xapian DBs, we can
rely on core.sharedRepository. For external ones, we'll just
rely on existing permissions if the directory already exists.
Eric Wong [Tue, 21 Mar 2023 23:07:40 +0000 (23:07 +0000)]
cindex: implement reindex
This allows changing --indexlevel at the moment and will allow
us to fix some yet-to-be-discovered bugs or backwards-compatible
improvements in the future.
Eric Wong [Tue, 21 Mar 2023 23:07:39 +0000 (23:07 +0000)]
cindex: add support for --prune
This gets rid of both inaccessible commits AND repositories.
It will only unindex commits which are pruned in git, first,
so repos with auto GC disabled will need GC to prune them.
Eric Wong [Tue, 21 Mar 2023 23:07:37 +0000 (23:07 +0000)]
spawn: show failing directory for chdir failures
Our use of `git rev-parse --git-dir' depends on our (v)fork+exec
wrapper doing chdir, so the error message is required to avoid
user confusion. I'm still avoiding `git -C $DIR' for now since
ancient versions of git did not support it.
Eric Wong [Tue, 21 Mar 2023 23:07:35 +0000 (23:07 +0000)]
cindex: attempt to give oldest commits lowest docids
Monotonically increasing docids may help us avoid sorting output
for the web and CLI, since recent commits are generally the most
desired search results.
`git log --reverse' incurs no extra overhead in this case, since
`--stdin' will mean git buffers the commit list in memory before
attempting to emit anything.
Eric Wong [Tue, 21 Mar 2023 23:07:30 +0000 (23:07 +0000)]
cindex: handle graceful shutdown by default
While individual Xapian shards are consistent due to the use of
Xapian transactions, the data across shards still needs to be
in a consistent state for our search to work.
Eric Wong [Tue, 21 Mar 2023 23:07:26 +0000 (23:07 +0000)]
ds: @post_loop_do replaces SetPostLoopCallback
This allows us to avoid repeatedly using memory-intensive
anonymous subs in CodeSearchIdx where the callback is assigned
frequently. Anonymous subs are known to leak memory in old
Perls (e.g. 5.16.3 in enterprise distros) and still expensive in
newer Perls. So favor the (\&subroutine, @args) form which
allows us to eliminate anonymous subs going forward.
Only CodeSearchIdx takes advantage of the new API at the moment,
since it's the biggest repeat user of post-loop callback
changes.
Getting rid of the subroutine and relying on a global `our'
variable also has two advantages:
1) Perl warnings can detect typos at compile-time, whereas the
(now gone) method could only detect errors at run-time.
2) `our' variable assignment can be `local'-ized to a scope
Eric Wong [Tue, 21 Mar 2023 23:07:25 +0000 (23:07 +0000)]
cindex: use DS and workqueues for parallelism
This avoids forking new shard processes for each repo we scan,
but we can't avoid many excessive commits since we need to
ensure the `seen()' sub can avoid excessive work.
Eric Wong [Tue, 21 Mar 2023 23:07:23 +0000 (23:07 +0000)]
cindex: use read-only shards during prep phases
No need to open shards for read/write access when read-only
will do. Since we also control how a document gets sharded,
we'll also access the shard directly instead of letting Xapian
do the mappings.
--reindex didn't work properly before this change since it was
over-indexing. It is now broken in the opposite way in that it
doesn't do reindexing at all. --reindex will be implemented
properly in the future.
Eric Wong [Tue, 21 Mar 2023 23:07:22 +0000 (23:07 +0000)]
cindex: parallelize prep phases
Listing refs, fingerprinting and root scanning can all be
parallelized to reduce runtime on SMP systems.
We'll use DESTROY-based dependency management with
parallelizagion as in LeiMirror to handle ref listing and
fingerprinting before serializing Xapian DB access to check
against the existing fingerprint.
We'll also delay root listing until we get a fingerprint
mismatch to speed up no-op indexing.
Eric Wong [Tue, 21 Mar 2023 23:07:21 +0000 (23:07 +0000)]
codesearch: initial cut w/ -cindex tool
It seems relying on root commits is a reasonable way to
deduplicate and handle repositories with common history.
I initially wanted to shoehorn this into extindex, but decided a
separate Xapian index layout capable of being EITHER external to
handle many forks or internal (in $GIT_DIR/public-inbox-cindex)
for small projects is the right way to go.
Unlike most existing parts of public-inbox, this relies on
absolute paths of $GIT_DIR stored in the Xapian DB and does not
rely on the config file. We'll be relying on the config file to
map absolute paths to public URL paths for WWW.
Eric Wong [Sat, 25 Mar 2023 02:08:52 +0000 (02:08 +0000)]
ipc: retry sendmsg + recvmsg calls on EINTR
I'm not sure how this went undetected for so long, but EINTR
must be checked for when working with blocking sockets. EINTR
shouldn't happen for non-blocking sockets, though, but it's
easier to just use the new wrapper in most of those places.
I don't know what I was smoking when I left out EINTR checks :x
Eric Wong [Wed, 15 Mar 2023 21:47:56 +0000 (21:47 +0000)]
ds: reap_pids: remove redundant signal blocking
Blocking signals when reaping was done when the lei pager was
spawned by the daemon in b90e8d6e02. Shortly afterwards in 7b79c918a5, the client script took over spawning of the pager
and made b90e8d6e02 redundant.
cf. b90e8d6e02 (ds: block signals when reaping, 2021-01-10) 7b79c918a5 (lei: run pager in client script, 2021-01-10)
Eric Wong [Wed, 15 Feb 2023 22:20:23 +0000 (22:20 +0000)]
lei_mirror: use fetch.hideRefs to speed up connectivity check
`git fetch' runs an expensive connectivity check against all
refs, which is unnecessarily expensive for incremental fetches
on RAM-constrained systems.
This depends on the proposal to support `fetch.hideRefs' for `git fetch':
https://public-inbox.org/git/20230212090426.M558990@dcvr/