Tom Lane [Thu, 2 Apr 2026 16:20:26 +0000 (12:20 -0400)]
Harden astreamer tar parsing logic against archives it can't handle.
Previously, there was essentially no verification in this code that
the input is a tar file at all, let alone that it fits into the
subset of valid tar files that we can handle. This was exposed by
the discovery that we couldn't handle files that FreeBSD's tar
makes, because it's fairly aggressive about converting sparse WAL
files into sparse tar entries. To fix:
* Bail out if we find a pax extension header. This covers the
sparse-file case, and also protects us against scenarios where
the pax header changes other file properties that we care about.
(Eventually we may extend the logic to actually handle such
headers, but that won't happen in time for v19.)
* Be more wary about tar file type codes in general: do not assume
that anything that's neither a directory nor a symlink must be a
regular file. Instead, we just ignore entries that are none of the
three supported types.
* Apply pg_dump's isValidTarHeader to verify that a purported
header block is actually in tar format. To make this possible,
move isValidTarHeader into src/port/tar.c, which is probably where
it should have been since that file was created.
I also took the opportunity to const-ify the arguments of
isValidTarHeader and tarChecksum, and to use symbols not hard-wired
constants inside tarChecksum.
Back-patch to v18 but not further. Although this code exists inside
pg_basebackup in older branches, it's not really exposed in that
usage to tar files that weren't generated by our own code, so it
doesn't seem worth back-porting these changes across 3c9056981
and f80b09bac. I did choose to include a back-patch of 5868372bb
into v18 though, to minimize cosmetic differences between these
two branches.
Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/3049460.1775067940@sss.pgh.pa.us>
Backpatch-through: 18
Remove redundant SetLatch() calls in interrupt handling functions
Interrupt handling functions (e.g., HandleCatchupInterrupt(),
HandleParallelApplyMessageInterrupt()) are called only by
procsignal_sigusr1_handler(), which already calls SetLatch()
for the current process at the end of its processing.
Therefore, these interrupt handling functions do not need to
call SetLatch() themselves.
However, previously, some of these functions redundantly
called SetLatch(). This commit removes those unnecessary
calls.
While duplicate SetLatch() calls are redundant, they are
harmless, so this change is not backpatched.
John Naylor [Thu, 2 Apr 2026 12:39:57 +0000 (19:39 +0700)]
Check for __cpuidex and __get_cpuid_count separately
Previously we would only check for the availability of __cpuidex if
the related __get_cpuid_count was not available on a platform.
Future commits will need to access hypervisor information about
the TSC frequency of x86 CPUs. For that case __cpuidex is the only
viable option for accessing a high leaf (e.g. 0x40000000), since
__get_cpuid_count does not allow that.
__cpuidex is defined in cpuid.h for gcc/clang, but in intrin.h
for MSVC, so adjust tests to suite. We also need to cast the array
of unsigned ints to signed, since gcc (with -Wall) and clang emit
warnings otherwise.
Author: Lukas Fittl <lukas@fittl.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: John Naylor <john.naylor@postgresql.org>
Discussion: https://postgr.es/m/CAP53PkyooCeR8YV0BUD_xC7oTZESHz8OdA=tP7pBRHFVQ9xtKg@mail.gmail.com
Andrew Dunstan [Wed, 1 Apr 2026 17:55:21 +0000 (13:55 -0400)]
Use command_ok for pg_regress calls in 002_pg_upgrade and 027_stream_regress
Now that command_ok() captures and displays failure output, use it
instead of system() plus manual diff-dumping in these two tests. This
simplifies both scripts and produces consistent, truncated output on
failure.
Andrew Dunstan [Wed, 1 Apr 2026 17:55:13 +0000 (13:55 -0400)]
perl tap: Use croak instead of die in our helper modules
Replace die with croak throughout Cluster.pm and Utils.pm (except in
INIT blocks and signal handlers, where die is correct) so that error
messages report the test script's line number rather than the helper
module's.
Add @CARP_NOT in Utils.pm listing PostgreSQL::Test::Cluster, so that
when a Utils function is called through a Cluster.pm wrapper, croak
skips both packages and reports the actual test-script caller.
Author: Jelte Fennema-Nio <postgres@jeltef.nl> Reviewed-by: Andrew Dunstan <andrew@dunslane.net> Reviewed-by: Corey Huinker <corey.huinker@gmail.com> Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com> Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/DFYFWM053WHS.10K8ZPJ605UFK@jeltef.nl
Andrew Dunstan [Wed, 1 Apr 2026 17:54:41 +0000 (13:54 -0400)]
perl tap: Show die reason in TAP output
Install a $SIG{__DIE__} handler in the INIT block of Utils.pm that emits
the die message as a TAP diagnostic. Previously, an unexpected die
(e.g. from safe_psql) produced only "no plan was declared" with no
indication of the actual error. The handler also calls done_testing()
to suppress that confusing message.
Dies during compilation ($^S undefined) and inside eval ($^S == 1) are
left alone.
Author: Jelte Fennema-Nio <postgres@jeltef.nl> Reviewed-by: Andrew Dunstan <andrew@dunslane.net> Reviewed-by: Corey Huinker <corey.huinker@gmail.com> Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com> Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/DFYFWM053WHS.10K8ZPJ605UFK@jeltef.nl
Discussion: https://postgr.es/m/20220222181924.eehi7o4pmneeb4hm%40alap3.anarazel.de
Andrew Dunstan [Wed, 1 Apr 2026 17:54:29 +0000 (13:54 -0400)]
perl tap: Show failed command output
Capture stdout and stderr from command_ok() and command_fails() and emit
them as TAP diagnostics on failure. Output is truncated to the first
and last 30 lines per channel to avoid flooding.
A new helper _diag_command_output() is introduced in Utils.pm so
both functions share the same truncation and formatting logic.
Author: Jelte Fennema-Nio <postgres@jeltef.nl> Reviewed-by: Andrew Dunstan <andrew@dunslane.net> Reviewed-by: Corey Huinker <corey.huinker@gmail.com> Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com> Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/DFYFWM053WHS.10K8ZPJ605UFK@jeltef.nl
Andrew Dunstan [Wed, 1 Apr 2026 17:53:49 +0000 (13:53 -0400)]
pg_regress: Include diffs in TAP output
When pg_regress fails it is often tedious to find the actual diffs,
especially in CI where you must navigate a file browser. Emit the first
80 lines of the combined regression.diffs as TAP diagnostics so the
failure reason is visible directly in the test output.
The line limit is across all failing tests in a single pg_regress run to
avoid flooding when a crash causes every subsequent test to fail.
New DIAG_DETAIL / DIAG_END tap output types are added, mirroring the
existing NOTE_DETAIL / NOTE_END pair, so that long diff lines can be
emitted without spurious '#' prefixes on continuation lines.
Author: Jelte Fennema-Nio <postgres@jeltef.nl> Reviewed-by: Andrew Dunstan <andrew@dunslane.net> Reviewed-by: Corey Huinker <corey.huinker@gmail.com> Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com> Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/DFYFWM053WHS.10K8ZPJ605UFK@jeltef.nl
Tomas Vondra [Thu, 2 Apr 2026 10:53:18 +0000 (12:53 +0200)]
jit: Change the default to off.
While JIT can speed up large analytical queries, it can also cause
serious performance issues on otherwise very fast queries. Compiling
and optimizing the expressions may be so expensive, it completely
outweighs the JIT benefits for shorter queries.
Ideally, we'd address this in the cost model, but the part deciding
whether to enable JIT for a query is rather simple, partially because we
don't have any reliable estimates of how expensive the LLVM compilation
and optimization is.
Sometimes seemingly unrelated changes (for example a couple additional
INSERTs into a table) increase the cost just enough to enable JIT,
resulting in a performance cliff.
Because of these risks, most large-scale deployments already disable JIT
by default. Notably, this includes all hyperscalers.
This commit changes our default to align with that established practice.
If we improve the JIT (be it better costing or cheaper execution), we
can consider enabling it by default again.
Add 'pg_stat_statements' to the crash restart test, to test that
shared memory and LWLock initialization works across crash restart in
a library listed in shared_preload_libraries. We had no test coverage
for that.
pg_publication_rel.prrelid refers to sequences whereas stores information only of tables.
Author: Peter Smith <smithpb2250@gmail.com> Reviewed-by: shveta malik <shveta.malik@gmail.com>
Discussion: https://postgr.es/m/CAHut+Pv1UKR_bxmN7wcCCpQveHoYprvH-hbdFq8gsaH1Ye7B_w@mail.gmail.com
Thomas Munro [Thu, 2 Apr 2026 02:24:44 +0000 (15:24 +1300)]
jit: Stop emitting lifetime.end for LLVM 22.
The lifetime.end intrinsic can now only be used for stack memory
allocated with alloca[1][2][3]. We use it to tell LLVM about the
lifetime of function arguments/isnull values that we keep in palloc'd
memory, so that it can avoid spilling registers to memory.
We might need to rearrange things and put them on the stack, but that'll
take some research. In the meantime, unbreak the build on LLVM 22.
David Rowley [Thu, 2 Apr 2026 01:11:17 +0000 (14:11 +1300)]
Fix nocachegetattr() so it again supports deforming cstrings
c456e3911 added various optimizations to the tuple deformation routines.
One optimization assumed that heap tuples would never contain cstrings.
That optimization also made its way into nocachegetattr(), which isn't
correct as ROW() types get formed into HeapTuples by ExecEvalRow() and
those can contain cstring Datums. nocachegetattr() gets used to extract
Datums from those tuples.
Here we remove the pg_assume(), which was there to instruct the compiler
to omit the attlen == -2 related code in att_addlength_pointer().
Author: David Rowley <dgrowleyml@gmail.com> Reported-by: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://postgr.es/m/80aeac57-8f50-4732-a5b4-c2373c3f8149@gmail.com
Andres Freund [Thu, 2 Apr 2026 00:02:09 +0000 (20:02 -0400)]
pg_test_timing: Reduce per-loop overhead
The pg_test_timing program was previously using INSTR_TIME_GET_NANOSEC on an
absolute instr_time value in order to do a diff, which goes against the spirit
of how the GET_* macros are supposed to be used, and will cause overhead in a
future change that assumes these macros are typically used on intervals only.
Additionally the program was doing unnecessary work in the test loop by
measuring the time elapsed, instead of checking the existing current time
measurement against a target end time. To support that, introduce a new
INSTR_TIME_ADD_NANOSEC macro that allows adding user-defined nanoseconds
to an instr_time variable.
While modifying the relevant code anyway, simplify it by not handling
durations <= 0 in test_timing(), since duration is unsigned and 0 is
disallowed by the caller.
Author: Lukas Fittl <lukas@fittl.com> Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAP53Pkyxv3-3gX+aOxC5tX0p2v9RHU+XH0iyvb64+ZnBXj92vg@mail.gmail.com
Andres Freund [Wed, 1 Apr 2026 23:50:03 +0000 (19:50 -0400)]
read_stream: Prevent distance from decaying too quickly
Until now we reduced the look-ahead distance by 1 on every hit, and doubled it
on every miss. That is problematic because there are very common IO patterns
where this prevents us from ever reaching a sufficiently high distance (e.g. a
miss followed by a hit will never have the distance grow beyond 2). In many
such cases, if we had ever reached a sufficient look-ahead distance, things
would have been fine, because we grow the distance faster than we decrease it.
One might think that the most obvious answer to this problem would be to never
reduce the distance. However, that would not work well, as (particularly with
upcoming users of read streams), it is reasonably common to at first have a
lot of misses and then to transition to a fully cached workload, e.g. because
the same blocks are needed repeatedly within one stream. Doing unnecessarily
deep readahead can be costly, due to having to pin a lot more buffers, which
increases CPU overhead.
Because the cost of a synchronously handled miss can be very high (multiple
milliseconds for every IO with commonly used storage) compared to the CPU
overhead of keeping the distance too high, we want to err on the side of not
reducing the distance too early.
The insight that a decrease of the distance by 1 at ever hit may be ok at
large distances, but not at low distances, shows a way out: If we only allow
decreasing the distance once there were no misses for our maximum look-ahead
distance, we will keep the distance high as long as readahead has a chance to
do IO asynchronously, but not commonly when not.
Several folks have written variants of this patch, including at least Thomas
Munro, Melanie Plageman and I.
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/f3xxfrkafjxpyqxywcxricxgyizjirfceychyxsgn7bwjp5eda@kwbduhy7tfmu
Discussion: https://postgr.es/m/CA+hUKGL2PhFyDoqrHefqasOnaXhSg48t1phs3VM8BAdrZqKZkw@mail.gmail.com
Discussion: https://postgr.es/m/CAH2-Wz%3DkMg3PNay96cHMT0LFwtxP-cQSRZTZzh1Cixxf8G%3Dzrw%40mail.gmail.com
Andres Freund [Wed, 1 Apr 2026 23:22:44 +0000 (19:22 -0400)]
read_stream: Issue IO synchronously while in fast path
While in fast-path, execute any IO that we might encounter synchronously.
Because we are, in that moment, not reading ahead, dispatching any occasional
IO to workers has the dispatch overhead, without any realistic chance of the
IO completing before we need it.
This helps io_method=worker performance for workloads that have only
occasional cache misses, but where those occasional misses still take long
enough to matter. It is likely this is only measurable with fast local
storage or workloads with the data in the kernel page cache, as with remote
storage the IO latency, not the dispatch-to-worker latency, is the determining
factor.
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/f3xxfrkafjxpyqxywcxricxgyizjirfceychyxsgn7bwjp5eda@kwbduhy7tfmu
Discussion: https://postgr.es/m/CAH2-Wz%3DkMg3PNay96cHMT0LFwtxP-cQSRZTZzh1Cixxf8G%3Dzrw%40mail.gmail.com
This is an extension of the UPDATE and DELETE commands to do a
"temporal update/delete" based on a range or multirange column. The
user can say UPDATE t FOR PORTION OF valid_at FROM '2001-01-01' TO
'2002-01-01' SET ... (or likewise with DELETE) where valid_at is a
range or multirange column.
The command is automatically limited to rows overlapping the targeted
portion, and only history within those bounds is changed. If a row
represents history partly inside and partly outside the bounds, then
the command truncates the row's application time to fit within the
targeted portion, then it inserts one or more "temporal leftovers":
new rows containing all the original values, except with the
application-time column changed to only represent the untouched part
of history.
To compute the temporal leftovers that are required, we use the *_minus_multi
set-returning functions defined in 5eed8ce50c.
- Added bison support for FOR PORTION OF syntax. The bounds must be
constant, so we forbid column references, subqueries, etc. We do
accept functions like NOW().
- Added logic to executor to insert new rows for the "temporal
leftover" part of a record touched by a FOR PORTION OF query.
- Documented FOR PORTION OF.
- Added tests.
Author: Paul A. Jungwirth <pj@illuminatedcomputing.com> Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://www.postgresql.org/message-id/flat/ec498c3d-5f2b-48ec-b989-5561c8aa2024%40illuminatedcomputing.com
Fix vicinity of tuple_insert to use uint32, not int, for options
Oversight in commit 1bd6f22f43ac: I was way too optimistic about the
compiler letting me know what variables needed to be updated, and missed
a few of them. Clean it up.
Author: Álvaro Herrera <alvherre@kurilemu.de> Reported-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/40E570EE-5A60-49D8-B8F7-2F8F2B7C8DFA@gmail.com
Dean Rasheed [Wed, 1 Apr 2026 16:02:24 +0000 (17:02 +0100)]
Add support for extended statistics on virtual generated columns.
This allows both univariate and multivariate statistics to be built on
virtual generated columns and expressions that refer to virtual
generated columns. The restriction disallowing extended statistics on
a single column is lifted in the case of a single virtual generated
column, since it is treated as a single expression.
In the catalogs, references to virtual generated columns are stored
as-is. They are expanded at ANALYZE time to build the statistics, and
at planning time to allow the optimizer to make use of the statistics.
This allows the statistics to be correctly rebuilt using ANALYZE, if a
column's generation expression is altered (which causes any existing
statistics data to be deleted).
Author: Yugo Nagata <nagata@sraoss.co.jp> Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com>
Discussion: https://postgr.es/m/20250422181006.dd6f9d1d81299f5b2ad55e1a@sraoss.co.jp
Author: Peter Smith <smithpb2250@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CAHut%2BPv72haFerrCdYdmF6hu6o2jKcGzkXehom%2BsP-JBBmOVDg%40mail.gmail.com
Backpatch-through: 14
Andres Freund [Wed, 1 Apr 2026 13:26:43 +0000 (09:26 -0400)]
bufmgr: Return whether WaitReadBuffers() needed to wait
Thanks to the previous commit, pgaio_wref_check_done() will now detect whether
IO has completed even if userspace has not yet consumed the kernel completion.
This knowledge can be useful for callers of WaitReadBuffers() to know whether
it needed to wait or not, e.g. for adjusting read-ahead aggressiveness or for
instrumentation.
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/f3xxfrkafjxpyqxywcxricxgyizjirfceychyxsgn7bwjp5eda@kwbduhy7tfmu
Discussion: https://postgr.es/m/CAH2-Wz%3DkMg3PNay96cHMT0LFwtxP-cQSRZTZzh1Cixxf8G%3Dzrw%40mail.gmail.com
Discussion: https://postgr.es/m/a177a6dd-240b-455a-8f25-aca0b1c08c6e@vondra.me
Andres Freund [Wed, 1 Apr 2026 13:26:43 +0000 (09:26 -0400)]
aio: io_uring: Allow IO methods to check if IO completed in the background
Until now pgaio_wref_check_done() with io_method=io_uring would not detect if
IOs are known to have completed to the kernel, but the completion has not yet
been consumed by userspace. This can lead to inferior performance and also
makes it harder to use smarter feedback logic in read_stream, because we
cannot use knowledge about whether an IO completed to control the readahead
distance.
This commit just adds the io_uring specific infrastructure. Later commits will
return whether a wait was needed from WaitReadBuffers() and then use that
knowledge.
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/f3xxfrkafjxpyqxywcxricxgyizjirfceychyxsgn7bwjp5eda@kwbduhy7tfmu
Discussion: https://postgr.es/m/CAH2-Wz%3DkMg3PNay96cHMT0LFwtxP-cQSRZTZzh1Cixxf8G%3Dzrw%40mail.gmail.com
Amit Langote [Wed, 1 Apr 2026 09:43:40 +0000 (18:43 +0900)]
Make FastPathMeta self-contained by copying FmgrInfo structs
FastPathMeta stored pointers into ri_compare_cache entries via
compare_entries[], creating a dependency on that cache remaining
stable. If ri_compare_cache entries were invalidated after fpmeta
was populated, the pointers would dangle.
Replace compare_entries[] with inline copies of the two FmgrInfo
fields actually needed (cast_func_finfo and eq_opr_finfo), copied
at populate time via fmgr_info_copy(). fpmeta now depends only on
riinfo remaining valid, which is already handled by the invalidation
callback.
Introduced by commit 2da86c1ef9 ("Add fast path for foreign key
constraint checks"), noticed while reviewing code for robustness
under CLOBBER_CACHE_ALWAYS.
Amit Langote [Wed, 1 Apr 2026 08:29:33 +0000 (17:29 +0900)]
Fix two issues in fast-path FK check introduced by commit 2da86c1ef9
First, under CLOBBER_CACHE_ALWAYS, the RI_ConstraintInfo entry can
be invalidated by relcache callbacks triggered inside table_open()
or index_open(), leaving ri_FastPathCheck() calling
ri_populate_fastpath_metadata() with a stale entry whose valid flag
is false. Fix by moving the fpmeta initialization to after
ri_CheckPermissions(), reloading riinfo first to ensure it is
valid, then calling ri_ExtractValues() and build_index_scankeys()
immediately after before any further operations that could trigger
invalidation.
Second, fpmeta allocated in TopMemoryContext was not freed when the
entry was invalidated in InvalidateConstraintCacheCallBack(),
leaking memory each time the constraint cache entry was recycled.
Fix by freeing and NULLing fpmeta at invalidation time.
Noticed locally when testing with CLOBBER_CACHE_ALWAYS.
John Naylor [Wed, 1 Apr 2026 07:18:57 +0000 (14:18 +0700)]
Skip common prefixes during radix sort
During the counting step, keep track of the bits that are the same
for the entire input. If we counted only a single distinct byte,
the next recursion will start at the next byte position that has
more than one distinct byte in the input. This allows us to skip over
multiple passes where the byte is the same for the entire input.
This provides a significant speedup for integers that have some upper
bytes with all-zeros or all-ones, which is common.
Reviewed-by: Chengpeng Yan <chengpeng_yan@outlook.com> Reviewed-by: ChangAo Chen <cca5507@qq.com>
Discussion: https://postgr.es/m/CANWCAZYpGMDSSwAa18fOxJGXaPzVdyPsWpOkfCX32DWh3Qznzw@mail.gmail.com
Reduce log level of some logical decoding messages from LOG to DEBUG1
Previously some logical decoding messages (e.g., "logical decoding found
consistent point") were logged at level LOG, even though they provided
low-level, developer-oriented information that DBAs were typically not
interested in.
Since these messages can occur routinely (for example, when keeping calling
pg_logical_slot_get_changes() to obtain the changes from logical decoding),
logging them at LOG can be overly verbose.
This commit reduces their log level to DEBUG1 to avoid unnecessary log noise.
This change applies to a small set of messages for now. Additional messages
may be adjusted similarly in the future.
Even with this change, if these messages from walsender still need to be
observed, enabling DEBUG1 logging selectively for walsender (e.g.,
log_min_messages = 'warning,walsender:debug1') would be helpful to avoid
increasing overall log volume.
Use the standard C23 and C++ attributes [[nodiscard]], [[noreturn]],
and [[maybe_unused]], if available.
This makes pg_nodiscard and pg_attribute_unused() available in
not-GCC-compatible compilers that support C23 as well as in C++.
For pg_noreturn, we can now drop the GCC-specific and MSVC-specific
fallbacks, because the C11 and the C++ implementation will now cover
all required cases.
Note, in a few places, we need to change the position of the attribute
because it's not valid in that place in C23.
The test_cplusplusext test module has so far been disabled on MSVC.
The only remaining problem now is that designated initializers, as
used in PG_MODULE_MAGIC, require C++20. (With GCC and Clang they work
in older C++ versions as well.)
This adds another test in the top-level meson.build to check that the
compiler supports C++20 designated initializers. This is not
required, we are just checking and recording the answer. If yes, we
can enable the test module.
Most current compilers likely won't be in C++20 mode by default. This
doesn't change that; we are not doing anything to try to switch the
compiler into that mode. This might be a separate project, but for
now we'll leave that for the user or the test scaffolding.
The VS task on Cirrus CI is changed to provide the required flag to
turn on C++20 mode.
There is no equivalent change in configure, since this change mainly
targets MSVC.
Amit Kapila [Wed, 1 Apr 2026 03:38:54 +0000 (09:08 +0530)]
Fix miscellaneous issues in EXCEPT publication clause.
Improve documentation regarding multiple publications and partition
hierarchies. Refine error reporting for excluded relations. Consolidate
docs by using table_object instead of expanded table syntax in publication
commands. Also includes minor test cleanup and naming fixes.
Reported-by: Peter Smith <smithpb2250@gmail.com>
Author: vignesh C <vignesh21@gmail.com> Reviewed-by: Peter Smith <smithpb2250@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/CALDaNm1CiBYcteE_jjPA4BPHfX30dg9eTTTkJgkjY5tgE7t=bQ@mail.gmail.com
Discussion: https://postgr.es/m/CALDaNm3=JrucjhiiwsYQw5-PGtBHFONa6F7hhWCXMsGvh=tamA@mail.gmail.com
Thomas Munro [Wed, 1 Apr 2026 01:02:39 +0000 (14:02 +1300)]
Fix pg_waldump/t/001_basic.pl with BSD tar on ZFS.
The new test fails with an error about a missing WAL file on that
stack, because it is archived in GNU tar's --sparse --format=posix
format. BSD tar uses that format by default, unlike GNU tar itself, and
ZFS triggers it by implicitly creating sparse files when it sees a lot
of zeroes.
The problem will surely also affect real users of the new tar support in
pg_waldump (commit b15c1513) and pg_verifybackup (commit b3cf461b3) on
such systems. Ideas under discussion, but for now the test is made to
pass by disabling sparse file detection in BSD tar.
Diagnosed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Andrew Dunstan <andrew@dunslane.net>
Discussion: https://postgr.es/m/1624716.1774736283%40sss.pgh.pa.us
Andres Freund [Tue, 31 Mar 2026 23:24:58 +0000 (19:24 -0400)]
bufmgr: Fix ordering of checks in PinBuffer()
The check for skip_if_not_valid added in 819dc118c0f6 was put at the start of
the loop. A CAS loop in theory does allow to make that check in a race free
manner. However, just after the check, there's a
old_buf_state = WaitBufHdrUnlocked(buf);
which introduces a race, because it would allow BM_VALID to be cleared, after
the skip_if_not_valid check.
Fix by restarting the loop after WaitBufHdrUnlocked().
Reported-by: Yura Sokolov <y.sokolov@postgrespro.ru>
Discussion: https://postgr.es/m/5bf667f3-5270-4b19-a08f-0facbecdff68@postgrespro.ru
Tom Lane [Tue, 31 Mar 2026 20:36:01 +0000 (16:36 -0400)]
Doc: warn that parallel pg_restore may fail if --no-schema was used.
If the archive file was made with --no-schema or related options
then it likely does not have enough dependency information to
ensure that parallel restore will choose a workable restore order.
Document this.
In passing, do some minor wordsmithing on new nearby text about
restoring from pg_dumpall archives.
Author: vaibhave postgres <postgresvaibhave@gmail.com> Reviewed-by: David G. Johnston <david.g.johnston@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAM_eQjzTLtt1X9WKvMV6Rew0UvxT3mmhimZa9WT-vqaPjmXk-g@mail.gmail.com
Melanie Plageman [Tue, 31 Mar 2026 19:02:52 +0000 (15:02 -0400)]
Fix test_aio read_buffers() to work without cassert
In a production build, StartReadBuffers() doesn't populate all fields
of a ReadBuffersOperation for a buffer hit because no callers use them
(they are populated in assert builds).
read_buffers() (a test-only function) relied on some of these fields, so
AIO tests failed on non-assert builds (discovered on the buildfarm after
commit 020c02bd908).
Fix by tracking the required information ourselves in read_buffers() and
avoiding reliance on the ReadBuffersOperation unless we know that we did
IO.
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/9ce8f5d8-8ab2-4aa2-b062-c5d74161069c%40gmail.com
Jacob Champion [Tue, 31 Mar 2026 18:47:33 +0000 (11:47 -0700)]
oauth: Don't log discovery connections by default
Currently, when the client sends a parameter discovery request within
OAUTHBEARER, the server logs the attempt with
FATAL: OAuth bearer authentication failed for user
These log entries are difficult to distinguish from true authentication
failures, and by default, libpq sends a discovery request as part of
every OAuth connection, making them annoyingly noisy. Use the new
PG_SASL_EXCHANGE_ABANDONED status to suppress them.
Patch by Zsolt Parragi, with some additional comments added by me.
Jacob Champion [Tue, 31 Mar 2026 18:47:31 +0000 (11:47 -0700)]
sasl: Allow backend mechanisms to "abandon" exchanges
Introduce PG_SASL_EXCHANGE_ABANDONED, which allows CheckSASLAuth to
suppress the failing log entry for any SASL exchange that isn't actually
an authentication attempt. This is desirable for OAUTHBEARER's discovery
exchanges (and a subsequent commit will make use of it there).
This might have some overlap in the future with in-band aborts for SASL
exchanges, but it's intentionally not named _ABORTED to avoid confusion.
(We don't currently support clientside aborts in our SASL profile.)
Adapted from a patch by Zsolt Parragi.
Author: Zsolt Parragi <zsolt.parragi@percona.com> Co-authored-by: Jacob Champion <jacob.champion@enterprisedb.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CAN4CZFPim7hUiyb7daNKQPSZ8CvQRBGkVhbvED7yZi8VktSn4Q%40mail.gmail.com
Jacob Champion [Tue, 31 Mar 2026 18:47:29 +0000 (11:47 -0700)]
Add FATAL_CLIENT_ONLY to ereport/elog
SASL exchanges must end with either an AuthenticationOk or an
ErrorResponse from the server, and the standard way to produce an
ErrorResponse packet is for auth_failed() to call ereport(FATAL). This
means that there's no way for a SASL mechanism to suppress the server
log entry if the "authentication attempt" was really just a query for
authentication metadata, as is done with OAUTHBEARER.
Following the example of 1f9158ba4, add a FATAL_CLIENT_ONLY elevel. This
will allow ClientAuthentication() to choose not to log a particular
failure, while still correctly ending the authentication exchange before
process exit.
(The provenance of this patch is convoluted: since it's a mechanical
copy-paste of 1f9158ba4, both Zsolt Parragi and I produced nearly
identical versions independently, and Andrey Borodin reviewed Zsolt's
version. Tom Lane is the author of 1f9158ba4, but I don't want to imply
that he's signed off on this adaptation. See Discussion.)
Jacob Champion [Tue, 31 Mar 2026 18:47:26 +0000 (11:47 -0700)]
libpq: Allow developers to reimplement libpq-oauth
For PG19, since we won't have the ability to officially switch out flow
plugins, relax the flow-loading code to not require the internal init
function. Modules that don't have one will be treated as custom user
flows in error messages.
This will let bleeding-edge developers more easily test out the API and
provide feedback for PG20, by telling the runtime linker to find a
different libpq-oauth. It remains undocumented for end users.
Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CAOYmi%2BmrGg%2Bn_X2MOLgeWcj3v_M00gR8uz_D7mM8z%3DdX1JYVbg%40mail.gmail.com
Jacob Champion [Tue, 31 Mar 2026 18:47:23 +0000 (11:47 -0700)]
libpq: Poison the v2 part of a v1 Bearer request
The new PGoauthBearerRequestV2 API (which has similarities to the
"subclass" pointer architecture in use by the backend, for Nodes)
carries the risk of a developer ignoring the type of hook in use and
just casting directly to the V2 struct. This will appear to work fine in
19, but crash (or worse) when speaking to libpq 18.
However, we're in a unique position to catch this problem, because we
have tight control over the struct. Add poisoning code to the v1 path
which does the following:
- masks the v2 request->issuer pointer, to hopefully point at nonsense
memory
- abort()s if the v2 request->error is assigned by the hook
- attempts to cover both with VALGRIND_MAKE_MEM_NOACCESS for the
duration of the callback (a potential AddressSanitizer implementation
is left for future work)
The struct is unpoisoned after the call, so we can switch back to the v2
internal implementation when necessary.
Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CAOYmi%2BnCg5upBVOo_UCSjMfO%3DYMkZXcSEsgaADKXqerr5wahZQ%40mail.gmail.com
Nathan Bossart [Tue, 31 Mar 2026 17:43:52 +0000 (12:43 -0500)]
Avoid including vacuum.h in tableam.h and heapam.h.
Commit 2252fcd427 modified some function prototypes in tableam.h
and heapam.h to take a VacuumParams argument instead of a pointer,
which required including vacuum.h in those headers. vacuum.h has a
reasonably large dependency tree, and headers like tableam.h are
widely included, so this is not ideal. To fix, change the
functions in question to accept a "const VacuumParams *" argument
instead. That allows us to use a forward declaration for
VacuumParams and avoid including vacuum.h. Since vacuum_rel()
needs to scribble on the params argument, we still pass it by value
to that function so that the original struct is not modified.
Reported-by: Andres Freund <andres@anarazel.de> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/rzxpxod4c4la62yvutyrvgoyilrl2fx55djaf2suidy7np5m6c%403l2ln476eadh
Tom Lane [Tue, 31 Mar 2026 15:49:54 +0000 (11:49 -0400)]
Doc: remove bogus claim that tsvectors can have up to 2^64 entries.
This is nonsense on its face, since the textsearch parsing logic
generally uses int32 to count words (see, eg, struct ParsedText).
Not to mention that we don't support input strings larger than
1GB.
The actual limitation of interest is documented nearby: a tsvector
can't be larger than 1MB, thanks to 20-bit offset fields within it
(see WordEntry.pos). That constrains us to well under 256K lexemes
per tsvector, depending on how many positions are stored per lexeme.
It seems sufficient therefore to just remove the bit about number
of lexemes.
Author: Dharin Shah <dharinshah95@gmail.com>
Discussion: https://postgr.es/m/CAOj6k6d0YO6AO-bhxkfUXPxUi-+YX9-doh2h5D5z0Bm8D2w=OA@mail.gmail.com
Tom Lane [Tue, 31 Mar 2026 15:23:20 +0000 (11:23 -0400)]
Doc: improve explanation of GiST compress/decompress methods.
The docs previously didn't explain that leaf and non-leaf keys
could be treated differently, even though many of our opclasses
do exactly that. It also wasn't explained how that relates to
the STORAGE option, particularly since only one storage type
can be specified for both leaf and non-leaf keys.
While here, reorganize the text slightly, rather than sticking
additional detail into what's supposed to be a brief summary
paragraph.
Author: Paul A Jungwirth <pj@illuminatedcomputing.com> Co-authored-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CA+renyWs5Np+FLSYfL+eu20S4U671A3fQGb-+7e22HLrD1NbYw@mail.gmail.com
It's been unused forever. There's no urgency in removing it now, but
it was just something that caught my eye.
Aleksander Alekseev proposed this a long time ago [0], but Tom Lane
was worried about third-party extensions using it. I believe that's a
non-issue: I tried grepping through all extensions found on github and
didn't find any references to HASH_SEGMENT.
Peter Eisentraut [Tue, 31 Mar 2026 09:44:43 +0000 (11:44 +0200)]
Fix cross variable references in graph pattern causing segfault
When converting the WHERE clause in an element pattern,
generate_query_for_graph_path() calls replace_property_refs() to
replace the property references in it. Only the current graph element
pattern is passed as the context for replacement. If there are
references to variables from other element patterns, it causes a
segmentation fault (an assertion failure in an Assert enabled build)
since it does not find path_element object corresponding to those
variables.
We do not support forward and backward variable references within a
graph table clause. Hence prohibit all the cross references.
Peter Eisentraut [Tue, 31 Mar 2026 09:37:35 +0000 (11:37 +0200)]
Property references are preferred over regular column references
When a ColumnRef can be resolved as a graph table property reference
and a lateral table column reference prefer the graph table property
reference since element pattern variables in the GRAPH_TABLE clause
form the innermost namespace.
XLOG_CHECKPOINT_REDO only contains the wal_level copied straight in
without an encapsulating record structure. While it works, it makes
future uses of XLOG_CHECKPOINT_REDO hard as there is nowhere to put
new data items. This fix this was inspired by the online checksums
patch which adds data to this record, but this change has value on
its own.
Author: Daniel Gustafsson <daniel@yesql.se> Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/c92b5d8b-bc03-47bc-b209-2e4a719eee32@iki.fi
Peter Eisentraut [Tue, 31 Mar 2026 06:38:24 +0000 (08:38 +0200)]
Disable some C++ warnings in MSVC
Flexible array members, as used in many PostgreSQL header files, are
not a C++ feature. MSVC warns about these. Disable the
warning. (GCC and Clang accept them, but they would warn in -pedantic
mode.)
Amit Langote [Tue, 31 Mar 2026 04:49:21 +0000 (13:49 +0900)]
Add fast path for foreign key constraint checks
Add a fast-path optimization for foreign key checks that bypasses SPI
by directly probing the unique index on the referenced table.
Benchmarking shows ~1.8x speedup for bulk FK inserts (int PK/int FK,
1M rows, where PK table and index are cached).
The fast path applies when the referenced table is not partitioned and
the constraint does not involve temporal semantics. Otherwise, the
existing SPI path is used.
This optimization covers only the referential check trigger
(RI_FKey_check). The action triggers (CASCADE, SET NULL, SET DEFAULT,
RESTRICT, NO ACTION) must find rows on the FK side to modify, which
requires a table scan with no guaranteed index available, and then
execute DML against those rows through the full executor path including
any triggered actions. Replicating that without substantial code
duplication is not feasible, so those triggers remain on the SPI path.
Extending the fast path to action triggers remains possible as future
work if the necessary infrastructure is built.
The new ri_FastPathCheck() function extracts the FK values, builds scan
keys, performs an index scan, and locks the matching tuple with
LockTupleKeyShare via ri_LockPKTuple(), which handles the RI-specific
subset of table_tuple_lock() results.
If the locked tuple was reached by chasing an update chain
(tmfd.traversed), recheck_matched_pk_tuple() verifies that the key
is still the same, emulating EvalPlanQual.
The scan uses GetTransactionSnapshot(), matching what the SPI path
uses (via _SPI_execute_plan pushing GetTransactionSnapshot() as the
active snapshot). Under READ COMMITTED this is a fresh snapshot;
under REPEATABLE READ / SERIALIZABLE it is the frozen transaction-
start snapshot, so PK rows committed after the transaction started
are not visible.
The ri_CheckPermissions() function performs schema USAGE and table
SELECT checks, matching what the SPI path gets implicitly through
the executor's permission checks. The fast path also switches to
the PK table owner's security context (with SECURITY_NOFORCE_RLS)
before the index probe, matching the SPI path where the query runs
as the table owner.
ri_HashCompareOp() is adjusted to handle cross-type equality operators
(e.g. int48eq for int4 PK / int8 FK) which can appear in conpfeqop.
The existing code asserted same-type operators only, which was correct
for its existing callers (ri_KeysEqual compares same-type FK column
values via ff_eq_oprs), but the fast path is the first caller to pass
pf_eq_oprs, which can be cross-type.
Per-key metadata (compare entries, operator procedures, strategy
numbers) is cached in RI_ConstraintInfo via
ri_populate_fastpath_metadata() on first use, eliminating repeated
calls to ri_HashCompareOp() and get_op_opfamily_properties().
conindid and pk_is_partitioned are also cached at constraint load
time, avoiding per-invocation syscache lookups and the need to open
pk_rel before deciding whether the fast path applies.
New regression tests cover RLS bypass and ACL enforcement for the
fast-path permission checks. New isolation tests exercise concurrent
PK updates under both READ COMMITTED and REPEATABLE READ.
Author: Junwang Zhao <zhjwpku@gmail.com> Co-authored-by: Amit Langote <amitlangote09@gmail.com> Reviewed-by: Haibo Yan <tristan.yim@gmail.com> Tested-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CA+HiwqF4C0ws3cO+z5cLkPuvwnAwkSp7sfvgGj3yQ=Li6KNMqA@mail.gmail.com
Amit Kapila [Tue, 31 Mar 2026 04:10:51 +0000 (09:40 +0530)]
Change syntax of EXCEPT TABLE clause in publication commands.
Adjust the syntax of the EXCEPT clause in CREATE/ALTER PUBLICATION
added in commits fd366065e0 and 493f8c6439 to move the TABLE keyword
inside the relation list.
Old syntax:
CREATE PUBLICATION ... FOR ALL TABLES EXCEPT TABLE (t1, ...);
ALTER PUBLICATION ... SET ALL TABLES EXCEPT TABLE (t1, ...);
New syntax:
CREATE PUBLICATION ... FOR ALL TABLES EXCEPT (TABLE t1, ...);
ALTER PUBLICATION ... SET ALL TABLES EXCEPT (TABLE t1, ...);
This is to ensure that inclusion and exclusion list can be specified in
a same way. Previously, the exclusion table list can be specified as
TABLE (t1, t2, t3) and inclusion list can be specified as TABLE t1, t2,
t3, or TABLE t1, TABLE t2, TABLE t3.
This change is purely syntactic and does not alter behavior.
Reported-by: Masahiko Sawada <sawada.mshk@gmail.com>
Author: vignesh C <vignesh21@gmail.com>
Author: Shlok Kyal <shlok.kyal.oss@gmail.com> Reviewed-by: shveta malik <shveta.malik@gmail.com> Reviewed-by: Peter Smith <smithpb2250@gmail.com> Reviewed-by: Zhijie Hou <houzj.fnst@fujitsu.com> Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com> Reviewed-by: SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/CAD21AoCC8XuwfX62qKBSfHUAoww_XB3_84HjswgL9jxQy696yw@mail.gmail.com
Discussion: https://postgr.es/m/CALDaNm3=JrucjhiiwsYQw5-PGtBHFONa6F7hhWCXMsGvh=tamA@mail.gmail.com
Tom Lane [Mon, 30 Mar 2026 22:25:17 +0000 (18:25 -0400)]
Doc: update ddl.sgml's description of cmin and cmax.
We long ago folded these two tuple header fields into one field
to save space. However, nothing was done to the user-facing
documentation about them, perhaps with the idea that we'd add
code to emit something approximating the original definitions.
That never happened and presumably never will, so update the
text to reflect current reality.
Author: Paul A Jungwirth <pj@illuminatedcomputing.com> Co-authored-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CA+renyVYYboiTayRRE0j1oKpeB+NjEBSUXfwgEu6O0JESSmauQ@mail.gmail.com
Peter Eisentraut [Mon, 30 Mar 2026 21:12:38 +0000 (23:12 +0200)]
Add warning option -Wold-style-declaration
This warning has been triggered a few times via the buildfarm (see
commits 8212625e53f, 2b7259f8557, afe86a9e73b), so we might as well
add it so that everyone sees it.
(This is completely separate from the recently added
-Wold-style-definition.)
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/aa73q1aT0A3/vke/%40ip-10-97-1-34.eu-west-3.compute.internal
Jacob Champion [Mon, 30 Mar 2026 21:14:45 +0000 (14:14 -0700)]
libpq: Add oauth_ca_file option to change CAs without debugging
PG18 hid the PGOAUTHCAFILE envvar behind PGOAUTHDEBUG=UNSAFE, because I
thought that any "real" production usage of private CA certificates
would have them added to the Curl system trust store. But there are use
cases, such as containerized environments, that prefer to manage custom
CA settings more granularly; some of them consider envvar configuration
of certificates to be standard practice.
Move PGOAUTHCAFILE out from under the debug flag, and add an
oauth_ca_file option to libpq to configure trusted CAs per connection.
Patch by Jonathan Gonzalez V., with some additional wordsmithing and
test organization by me.
Author: Jonathan Gonzalez V. <jonathan.abdiel@gmail.com> Co-authored-by: Jacob Champion <jacob.champion@enterprisedb.com> Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Discussion: https://postgr.es/m/16a91d02795cb991963326a902afa764e4d721db.camel%40gmail.com
Nathan Bossart [Mon, 30 Mar 2026 21:12:08 +0000 (16:12 -0500)]
Remove bits* typedefs.
In addition to removing the bits8, bits16, and bits32 typedefs,
this commit replaces all uses with uint8, uint16, or uint32. bits*
provided little benefit beyond establishing the intent of the
variable, and they were inconsistently used for that purpose.
Third-party code should instead use the corresponding uint*
typedef.
Suggested-by: Andres Freund <andres@anarazel.de> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org>
Discussion: https://postgr.es/m/absbX33E4eaA0Ity%40nathan
Melanie Plageman [Mon, 30 Mar 2026 20:07:11 +0000 (16:07 -0400)]
Set pd_prune_xid on insert
Now that on-access pruning can update the visibility map (VM) during
read-only queries, set the page’s pd_prune_xid hint during INSERT and on
the new page during UPDATE.
This allows heap_page_prune_and_freeze() to set the VM the first time a
page is read after being filled with tuples. This may avoid I/O
amplification by setting the page all-visible when it is still in shared
buffers and allowing later vacuums to skip scanning the page. It also
enables index-only scans of newly inserted data much sooner.
As a side benefit, this addresses a long-standing note in heap_insert()
and heap_multi_insert(): aborted inserts can now be pruned on-access
rather than lingering until the next VACUUM.
Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/flat/CAAKRu_ZMw6Npd_qm2KM%2BFwQ3cMOMx1Dh3VMhp8-V7SOLxdK9-g%40mail.gmail.com
Melanie Plageman [Mon, 30 Mar 2026 19:47:07 +0000 (15:47 -0400)]
Allow on-access pruning to set pages all-visible
Many queries do not modify the underlying relation. For such queries, if
on-access pruning occurs during the scan, we can check whether the page
has become all-visible and update the visibility map accordingly.
Previously, only vacuum and COPY FREEZE marked pages as all-visible or
all-frozen.
This commit implements on-access VM setting for sequential scans, tid
range scans, sample scans, bitmap heap scans, and the underlying heap
relation in index scans.
Setting the visibility map on-access can avoid write amplification
caused by vacuum later needing to set the page all-visible, which could
trigger a write and potentially an FPI. It also allows more frequent
index-only scans, since they require pages to be marked all-visible in
the VM.
Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/flat/CAAKRu_ZMw6Npd_qm2KM%2BFwQ3cMOMx1Dh3VMhp8-V7SOLxdK9-g%40mail.gmail.com
Peter Eisentraut [Mon, 30 Mar 2026 18:55:16 +0000 (20:55 +0200)]
configure: Apply -Werror=vla to C++ as well as C
The comment part of d9dd406fe281 mentioned that -Wvla is not applicable
for C++. That is not fully correct: it is true that VLAs are not part of the
C++ standard, and g++ with -pedantic will also warn about them as a non-standard
extension. However, -Wvla itself works fine in C++ and will catch VLA
usage just as in C.
Fix configure.ac to apply -Werror=vla to C++ as well. There is no need to
fix meson.build as it already includes it in common_warning_flags.
Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Suggested-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://postgr.es/m/7bf60ab1-2b5d-4a77-93ce-815072a0a014%40eisentraut.org
Tom Lane [Mon, 30 Mar 2026 17:59:54 +0000 (13:59 -0400)]
Be more careful to preserve consistency of a tuplestore.
Several places in tuplestore.c would leave the tuplestore data
structure effectively corrupt if some subroutine were to throw
an error. Notably, if WRITETUP() failed after some number of
successful calls within dumptuples(), the tuplestore would
contain some memtuples pointers that were apparently live
entries but in fact pointed to pfree'd chunks.
In most cases this sort of thing is fine because transaction
abort cleanup is not too picky about the contents of memory that
it's going to throw away anyway. There's at least one exception
though: if a Portal has a holdStore, we're going to call
tuplestore_end() on that, even during transaction abort.
So it's not cool if that tuplestore is corrupt, and that means
tuplestore.c has to be more careful.
This oversight demonstrably leads to crashes in v15 and before,
if a holdable cursor fails to persist its data due to an undersized
temp_file_limit setting. Very possibly the same thing can happen in
v16 and v17 as well, though the specific test case submitted failed
to fail there (cf. 095555daf). The failure is accidentally dodged
as of v18 because 590b045c3 got rid of tuplestore_end's retail tuple
deletion loop. Still, it seems unwise to permit tuplestores to become
internally inconsistent in any branch, so I've applied the same fix
across the board.
Since the known test case for this is rather expensive and doesn't
fail in recent branches, I've omitted it.
Bug: #19438 Reported-by: Dmitriy Kuzmin <kuzmin.db4@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/19438-9d37b179c56d43aa@postgresql.org
Backpatch-through: 14
Replace getopt() with our re-entrant variant in the backend
Some of these probably could continue using non-re-entrant getopt()
even if we start using threads in the future, but it seems better to
make them all anyway, so that we have a clear-cut rule of "no plain
getopt() in the postgres binary".
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://www.postgresql.org/message-id/d1da5f0e-0d68-47c9-a882-eb22f462752f@iki.fi
The standard getopt(3) function is not re-entrant nor thread-safe.
That's OK for current usage, but it's one more little thing we need to
change in order to make the server multi-threaded.
There's no standard getopt_r() function on any platform, I presume
because command line arguments are usually parsed early when you start
a program, before launching any threads, so there isn't much need for
it. However, we call it at backend startup to parse options from the
startup packet. Because there's no standard, we're free to define our
own.
The pg_getopt_start/next() implementation is based on the old getopt
implementation, I just gathered all the state variables to a struct.
The non-re-entrant getopt() function is now a wrapper around the
re-entrant variant, on platforms that don't have getopt(3).
getopt_long() is not used in the server, so we don't need to provide a
re-entrant variant of that.
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://www.postgresql.org/message-id/d1da5f0e-0d68-47c9-a882-eb22f462752f@iki.fi
The function is supposed to look at the passed in 'arg' argument, but
peeks at the 'optarg' global variable that's part of getopt()
instead. It happened to work anyway, because all callers passed
'optarg' as the argument.
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://www.postgresql.org/message-id/d1da5f0e-0d68-47c9-a882-eb22f462752f@iki.fi
Melanie Plageman [Mon, 30 Mar 2026 17:27:34 +0000 (13:27 -0400)]
Pass down information on table modification to scan nodes
Pass down information to sequential scan, index [only] scan, bitmap
table scan, sample scan, and TID range scan nodes on whether or not the
query modifies the relation being scanned. A later commit will use this
information to update the VM during on-access pruning only if the
relation is not modified by the query.
Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru> Reviewed-by: Tomas Vondra <tomas@vondra.me> Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/4379FDA3-9446-4E2C-9C15-32EFE8D4F31B%40yandex-team.ru
Melanie Plageman [Mon, 30 Mar 2026 16:27:24 +0000 (12:27 -0400)]
Thread flags through begin-scan APIs
Add an AM user-settable flags parameter to several of the table scan
functions, one table AM callback, and index_beginscan(). This allows
users to pass additional context to be used when building the scan
descriptors.
For index scans, a new flags field is added to IndexFetchTableData, and
the heap AM saves the caller-provided flags there.
This introduces an extension point for follow-up work to pass per-scan
information (such as whether the relation is read-only for the current
query) from the executor to the AM layer.
Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Tomas Vondra <tomas@vondra.me> Reviewed-by: David Rowley <dgrowleyml@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/2be31f17-5405-4de9-8d73-90ebc322f7d8%40vondra.me
Tom Lane [Mon, 30 Mar 2026 16:02:08 +0000 (12:02 -0400)]
Detect pfree or repalloc of a previously-freed memory chunk.
Before the major rewrite in commit c6e0fe1f2, AllocSetFree() would
typically crash when asked to free an already-free chunk. That was
an ugly but serviceable way of detecting coding errors that led to
double pfrees. But since that rewrite, double pfrees went through
just fine, because the "hdrmask" of a freed chunk isn't changed at all
when putting it on the freelist. We'd end with a corrupt freelist
that circularly links back to the doubly-freed chunk, which would
usually result in trouble later, far removed from the actual bug.
This situation is no good at all for debugging purposes. Fortunately,
we can fix it at low cost in MEMORY_CONTEXT_CHECKING builds by making
AllocSetFree() check for chunk->requested_size == InvalidAllocSize,
relying on the pre-existing code that sets it that way just below.
I investigated the alternative of changing a freed chunk's methodid
field, which would allow detection in non-MEMORY_CONTEXT_CHECKING
builds too. But that adds measurable overhead. Seeing that we didn't
notice this oversight for more than three years, it's hard to argue
that detecting this type of bug is worth any extra overhead in
production builds.
Likewise fix AllocSetRealloc() to detect repalloc() on a freed chunk,
and apply similar changes in generation.c and slab.c. (generation.c
would hit an Assert failure anyway, but it seems best to make it act
like aset.c.) bump.c doesn't need changes since it doesn't support
pfree in the first place. Ideally alignedalloc.c would receive
similar changes, but in debugging builds it's impossible to reach
AlignedAllocFree() or AlignedAllocRealloc() on a pfreed chunk, because
the underlying context's pfree would have wiped the chunk header of
the aligned chunk. But that means we should get an error of some
sort, so let's be content with that.
Per investigation of why the test case for bug #19438 didn't appear to
fail in v16 and up, even though the underlying bug was still present.
(This doesn't fix the underlying double-free bug, just cause it to
get detected.)
Bug: #19438 Reported-by: Dmitriy Kuzmin <kuzmin.db4@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/19438-9d37b179c56d43aa@postgresql.org
Backpatch-through: 16
Robert Haas [Mon, 30 Mar 2026 13:58:25 +0000 (09:58 -0400)]
pg_plan_advice: Avoid assertion failure with partitionwise aggregate.
An Append node that is part of a partitionwise aggregate has no
apprelids. If such a node was elided, the previous coding would
attempt to call unique_nonjoin_rtekind() on a NULL pointer, which
leads to an assertion failure. Insert a NULL check to prevent that.
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Discussion: http://postgr.es/m/0afba1ce-c946-4131-972d-191d9a1c097c@gmail.com
Melanie Plageman [Mon, 30 Mar 2026 13:51:28 +0000 (09:51 -0400)]
Remove PlannedStmt->resultRelations in favor of resultRelationRelids
PlannedStmt->resultRelations was an integer list of range table indexes
because at the time it was added (to Query), the Bitmapset data type did
not yet exist in Postgres.
0f4c170cf3b added a Bitmapset of result relations, so remove the integer
list of RTIs and use the more compact resultRelationRelids.
Melanie Plageman [Mon, 30 Mar 2026 13:38:03 +0000 (09:38 -0400)]
Make it cheap to check if a relation is modified by a query
Save the range table indexes of result relations and row mark relations
in separate bitmapsets in the PlannedStmt. Precomputing them allows
cheap membership checks during execution. Together, these two groups
approximate all relations that will be modified by a query. This
includes relations targeted by INSERT, UPDATE, DELETE, and MERGE as well
as relations with any row mark (like SELECT FOR UPDATE).
Future work will use information on whether or not a relation is
modified by a query in a heuristic.
PlannedStmt->resultRelations is only used in a membership check, so it
will be removed in a separate commit.
Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: David Rowley <dgrowleyml@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/F5CDD1B5-628C-44A1-9F85-3958C626F6A9%40gmail.com
Peter Eisentraut [Mon, 30 Mar 2026 07:31:08 +0000 (09:31 +0200)]
headerscheck: Avoid mutual inclusion of pg_config.h and c.h
Headers that c.h includes early should not have another header
included before them in the headerscheck test file, especially not
c.h.
A particular instance of a problem is that pg_config.h defines some
symbols that c.h later undefines in some cases, such as in the code
added by commit cd083b54bd67, but there were also some before that.
This only works correctly if pg_config.h is included first.
pg_config_manual.h and pg_config_os.h are closely related to
pg_config.h and should be treated the same way.
postgres_ext.h is meant to be usable standalone, so testing it with
c.h included first defeats the point.
c.h also includes port.h, but this commit leaves that alone, since
port.h does need some of c.h to be processed first. (But because of
header guards, testing port.h separately is probably ineffective.)
Peter Eisentraut [Mon, 30 Mar 2026 07:06:27 +0000 (09:06 +0200)]
Make cast function from circle to polygon error safe
Previously, the function casting type circle to type polygon could not
be made error safe, because it is an SQL language function.
This refactors it as a C/internal function, by sharing code with the
C/internal function that the SQL function previously wrapped, and soft
error support is added.
Author: jian he <jian.universality@gmail.com> Reviewed-by: Amul Sul <sulamul@gmail.com> Reviewed-by: Corey Huinker <corey.huinker@gmail.com>
Discussion: Discussion: https://www.postgresql.org/message-id/flat/CADkLM%3Dfv1JfY4Ufa-jcwwNbjQixNViskQ8jZu3Tz_p656i_4hQ%40mail.gmail.com
Fujii Masao [Mon, 30 Mar 2026 05:37:33 +0000 (14:37 +0900)]
Fix FK triggers losing DEFERRABLE/INITIALLY DEFERRED when marked ENFORCED again
Previously, a foreign key defined as DEFERRABLE INITIALLY DEFERRED could
behave as NOT DEFERRABLE after being set to NOT ENFORCED and then back
to ENFORCED.
This happened because recreating the FK triggers on re-enabling the constraint
forgot to restore the tgdeferrable and tginitdeferred fields in pg_trigger.
Fix this bug by properly setting those fields when the foreign key constraint
is marked ENFORCED again and its triggers are recreated, so the original
DEFERRABLE and INITIALLY DEFERRED properties are preserved.
Backpatch to v18, where NOT ENFORCED foreign keys were introduced.
David Rowley [Mon, 30 Mar 2026 03:14:34 +0000 (16:14 +1300)]
Fix datum_image_*()'s inability to detect sign-extension variations
Functions such as hash_numeric() are not careful to use the correct
PG_RETURN_*() macro according to the return type of that function as
defined in pg_proc. Because that function is meant to return int32,
when the hashed value exceeds 2^31, the 64-bit Datum value won't wrap to
a negative number, which means the Datum won't have the same value as it
would have had it been cast to int32 on a two's complement machine. This
isn't harmless as both datum_image_eq() and datum_image_hash() may receive
a Datum that's been formed and deformed from a tuple in some cases, and
not in other cases. When formed into a tuple, the Datum value will be
coerced into an integer according to the attlen as specified by the
TupleDesc. This can result in two Datums that should be equal being
classed as not equal, which could result in (but not limited to) an error
such as:
ERROR: could not find memoization table entry
Here we fix this by ensuring we cast the Datum value to a signed integer
according to the typLen specified in the datum_image_eq/datum_image_hash
function call before comparing or hashing.
Author: David Rowley <dgrowleyml@gmail.com> Reported-by: Tender Wang <tndrwang@gmail.com>
Backpatch-through: 14
Discussion: https://postgr.es/m/CAHewXNmcXVFdB9_WwA8Ez0P+m_TQy_KzYk5Ri5dvg+fuwjD_yw@mail.gmail.com
Fujii Masao [Mon, 30 Mar 2026 02:21:22 +0000 (11:21 +0900)]
psql: Make \d+ inheritance tables list formatting consistent with other objects
This followw up on the previous change (commit 7bff9f106a5) for partitions by
applying the same formatting to inheritance tables lists.
Previously, \d+ <table> displayed inheritance tables differently from other
object lists: the first inheritance table appeared on the same line as the
"Inherits" header. For example:
Inherits: test_like_5,
test_like_5x
This commit updates the output so that inheritance tables are listed
consistently with other objects, with each entry on its own line starting
below the header:
Inherits:
test_like_5
test_like_5x
Author: Peter Smith <smithpb2250@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Neil Chen <carpenter.nail.cz@gmail.com> Reviewed-by: Greg Sabino Mullane <htamfids@gmail.com> Reviewed-by: Soumya S Murali <soumyamurali.work@gmail.com> Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/CAHut+Pu1puO00C-OhgLnAcECzww8MB3Q8DCsvx0cZWHRfs4gBQ@mail.gmail.com
Fujii Masao [Mon, 30 Mar 2026 02:06:42 +0000 (11:06 +0900)]
psql: Make \d+ partition list formatting consistent with other objects
Previously, \d+ <table> displayed partitions differently from other object
lists: the first partition appeared on the same line as the "Partitions"
header. For example:
Partitions: pt12 FOR VALUES IN (1, 2),
pt34 FOR VALUES IN (3, 4)
This commit updates the output so that partitions are listed consistently
with other objects, with each entry on its own line starting below the header:
Partitions:
pt12 FOR VALUES IN (1, 2)
pt34 FOR VALUES IN (3, 4)
Author: Peter Smith <smithpb2250@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Neil Chen <carpenter.nail.cz@gmail.com> Reviewed-by: Greg Sabino Mullane <htamfids@gmail.com> Reviewed-by: Soumya S Murali <soumyamurali.work@gmail.com> Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/CAHut+Pu1puO00C-OhgLnAcECzww8MB3Q8DCsvx0cZWHRfs4gBQ@mail.gmail.com
Amit Langote [Mon, 30 Mar 2026 01:29:21 +0000 (10:29 +0900)]
Doc: fix stale text about partition locking with cached plans
Commit 121d774caea added text to master describing pruning-aware
locking behavior introduced by 525392d57. That behavior was
reverted in May 2025, making the text incorrect. Replace it with
the text used in back branches, which correctly describes current
behavior: pruned partitions are still locked at the beginning of
execution.
Amit Langote [Mon, 30 Mar 2026 01:10:17 +0000 (10:10 +0900)]
Add comment explaining fire_triggers=false in ri_PerformCheck()
The reason for passing fire_triggers=false to SPI_execute_snapshot()
in ri_PerformCheck() was not documented, making it unclear why it was
done that way. Add a comment explaining that it ensures AFTER triggers
on rows modified by the RI action are queued in the outer query's
after-trigger context and fire only after all RI updates on the same
row are complete.
Peter Eisentraut [Sun, 29 Mar 2026 18:40:50 +0000 (20:40 +0200)]
Make geometry cast functions error safe
This adjusts cast functions of the geometry types to support soft
errors. This requires refactoring of various helper functions to
support error contexts. Also make the float8 to float4 cast error
safe. It requires some of the same helper functions.
This is in preparation for a future feature where conversion errors in
casts can be caught.
(The function casting type circle to type polygon is not yet made error
safe, because it is an SQL language function.)
Author: jian he <jian.universality@gmail.com> Reviewed-by: Amul Sul <sulamul@gmail.com> Reviewed-by: Corey Huinker <corey.huinker@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CADkLM%3Dfv1JfY4Ufa-jcwwNbjQixNViskQ8jZu3Tz_p656i_4hQ%40mail.gmail.com
Tom Lane [Sun, 29 Mar 2026 18:06:50 +0000 (14:06 -0400)]
Doc: document more incompatible pg_restore option pairs.
Most of the pairs of incompatible options (such as --file and --dbname)
are pretty obvious and need no explanation. But it may not be obvious
that --single-transaction cannot be used together with --create or
multiple jobs, so let's mention that in the documentation.
Tom Lane [Sun, 29 Mar 2026 17:53:17 +0000 (13:53 -0400)]
Doc: clarify introductory description of pg_dumpall.
Add a sentence that describes the parts of a cluster's state that are
*not* included in the output.
Also swap two sentences in the introductory paragraph. Without that,
it is not clear what the "it" at the beginning of the second sentence
is referring to. Also add a reference to pg_restore, since not all
output formats are restored with pg_dump.
Also clarify the recently-added text about where different output
formats go, and relocate it above the ancillary text about having
to run as superuser.
Reported-by: Dimitre Radoulov <cichomitiko@gmail.com>
Author: Laurenz Albe <laurenz.albe@cybertec.at> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAGJBphSX2oMPPu=VM4U8NP4+qffFH_483tFQCJ_s-mOcN3DLDw@mail.gmail.com
Andrew Dunstan [Mon, 23 Mar 2026 20:17:08 +0000 (16:17 -0400)]
Fix multiple bugs in astreamer pipeline code.
astreamer_tar_parser_content() sent the wrong data pointer when
forwarding MEMBER_TRAILER padding to the next streamer. After
astreamer_buffer_until() buffers the padding bytes, the 'data'
pointer has been advanced past them, but the code passed 'data'
instead of bbs_buffer.data. This caused the downstream consumer
to receive bytes from after the padding rather than the padding
itself, and could read past the end of the input buffer.
astreamer_gzip_decompressor_content() only checked for
Z_STREAM_ERROR from inflate(), silently ignoring Z_DATA_ERROR
(corrupted data) and Z_MEM_ERROR (out of memory). Fix by
treating any return other than Z_OK, Z_STREAM_END, and
Z_BUF_ERROR as fatal.
astreamer_gzip_decompressor_free() missed calling inflateEnd() to
release zlib's internal decompression state.
astreamer_tar_parser_free() neglected to pfree() the streamer
struct itself, leaking it.
astreamer_extractor_content() did not check the return value of
fclose() when closing an extracted file. A deferred write error
(e.g., disk full on buffered I/O) would be silently lost.
Peter Eisentraut [Sat, 28 Mar 2026 14:44:13 +0000 (15:44 +0100)]
Make cast functions from jsonb error safe
This adjusts cast functions from jsonb to other types to support soft
errors. This just involves some refactoring of the underlying helper
functions to use ereturn.
This is in preparation for a future feature where conversion errors in
casts can be caught.
Author: jian he <jian.universality@gmail.com> Reviewed-by: Amul Sul <sulamul@gmail.com> Reviewed-by: Corey Huinker <corey.huinker@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CADkLM%3Dfv1JfY4Ufa-jcwwNbjQixNViskQ8jZu3Tz_p656i_4hQ%40mail.gmail.com
Andres Freund [Fri, 27 Mar 2026 23:51:53 +0000 (19:51 -0400)]
aio: Don't wait for already in-progress IO
When a backend attempts to start a read IO and finds the first buffer already
has I/O in progress, previously it waited for that I/O to complete before
initiating reads for any of the subsequent buffers.
Although it must wait for the I/O to finish when acquiring the buffer, there's
no reason for it to wait when setting up the read operation. Waiting at this
point prevents starting I/O on subsequent buffers and can significantly reduce
concurrency.
This matters in two workloads:
1) When multiple backends scan the same relation concurrently.
2) When a single backend requests the same block multiple times within the
readahead distance.
Waiting each time an in-progress read is encountered effectively degenerates
the access pattern into synchronous I/O.
To fix this, when encountering an already in-progress IO for the head buffer,
the wait reference is now recorded and waiting is deferred until
WaitReadBuffers(), when the buffer actually needs to be acquired.
In rare cases, a backend may still need to wait synchronously at IO
start time: If another backend has set BM_IO_IN_PROGRESS on the buffer
but has not yet set the wait reference. Such windows should be brief and
uncommon.
Author: Melanie Plageman <melanieplageman@gmail.com>
Author: Andres Freund <andres@anarazel.de> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/flat/zljergweqti7x67lg5ije2rzjusie37nslsnkjkkby4laqqbfw%403p3zu522yykv
Andres Freund [Fri, 27 Mar 2026 23:02:23 +0000 (19:02 -0400)]
bufmgr: Improve StartBufferIO interface
Until now StartBufferIO() had a few weaknesses:
- As it did not submit staged IOs, it was not safe to call StartBufferIO()
where there was a potential for unsubmitted IO, which required
AsyncReadBuffers() to use a wrapper (ReadBuffersCanStartIO()) around
StartBufferIO().
- With nowait = true, the boolean return value did not allow to distinguish
between no IO being necessary and having to wait, which would lead
ReadBuffersCanStartIO() to unnecessarily submit staged IO.
- Several callers needed to handle both local and shared buffers, requiring
the caller to differentiate between StartBufferIO() and StartLocalBufferIO()
- In a future commit some callers of StartBufferIO() want the BufferDesc's
io_wref to be returned, to asynchronously wait for in-progress IO
- Indicating whether to wait with the nowait parameter was somewhat confusing
compared to a wait parameter
Address these issues as follows:
- StartBufferIO() is renamed to StartSharedBufferIO()
- A new StartBufferIO() is introduced that supports both shared and local
buffers
- The boolean return value has been replaced with an enum, indicating whether
the IO is already done, already in progress or that the buffer has been
readied for IO
- A new PgAioWaitRef * argument allows the caller to get the wait reference is
desired. All current callers pass NULL, a user of this will be introduced
subsequently
- Instead of the nowait argument there now is wait
This probably would not have been worthwhile on its own, but since all these
lines needed to be touched anyway...
Author: Andres Freund <andres@anarazel.de>
Author: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/zljergweqti7x67lg5ije2rzjusie37nslsnkjkkby4laqqbfw@3p3zu522yykv
PostmasterContext is not available in single-user mode, use
TopMemoryContext instead. Also make sure that we use the correct
memory context in the lappend().
Andres Freund [Fri, 27 Mar 2026 22:47:04 +0000 (18:47 -0400)]
test_aio: Add read_stream test infrastructure & tests
While we have a lot of indirect coverage of read streams, there are corner
cases that are hard to test when only indirectly controlling and observing the
read stream. This commit adds an SQL callable SRF interface for a read stream
and uses that in a few tests.
To make some of the tests possible, the injection point infrastructure in
test_aio had to be expanded to allow blocking IO completion.
While at it, fix a wrong debug message in inj_io_short_read_hook().
Author: Andres Freund <andres@anarazel.de> Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/zljergweqti7x67lg5ije2rzjusie37nslsnkjkkby4laqqbfw@3p3zu522yykv
Tom Lane [Fri, 27 Mar 2026 21:41:00 +0000 (17:41 -0400)]
Doc: split functions-posix-regexp section into multiple subsections.
Create a <sect4> section for each function that the previous text
described in one long series of paragraphs. Also split the functions'
previously in-line syntax summaries into <synopsis> clauses, which is
more readable and allows us to sneak in an explicit mention of the
result data type.
This change gives us an opportunity to make cross-reference links
more specific, too, so do that.
Author: jian he <jian.universality@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CACJufxFuk9P=P4=BZ=qCkgvo6im8aL8NnCkjxx2S2MQDWNdouw@mail.gmail.com
Andres Freund [Fri, 27 Mar 2026 19:27:04 +0000 (15:27 -0400)]
bufmgr: Make UnlockReleaseBuffer() more efficient
Now that the buffer content lock is implemented as part of BufferDesc.state,
releasing the lock and unpinning the buffer can be implemented as a single
atomic operation.
This improves workloads that have heavy contention on a small number of
buffers substantially, I e.g., see a ~20% improvement for pipelined readonly
pgbench on an older two socket machine.
Andres Freund [Fri, 27 Mar 2026 19:27:04 +0000 (15:27 -0400)]
Use UnlockReleaseBuffer() in more places
An upcoming commit will make UnlockReleaseBuffer() considerably faster and
more scalable than doing LockBuffer(BUFFER_LOCK_UNLOCK); ReleaseBuffer();. But
it's a small performance benefit even as-is.
Most of the callsites changed in this patch are not performance sensitive,
however some, like the nbtree ones, are in critical paths.
This patch changes all the easily convertible places over to
UnlockReleaseBuffer() mainly because I needed to check all of them anyway, and
reducing cases where the operations are done separately makes the checking
easier.