Fujii Masao [Mon, 30 Mar 2026 02:06:42 +0000 (11:06 +0900)]
psql: Make \d+ partition list formatting consistent with other objects
Previously, \d+ <table> displayed partitions differently from other object
lists: the first partition appeared on the same line as the "Partitions"
header. For example:
Partitions: pt12 FOR VALUES IN (1, 2),
pt34 FOR VALUES IN (3, 4)
This commit updates the output so that partitions are listed consistently
with other objects, with each entry on its own line starting below the header:
Partitions:
pt12 FOR VALUES IN (1, 2)
pt34 FOR VALUES IN (3, 4)
Author: Peter Smith <smithpb2250@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Neil Chen <carpenter.nail.cz@gmail.com> Reviewed-by: Greg Sabino Mullane <htamfids@gmail.com> Reviewed-by: Soumya S Murali <soumyamurali.work@gmail.com> Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/CAHut+Pu1puO00C-OhgLnAcECzww8MB3Q8DCsvx0cZWHRfs4gBQ@mail.gmail.com
Amit Langote [Mon, 30 Mar 2026 01:29:21 +0000 (10:29 +0900)]
Doc: fix stale text about partition locking with cached plans
Commit 121d774caea added text to master describing pruning-aware
locking behavior introduced by 525392d57. That behavior was
reverted in May 2025, making the text incorrect. Replace it with
the text used in back branches, which correctly describes current
behavior: pruned partitions are still locked at the beginning of
execution.
Amit Langote [Mon, 30 Mar 2026 01:10:17 +0000 (10:10 +0900)]
Add comment explaining fire_triggers=false in ri_PerformCheck()
The reason for passing fire_triggers=false to SPI_execute_snapshot()
in ri_PerformCheck() was not documented, making it unclear why it was
done that way. Add a comment explaining that it ensures AFTER triggers
on rows modified by the RI action are queued in the outer query's
after-trigger context and fire only after all RI updates on the same
row are complete.
Peter Eisentraut [Sun, 29 Mar 2026 18:40:50 +0000 (20:40 +0200)]
Make geometry cast functions error safe
This adjusts cast functions of the geometry types to support soft
errors. This requires refactoring of various helper functions to
support error contexts. Also make the float8 to float4 cast error
safe. It requires some of the same helper functions.
This is in preparation for a future feature where conversion errors in
casts can be caught.
(The function casting type circle to type polygon is not yet made error
safe, because it is an SQL language function.)
Author: jian he <jian.universality@gmail.com> Reviewed-by: Amul Sul <sulamul@gmail.com> Reviewed-by: Corey Huinker <corey.huinker@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CADkLM%3Dfv1JfY4Ufa-jcwwNbjQixNViskQ8jZu3Tz_p656i_4hQ%40mail.gmail.com
Tom Lane [Sun, 29 Mar 2026 18:06:50 +0000 (14:06 -0400)]
Doc: document more incompatible pg_restore option pairs.
Most of the pairs of incompatible options (such as --file and --dbname)
are pretty obvious and need no explanation. But it may not be obvious
that --single-transaction cannot be used together with --create or
multiple jobs, so let's mention that in the documentation.
Tom Lane [Sun, 29 Mar 2026 17:53:17 +0000 (13:53 -0400)]
Doc: clarify introductory description of pg_dumpall.
Add a sentence that describes the parts of a cluster's state that are
*not* included in the output.
Also swap two sentences in the introductory paragraph. Without that,
it is not clear what the "it" at the beginning of the second sentence
is referring to. Also add a reference to pg_restore, since not all
output formats are restored with pg_dump.
Also clarify the recently-added text about where different output
formats go, and relocate it above the ancillary text about having
to run as superuser.
Reported-by: Dimitre Radoulov <cichomitiko@gmail.com>
Author: Laurenz Albe <laurenz.albe@cybertec.at> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAGJBphSX2oMPPu=VM4U8NP4+qffFH_483tFQCJ_s-mOcN3DLDw@mail.gmail.com
Andrew Dunstan [Mon, 23 Mar 2026 20:17:08 +0000 (16:17 -0400)]
Fix multiple bugs in astreamer pipeline code.
astreamer_tar_parser_content() sent the wrong data pointer when
forwarding MEMBER_TRAILER padding to the next streamer. After
astreamer_buffer_until() buffers the padding bytes, the 'data'
pointer has been advanced past them, but the code passed 'data'
instead of bbs_buffer.data. This caused the downstream consumer
to receive bytes from after the padding rather than the padding
itself, and could read past the end of the input buffer.
astreamer_gzip_decompressor_content() only checked for
Z_STREAM_ERROR from inflate(), silently ignoring Z_DATA_ERROR
(corrupted data) and Z_MEM_ERROR (out of memory). Fix by
treating any return other than Z_OK, Z_STREAM_END, and
Z_BUF_ERROR as fatal.
astreamer_gzip_decompressor_free() missed calling inflateEnd() to
release zlib's internal decompression state.
astreamer_tar_parser_free() neglected to pfree() the streamer
struct itself, leaking it.
astreamer_extractor_content() did not check the return value of
fclose() when closing an extracted file. A deferred write error
(e.g., disk full on buffered I/O) would be silently lost.
Peter Eisentraut [Sat, 28 Mar 2026 14:44:13 +0000 (15:44 +0100)]
Make cast functions from jsonb error safe
This adjusts cast functions from jsonb to other types to support soft
errors. This just involves some refactoring of the underlying helper
functions to use ereturn.
This is in preparation for a future feature where conversion errors in
casts can be caught.
Author: jian he <jian.universality@gmail.com> Reviewed-by: Amul Sul <sulamul@gmail.com> Reviewed-by: Corey Huinker <corey.huinker@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CADkLM%3Dfv1JfY4Ufa-jcwwNbjQixNViskQ8jZu3Tz_p656i_4hQ%40mail.gmail.com
Andres Freund [Fri, 27 Mar 2026 23:51:53 +0000 (19:51 -0400)]
aio: Don't wait for already in-progress IO
When a backend attempts to start a read IO and finds the first buffer already
has I/O in progress, previously it waited for that I/O to complete before
initiating reads for any of the subsequent buffers.
Although it must wait for the I/O to finish when acquiring the buffer, there's
no reason for it to wait when setting up the read operation. Waiting at this
point prevents starting I/O on subsequent buffers and can significantly reduce
concurrency.
This matters in two workloads:
1) When multiple backends scan the same relation concurrently.
2) When a single backend requests the same block multiple times within the
readahead distance.
Waiting each time an in-progress read is encountered effectively degenerates
the access pattern into synchronous I/O.
To fix this, when encountering an already in-progress IO for the head buffer,
the wait reference is now recorded and waiting is deferred until
WaitReadBuffers(), when the buffer actually needs to be acquired.
In rare cases, a backend may still need to wait synchronously at IO
start time: If another backend has set BM_IO_IN_PROGRESS on the buffer
but has not yet set the wait reference. Such windows should be brief and
uncommon.
Author: Melanie Plageman <melanieplageman@gmail.com>
Author: Andres Freund <andres@anarazel.de> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/flat/zljergweqti7x67lg5ije2rzjusie37nslsnkjkkby4laqqbfw%403p3zu522yykv
Andres Freund [Fri, 27 Mar 2026 23:02:23 +0000 (19:02 -0400)]
bufmgr: Improve StartBufferIO interface
Until now StartBufferIO() had a few weaknesses:
- As it did not submit staged IOs, it was not safe to call StartBufferIO()
where there was a potential for unsubmitted IO, which required
AsyncReadBuffers() to use a wrapper (ReadBuffersCanStartIO()) around
StartBufferIO().
- With nowait = true, the boolean return value did not allow to distinguish
between no IO being necessary and having to wait, which would lead
ReadBuffersCanStartIO() to unnecessarily submit staged IO.
- Several callers needed to handle both local and shared buffers, requiring
the caller to differentiate between StartBufferIO() and StartLocalBufferIO()
- In a future commit some callers of StartBufferIO() want the BufferDesc's
io_wref to be returned, to asynchronously wait for in-progress IO
- Indicating whether to wait with the nowait parameter was somewhat confusing
compared to a wait parameter
Address these issues as follows:
- StartBufferIO() is renamed to StartSharedBufferIO()
- A new StartBufferIO() is introduced that supports both shared and local
buffers
- The boolean return value has been replaced with an enum, indicating whether
the IO is already done, already in progress or that the buffer has been
readied for IO
- A new PgAioWaitRef * argument allows the caller to get the wait reference is
desired. All current callers pass NULL, a user of this will be introduced
subsequently
- Instead of the nowait argument there now is wait
This probably would not have been worthwhile on its own, but since all these
lines needed to be touched anyway...
Author: Andres Freund <andres@anarazel.de>
Author: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/zljergweqti7x67lg5ije2rzjusie37nslsnkjkkby4laqqbfw@3p3zu522yykv
PostmasterContext is not available in single-user mode, use
TopMemoryContext instead. Also make sure that we use the correct
memory context in the lappend().
Andres Freund [Fri, 27 Mar 2026 22:47:04 +0000 (18:47 -0400)]
test_aio: Add read_stream test infrastructure & tests
While we have a lot of indirect coverage of read streams, there are corner
cases that are hard to test when only indirectly controlling and observing the
read stream. This commit adds an SQL callable SRF interface for a read stream
and uses that in a few tests.
To make some of the tests possible, the injection point infrastructure in
test_aio had to be expanded to allow blocking IO completion.
While at it, fix a wrong debug message in inj_io_short_read_hook().
Author: Andres Freund <andres@anarazel.de> Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/zljergweqti7x67lg5ije2rzjusie37nslsnkjkkby4laqqbfw@3p3zu522yykv
Tom Lane [Fri, 27 Mar 2026 21:41:00 +0000 (17:41 -0400)]
Doc: split functions-posix-regexp section into multiple subsections.
Create a <sect4> section for each function that the previous text
described in one long series of paragraphs. Also split the functions'
previously in-line syntax summaries into <synopsis> clauses, which is
more readable and allows us to sneak in an explicit mention of the
result data type.
This change gives us an opportunity to make cross-reference links
more specific, too, so do that.
Author: jian he <jian.universality@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CACJufxFuk9P=P4=BZ=qCkgvo6im8aL8NnCkjxx2S2MQDWNdouw@mail.gmail.com
Andres Freund [Fri, 27 Mar 2026 19:27:04 +0000 (15:27 -0400)]
bufmgr: Make UnlockReleaseBuffer() more efficient
Now that the buffer content lock is implemented as part of BufferDesc.state,
releasing the lock and unpinning the buffer can be implemented as a single
atomic operation.
This improves workloads that have heavy contention on a small number of
buffers substantially, I e.g., see a ~20% improvement for pipelined readonly
pgbench on an older two socket machine.
Andres Freund [Fri, 27 Mar 2026 19:27:04 +0000 (15:27 -0400)]
Use UnlockReleaseBuffer() in more places
An upcoming commit will make UnlockReleaseBuffer() considerably faster and
more scalable than doing LockBuffer(BUFFER_LOCK_UNLOCK); ReleaseBuffer();. But
it's a small performance benefit even as-is.
Most of the callsites changed in this patch are not performance sensitive,
however some, like the nbtree ones, are in critical paths.
This patch changes all the easily convertible places over to
UnlockReleaseBuffer() mainly because I needed to check all of them anyway, and
reducing cases where the operations are done separately makes the checking
easier.
Andres Freund [Fri, 27 Mar 2026 19:27:04 +0000 (15:27 -0400)]
bufmgr: Don't copy pages while writing out
After the series of preceding commits introducing and using
BufferBeginSetHintBits()/BufferSetHintBits16(), hint bits are not set anymore
while IO is going on. Therefore we do not need to copy pages while they are
being written out anymore.
For the same reason XLogSaveBufferForHint() now does not need to operate on a
copy of the page anymore, but can instead use the normal XLogRegisterBuffer()
mechanism. For that the assertions and comments to XLogRegisterBuffer() had to
be updated to allow share-exclusive locked buffers to be registered.
Tom Lane [Fri, 27 Mar 2026 19:38:48 +0000 (15:38 -0400)]
pgindent: ensure all C files end with a newline.
Not only is this good style, but it dodges some obscure bugs within
pg_bsd_indent. We could try to fix said bugs, but the amount of
effort required seems far out of proportion to the benefit.
Reported-by: Akshay Joshi <akshay.joshi@enterprisedb.com>
Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Andrew Dunstan <andrew@dunslane.net>
Discussion: https://postgr.es/m/CANxoLDfca8O5SkeDxB_j6SVNXd+pNKaDmVmEW+2yyicdU8fy0w@mail.gmail.com
Masahiko Sawada [Fri, 27 Mar 2026 19:13:29 +0000 (12:13 -0700)]
doc: Clarify collation requirements for base32hex sortability.
While fixing the base32hex UUID sortability test in commit 89210037a0a, it turned out that the expected lexicographical order is
only maintained under the C collation (or an equivalent byte-wise
collation). Natural language collations may employ different rules,
breaking the sortability.
This commit updates the documentation to explicitly state that
base32hex is "byte-wise sortable", ensuring users do not fall into the
trap of using natural language collations when querying their encoded
data.
Nathan Bossart [Fri, 27 Mar 2026 15:17:05 +0000 (10:17 -0500)]
Add rudimentary table prioritization to autovacuum.
Autovacuum workers scan pg_class twice to collect the set of tables
to process. The first pass is for plain relations and materialized
views, and the second is for TOAST tables. When the worker finds a
table to process, it adds it to the end of a list. Later on, it
processes the tables in the same order as the list. This simple
strategy has worked surprisingly well for a long time, but there
have been many discussions over the years about trying to improve
it.
This commit introduces a scoring system that is used to sort the
aforementioned list of tables to process. The idea is to have
autovacuum workers prioritize tables that are furthest beyond their
thresholds (e.g., a table nearing transaction ID wraparound should
be vacuumed first). This prioritization scheme is certainly far
from perfect; there are simply too many possibilities for any
scoring technique to work across all workloads, and the situation
might change significantly between the time we calculate the score
and the time that autovacuum processes it. However, we have
attemped to develop something that is expected to work for a large
portion of workloads with reasonable parameter settings.
The score is calculated as the maximum of the ratios of each of the
table's relevant values to its threshold. For example, if the
number of inserted tuples is 100, and the insert threshold for the
table is 80, the insert score is 1.25. If all other scores are
below that value, the table's score will be 1.25. The other
criteria considered for the score are the table ages (both
relfrozenxid and relminmxid) compared to the corresponding
freeze-max-age setting, the number of update/deleted tuples
compared to the vacuum threshold, and the number of
inserted/updated/deleted tuples compared to the analyze threshold.
Once exception to the previous paragraph is for tables nearing
wraparound, i.e., those that have surpassed the effective failsafe
ages. In that case, the relfrozenxid/relminmxid-based score is
scaled aggressively so that the table has a decent chance of
sorting to the front of the list.
To adjust how strongly each component contributes to the score, the
following parameters can be adjusted from their default of 1.0 to
anywhere between 0.0 and 10.0 (inclusive). Setting all of these to
0.0 restores pre-v19 prioritization behavior:
This is intended to be a baby step towards smarter autovacuum
workers. Possible future improvements include, but are not limited
to, periodic reprioritization, automatic cost limit adjustments,
and better observability (e.g., a system view that shows current
scores). While we do not expect this commit to produce any
earth-shattering improvements, it is arguably a prerequisite for
the aforementioned follow-up changes.
Reviewed-by: Sami Imseih <samimseih@gmail.com> Reviewed-by: David Rowley <dgrowleyml@gmail.com> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: wenhui qiu <qiuwenhuifx@gmail.com> Reviewed-by: Greg Burd <greg@burd.me> Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Discussion: https://postgr.es/m/aOaAuXREwnPZVISO%40nathan
Peter Eisentraut [Fri, 27 Mar 2026 14:49:34 +0000 (15:49 +0100)]
Align tests for stored and virtual generated columns
These tests were intended to be aligned with each other, but
additional tests for virtual generated columns disrupted that
alignment. The test confirming that user-defined types are not
allowed in virtual generated columns has also been moved to the
generated_virtual.sql-specific section.
Author: Yugo Nagata <nagata@sraoss.co.jp> Reviewed-by: Paul A Jungwirth <pj@illuminatedcomputing.com> Reviewed-by: Mutaamba Maasha <maasha@gmail.com> Reviewed-by: Surya Poondla <s_poondla@apple.com>
Discussion: https://www.postgresql.org/message-id/flat/20250808115142.e9ccb81f35466a9a131a4c55@sraoss.co.jp
Peter Eisentraut [Fri, 27 Mar 2026 13:24:13 +0000 (14:24 +0100)]
pgindent: Clean up temp files created by File::Temp on SIGINT
When pressing Ctrl+C while running pgindent, it would often leave around
files like pgtypedefAXUEEA. This slightly changes SIGINT handling so
those files are cleaned up.
Refactor PredicateLockShmemInit to not reuse var for different things
The PredicateLockShmemInit function is pretty complicated, and one
source of confusion is that it reuses the same local variable for
sizes of things. Replace the different uses with separate variables
for clarity.
Peter Eisentraut [Fri, 27 Mar 2026 09:49:49 +0000 (10:49 +0100)]
Add a graph pattern variable only once
An element pattern variable may be repeated in the path pattern.
GraphTableParseState maintains a list of all variable names used in
the graph pattern. Add a new variable name to that list only when it
is not present already. This isn't a problem right now, but it could
be in the future.
Peter Eisentraut [Fri, 27 Mar 2026 09:30:01 +0000 (10:30 +0100)]
Reject consecutive element patterns of same kind
Adding an implicit empty vertex pattern when a path pattern starts or
ends with an edge pattern or when two consecutive edge patterns appear
in the pattern is not supported right now. Prohibit such path
patterns.
Use ShmemInitStruct to allocate lwlock.c's shared memory
It's nice to have them show up in pg_shmem_allocations like all other
shmem areas. ShmemInitStruct() depends on ShmemIndexLock, but only
after postmaster startup.
This makes shmem.c independent of the main LWLock array. That makes it
possible to stop passing MainLWLockArray through BackendParameters in
the next commit.
Previously we reused the shmem allocator's ShmemLock to also protect
lwlock.c's shared memory structures. Introduce a separate spinlock for
lwlock.c for the sake of modularity. Now that lwlock.c has its own
shared memory struct (LWLockTranches), this is easy to do.
Refactor how user-defined LWLock tranches are stored in shmem
Merge the LWLockTranches and NamedLWLockTrancheRequest data structures
in shared memory into one array of user-defined tranches. The
NamedLWLockTrancheRequest list is now only used in postmaster, to hold
the requests until shared memory is initialized.
Introduce a C struct, LWLockTranches, to hold all the different fields
kept in shared memory. This gives an easier overview of what are all
the things kept in shared memory. Previously, we had separate pointers
for LWLockTrancheNames, LWLockCounter and the (shared memory copy of)
NamedLWLockTrancheRequestArray.
Rename MAX_NAMED_TRANCHES to MAX_USER_DEFINED_TRANCHES
The "named tranches" term is a little confusing. In most places it
refers to tranches requested with RequestNamedLWLockTranche(), even
though all built-in tranches and tranches allocated with
LWLockNewTrancheId() also have a name. But in MAX_NAMED_TRANCHES, it
refers to tranches requested with either RequestNamedLWLockTranche()
or LWLockNewTrancheId(), as it's the maximum of all of those in total.
The "user defined" term is already used in
LWTRANCHE_FIRST_USER_DEFINED, so let's standardize on that to mean
tranches allocated with either RequestNamedLWLockTranche() or
LWLockNewTrancheId().
Reviewed-by: Nathan Bossart <nathandbossart@gmail.com> Reviewed-by: Sami Imseih <samimseih@gmail.com>
Discussion: https://www.postgresql.org/message-id/47aaf57e-1b7b-4e12-bda2-0316081ff50e@iki.fi
Tom Lane [Thu, 26 Mar 2026 21:27:32 +0000 (17:27 -0400)]
Doc: declutter CREATE TABLE synopsis.
Factor out the "persistence mode" and storage/compression parts
of the syntax synopsis to reduce line lengths and increase
readability. Also add an introductory para about the persistence
modes so that the Description section still lines up with the
synopsis.
Author: David G. Johnston <david.g.johnston@gmail.com> Reviewed-by: Laurenz Albe <laurenz.albe@cybertec.at> Reviewed-by: Jian He <jian.universality@gmail.com>
Discussion: https://postgr.es/m/CAKFQuwYfMV-2SdrP-umr5SVNSqTn378BUvHsebetp5=DhT494w@mail.gmail.com
The premise of src/test/modules/test_plan_advice is that if we plan
a query once, generate plan advice, and then replan it using that
same advice, all of that advice should apply cleanly, since the
settings and everything else are the same. Unfortunately, that's
not the case: the test suite is the main regression tests, and
concurrent activity can change the statistics on tables involved
in the query, especially system catalogs. That's OK as long as it
only affects costing, but in a few cases, it affects which relations
appear in the final plan at all.
In the buildfarm failures observed to date, this happens because
we consider alternative subplans for the same portion of the query;
in theory, MinMaxAggPath is vulnerable to a similar hazard. In both
cases, the planner clones an entire subquery, and the clone has a
different plan name, and therefore different range table identifiers,
than the original. If a cost change results in flipping between one
of these plans and the other, the test_plan_advice tests will fail,
because the range table identifiers to which advice was applied won't
even be present in the output of the second planning cycle.
To fix, invent a new DO_NOT_SCAN advice tag. When generating advice,
emit it for relations that should not appear in the final plan at
all, because some alternative version of that relation was used
instead. When DO_NOT_SCAN is supplied, disable all scan methods for
that relation.
To make this work, we reuse a bunch of the machinery that previously
existed for the purpose of ensuring that we build the same set of
relation identifiers during planning as we do from the final
PlannedStmt. In the process, this commit slightly weakens the
cross-check mechanism: before this commit, it would fire whenever
the pg_plan_advice module was loaded, even if pg_plan_advice wasn't
actually doing anything; now, it will only engage when we have some
other reason to create a pgpa_planner_state. The old way was complex
and didn't add much useful test coverage, so this seems like an
acceptable sacrifice.
Robert Haas [Thu, 26 Mar 2026 20:45:17 +0000 (16:45 -0400)]
Add an alternative_plan_name field to PlannerInfo.
Typically, we have only one PlannerInfo for any given subquery, but
when we are considering a MinMaxAggPath or a hashed subplan, we end
up creating a second PlannerInfo for the same portion of the query,
with a clone of the original range table. In fact, in the MinMaxAggPath
case, we might end up creating several clones, one per aggregate.
At present, there's no easy way for a plugin, such as pg_plan_advice,
to understand the relationships between the original range table and
the copies of it that are created in these cases. To fix, add an
alternative_plan_name field to PlannerInfo. For a hashed subplan, this
is the plan name for the non-hashed alternative; for minmax aggregates,
this is the plan_name from the parent PlannerInfo; otherwise, it's the
same as plan_name.
Tom Lane [Thu, 26 Mar 2026 19:14:21 +0000 (15:14 -0400)]
Doc: commit performs rollback of aborted transactions.
The COMMIT command handles an aborted transaction in the same
manner as the ROLLBACK command, but this wasn't explained in
its official reference page. Also mention that behavior in
the tutorial's material on transactions.
Also add a comment mentioning that we don't raise an exception
for COMMIT within an aborted transaction, as the SQL standard
would have us do.
Hyperlink a couple of cross-references while we're at it.
Author: David G. Johnston <david.g.johnston@gmail.com> Reviewed-by: Gurjeet Singh <gurjeet@singh.im> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAKFQuwYgYR3rWt6vFXw=ZWZ__bv7PqvdOnHujG+UyqE11f+3sg@mail.gmail.com
Andres Freund [Thu, 26 Mar 2026 14:51:52 +0000 (10:51 -0400)]
bufmgr: Restructure AsyncReadBuffers()
Restructure AsyncReadBuffers() to use early return when the head buffer is
already valid, instead of using a did_start_io flag and if/else branches. Also
move around a bit of the code to be located closer to where it is used. This
is a refactor only.
Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/zljergweqti7x67lg5ije2rzjusie37nslsnkjkkby4laqqbfw@3p3zu522yykv
Andres Freund [Thu, 26 Mar 2026 14:51:25 +0000 (10:51 -0400)]
bufmgr: Make buffer hit helper
Already two places count buffer hits, requiring quite a few lines of
code since we do accounting in so many places. Future commits will add
more locations, so refactor into a helper.
Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/zljergweqti7x67lg5ije2rzjusie37nslsnkjkkby4laqqbfw@3p3zu522yykv
Andres Freund [Thu, 26 Mar 2026 14:50:44 +0000 (10:50 -0400)]
bufmgr: Pass io_object and io_context through to PinBufferForBlock()
PinBufferForBlock() is always_inline and called in a loop in
StartReadBuffersImpl(). Previously it computed io_context and io_object
internally, which required calling IOContextForStrategy() -- a non-inline
function the compiler cannot prove is side-effect-free. This could potential
cause unneeded redundant function calls.
Compute io_context and io_object in the callers instead, allowing
StartReadBuffersImpl() to do so once before entering the loop.
Author: Melanie Plageman <melanieplageman@gmail.com> Suggested-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/zljergweqti7x67lg5ije2rzjusie37nslsnkjkkby4laqqbfw@3p3zu522yykv
Robert Haas [Thu, 26 Mar 2026 15:57:33 +0000 (11:57 -0400)]
pg_plan_advice: Refactor to invent pgpa_planner_info
pg_plan_advice tracks two pieces of per-PlannerInfo data: (1) for each
RTI, the corresponding relation identifier, for purposes of
cross-checking those calculations against the final plan; and (2) the
set of semijoins seen during planning for which the strategy of making
one side unique was considered. The former is tracked using a hash
table that uses <plan_name, RTI> as the key, and the latter is
tracked using a List of <plan_name, relids>.
It seems better to track both of these things in the same way and
to try to reuse some code instead of having everything be completely
separate, so invent pgpa_planner_info; we'll create one every time we
see a new PlannerInfo and need to associate some data with it, and
we'll use the plan_name field to distinguish between PlannerInfo
objects, as it should always be unique. Then, refactor the two
systems mentioned above to use this new infrastructure.
(Note that the adjustment in pgpa_plan_walker is necessary in order
to avoid spuriously triggering the sanity check in that function,
in the case where a pgpa_planner_info is created for a purpose not
related to sj_unique_rels.)
Tom Lane [Thu, 26 Mar 2026 15:36:38 +0000 (11:36 -0400)]
Add labels to help make psql's hidden queries more understandable.
We recommend looking at psql's "-E" output to help understand the
system catalogs, but in some cases (particularly table displays)
there's a bunch of rather impenetrable SQL there. As a small
improvement, label each query issued by describe.c with a short
description of its purpose. The code is arranged so that the
labels also appear as SQL comments in the server log, if the
server is logging these commands.
We could expand this policy to every use of PSQLexec(), but most of
the ones outside describe.c are issuing simple commands like "BEGIN"
or "COMMIT", which don't seem to need such glosses. I did add
labels to the commands issued by \sf, \sv and friends.
Also, make the -E and log output for hidden queries say
"INTERNAL QUERY" not just "QUERY", to distinguish them from
user-written queries.
Author: Greg Sabino Mullane <htamfids@gmail.com> Co-authored-by: David Christensen <david+pg@pgguru.net> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAKAnmmJz8Hh=8Ru8jgzySPWmLBhnv4=oc_0KRiz-UORJ0Dex+w@mail.gmail.com
Andres Freund [Thu, 26 Mar 2026 14:07:59 +0000 (10:07 -0400)]
Fix off-by-one error in read IO tracing
AsyncReadBuffer()'s no-IO needed path passed
TRACE_POSTGRESQL_BUFFER_READ_DONE the wrong block number because it had
already incremented operation->nblocks_done. Fix by folding the
nblocks_done offset into the blocknum local variable at initialization.
Andres Freund [Thu, 26 Mar 2026 14:07:59 +0000 (10:07 -0400)]
aio: Refactor tests in preparation for more tests
In a future commit more AIO related tests are due to be introduced. However
001_aio.pl already is fairly large.
This commit introduces a new TestAio package with helpers for writing AIO
related tests. Then it uses the new helpers to simplify the existing
001_aio.pl by iterating over all supported io_methods. This will be
particularly helpful because additional methods already have been submitted.
Additionally this commit splits out testing of initdb using a non-default
method into its own test. While that test is somewhat important, it's fairly
slow and doesn't break that often. For development velocity it's helpful for
001_aio.pl to be faster.
While particularly the latter could benefit from being its own commit, it
seems to introduce more back-and-forth than it's worth.
Author: Andres Freund <andres@anarazel.de> Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/zljergweqti7x67lg5ije2rzjusie37nslsnkjkkby4laqqbfw@3p3zu522yykv
Robert Haas [Fri, 20 Mar 2026 18:04:41 +0000 (14:04 -0400)]
Respect disabled_nodes in fix_alternative_subplan.
When my commit e22253467942fdb100087787c3e1e3a8620c54b2 added the
concept of disabled_nodes, it failed to add a disabled_nodes field
to SubPlan. This is a regression: before that commit, when
fix_alternative_subplan compared the costs of two plans, the number
of disabled nodes affected the result, because it was just a
component of the total cost. After that commit, it no longer did,
making it possible for a disabled path to win on cost over one that
is not disabled. Fix that.
As usual for planner fixes that might destabilize plan choices,
no back-patch.
Peter Eisentraut [Thu, 26 Mar 2026 14:00:24 +0000 (15:00 +0100)]
Fix -Wcast-qual warning
This dials back a couple of the qualifiers added by commit 7724cb9935a. Specifically, in match_boolean_partition_clause() the
call to negate_clause() casts away the const, so we shouldn't make the
input argument const.
Fujii Masao [Thu, 26 Mar 2026 11:54:32 +0000 (20:54 +0900)]
Avoid sending duplicate WAL locations in standby status replies
Previously, when the startup process applied WAL and requested walreceiver
to send an apply notification to the primary, walreceiver sent a status reply
unconditionally, even if the WAL locations had not advanced since
the previous update.
As a result, the standby could send two consecutive status reply messages
with identical WAL locations even though wal_receiver_status_interval had
not yet elapsed. This could unexpectedly reset the reported replication lag,
making it difficult for users to monitor lag. The second message was also
unnecessary because it reported no progress.
This commit updates walreceiver to send a reply only when the apply location
has advanced since the last status update, even when the startup process
requests a notification.
Fujii Masao [Thu, 26 Mar 2026 11:49:31 +0000 (20:49 +0900)]
Fix premature NULL lag reporting in pg_stat_replication
pg_stat_replication is documented to keep the last measured lag values for
a short time after the standby catches up, and then set them to NULL when
there is no WAL activity. However, previously lag values could become NULL
prematurely even while WAL activity was ongoing, especially in logical
replication.
This happened because the code cleared lag when two consecutive reply messages
indicated that the apply location had caught up with the send location.
It did not verify that the reported positions were unchanged, so lag could be
cleared even when positions had advanced between messages. In logical
replication, where the apply location often quickly catches up, this issue was
more likely to occur.
This commit fixes the issue by clearing lag only when the standby reports that
it has fully replayed WAL (i.e., both flush and apply locations have caught up
with the send location) and the write/flush/apply positions remain unchanged
across two consecutive reply messages.
The second message with unchanged positions typically results from
wal_receiver_status_interval, so lag values are cleared after that interval
when there is no activity. This avoids showing stale lag data while preventing
premature NULL values.
Even with this fix, lag may rarely become NULL during activity if identical
position reports are sent repeatedly. Eliminating such duplicate messages
would address this fully, but that change is considered too invasive for stable
branches and will be handled in master only later.
Peter Eisentraut [Thu, 26 Mar 2026 08:10:42 +0000 (09:10 +0100)]
MSVC: Remove unnecessary warning option
The MSVC warning option /w24777 added by commit 2307cfe3162 was a
typo, it should have been /w24477. But this option is already enabled
by default in level 1, so we don't need to add it explicitly. So just
remove it.
Peter Eisentraut [Thu, 26 Mar 2026 07:40:18 +0000 (08:40 +0100)]
Make fixed-length list building macros work in C++
Compound literals, as used in pg_list.h for list_makeN(), are not a
C++ feature. MSVC doesn't accept these. (GCC and Clang accept them,
but they would warn in -pedantic mode.) Replace with equivalent
inline functions. (These are the only instances of compound literals
used in PostgreSQL header files.)
Amit Kapila [Thu, 26 Mar 2026 03:45:25 +0000 (09:15 +0530)]
Refactor replorigin_session_setup() for better readability.
Reorder the validation checks in replorigin_session_setup() to provide a
more logical flow. This makes the function easier to follow and ensures
that basic state checks are performed consistently.
Additionally, update an error message to align its phrasing with similar
diagnostics in the replication origin subsystem, improving overall
consistency.
Author: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-by: shveta malik <shveta.malik@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/e0508305-bc6a-417c-b969-36564d632f9e@iki.fi
Masahiko Sawada [Thu, 26 Mar 2026 03:12:26 +0000 (20:12 -0700)]
Fix UUID sortability tests in base32hex encoding.
Commit 497c1170cb1 added base32hex encoding support, but its
regression test for UUIDs failed on buildfarm members hippopotamus and
jay using natural language locales (such as cs_CZ). This happened
because those collations may sort characters differently, which breaks
the strict byte-wise lexicographical ordering expected by base32hex
encoding.
This commit fixes the regression tests by explicitly using the C
collation.
Per buildfarm members hippopotamus and jay.
Analyzed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/682417.1774482047@sss.pgh.pa.us
Michael Paquier [Thu, 26 Mar 2026 01:39:40 +0000 (10:39 +0900)]
Improve timeout handling of pg_promote()
Previously, pg_promote() looped a fixed number of times, calculated from
the specified timeout, and waited 100ms on a latch, once per iteration,
for the promotion of a standby to complete. However, unrelated signals
to the backend could set the latch and wake up the backend early,
resulting in a faster consumption of the loops and an execution time of
the function that does not match with the timeout input given in input.
This could be confusing for the function caller, especially if some
backend-side timeout is aggressive, because the function would return
much earlier than expected and report that the promote request has not
completed within the time requested.
This commit refines the logic to track the time actually elapsed, by
looping until the requested duration has truly passed. The code
calculates the end time we expect, then uses it when looping.
Author: Robert Pang <robertpang@google.com> Reviewed-by: Tiancheng Ge <getiancheng_2012@163.com>
Discussion: https://postgr.es/m/CAJhEC07OK8J7tLUbyiccnuOXRE7UKxBNqD2-pLfeFXa=tBoWtw@mail.gmail.com
Tom Lane [Wed, 25 Mar 2026 23:15:52 +0000 (19:15 -0400)]
Remove a low-value, high-risk optimization in pg_waldump.
The code removed here deleted already-used data from a partially-read
WAL segment's hashtable entry. The intent was evidently to try to
keep the entry's memory consumption below the WAL segment's total
size, but we don't use WAL segments that are so large as to make that
a big win. The important memory-space optimization is to remove
hashtable entries altogether when done with them, and that's handled
elsewhere. To buy that, we must accept a substantially more complex
(and under-documented) logical invariant about what is in entry->buf,
as well as complex and under-documented interactions with the entry
spilling logic, various re-checking code paths in xlogreader.c,
and pg_waldump's overall data processing order. Any of those aspects
could have bugs lurking still, and are quite likely to be prone to
new bugs after future code changes.
Given the number of bugs we've already found in commit b15c15139,
I judge that simplifying anything we possibly can is a good decision.
While here, revise and extend some related comments.
Tom Lane [Wed, 25 Mar 2026 22:37:28 +0000 (18:37 -0400)]
Fix misuse of simplehash.h hash operations in pg_waldump.
Both ArchivedWAL_insert() and ArchivedWAL_delete_item() can cause
existing hashtable entries to move. The code didn't account for this
and could leave privateInfo->cur_file pointing at a dead or incorrect
entry, with hilarity ensuing. Likewise, read_archive_wal_page calls
read_archive_file which could result in movement of the hashtable
entry it is working with.
I believe these bugs explain some odd buildfarm failures, although
the amount of data we use in pg_waldump's TAP tests isn't enough to
trigger them reliably.
This code's all new as of commit b15c15139, so no need for back-patch.
Tom Lane [Wed, 25 Mar 2026 22:28:42 +0000 (18:28 -0400)]
Fix file descriptor leakages in pg_waldump.
TarWALDumpCloseSegment was of the opinion that it didn't need to
do anything. It was mistaken: it has to close the open file if
any, because nothing else will, leading to a descriptor leak.
In addition, we failed to ensure that any file being read by the
XLogReader machinery gets closed before the atexit callback tries to
cleanup the temporary directory holding spilled WAL files. While the
file would have been closed already in case of a success exit, this
doesn't happen in case of pg_fatal() exits. The least messy way
to fix that is to move the atexit function into pg_waldump.c,
where it has easier access to the XLogReaderState pointer and to
WALDumpCloseSegment.
These FD leakages are pretty insignificant on Unix-ish platforms,
but they're a bug on Windows, because they prevent successful cleanup
of the temporary directory for extracted WAL files. (Windows can't
delete a directory that holds a deleted-but-still-open file.)
This is visible in occasional buildfarm failures.
This code's all new as of commit b15c15139, so no need for back-patch.
Masahiko Sawada [Wed, 25 Mar 2026 18:35:19 +0000 (11:35 -0700)]
Add base32hex support to encode() and decode() functions.
This adds support for base32hex encoding and decoding, as defined in
RFC 4648 Section 7. Unlike standard base32, base32hex uses the
extended hex alphabet (0-9, A-V) which preserves the lexicographical
order of the encoded data.
This is particularly useful for representing UUIDv7 values in a
compact string format while maintaining their time-ordered sort
property.
The encode() function produces output padded with '=', while decode()
accepts both padded and unpadded input. Following the behavior of
other encoding types, decoding is case-insensitive.
Suggested-by: Sergey Prokhorenko <sergeyprokhorenko@yahoo.com.au>
Author: Andrey Borodin <x4mmm@yandex-team.ru> Co-authored-by: Aleksander Alekseev <aleksander@tigerdata.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Илья Чердаков <i.cherdakov.pg@gmail.com> Reviewed-by: Chengxi Sun <chengxisun92@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CAJ7c6TOramr1UTLcyB128LWMqita1Y7%3Darq3KHaU%3Dqikf5yKOQ%40mail.gmail.com
Álvaro Herrera [Wed, 25 Mar 2026 13:04:33 +0000 (14:04 +0100)]
Remove unused autovac_table.at_sharedrel
The last use was removed by commit 38f7831d703b. After that, we compute
MyWorkerInfo->wi_sharedrel directly from the pg_class tuple of the table
being vacuumed rather than passing it around.
Masahiko Sawada [Wed, 25 Mar 2026 16:30:26 +0000 (09:30 -0700)]
psql: Fix tab completion for FOREIGN DATA WRAPPER and SUBSCRIPTION.
Commit 8185bb5347 extended the CREATE/ALTER SUBSCRIPTION and
CREATE/ALTER FOREIGN DATA WRAPPER commands, but missed the
corresponding tab-completion logic. This commit fixes that oversight
by adding completion support for:
- The CONNECTION keyword in CREATE/ALTER FOREIGN DATA WRAPPER.
- The list of foreign servers in CREATE/ALTER SUBSCRIPTION.
Peter Eisentraut [Wed, 25 Mar 2026 14:03:30 +0000 (15:03 +0100)]
Remove compiler warning option -Wendif-labels
This warning has always been on by default in GCC (and in Clang at
least going back to 3.1), so we don't need the option explicitly.
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/aa73q1aT0A3/vke/%40ip-10-97-1-34.eu-west-3.compute.internal
Peter Eisentraut [Wed, 25 Mar 2026 14:03:30 +0000 (15:03 +0100)]
Disable warnings in system headers in MSVC
This is similar to the standard behavior in GCC. For MSVC, we set all
headers in angle brackets to be considered system headers. (GCC goes
by path, not include style.)
The required option is available since VS 2017. (Before VS 2019
version 16.10, the additional option /experimental:external is
required, but per discussion in [0], we effectively require 16.11, so
this shouldn't be a problem.)
Then, we can remove one workaround for avoiding a warning from a
system header. (And some warnings to be enabled in the future could
benefit from this.)
Amit Kapila [Wed, 25 Mar 2026 05:52:07 +0000 (11:22 +0530)]
pg_createsubscriber: Add -l/--logdir option to redirect output to files.
This commit introduces a -l (or --logdir) argument to pg_createsubscriber,
allowing users to specify a directory for log files.
When enabled, a timestamped subdirectory is created within the specified
log directory, containing:
pg_createsubscriber_server.log: Captures logs from the standby server
during its start/stop cycles.
pg_createsubscriber_internal.log: Captures the tool's own internal
diagnostic and progress messages.
This ensures that transient server and utility messages are preserved for
troubleshooting after the subscriber creation process completes or errored
out.
Author: Gyan Sreejith <gyan.sreejith@gmail.com>
Author: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: vignesh C <vignesh21@gmail.com> Reviewed-by: Euler Taveira <euler@eulerto.com> Reviewed-by: shveta malik <shveta.malik@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Shlok Kyal <shlok.kyal.oss@gmail.com> Reviewed-by: Peter Smith <smithpb2250@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CAEqnbaUthOQARV1dscGvB_EsqC-YfxiM6rWkVDHc+G+f4oSUHw@mail.gmail.com
John Naylor [Wed, 25 Mar 2026 05:32:36 +0000 (12:32 +0700)]
Refactor handling of x86 CPUID instructions
Introduce two helpers for CPUID, pg_cpuid and pg_cpuid_subleaf that wrap
the platform specific __get_cpuid/__cpuid and __get_cpuid_count/__cpuidex
functions.
Additionally, use macros to specify registers names (e.g. EAX) for clarity,
instead of numeric integers into the result array.
Author: Lukas Fittl <lukas@fittl.com> Suggested-By: John Naylor <john.naylor@postgresql.org>
Discussion: https://postgr.es/m/CANWCAZZ+Crjt5za9YmFsURRMDW7M4T2mutDezd_3s1gTLnrzGQ@mail.gmail.com
Michael Paquier [Tue, 24 Mar 2026 23:48:15 +0000 (08:48 +0900)]
Remove isolation test lock-stats
This test is proving to be unstable in the CI for Windows, at least.
The origin of the issue is that the deadlock_timeout requests may not
be processed, causing the lock stats to not be updated. This could be
mitigated by making the hardcoded sleep longer, however this would cost
in runtime on fast machines. On slow machines, there is no guarantee
that an augmented sleep would be enough.
An isolation test may not be the best method to write this test
(TAP test with injection point with a NOTICE+wait_for_log before
processing the deadlock_timeout request should remove the need of a
sleep). As we are late in the release cycle, I am removing the test for
now to keep the CI and the buildfarm a maximum stable. Let's revisit
this part later.
Jeff Davis [Tue, 24 Mar 2026 22:10:03 +0000 (15:10 -0700)]
GetSubscription(): use per-object memory context.
Constructing a Subcription object uses a number of small or temporary
allocations. Use a per-object memory context for easy cleanup.
Get rid of FreeSubscription() which did not free all the allocations
anyway. Also get rid of the PG_TRY()/PG_CATCH() logic in
ForeignServerConnectionString() which were used to avoid leaks during
GetSubscription().
Co-authored-by: Álvaro Herrera <alvherre@kurilemu.de> Suggested-by: Andres Freund <andres@anarazel.de> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/xvdjrdqnpap3uq7owbaox3r7p5gf7sv62aaqf2ju3vb6yglatr%40kvvwhoudrlxq
Discussion: https://postgr.es/m/CAA4eK1K=WjZ1maBCmj=5ZdO66AwPORK5ZBxVKedS0xdCcb621A@mail.gmail.com
Melanie Plageman [Tue, 24 Mar 2026 21:58:12 +0000 (17:58 -0400)]
Remove XLOG_HEAP2_VISIBLE entirely
There are no remaining users that emit XLOG_HEAP2_VISIBLE records, so it
can be removed. This includes deleting the xl_heap_visible struct and
all functions responsible for emitting or replaying XLOG_HEAP2_VISIBLE
records.
Bumps XLOG_PAGE_MAGIC because we removed a WAL record type.
Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/flat/CAAKRu_ZMw6Npd_qm2KM%2BFwQ3cMOMx1Dh3VMhp8-V7SOLxdK9-g%40mail.gmail.com
Melanie Plageman [Tue, 24 Mar 2026 21:28:05 +0000 (17:28 -0400)]
WAL log VM setting for empty pages in XLOG_HEAP2_PRUNE_VACUUM_SCAN
As part of removing XLOG_HEAP2_VISIBLE records, phase I of VACUUM now
marks empty pages all-visible and all-frozen in a
XLOG_HEAP2_PRUNE_VACUUM_SCAN record.
This has no real independent benefit, but empty pages were the last user
of XLOG_HEAP2_VISIBLE, so by making this change we can next remove all
of the XLOG_HEAP2_VISIBLE code.
Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Earlier version Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Melanie Plageman [Tue, 24 Mar 2026 20:49:46 +0000 (16:49 -0400)]
WAL log VM setting during vacuum phase I in XLOG_HEAP2_PRUNE_VACUUM_SCAN
Vacuum no longer emits a separate WAL record for each page set
all-visible or all-frozen during phase I. Instead, visibility map
updates are now included in the XLOG_HEAP2_PRUNE_VACUUM_SCAN record that
is already emitted for pruning and freezing.
Previously, heap_page_prune_and_freeze() determined whether a page was
all-visible, but the corresponding VM bits were only set later in
lazy_scan_prune(). Now the VM is updated immediately in
heap_page_prune_and_freeze(), at the same time as the heap
modifications. This reduces WAL volume produced by vacuum.
For now, vacuum is still the only user of heap_page_prune_and_freeze()
allowed to set the VM. On-access pruning is not yet able to set the VM.
Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Earlier version Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/flat/CAAKRu_ZMw6Npd_qm2KM%2BFwQ3cMOMx1Dh3VMhp8-V7SOLxdK9-g%40mail.gmail.com
Robert Haas [Tue, 24 Mar 2026 20:17:26 +0000 (16:17 -0400)]
get_memoize_path: Don't exit quickly when PGS_NESTLOOP_PLAIN is unset.
This function exits early in the case where the number of inner rows
is estimated to be less than 2, on the theory that in that case a
Nested Loop with inner Memoize must lose to a plain Nested Loop.
But since commit 4020b370f214315b8c10430301898ac21658143f it's
possible for a plain Nested Loop to be disabled, while a Nested Loop
with inner Memoize is still enabled. In that case, this reasoning
is not valid, so adjust the code not to exit early in that case.
This issue was revealed by a test_plan_advice failure on buildfarm
member skink, where NESTED_LOOP_MEMOIZE() couldn't be enforced on
replanning due to this early exit.
Melanie Plageman [Tue, 24 Mar 2026 19:36:34 +0000 (15:36 -0400)]
Keep newest live XID up-to-date even if page not all-visible
During pruning, we keep track of the newest xmin of live tuples on the
page visible to all running and future transactions so that we can use
it later as the snapshot conflict horizon when setting the VM if the
page turns out to be all-visible.
Previously, we stopped updating this value once we determined the page
was not all-visible. However, maintaining it even when the page is not
all-visible is inexpensive and makes the snapshot conflict horizon
calculation clearer. This guarantees it won't contain a stale value.
Since we'll keep it up to date all the time now anyway, there's no
reason not to maintain set_all_visible for on-access pruning. This will
allow us to set the VM on-access in the future.
Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/bqc4kh5midfn44gnjiqez3bjqv4zogydguvdn446riw45jcf3y%404ez66il7ebvk
Melanie Plageman [Tue, 24 Mar 2026 18:50:59 +0000 (14:50 -0400)]
Use GlobalVisState in vacuum to determine page level visibility
During vacuum's first and third phases, we examine tuples' visibility to
determine if we can set the page all-visible in the visibility map.
Previously, this check compared tuple xmins against a single XID chosen
at the start of vacuum (OldestXmin). We now use GlobalVisState, which
enables future work to set the VM during on-access pruning, since
ordinary queries have access to GlobalVisState but not OldestXmin.
This also benefits vacuum: in some cases, GlobalVisState may advance
during a vacuum, allowing more pages to become considered all-visible.
And, in the future, we could easily add a heuristic to update
GlobalVisState more frequently during vacuums of large tables.
OldestXmin is still used for freezing and as a backstop to ensure we
don't freeze a dead tuple that wasn't yet prunable according to
GlobalVisState in the rare occurrences where GlobalVisState moves
backwards.
Because comparing a transaction ID against GlobalVisState is more
expensive than comparing against a single XID, we defer this check until
after scanning all tuples on the page. Therefore, we perform the
GlobalVisState check only once per page. This is safe because
visibility_cutoff_xid records the newest live xmin on the page; if it is
globally visible, then the entire page is all-visible.
Using GlobalVisState means on-access pruning can also maintain
visibility_cutoff_xid, which is required to set the visibility map
on-access in the future.
Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://postgr.es/m/flat/bqc4kh5midfn44gnjiqez3bjqv4zogydguvdn446riw45jcf3y%404ez66il7ebvk#c755ef151507aba58471ffaca607e493
Álvaro Herrera [Tue, 24 Mar 2026 16:30:40 +0000 (17:30 +0100)]
Avoid including clog.h in proc.h
The number of .c files that must include access/clog.h can currently be
counted on one's fingers and miss only one (assuming one has the usual
number of hands). However, due to indirect inclusion via proc.h,
there's a lot of files that are pointlessly including it. This is easy
to avoid with the easy trick implemented by this commit.
Tom Lane [Tue, 24 Mar 2026 16:17:04 +0000 (12:17 -0400)]
Fix poorly-sized buffers in astreamer compression modules.
astreamer_gzip.c and astreamer_lz4.c left their decompression
output buffers at StringInfo's default allocation, merely 1kB.
This results in a lot of ping-ponging between the decompressor
and the next astreamer filter. This patch increases these buffer
sizes to 256kB. In a simple test this had a small but measurable
effect (saving a few percent) on the overall runtime of pg_waldump
for the gzipped-data case; I didn't bother measuring for lz4.
astreamer_zstd.c used ZSTD_DStreamOutSize() to size its
compression output buffer, but the libzstd API says you should use
ZSTD_CStreamOutSize(); ZSTD_DStreamOutSize() is for decompression.
The two functions seem to produce the same value (256kB) here, so
this is just cosmetic, but nonetheless we should play by the rules.
While these issues are old, they don't seem significant enough to
warrant back-patching.
Tom Lane [Tue, 24 Mar 2026 16:06:08 +0000 (12:06 -0400)]
Remove read_archive_file()'s "count" parameter.
Instead, always try to fill the allocated buffer completely.
The previous coding apparently intended (though it's undocumented)
to read only small amounts of data until we are able to identify the
WAL segment size and begin filtering out unwanted segments. However
this extra complication has no measurable value according to simple
testing here, and it could easily be a net loss if there is a
substantial amount of non-WAL data in the archive file before the
first WAL file.
Álvaro Herrera [Tue, 24 Mar 2026 16:11:12 +0000 (17:11 +0100)]
Don't include storage/lock.h in so many headers
Since storage/locktags.h was added by commit 322bab79744d, many headers
can be made leaner by depending on that instead of on storage/lock.h,
which has many other dependencies.
(In fact, some of these changes were possible even before that.)
Álvaro Herrera [Tue, 24 Mar 2026 15:45:39 +0000 (16:45 +0100)]
Fix dereference in a couple of GUC check hooks
check_backtrace_functions() and check_archive_directory() were doing an
empty-string check this way:
*newval[0] == '\0'
which, because of operator precedence, is interpreted as *(newval[0])
instead of (*newval)[0] -- but these variables are pointers to C-strings
and we want to check the first character therein, rather than check the
first pointer of the array, so that interpretation is wrong. This would
be wrong for any index element other than 0, as evidenced by every other
dereference of the same variable in check_backtrace_functions, which use
parentheses.
Add parentheses to make the intended dereference explicit.
This is just cosmetic at this stage, so no backpatch, although it's been
"wrong" for a long time.
Author: Zhang Hu <kongbaik228@gmail.com> Reviewed-by: Junwang Zhao <zhjwpku@gmail.com> Reviewed-by: Chao Li <lic@highgo.com>
Discussion: https://postgr.es/m/CAB5m2QssN6UO+ckr6ZCcV0A71mKUB6WdiTw1nHo43v4DTW1Dfg@mail.gmail.com
Nathan Bossart [Tue, 24 Mar 2026 14:32:15 +0000 (09:32 -0500)]
test_bloomfilter: Fix error message.
The error message in question uses the wrong format specifier and
variable. This has been wrong for a while, but since it's in a
test module and wasn't noticed until just now, no back-patch.
Fujii Masao [Tue, 24 Mar 2026 13:33:09 +0000 (22:33 +0900)]
Report detailed errors from XLogFindNextRecord() failures.
Previously, XLogFindNextRecord() did not return detailed error information
when it failed to find a valid WAL record. As a result, callers such as
the WAL summarizer, pg_waldump, and pg_walinspect could only report generic
errors (e.g., "could not find a valid record after ..."), making
troubleshooting difficult.
This commit fix the issue by extending XLogFindNextRecord() to return
detailed error information on failure, and updating its callers to include
those details in their error messages.
For example, when pg_waldump is run on a WAL file with an invalid magic number,
it now reports not only the generic error but also the specific cause
(e.g., "invalid magic number").
Author: Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com> Reviewed-by: Mircea Cadariu <cadariu.mircea@gmail.com> Reviewed-by: Japin Li <japinli@hotmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/CAO6_XqoxJXddcT4wkd9Xd+cD6Sz-fyspRGuV4Bq-wbXG4pVNzA@mail.gmail.com
Robert Haas [Tue, 24 Mar 2026 12:58:50 +0000 (08:58 -0400)]
Bounds-check access to TupleDescAttr with an Assert.
The second argument to TupleDescAttr should always be at least zero
and less than natts; otherwise, we index outside of the attribute
array. Assert that this is the case.
Various violations, or possible violations, of this rule that are
currently in the tree are actually harmless, because while
we do call TupleDescAttr() before verifying that the argument is
within range, we don't actually dereference it unless the argument
was within range all along. Nonetheless, the Assert means we
should be more careful, so tidy up accordingly.
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: http://postgr.es/m/CA+TgmoacixUZVvi00hOjk_d9B4iYKswWP1gNqQ8Vfray-AcOCA@mail.gmail.com
Peter Eisentraut [Tue, 24 Mar 2026 11:01:05 +0000 (12:01 +0100)]
Make many cast functions error safe
This adjusts many C functions underlying casts to support soft errors.
This is in preparation for a future feature where conversion errors in
casts can be caught.
This patch covers cast functions that can be adjusted easily by
changing ereport to ereturn or making other light changes. The
underlying helper functions were already changed to support soft
errors some time ago as part of soft error support in type input
functions.
Other casts and types will require some more work and are being kept
as separate patches.
Author: jian he <jian.universality@gmail.com> Reviewed-by: Amul Sul <sulamul@gmail.com> Reviewed-by: Corey Huinker <corey.huinker@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CADkLM%3Dfv1JfY4Ufa-jcwwNbjQixNViskQ8jZu3Tz_p656i_4hQ%40mail.gmail.com
Robert Haas [Tue, 24 Mar 2026 10:11:15 +0000 (06:11 -0400)]
Prevent spurious "indexes on virtual generated columns are not supported".
Both of the checks in DefineIndex() that can produce this error
message have a guard against negative attribute numbers, but lack a
guard to ensure that attno is non-zero. As a result, we can index
off the beginning of the TupleDesc and read a garbage byte for
attgenerated. If that byte happens to be 'v', we'll incorrectly
produce the error mentioned above.
The first call site is easy to hit: any attempt to create an
expression index does so. The second one is not currently hit in
the regression tests, but can be hit by something like
CREATE INDEX ON some_table ((some_function(some_table))).
Found by study of a test_plan_advice failure on buildfarm member
skink, though this issue has nothing to do with test_plan_advice
and seems to have only been revealed by happenstance.
Backpatch-through: 18 Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: http://postgr.es/m/CA+TgmoacixUZVvi00hOjk_d9B4iYKswWP1gNqQ8Vfray-AcOCA@mail.gmail.com
John Naylor [Tue, 24 Mar 2026 09:40:33 +0000 (16:40 +0700)]
Fix copy-paste error in test_ginpostinglist
The check for a mismatch on the second decoded item pointer
was an exact copy of the first item pointer check, comparing
orig_itemptrs[0] with decoded_itemptrs[0] instead of orig_itemptrs[1]
with decoded_itemptrs[1]. The error message also reported (0, 1) as
the expected value instead of (blk, off). As a result, any decoding
error in the second item pointer (where the varbyte delta encoding
is exercised) would go undetected.
This has been wrong since commit bde7493d1, so backpatch to all
supported versions.
Author: Jianghua Yang <yjhjstz@gmail.com>
Discussion: https://postgr.es/m/CAAZLFmSOD8R7tZjRLZsmpKtJLoqjgawAaM-Pne1j8B_Q2aQK8w@mail.gmail.com
Backpatch-through: 14
Further improve commentary about ChangeVarNodesWalkExpression()
The updated comment explains why we use ChangeVarNodes_walker() instead of
expression_tree_walker(), and provides a bit more detail about the differences
in processing top-level Query and subqueries.
Author: Alexander Korotkov <aekorotkov@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAPpHfdvbjq342WTQ705Wmqhe8794pcp7wospz%2BWUJ2qB7vuOqA%40mail.gmail.com
Backpatch-through: 18
Michael Paquier [Tue, 24 Mar 2026 06:32:09 +0000 (15:32 +0900)]
Add support for lock statistics in pgstats
This commit adds a new stats kind, called PGSTAT_KIND_LOCK, implementing
statistics for lock tags, as reported by pg_locks. The implementation
is fixed-sized, as the data is caped based on the number of lock tags in
LockTagType.
The new statistics kind records the following fields, providing insight
regarding lock behavior, while avoiding impact on performance-critical
code paths (such as fast-path lock acquisition):
- waits and wait_time: respectively track the number of times a lock
required waiting and the total time spent acquiring it. These metrics
are only collected once a lock is successfully acquired and after
deadlock_timeout has been exceeded.
fastpath_exceeded: counts how often a lock could not be acquired via
the fast path due to the max_locks_per_transaction slot limits.
A new view called pg_stat_lock can be used to access this data, coupled
with a SQL function called pg_stat_get_lock().
Bump stat file format PGSTAT_FILE_FORMAT_ID.
Bump catalog version.
Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/aIyNxBWFCybgBZBS%40ip-10-97-1-34.eu-west-3.compute.internal
Michael Paquier [Tue, 24 Mar 2026 04:34:54 +0000 (13:34 +0900)]
Move some code blocks in lock.c and proc.c
This change will simplify an upcoming change that will introduce lock
statistics, reducting code churn.
This commit means that we begin to calculate the time it took to acquire
a lock after the deadlock check interrupt has run should log_lock_waits
be off, when taken in isolation. This is not a performance-critical
code path, and note that log_lock_waits is enabled by default since 2aac62be8cbb.
Extracted from a larger patch by the same author.
Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/aIyNxBWFCybgBZBS@ip-10-97-1-34.eu-west-3.compute.internal
Michael Paquier [Mon, 23 Mar 2026 23:29:23 +0000 (08:29 +0900)]
Make implementation of SASLprep compliant for ASCII characters
This commit makes our implementation of SASLprep() compliant with RFC
3454 (Stringprep) and RFC 4013 (SASLprep). Originally, as introduced in 60f11b87a234, the operation considered a password made of only ASCII
characters as valid, performing an optimization for this case to skip
the internal NFKC transformation.
However, the RFCs listed above use a different definition, with the
following characters being prohibited:
- 0x00~0x1F (0~31), control characters.
- 0x7F (127, DEL).
In its SCRAM protocol, Postgres has the idea to apply a password as-is
if SASLprep() is not a success, so this change is safe on
backward-compatibility grounds:
- A libpq client with the compliant SASLprep can connect to a server
with a non-compliant SASLprep.
- A libpq client with the non-compliant SASLprep can connect to a server
with a compliant SASLprep.
This commit removes the all-ASCII optimization used in pg_saslprep() and
applies SASLprep even if a password is made only of ASCII characters,
making the operation compatible with the RFC. All the in-core callers
of pg_saslprep() do that:
- pg_be_scram_build_secret() in auth-scram.c, when generating a
SCRAM verifier for rolpassword in the backend.
- scram_init() in fe-auth-scram.c, when starting the SASL exchange.
- pg_fe_scram_build_secret() in fe-auth-scram.c, when generating a SCRAM
verifier for the frontend with libpq, to generate it for a ALTER/CREATE
ROLE command for example.
The test module test_saslprep shows the difference this change is
leading to.
Author: Michael Paquier <michael@paquier.xyz> Reviewed-by: John Naylor <johncnaylorls@gmail.com>
Discussion: https://postgr.es/m/aaEJ-El2seZHeFcG@paquier.xyz
Tom Lane [Mon, 23 Mar 2026 21:25:12 +0000 (17:25 -0400)]
Silence compiler warning from older compilers.
Our RHEL7-vintage buildfarm animals are complaining about
"the comparison will always evaluate as true" for a usage of
SOFT_ERROR_OCCURRED() on a local variable. This is the same
issue addressed in 7bc88c3d6 and some earlier commits, so solve
it the same way: write "escontext.error_occurred" instead.
Problem dates to recent commit a0b6ef29a, no need for back-patch.
Tom Lane [Mon, 23 Mar 2026 19:33:51 +0000 (15:33 -0400)]
Doc: minor improvements to SNI documentation.
My attention was drawn to this new documentation by overlength-line
complaints in the PDF docs builds: the synopsis for hostname lines was
too wide. I initially thought of shortening the parameter names to
fit, but it turns out that adding <optional> markup is enough to
persuade DocBook to break the line, and that seems more helpful
anyway.
While here, I couldn't resist some copy-editing, mostly being
consistent about whether to use Oxford commas or not. The biggest
change was to re-order the entries in the hostname-values table to
match the running text.