Michael Paquier [Wed, 28 Jan 2026 02:48:45 +0000 (11:48 +0900)]
Fix pg_restore_extended_stats() with expressions
This commit fixes an issue with the restore of ndistinct and
dependencies statistics, causing the operation to fail when any of these
kinds included expressions.
In extended statistics, expressions use strictly negative attribute
numbers, decremented from -1. For example, let's imagine an object
defined as follows:
CREATE STATISTICS stats_obj (dependencies) ON lower(name), upper(name)
FROM tab_obj;
This object would generate dependencies stats using -1 and -2 as
attribute numbers, like that:
[{"attributes": [-1], "dependency": -2, "degree": 1.000000},
{"attributes": [-2], "dependency": -1, "degree": 1.000000}]
However, pg_restore_extended_stats() forgot to account for the number of
expressions defined in an extended statistics object. This would cause
the validation step of ndistinct and dependencies data to fail,
preventing a restore of their stats even if the input is valid.
This issue has come up due to an incorrect split of the patch set. Some
tests are included to cover this behavior.
Author: Corey Huinker <corey.huinker@gmail.com> Co-authored-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/aXl4bMfSTQUxM_yy@paquier.xyz
Michael Paquier [Tue, 27 Jan 2026 23:37:46 +0000 (08:37 +0900)]
Add output test for pg_dependencies statistics import
Commit 302879bd68d115 has added the ability to restore extended stats of
the type "dependencies", but it has forgotten the addition of a test to
verify that the value restored was actually set.
This test is the pg_dependencies equivalent of the test added for
pg_ndistinct in 0e80f3f88dea.
Jacob Champion [Tue, 27 Jan 2026 19:56:44 +0000 (11:56 -0800)]
oauth: Correct test dependency on oauth_hook_client
The oauth_validator tests missed the lessons of c89525d57 et al, so
certain combinations of command-line build order and `meson test`
options can result in
Command 'oauth_hook_client' not found in [...] at src/test/perl/PostgreSQL/Test/Utils.pm line 427.
Add the missing dependency on the test executable. This fixes, for
example,
Reported-by: Jonathan Gonzalez V. <jonathan.abdiel@gmail.com>
Author: Jonathan Gonzalez V. <jonathan.abdiel@gmail.com>
Discussion: https://postgr.es/m/6e8f4f7c23faf77c4b6564c4b7dc5d3de64aa491.camel@gmail.com
Discussion: https://postgr.es/m/qh4c5tvkgjef7jikjig56rclbcdrrotngnwpycukd2n3k25zi2%4044hxxvtwmgum
Backpatch-through: 18
Robert Haas [Tue, 27 Jan 2026 13:31:15 +0000 (08:31 -0500)]
pg_waldump: Remove file-level global WalSegSz.
It's better style to pass the value around to just the places that
need it. This makes it easier to determine whether the value is
always properly initialized before use.
Reviewed-by: Amul Sul <sulamul@gmail.com>
Discussion: http://postgr.es/m/CAAJ_b94+wObPn-z1VECipnSFhjMJ+R2cpTmKVYLjyQuVn+B5QA@mail.gmail.com
Amit Kapila [Tue, 27 Jan 2026 05:06:29 +0000 (05:06 +0000)]
Prevent invalidation of newly synced replication slots.
A race condition could cause a newly synced replication slot to become
invalidated between its initial sync and the checkpoint.
When syncing a replication slot to a standby, the slot's initial
restart_lsn is taken from the publisher's remote_restart_lsn. Because slot
sync happens asynchronously, this value can lag behind the standby's
current redo pointer. Without any interlocking between WAL reservation and
checkpoints, a checkpoint may remove WAL required by the newly synced
slot, causing the slot to be invalidated.
To fix this, we acquire ReplicationSlotAllocationLock before reserving WAL
for a newly synced slot, similar to commit 006dd4b2e5. This ensures that
if WAL reservation happens first, the checkpoint process must wait for
slotsync to update the slot's restart_lsn before it computes the minimum
required LSN.
However, unlike in ReplicationSlotReserveWal(), this lock alone cannot
protect a newly synced slot if a checkpoint has already run
CheckPointReplicationSlots() before slotsync updates the slot. In such
cases, the remote restart_lsn may be stale and earlier than the current
redo pointer. To prevent relying on an outdated LSN, we use the oldest
WAL location available if it is greater than the remote restart_lsn.
This ensures that newly synced slots always start with a safe, non-stale
restart_lsn and are not invalidated by concurrent checkpoints.
Michael Paquier [Tue, 27 Jan 2026 04:42:32 +0000 (13:42 +0900)]
Include extended statistics data in pg_dump
This commit integrates the new pg_restore_extended_stats() function into
pg_dump, so as the data of extended statistics is detected and included
in dumps when the --statistics switch is specified. Currently, the same
extended stats kinds as the ones supported by the SQL function can be
dumped: "n_distinct" and "dependencies".
The extended statistics data can be dumped down to PostgreSQL 10, with
the following changes depending on the backend version dealt with:
- In v19 and newer versions, the format of pg_ndistinct and
pg_dependencies has changed, catalogs can be directly queried.
- In v18 and older versions, the format is translated to the new format
supported by the backend.
- In v14 and older versions, inherited extended statistics are not
supported.
- In v11 and older versions, the data for ndistinct and dependencies
was stored in pg_statistic_ext. These have been moved to pg_stats_ext
in v12.
- Extended Statistics have been introduced in v10, no support is needed
for versions older than that.
The extended statistics data is dumped if it can be found in the
catalogs. If the catalogs are empty, then no restore of the stats data
is attempted.
Author: Corey Huinker <corey.huinker@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CADkLM=dpz3KFnqP-dgJ-zvRvtjsa8UZv8wDAQdqho=qN3kX0Zg@mail.gmail.com
Fujii Masao [Tue, 27 Jan 2026 02:55:32 +0000 (11:55 +0900)]
Remove unnecessary abort() from WalSndShutdown().
WalSndShutdown() previously called abort() after proc_exit(0) to
silence compiler warnings. This is no longer needed, because both
WalSndShutdown() and proc_exit() are declared pg_noreturn,
allowing the compiler to recognize that the function does not return.
Also there are already other functions, such as CheckpointerMain(),
that call proc_exit() without an abort(), and they do not produce warnings.
Therefore this abort() call in WalSndShutdown() is useless and
this commit removes it.
Author: Fujii Masao <masao.fujii@gmail.com> Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/CAHGQGwHPX1yoixq+YB5rF4zL90TMmSEa3FpHURtqW3Jc5+=oSA@mail.gmail.com
Tomas Vondra [Mon, 26 Jan 2026 21:20:18 +0000 (22:20 +0100)]
Handle ENOENT status when querying NUMA node
We've assumed that touching the memory is sufficient for a page to be
located on one of the NUMA nodes. But a page may be moved to a swap
after we touch it, due to memory pressure.
We touch the memory before querying the status, but there is no
guarantee it won't be moved to the swap in the meantime. The touching
happens only on the first call, so later calls are more likely to be
affected. And the batching increases the window too.
It's up to the kernel if/when pages get moved to swap. We have to accept
ENOENT (-2) as a valid result, and handle it without failing. This patch
simply treats it as an unknown node, and returns NULL in the two
affected views (pg_shmem_allocations_numa and pg_buffercache_numa).
Hugepages cannot be swapped out, so this affects only regular pages.
Reported by Christoph Berg, investigation and fix by me. Backpatch to
18, where the two views were introduced.
Reported-by: Christoph Berg <myon@debian.org>
Discussion: 18
Backpatch-through: https://postgr.es/m/aTq5Gt_n-oS_QSpL@msg.df7cb.de
Michael Paquier [Mon, 26 Jan 2026 23:20:13 +0000 (08:20 +0900)]
Add support for "dependencies" in pg_restore_extended_stats()
This commit adds support for the restore of extended statistics of the
kind "dependencies", for the following input data:
[{"attributes": [2], "dependency": 3, "degree": 1.000000},
{"attributes": [3], "dependency": 2, "degree": 1.000000}]
This relies on the existing routines of "dependencies" to cross-check
the input data with the definition of the extended statistics objects
for the attribute numbers. An input argument of type "pg_dependencies"
is required for this new option.
Thanks to the work done in 0e80f3f88dea for the restore function and e1405aa5e3ac for the input handling of data type pg_dependencies, this
addition is straight-forward. This will be used so as it is possible to
transfer these statistics across dumps and upgrades, removing the need
for a post-operation ANALYZE for these kinds of statistics.
Author: Corey Huinker <corey.huinker@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CADkLM=dpz3KFnqP-dgJ-zvRvtjsa8UZv8wDAQdqho=qN3kX0Zg@mail.gmail.com
Melanie Plageman [Mon, 26 Jan 2026 22:12:05 +0000 (17:12 -0500)]
Refactor lazy_scan_prune() VM clear logic into helper
Encapsulating the cases that clear the visibility map after vacuum phase
I, when corruption is detected, into in a helper makes the code cleaner
and enables further refactoring in future commits.
Melanie Plageman [Mon, 26 Jan 2026 22:00:13 +0000 (17:00 -0500)]
Eliminate use of cached VM value in lazy_scan_prune()
lazy_scan_prune() takes a parameter from lazy_scan_heap() indicating
whether the page was marked all-visible in the VM at the time it was
last checked in find_next_unskippable_block(). This behavior is
historical, dating back to commit 608195a3a365, when we did not pin the
VM page until deciding we must read it. Now that the VM page is already
pinned, there is no meaningful benefit to relying on a cached VM status.
Removing this cached value simplifies the logic in both lazy_scan_heap()
and lazy_scan_prune(). It also clarifies future work that will set the
visibility map on-access: such paths will not have a cached value
available, which would make the logic harder to reason about. And
eliminating it enables us to detect and repair VM corruption on-access.
Along with removing the cached value and unconditionally checking the
visibility status of the heap page, this commit also moves the VM
corruption handling to occur first. This reordering should have no
performance impact, since the checks are inexpensive and performed only
once per page. It does, however, make the control flow easier to
understand. The new restructuring also makes it possible to set the VM
after fixing corruption (if pruning found the page all-visible).
Now that no callers of visibilitymap_set() use its return value, change
its (and visibilitymap_set_vmbits()) return type to void.
Melanie Plageman [Mon, 26 Jan 2026 20:57:51 +0000 (15:57 -0500)]
Combine visibilitymap_set() cases in lazy_scan_prune()
lazy_scan_prune() previously had two separate cases that called
visibilitymap_set() after pruning and freezing. These branches were
nearly identical except that one attempted to avoid dirtying the heap
buffer. However, that situation can never occur — the heap buffer cannot
be clean at that point (and we would hit an assertion if it were).
In lazy_scan_prune(), when we change a previously all-visible page to
all-frozen and the page was recorded as all-visible in the visibility
map by find_next_unskippable_block(), the heap buffer will always be
dirty. Either we have just frozen a tuple and already dirtied the
buffer, or the buffer was modified between find_next_unskippable_block()
and heap_page_prune_and_freeze() and then pruned in
heap_page_prune_and_freeze().
Additionally, XLogRegisterBuffer() asserts that the buffer is dirty, so
attempting to add a clean heap buffer to the WAL chain would assert out
anyway.
Since the “clean heap buffer with already set VM” case is impossible,
the two visibilitymap_set() branches in lazy_scan_prune() can be merged.
Doing so makes the intent clearer and emphasizes that the heap buffer
must always be marked dirty before being added to the WAL chain.
This commit also adds a test case for vacuuming when no heap
modifications are required. Currently this ensures that the heap buffer
is marked dirty before it is added to the WAL chain, but if we later
remove the heap buffer from the VM-set WAL chain or pass it with the
REGBUF_NO_CHANGES flag, this test would guard that behavior.
Tomas Vondra [Mon, 26 Jan 2026 17:54:12 +0000 (18:54 +0100)]
Exercise parallel GIN builds in regression tests
Modify two places creating GIN indexes in regression tests, so that the
build is parallel. This provides a basic test coverage, even if the
amounts of data are fairly small.
Tomas Vondra [Mon, 26 Jan 2026 17:52:16 +0000 (18:52 +0100)]
Lookup the correct ordering for parallel GIN builds
When building a tuplesort during parallel GIN builds, the function
incorrectly looked up the default B-Tree operator, not the function
associated with the GIN opclass (through GIN_COMPARE_PROC).
Fixed by using the same logic as initGinState(), and the other place
in parallel GIN builds.
This could cause two types of issues. First, a data type might not have
a B-Tree opclass, in which case the PrepareSortSupportFromOrderingOp()
fails with an ERROR. Second, a data type might have both B-Tree and GIN
opclasses, defining order/equality in different ways. This could lead to
logical corruption in the index.
Backpatch to 18, where parallel GIN builds were introduced.
Discussion: https://postgr.es/m/73a28b94-43d5-4f77-b26e-0d642f6de777@iki.fi Reported-by: Heikki Linnakangas <hlinnaka@iki.fi>
Backpatch-through: 18
Robert Haas [Mon, 26 Jan 2026 17:43:52 +0000 (12:43 -0500)]
Reduce length of TAP test file name.
Buildfarm member fairywren hit the Windows limitation on the length of a
file path. While there may be other things we should also do to prevent
this from happening, it's certainly the case that the length of this
test file name is much longer than others in the same directory, so make
it shorter.
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Discussion: http://postgr.es/m/274e0a1a-d7d2-4bc8-8b56-dd09f285715e@gmail.com
Backpatch-through: 17
Peter Eisentraut [Mon, 26 Jan 2026 15:02:31 +0000 (16:02 +0100)]
Fix accidentally cast away qualifiers
This fixes cases where a qualifier (const, in all cases here) was
dropped by a cast, but the cast was otherwise necessary or desirable,
so the straightforward fix is to add the qualifier into the cast.
Co-authored-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/b04f4d3a-5e70-4e73-9ef2-87f777ca4aac%40eisentraut.org
Fujii Masao [Mon, 26 Jan 2026 11:45:05 +0000 (20:45 +0900)]
doc: Clarify that \d and \d+ output lists are illustrative, not exhaustive.
The psql documentation for the \d and \d+ meta-commands lists objects
that may be shown, but previously the wording could be read as exhaustive
even though additional objects can also appear in the output.
This commit clarifies the description by adding phrasing such as "for example"
or "such as", making it clear that the listed objects are illustrative
rather than a complete list. While the change is small, it helps avoid
potential user confusion.
As this is a documentation clarification rather than a bug fix,
it is not backpatched.
Author: Peter Smith <smithpb2250@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Andreas Karlsson <andreas@proxel.se> Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/CAHut+Pt1DBtaUqfJftkkaQLJJJenYJBtb6Ec6s6vu82KEMh46A@mail.gmail.com
David Rowley [Mon, 26 Jan 2026 10:27:15 +0000 (23:27 +1300)]
Remove deduplication logic from find_window_functions
This code thought it was optimizing WindowAgg evaluation by getting rid
of duplicate WindowFuncs, but it turns out all it does today is lead to
cost-underestimations and makes it possible that optimize_window_clauses
could miss some of the WindowFuncs that must receive an updated winref.
The deduplication likely was useful when it was first added, but since
the projection code was changed in b8d7f053c, the list of WindowFuncs
gathered by find_window_functions isn't used during execution. Instead,
the expression evaluation code will process the node's targetlist to find
the WindowFuncs.
The reason the deduplication could cause issues for
optimize_window_clauses() is because if a WindowFunc is moved to another
WindowClause, the winref is adjusted to reference the new WindowClause.
If any duplicate WindowFuncs were discarded in find_window_functions()
then the WindowFuncLists may not include all the WindowFuncs that need
their winref adjusted. This could lead to an error message such as:
ERROR: WindowFunc with winref 2 assigned to WindowAgg with winref 1
The back-branches will receive a different fix so that the WindowAgg costs
are not affected.
Author: Meng Zhang <mza117jc@gmail.com> Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/CAErYLFAuxmW0UVdgrz7iiuNrxGQnFK_OP9hBD5CUzRgjrVrz=Q@mail.gmail.com
Peter Eisentraut [Mon, 26 Jan 2026 09:23:14 +0000 (10:23 +0100)]
Disable extended alignment uses on older g++
Fix for commit a9bdb63bba8. The previous plan of redefining alignas
didn't work, because it interfered with other C++ header files (e.g.,
LLVM). So now the new workaround is to just disable the affected
typedefs under the affected compilers. These are not typically used
in extensions anyway.
Michael Paquier [Mon, 26 Jan 2026 07:32:33 +0000 (16:32 +0900)]
Add test for MAINTAIN permission with pg_restore_extended_stats()
Like its cousin functions for the restore of relation and attribute
stats, pg_restore_extended_stats() needs to be run by a user that is the
database owner or has MAINTAIN privileges on the table whose stats are
restored. This commit adds a regression test ensuring that MAINTAIN is
required when calling the function. This test also checks that a
ShareUpdateExclusive lock is taken on the table whose stats are
restored.
This has been split from the commit that has introduced
pg_restore_extended_stats(), for clarity.
Author: Corey Huinker <corey.huinker@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CADkLM=dpz3KFnqP-dgJ-zvRvtjsa8UZv8wDAQdqho=qN3kX0Zg@mail.gmail.com
Michael Paquier [Mon, 26 Jan 2026 07:13:41 +0000 (16:13 +0900)]
Fix missing initialization in pg_restore_extended_stats()
The tuple data upserted into pg_statistic_ext_data was missing an
initialization for the nulls flag of stxoid and stxdinherit. This would
cause an incorrect handling of the stats data restored.
This issue has been spotted by CatalogTupleCheckConstraints(),
translating to a NOT NULL constraint inconsistency, while playing more
with the follow-up portions of the patch set.
Oversight in 0e80f3f88dea (mea culpa). Surprisingly, the buildfarm did
not complain yet.
Michael Paquier [Mon, 26 Jan 2026 06:08:15 +0000 (15:08 +0900)]
Add pg_restore_extended_stats()
This function closely mirror its relation and attribute counterparts,
but for extended statistics (i.e. CREATE STATISTICS) objects, being
able to restore extended statistics for an extended stats object. Like
the other functions, the goal of this feature is to ease the dump or
upgrade of clusters so as ANALYZE would not be required anymore after
these operations, stats being directly loaded into the target cluster
without any post-dump/upgrade computation.
The caller of this function needs the following arguments for the
extended stats to restore:
- The name of the relation.
- The schema name of the relation.
- The name of the extended stats object.
- The schema name of the extended stats object.
- If the stats are inherited or not.
- One or more extended stats kind with its data.
This commit adds only support for the restore of the extended statistics
kind "n_distinct", building the basic infrastructure for the restore
of more extended statistics kinds in follow-up commits, including MVC
and dependencies.
The support for "n_distinct" is eased in this commit thanks to the
previous work done particularly in commits 1f927cce4498 and 44eba8f06e55, that have added the input function for the type
pg_ndistinct, used as data type in input of this new restore function.
Bump catalog version.
Author: Corey Huinker <corey.huinker@gmail.com> Co-authored-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CADkLM=dpz3KFnqP-dgJ-zvRvtjsa8UZv8wDAQdqho=qN3kX0Zg@mail.gmail.com
Michael Paquier [Mon, 26 Jan 2026 04:32:17 +0000 (13:32 +0900)]
Add test with multirange type for pg_restore_attribute_stats()
This commit adds a test for pg_restore_attribute_stats() with the
injection of statistics related to a multirange type. This case is
supported in statatt_get_type() since its introduction in ce207d2a7901,
but there was no test in the main regression test suite to check for the
case where attribute stats is restored for a multirange type, as done by
multirange_typanalyze().
Michael Paquier [Mon, 26 Jan 2026 01:52:02 +0000 (10:52 +0900)]
Remove PG_MMAP_FLAGS from mem.h
Based on name of the macro, it was implied that it could be used for all
mmap() calls on portability grounds. However, its use is limited to
sysv_shmem.c, for CreateAnonymousSegment(). This commit removes the
declaration, reducing the confusion around it as a portability tweak,
being limited to SysV-style shared memory.
This macro has been introduced in b0fc0df9364d for sysv_shmem.c,
originally. It has been moved to mem.h in 0ac5e5a7e152 a bit later.
David Rowley [Mon, 26 Jan 2026 01:29:10 +0000 (14:29 +1300)]
Always inline SeqNext and SeqRecheck
The intention of the work done in fb9f95502 was that these functions are
inlined. I noticed my compiler isn't doing this on -O2 (gcc version
15.2.0). Also, clang version 20.1.8 isn't inlining either. Fix by
marking both of these functions as pg_attribute_always_inline to avoid
leaving this up to the compiler's heuristics.
A quick test with a Seq Scan on a table with a single int column running
a query that filters all 1 million rows in the WHERE clause yields a
3.9% speedup on my Zen4 machine.
Author: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/CAApHDvrL7Q41B=gv+3wc8+AJGKZugGegUbBo8FPQ+3+NGTPb+w@mail.gmail.com
Michael Paquier [Mon, 26 Jan 2026 00:30:22 +0000 (09:30 +0900)]
Add more tests with clause STORAGE on table and TOAST interactions
This commit adds more tests to cover STORAGE MAIN and EXTENDED, checking
how these use TOAST tables. EXTENDED is already widely tested as the
default behavior, but there were no tests where the clause pattern is
directly specified. STORAGE MAIN and its interactions with TOAST was
not covered at all.
This hole in the tests has been noticed for STORAGE MAIN (inline
compressible varlenas), where I have managed to break the backend
without the tests able to notice the breakage while playing with the
varlena structures.
Peter Eisentraut [Sun, 25 Jan 2026 10:16:58 +0000 (11:16 +0100)]
Work around buggy alignas in older g++
Older g++ (<9.3) mishandle the alignas specifier (raise warnings that
the alignment is too large), but the more or less equivalent attribute
works. So as a workaround, #define alignas to that attribute for
those versions.
see <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89357>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://www.postgresql.org/message-id/3119480.1769189606%40sss.pgh.pa.us
Michael Paquier [Sun, 25 Jan 2026 10:01:23 +0000 (19:01 +0900)]
pg_stat_statements: Fix test instability with cache-clobbering builds
Builds with CLOBBER_CACHE_ALWAYS enabled are failing the new test
introduced in 1572ea96e657, checking the nesting level calculation in
the planner hook. The inner query of the function called twice is
registered as normalized, as such builds would register a PGSS entry in
the post-parse-analyse hook due to the cached plans requiring
revalidation.
A trick based on debug_discard_caches cannot work as far as I can, a
normalized query still being registered. This commit takes a different
approach with the addition of a DISCARD PLANS before the first function
call. This forces the use of a normalized query in the PGSS entry for
the inner query of the function with and without CLOBBER_CACHE_ALWAYS,
which should be enough to stabilize the test. Note that the test is
still checking what it should: when removing the nesting level
calculation in the planner hook of PGSS, one still gets a failure for
the PGSS entry of the inner query in the function, with "toplevel" being
flipped to true instead of false (it should be false, as a non-top-level
entry).
Per buildfarm members avocet and trilobite, at least.
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://postgr.es/m/82dd02bb-4e0f-40ad-a60b-baa1763ff0bd@gmail.com
Dean Rasheed [Sat, 24 Jan 2026 11:30:48 +0000 (11:30 +0000)]
Fix trigger transition table capture for MERGE in CTE queries.
When executing a data-modifying CTE query containing MERGE and some
other DML operation on a table with statement-level AFTER triggers,
the transition tables passed to the triggers would fail to include the
rows affected by the MERGE.
The reason is that, when initializing a ModifyTable node for MERGE,
MakeTransitionCaptureState() would create a TransitionCaptureState
structure with a single "tcs_private" field pointing to an
AfterTriggersTableData structure with cmdType == CMD_MERGE. Tuples
captured there would then not be included in the sets of tuples
captured when executing INSERT/UPDATE/DELETE ModifyTable nodes in the
same query.
Since there are no MERGE triggers, we should only create
AfterTriggersTableData structures for INSERT/UPDATE/DELETE. Individual
MERGE actions should then use those, thereby sharing the same capture
tuplestores as any other DML commands executed in the same query.
This requires changing the TransitionCaptureState structure, replacing
"tcs_private" with 3 separate pointers to AfterTriggersTableData
structures, one for each of INSERT, UPDATE, and DELETE. Nominally,
this is an ABI break to a public structure in commands/trigger.h.
However, since this is a private field pointing to an opaque data
structure, the only way to create a valid TransitionCaptureState is by
calling MakeTransitionCaptureState(), and no extensions appear to be
doing that anyway, so it seems safe for back-patching.
Backpatch to v15, where MERGE was introduced.
Bug: #19380 Reported-by: Daniel Woelfel <dwwoelfel@gmail.com>
Author: Dean Rasheed <dean.a.rasheed@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/19380-4e293be2b4007248%40postgresql.org
Backpatch-through: 15
Jacob Champion [Fri, 23 Jan 2026 20:57:15 +0000 (12:57 -0800)]
pqcomm.h: Explicitly reserve protocol v3.1
Document this unused version alongside the other special protocol
numbers.
Reviewed-by: Jelte Fennema-Nio <postgres@jeltef.nl> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAOYmi%2BkKyw%3Dh-5NKqqpc7HC5M30_QmzFx3kgq2AdipyNj47nUw%40mail.gmail.com
Nathan Bossart [Fri, 23 Jan 2026 16:46:49 +0000 (10:46 -0600)]
Fix some rounding code for shared memory.
InitializeShmemGUCs() always added 1 to the value calculated for
shared_memory_size_in_huge_pages, which is unnecessary if the
shared memory size is divisible by the huge page size.
CreateAnonymousSegment() neglected to check for overflow when
rounding up to a multiple of the huge page size.
These are arguably bugs, but they seem extremely unlikely to be
causing problems in practice, so no back-patch.
Michael Paquier [Fri, 23 Jan 2026 05:17:28 +0000 (14:17 +0900)]
Add WALRCV_CONNECTING state to the WAL receiver
Previously, a WAL receiver freshly started would set its state to
WALRCV_STREAMING immediately at startup, before actually establishing a
replication connection.
This commit introduces a new state called WALRCV_CONNECTING, which is
the state used when the WAL receiver freshly starts, or when a restart
is requested, with a switch to WALRCV_STREAMING once the connection to
the upstream server has been established with COPY_BOTH, meaning that
the WAL receiver is ready to stream changes. This change is useful for
monitoring purposes, especially in environments with a high latency
where a connection could take some time to be established, giving some
room between the [re]start phase and the streaming activity.
From the point of view of the startup process, that flips the shared
memory state of the WAL receiver when it needs to be stopped, the
existing WALRCV_STREAMING and the new WALRCV_CONNECTING states have the
same semantics: the WAL receiver has started and it can be stopped.
Based on an initial suggestion from Noah Misch, with some input from me
about the design.
Author: Xuneng Zhou <xunengzhou@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Rahila Syed <rahilasyed90@gmail.com>
Discussion: https://postgr.es/m/CABPTF7VQ5tGOSG5TS-Cg+Fb8gLCGFzxJ_eX4qg+WZ3ZPt=FtwQ@mail.gmail.com
Amit Langote [Fri, 23 Jan 2026 01:17:43 +0000 (10:17 +0900)]
Fix bogus ctid requirement for dummy-root partitioned targets
ExecInitModifyTable() unconditionally required a ctid junk column even
when the target was a partitioned table. This led to spurious "could
not find junk ctid column" errors when all children were excluded and
only the dummy root result relation remained.
A partitioned table only appears in the result relations list when all
leaf partitions have been pruned, leaving the dummy root as the sole
entry. Assert this invariant (nrels == 1) and skip the ctid requirement.
Also adjust ExecModifyTable() to tolerate invalid ri_RowIdAttNo for
partitioned tables, which is safe since no rows will be processed in
this case.
Bug: #19099 Reported-by: Alexander Lakhin <exclusion@gmail.com>
Author: Amit Langote <amitlangote09@gmail.com> Reviewed-by: Tender Wang <tndrwang@gmail.com> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/19099-e05dcfa022fe553d%40postgresql.org
Backpatch-through: 14
Tom Lane [Thu, 22 Jan 2026 23:35:31 +0000 (18:35 -0500)]
Remove faulty Assert in partitioned INSERT...ON CONFLICT DO UPDATE.
Commit f16241bef mistakenly supposed that INSERT...ON CONFLICT DO
UPDATE rejects partitioned target tables. (This may have been
accurate when the patch was written, but it was already obsolete
when committed.) Hence, there's an assertion that we can't see
ItemPointerIndicatesMovedPartitions() in that path, but the assertion
is triggerable.
Some other places throw error if they see a moved-across-partitions
tuple, but there seems no need for that here, because if we just retry
then we get the same behavior as in the update-within-partition case,
as demonstrated by the new isolation test. So fix by deleting the
faulty Assert. (The fact that this is the fix doubtless explains
why we've heard no field complaints: the behavior of a non-assert
build is fine.)
The TM_Deleted case contains a cargo-culted copy of the same Assert,
which I also deleted to avoid confusion, although I believe that one
is actually not triggerable.
Per our code coverage report, neither the TM_Updated nor the
TM_Deleted case were reached at all by existing tests, so this
patch adds tests for both.
Reported-by: Dmitry Koval <d.koval@postgrespro.ru>
Author: Joseph Koshakow <koshy44@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/f5fffe4b-11b2-4557-a864-3587ff9b4c36@postgrespro.ru
Backpatch-through: 14
Álvaro Herrera [Thu, 22 Jan 2026 16:04:59 +0000 (17:04 +0100)]
Make some use of anonymous unions [reloptions]
In the spirit of commit 4b7e6c73b0df and following, which see for more
details; it appears to have been quite an uncontroversial C11 feature to
use and it makes the code nicer to read.
This commit changes the relopt_value struct.
Author: Peter Eisentraut <peter@eisentraut.org>
Author: Álvaro Herrera <alvherre@kurilemu.de>
Note: Yes, this was written twice independently.
Discussion: https://postgr.es/m/202601192106.zcdi3yu2gzti@alvherre.pgsql
Peter Eisentraut [Thu, 22 Jan 2026 14:17:12 +0000 (15:17 +0100)]
Record range constructor functions in pg_range
When a range type is created, several construction functions are also
created, two for the range type and three for the multirange type.
These have an internal dependency, so they "belong" to the range type.
But there was no way to identify those functions when given a range
type. An upcoming patch needs access to the two- or possibly the
three-argument range constructor function for a given range type. The
only way to do that would be with fragile workarounds like matching
names and argument types. The correct way to do that kind of thing is
to record to the links in the system catalogs. This is what this
patch does, it records the OIDs of these five constructor functions in
the pg_range catalog. (Currently, there is no code that makes use of
this.)
Reviewed-by: Paul A Jungwirth <pj@illuminatedcomputing.com> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://www.postgresql.org/message-id/7d63ddfa-c735-4dfe-8c7a-4f1e2a621058%40eisentraut.org
Peter Eisentraut [Thu, 22 Jan 2026 11:41:52 +0000 (12:41 +0100)]
Mark commented out code as unused
There were many PG_GETARG_* calls, mostly around gin, gist, spgist
code, that were commented out, presumably to indicate that the
argument was unused and to indicate that it wasn't forgotten or
miscounted. But keeping commented-out code updated with refactorings
and style changes is annoying. So this commit changes them to
#ifdef NOT_USED
blocks, which is a style already in use. That way, at least the
indentation and syntax highlighting works correctly, making some of
these blocks much easier to read.
An alternative would be to just delete that code, but there is some
value in making unused arguments explicit, and some of this arguably
serves as example code for index AM APIs.
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-by: David Geier <geidav.pg@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://www.postgresql.org/message-id/flat/328e4371-9a4c-4196-9df9-1f23afc900df%40eisentraut.org
Peter Eisentraut [Thu, 22 Jan 2026 11:41:40 +0000 (12:41 +0100)]
Remove incorrect commented out code
These calls, if activated, are happening before null checks, so they
are not correct. Also, the "in" variable is shadowed later. Remove
them to avoid confusion and bad examples.
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-by: David Geier <geidav.pg@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/328e4371-9a4c-4196-9df9-1f23afc900df%40eisentraut.org
Peter Eisentraut [Thu, 22 Jan 2026 08:19:13 +0000 (09:19 +0100)]
Remove redundant AssertVariableIsOfType uses
The uses of AssertVariableIsOfType in pg_upgrade are unnecessary
because the calls to upgrade_task_add_step() already check the
compatibility of the callback functions.
These were apparently copied from a previous coding style, but similar
removals were already done in commit 30b789eafe.
Peter Eisentraut [Thu, 22 Jan 2026 07:39:39 +0000 (08:39 +0100)]
Detect if flags are needed for C++11 support
Just like we only support compiling with C11, we only support
compiling extensions with C++11 and up. Some compilers support C++11
but don't enable it by default. This detects if flags are needed to
enable C++11 support, in a similar way to how we check the same for
C11 support.
The C++ test extension module added by commit 476b35d4e31 confirmed
that C++11 is effectively required. (This was understood in mailing
list discussions but not recorded anywhere in the source code.)
Author: Jelte Fennema-Nio <postgres@jeltef.nl> Co-authored-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://www.postgresql.org/message-id/flat/E1viDt1-001d7E-2I%40gemulon.postgresql.org
Michael Paquier [Thu, 22 Jan 2026 08:03:21 +0000 (17:03 +0900)]
doc: List all the possible values of pg_stat_wal_receiver.status
The possible values of pg_stat_wal_receiver.status have never been
documented. Note that the status "stopped" will never show up in this
view, hence there is no need to document it.
Issue noticed while discussing a patch that aims to add a new status to
WAL receiver.
Thomas Munro [Thu, 22 Jan 2026 02:43:13 +0000 (15:43 +1300)]
jit: Add missing inline pass for LLVM >= 17.
With LLVM >= 17, transform passes are provided as a string to
LLVMRunPasses. Only two strings were used: "default<O3>" and
"default<O0>,mem2reg".
With previous LLVM versions, an additional inline pass was added when
JIT inlining was enabled without optimization. With LLVM >= 17, the code
would go through llvm_inline, prepare the functions for inlining, but
the generated bitcode would be the same due to the missing inline pass.
This patch restores the previous behavior by adding an inline pass when
inlining is enabled but no optimization is done.
This fixes an oversight introduced by 76200e5e when support for LLVM 17
was added.
Backpatch-through: 14
Author: Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com> Reviewed-by: Thomas Munro <thomas.munro@gmail.com> Reviewed-by: Andreas Karlsson <andreas@proxel.se> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Reviewed-by: Pierre Ducroquet <p.psql@pinaraf.info> Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Discussion: https://postgr.es/m/CAO6_XqrNjJnbn15ctPv7o4yEAT9fWa-dK15RSyun6QNw9YDtKg%40mail.gmail.com
Fujii Masao [Thu, 22 Jan 2026 01:14:12 +0000 (10:14 +0900)]
file_fdw: Support multi-line HEADER option.
Commit bc2f348 introduced multi-line HEADER support for COPY. This commit
extends this capability to file_fdw, allowing multiple header lines to be
skipped.
Because CREATE/ALTER FOREIGN TABLE requires option values to be single-quoted,
this commit also updates defGetCopyHeaderOption() to accept integer values
specified as strings for HEADER option.
Fujii Masao [Thu, 22 Jan 2026 01:13:07 +0000 (10:13 +0900)]
Improve the error message in COPY with HEADER option.
The error message reported for invalid values of the HEADER option in COPY
command previously used the term "non-negative integer", which is
discouraged by the Error Message Style Guide because it is ambiguous about
whether zero is allowed.
This commit improves the error message by replacing "non-negative integer"
there with "an integer value greater than or equal to zero" to make
the accepted values explicit.
Nathan Bossart [Wed, 21 Jan 2026 20:21:00 +0000 (14:21 -0600)]
Refactor some SIMD and popcount macros.
This commit does the following:
* Removes TRY_POPCNT_X86_64. We now assume that the required CPUID
intrinsics are available when HAVE_X86_64_POPCNTQ is defined, as we
have done since v16 for meson builds when
USE_SSE42_CRC32C_WITH_RUNTIME_CHECK is defined and since v17 when
USE_AVX512_POPCNT_WITH_RUNTIME_CHECK is defined.
* Moves the MSVC check for HAVE_X86_64_POPCNTQ to configure-time.
This way, we set it for all relevant platforms in one place.
* Moves the #defines for USE_SSE2 and USE_NEON to c.h so that they
can be used elsewhere without including simd.h. Consequently, we
can remove the POPCNT_AARCH64 macro.
* Moves the #includes for pg_bitutils.h to below the system headers
in pg_popcount_{aarch64,x86}.c, since we no longer depend on macros
from pg_bitutils.h to decide which system headers to use.
Reviewed-by: John Naylor <johncnaylorls@gmail.com>
Discussion: https://postgr.es/m/aWf_InS1VrbeXAfP%40nathan
Nathan Bossart [Wed, 21 Jan 2026 20:21:00 +0000 (14:21 -0600)]
Rename "fast" and "slow" popcount functions.
Since we now have several implementations of the popcount
functions, let's give them more descriptive names. This commit
replaces "slow" with "portable" and "fast" with "sse42". While the
POPCNT instruction is technically not part of SSE4.2, this naming
scheme is close enough in practice and is arguably easier to
understand than using "popcnt" instead.
Reviewed-by: John Naylor <johncnaylorls@gmail.com> Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/aWf_InS1VrbeXAfP%40nathan
Tom Lane [Wed, 21 Jan 2026 20:08:38 +0000 (15:08 -0500)]
Force standard_conforming_strings to always be ON.
Continuing to support this backwards-compatibility feature has
nontrivial costs; in particular it is potentially a security hazard
if an application somehow gets confused about which setting the
server is using. We changed the default to ON fifteen years ago,
which seems like enough time for applications to have adapted.
Let's remove support for the legacy string syntax.
We should not remove the GUC altogether, since client-side code will
still test it, pg_dump scripts will attempt to set it to ON, etc.
Instead, just prevent it from being set to OFF. There is precedent
for this approach (see commit de66987ad).
This patch does remove the related GUC escape_string_warning, however.
That setting does nothing when standard_conforming_strings is on,
so it's now useless. We could leave it in place as a do-nothing
setting to avoid breaking clients that still set it, if there are any.
But it seems likely that any such client is also trying to turn off
standard_conforming_strings, so it'll need work anyway.
The client-side changes in this patch are pretty minimal, because even
though we are dropping the server's support, most of our clients still
need to be able to talk to older server versions. We could remove
dead client code only once we disclaim compatibility with pre-v19
servers, which is surely years away. One change of note is that
pg_dump/pg_dumpall now set standard_conforming_strings = on in their
source session, rather than accepting the source server's default.
This ensures that literals in view definitions and such will be
printed in a way that's acceptable to v19+. In particular,
pg_upgrade will work transparently even if the source installation has
standard_conforming_strings = off. (However, pg_restore will behave
the same as before if given an archive file containing
standard_conforming_strings = off. Such an archive will not be safely
restorable into v19+, but we shouldn't break the ability to extract
valid data from it for use with an older server.)
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/3279216.1767072538@sss.pgh.pa.us
Álvaro Herrera [Wed, 21 Jan 2026 19:06:01 +0000 (20:06 +0100)]
Allow Boolean reloptions to have ternary values
From the user's point of view these are just Boolean values; from the
implementation side we can now distinguish an option that hasn't been
set. Reimplement the vacuum_truncate reloption using this type.
This could also be used for reloptions vacuum_index_cleanup and
buffering, but those additionally need a per-option "alias" for the
state where the variable is unset (currently the value "auto").
Author: Nikolay Shaplov <dhyan@nataraj.su> Reviewed-by: Timur Magomedov <t.magomedov@postgrespro.ru> Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Discussion: https://postgr.es/m/3474141.usfYGdeWWP@thinkpad-pgpro
Tom Lane [Wed, 21 Jan 2026 18:26:19 +0000 (13:26 -0500)]
Remove useless flag PVC_INCLUDE_CONVERTROWTYPES.
This was introduced in the SJE patch (fc069a3a6), but it doesn't
do anything because pull_var_clause() never tests it. Apparently
it snuck in from somebody's private fork. Remove it again, but
only in HEAD -- seems best to let it be in v18.
Author: Alexander Pyhalov <a.pyhalov@postgrespro.ru> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/70008c19d22e3dd1565ca57f8436c0ba@postgrespro.ru
Álvaro Herrera [Wed, 21 Jan 2026 17:55:43 +0000 (18:55 +0100)]
amcheck: Fix snapshot usage in bt_index_parent_check
We were using SnapshotAny to do some index checks, but that's wrong and
causes spurious errors when used on indexes created by CREATE INDEX
CONCURRENTLY. Fix it to use an MVCC snapshot, and add a test for it.
Backpatch of 6bd469d26aca to branches 14-16. I previously misidentified
the bug's origin: it came in with commit 7f563c09f890 (pg11-era, not 5ae2087202af as claimed previously), so all live branches are affected.
Also take the opportunity to fix some comments that we failed to update
in the original commits and apply pgperltidy. In branch 14, remove the
unnecessary test plan specification (which would have need to have been
changed anyway; c.f. commit 549ec201d613.)
Diagnosed-by: Donghang Lin <donghanglin@gmail.com>
Author: Mihail Nikalayeu <mihailnikalayeu@gmail.com> Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru>
Backpatch-through: 17
Discussion: https://postgr.es/m/CANtu0ojmVd27fEhfpST7RG2KZvwkX=dMyKUqg0KM87FkOSdz8Q@mail.gmail.com
Peter Eisentraut [Wed, 21 Jan 2026 13:45:20 +0000 (14:45 +0100)]
Remove more leftovers of AIX support
The make variables MKLDEXPORT and POSTGRES_IMP were only used for AIX,
so they should have been removed with commit 0b16bb8776b.
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://www.postgresql.org/message-id/flat/7a48b624-2236-4e11-9b9d-6a3c658d77a1%40eisentraut.org
Michael Paquier [Wed, 21 Jan 2026 09:18:15 +0000 (18:18 +0900)]
pg_stat_statements: Add more tests for level tracking
This commit adds tests to verify the computation of the nesting level
for two code paths: the planner hook and the ExecutorFinish() hook. The
nesting level is essential to save a correct "toplevel" status for the
added PGSS entries.
The author has noticed that removing the manipulations of nesting_level
in these two code paths did not cause the tests to complain, meaning
that we never had coverage for the assumptions taken by the code.
Author: Sami Imseih <samimseih@gmail.com>
Discussion: https://postgr.es/m/CAA5RZ0uK1PSrgf52bWCtDpzaqbWt04o6ZA7zBm6UQyv7vyvf9w@mail.gmail.com
Peter Eisentraut [Wed, 21 Jan 2026 07:32:45 +0000 (08:32 +0100)]
Fix for C++ compatibility
After commit 476b35d4e31, some buildfarm members are complaining about
not recognizing _Noreturn when building the new C++ module
test_cplusplusext. This is not a C++ feature, but it was gated like
But apparently that was not sufficient. Some platforms define
__STDC_VERSION__ even in C++ mode. (In this particular case, it was
g++ on Solaris, but apparently this is also done by some other
platforms, and it is allowed by the C++ standard.) To fix, add a
... && !defined(__cplusplus)
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://www.postgresql.org/message-id/flat/CAGECzQR21OnnKiZO_1rLWO0-16kg1JBxnVq-wymYW0-_1cUNtg@mail.gmail.com
John Naylor [Wed, 21 Jan 2026 07:11:40 +0000 (14:11 +0700)]
Update some comments for fasthash
- Add advice about hashing multiple inputs with the incremental API
- Generalize statements that were specific to C strings to include
all variable length inputs, where applicable.
- Update comments about the standalone functions and make it easy to
find them.
Amit Kapila [Wed, 21 Jan 2026 04:58:03 +0000 (04:58 +0000)]
Improve errdetail for logical replication conflict messages.
This change enhances the clarity and usefulness of error detail messages
generated during logical replication conflicts. The following improvements
have been made:
1. Eliminate redundant output: Avoid printing duplicate remote row and
replica identity values for the multiple_unique_conflicts conflict type.
2. Improve message structure: Append tuple values directly to the main
error message, separated by a colon (:), for better readability.
3. Simplify local row terminology: Remove the word 'existing' when
referring to the local row, as this is already implied by context.
4. General code refinements: Apply miscellaneous code cleanups to improve
how conflict detail messages are constructed and formatted.
Michael Paquier [Tue, 20 Jan 2026 22:47:38 +0000 (07:47 +0900)]
pg_stat_statements: Rework test order
The test "squashing" was the last item of the REGRESS list, but
"cleanup" should be the second to last, dropping the extension.
"oldextversions" is the last item.
In passing, the REGRESS list is cleaned up to include one item per line,
so as diffs are minimized when adding new test files.
Noticed while playing with this area of the code.
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Man Zeng <zengman@halodbtech.com>
Discussion: https://postgr.es/m/aW6_Xc8auuu5iAPi@paquier.xyz
Peter Eisentraut [Tue, 20 Jan 2026 15:24:57 +0000 (16:24 +0100)]
tests: Add a test C++ extension module
While we already test that our headers are valid C++ using
headerscheck, it turns out that the macros we define might still
expand to invalid C++ code. This adds a minimal test extension that
is compiled using C++ to test that it's actually possible to build and
run extensions written in C++. Future commits will improve C++
compatibility of some of our macros and add usage of them to this
extension make sure that they don't regress in the future.
The test module is for the moment disabled when using MSVC. In
particular, the use of designated initializers in PG_MODULE_MAGIC
would require C++20, for which we are currently not set up. (GCC and
Clang support it as extensions.) It is planned to fix this.
Álvaro Herrera [Tue, 20 Jan 2026 15:41:04 +0000 (16:41 +0100)]
Use integer backend type when exec'ing a postmaster child
This way we don't have to walk the entire process type array and
strcmp() the string with the names therein. The integer value can be
directly used as array index instead.
Remove redundant pg_unreachable() after elog(ERROR) from ExecWaitStmt()
elog(ERROR) never returns. Compilers don't always understand this. So,
sometimes, we have to append pg_unreachable() to keep the compiler quiet
about returning from a non-void function without a value. But
pg_unreachable() is redundant for ExecWaitStmt(), which is void.
Amit Kapila [Tue, 20 Jan 2026 09:40:13 +0000 (09:40 +0000)]
Fix concurrent sequence drops during sequence synchronization.
A recent BF failure showed that commit 7a485bd641 did not handle the case
where a sequence is dropped concurrently during sequence synchronization
on the subscriber. Previously, pg_get_sequence_data() would ERROR out if
the sequence was dropped concurrently. After 7a485bd641, it instead
returns NULL, which leads to an assertion failure on the subscriber.
To handle this change, update sequence synchronization to skip sequences
for which pg_get_sequence_data() returns NULL.
Author: vignesh C <vignesh21@gmail.com> Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/CALDaNm0FoGdt+1mzua0t-=wYdup5_zmFrvfNf-L=MGBnj9HAcg@mail.gmail.com
Michael Paquier [Tue, 20 Jan 2026 04:13:47 +0000 (13:13 +0900)]
Add routine to free MCVList
This addition is in the same spirit as 32e27bd32082 for MVNDistinct and
MVDependencies, except that we were missing a free routine for the third
type of extended statistics, MCVList. I was not sure if we needed an
equivalent for MCVList, but after more review of the main patch set for
the import of extended statistics, it has become clear that we do.
This is introduced as its own change as this routine can be useful on
its own. This one is a piece that has not been written by Corey
Huinker, I have just noticed it by myself on the way.
Author: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CADkLM=dpz3KFnqP-dgJ-zvRvtjsa8UZv8wDAQdqho=qN3kX0Zg@mail.gmail.com
Bruce Momjian [Tue, 20 Jan 2026 03:59:10 +0000 (22:59 -0500)]
doc: revert "xreflabel" used for PL/Python & libpq chapters
This reverts d8aa21b74ff, which was added for the PG 18 release notes,
and adjusts the PG 18 release notes for this change. This is necessary
since the "xreflabel" affected other references to these chapters.
Michael Paquier [Mon, 19 Jan 2026 23:11:12 +0000 (08:11 +0900)]
pg_stat_statements: Fix crash in list squashing with Vars
When IN/ANY clauses contain both constants and variable expressions, the
optimizer transforms them into separate structures: constants become
an array expression while variables become individual OR conditions.
This transformation was creating an overlap with the token locations,
causing pg_stat_statements query normalization to crash because it
could not calculate the amount of bytes remaining to write for the
normalized query.
This commit disables squashing for mixed IN list expressions when
constructing a scalar array op, by setting list_start and list_end
to -1 when both variables and non-variables are present. Some
regression tests are added to PGSS to verify these patterns.
Robert Haas [Mon, 19 Jan 2026 17:02:08 +0000 (12:02 -0500)]
Don't set the truncation block length greater than RELSEG_SIZE.
When faced with a relation containing more than 1 physical segment
(i.e. >1GB, with normal settings), the previous code could compute a
truncation block length greater than RELSEG_SIZE, which could lead to
restore failures of this form:
file "%s" has truncation block length %u in excess of segment size %u
The fix is simply to clamp the maximum computed truncation_block_length
to RELSEG_SiZE. I have also added some comments to clarify the logic.
The test case was written by Oleg Tkachenko, but I have rewritten its
comments.
Richard Guo [Mon, 19 Jan 2026 02:13:23 +0000 (11:13 +0900)]
Fix unsafe pushdown of quals referencing grouping Vars
When checking a subquery's output expressions to see if it's safe to
push down an upper-level qual, check_output_expressions() previously
treated grouping Vars as opaque Vars. This implicitly assumed they
were stable and scalar.
However, a grouping Var's underlying expression corresponds to the
grouping clause, which may be volatile or set-returning. If an
upper-level qual references such an output column, pushing it down
into the subquery is unsafe. This can cause strange results due to
multiple evaluation of a volatile function, or introduce SRFs into
the subquery's WHERE/HAVING quals.
This patch teaches check_output_expressions() to look through grouping
Vars to their underlying expressions. This ensures that any
volatility or set-returning properties in the grouping clause are
detected, preventing the unsafe pushdown.
We do not need to recursively examine the Vars contained in these
underlying expressions. Even if they reference outputs from
lower-level subqueries (at any depth), those references are guaranteed
not to expand to volatile or set-returning functions, because
subqueries containing such functions in their targetlists are never
pulled up.
Backpatch to v18, where this issue was introduced.
Reported-by: Eric Ridge <eebbrr@gmail.com> Diagnosed-by: Tom Lane <tgl@sss.pgh.pa.us>
Author: Richard Guo <guofenglinux@gmail.com>
Discussion: https://postgr.es/m/7900964C-F99E-481E-BEE5-4338774CEB9F@gmail.com
Backpatch-through: 18
Tom Lane [Sun, 18 Jan 2026 19:54:33 +0000 (14:54 -0500)]
Update time zone data files to tzdata release 2025c.
This is pretty pro-forma for our purposes, as the only change
is a historical correction for pre-1976 DST laws in
Baja California. (Upstream made this release mostly to update
their leap-second data, which we don't use.) But with minor
releases coming up, we should be up-to-date.
Michael Paquier [Sun, 18 Jan 2026 08:24:25 +0000 (17:24 +0900)]
Fix error message related to end TLI in backup manifest
The code adding the WAL information included in a backup manifest is
cross-checked with the contents of the timeline history file of the end
timeline. A check based on the end timeline, when it fails, reported
the value of the start timeline in the error message. This error is
fixed to show the correct timeline number in the report.
This error report would be confusing for users if seen, because it would
provide an incorrect information, so backpatch all the way down.
Michael Paquier [Sun, 18 Jan 2026 07:11:46 +0000 (16:11 +0900)]
Remove useless asserts in report_namespace_conflict()
An assertion is used in this routine to check that a valid namespace OID
is given by the caller, but it was repeated twice: once at the top of
the routine and a second time multiple times in a switch/case. This
commit removes the assertions within the switch/case.
Peter Eisentraut [Fri, 16 Jan 2026 16:21:32 +0000 (17:21 +0100)]
Fix PL/Python build on MSVC with older Meson
Amendment for commit 2bc60f86219. With older Meson versions, we need
to specify the Python include directory directly to cc.check_header
instead of relying on the dependency to pass it through.
Author: Bryan Green <dbryan.green@gmail.com>
Discussion: https://www.postgresql.org/message-id/0de98c41-4145-44c1-aac5-087cf5b3e4a9%40gmail.com
Fix crash in test function on removable_cutoff(NULL)
The function is part of the injection_points test module and only used
in tests. None of the current tests call it with a NULL argument, but
it is supposed to work.
If auto-analyze kicks in at just the right moment, it can hold a
snapshot and prevent the VACUUM command in the test from removing the
deleted tuples. The test needs the tuples to be removed, otherwise no
half-dead page is generated. To fix, introduce a helper procedure to
wait for the removable cutoff to advance, like the one used in the
syscache-update-pruned test for similar purposes.
Thanks to Alexander Lakhin for reproducing and analyzing the test
failure, and Tom Lane for the report.
Andres Freund [Fri, 16 Jan 2026 11:58:35 +0000 (06:58 -0500)]
bufmgr: Avoid spurious compiler warning after fcb9c977aa5
Some compilers, e.g. gcc with -Og or -O1, warn about the wait_event in
BufferLockAcquire() possibly being uninitialized. That can't actually happen,
as the switch() covers all legal lock mode values, but we still need to
silence the warning. We could add a default:, but we'd like to get a warning
if we were to get a new lock mode in the future. So just initialize
wait_event to 0.
Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/934395.1768518154@sss.pgh.pa.us
Michael Paquier [Fri, 16 Jan 2026 06:24:59 +0000 (15:24 +0900)]
Improve pg_clear_extended_stats() with incorrect relation/stats combination
Issue fat-fingered in d756fa1019ff, noticed while doing more review of
the main patch set proposed. I have missed the fact that this can be
triggered by specifying an extended stats object that does not match
with the relation specified and already locked. Like the cases where
an object defined in input is missing, the code is changed to issue a
WARNING instead of a confusing cache lookup failure.
Amit Langote [Fri, 16 Jan 2026 05:53:50 +0000 (14:53 +0900)]
Fix rowmark handling for non-relation RTEs during executor init
Commit cbc127917e introduced tracking of unpruned relids to skip
processing of pruned partitions. PlannedStmt.unprunableRelids is
computed as the difference between PlannerGlobal.allRelids and
prunableRelids, but allRelids only contains RTE_RELATION entries.
This means non-relation RTEs (VALUES, subqueries, CTEs, etc.) are
never included in unprunableRelids, and consequently not in
es_unpruned_relids at runtime.
As a result, rowmarks attached to non-relation RTEs were incorrectly
skipped during executor initialization. This affects any DML statement
that has rowmarks on such RTEs, including MERGE with a VALUES or
subquery source, and UPDATE/DELETE with joins against subqueries or
CTEs. When a concurrent update triggers an EPQ recheck, the missing
rowmark leads to incorrect results.
Fix by restricting the es_unpruned_relids membership check to
RTE_RELATION entries only, since partition pruning only applies to
actual relations. Rowmarks for other RTE kinds are now always
processed.
Bug: #19355 Reported-by: Bihua Wang <wangbihua.cn@gmail.com> Diagnosed-by: Dean Rasheed <dean.a.rasheed@gmail.com> Diagnosed-by: Tender Wang <tndrwang@gmail.com>
Author: Dean Rasheed <dean.a.rasheed@gmail.com>
Discussion: https://postgr.es/m/19355-57d7d52ea4980dc6@postgresql.org
Backpatch-through: 18
Amit Langote [Fri, 16 Jan 2026 04:01:52 +0000 (13:01 +0900)]
Fix segfault from releasing locks in detached DSM segments
If a FATAL error occurs while holding a lock in a DSM segment (such
as a dshash lock) and the process is not in a transaction, a
segmentation fault can occur during process exit.
The problem sequence is:
1. Process acquires a lock in a DSM segment (e.g., via dshash)
2. FATAL error occurs outside transaction context
3. proc_exit() begins, calling before_shmem_exit callbacks
4. dsm_backend_shutdown() detaches all DSM segments
5. Later, on_shmem_exit callbacks run
6. ProcKill() calls LWLockReleaseAll()
7. Segfault: the lock being released is in unmapped memory
This only manifests outside transaction contexts because
AbortTransaction() calls LWLockReleaseAll() during transaction
abort, releasing locks before DSM cleanup. Background workers and
other non-transactional code paths are vulnerable.
Fix by calling LWLockReleaseAll() unconditionally at the start of
shmem_exit(), before any callbacks run. Releasing locks before
callbacks prevents the segfault - locks must be released before
dsm_backend_shutdown() detaches their memory. This is safe because
after an error, held locks are protecting potentially inconsistent
data anyway, and callbacks can acquire fresh locks if needed.
Also add a comment noting that LWLockReleaseAll() must be safe to
call before LWLock initialization (which it is, since
num_held_lwlocks will be 0), plus an Assert for the post-condition.
This fix aligns with the original design intent from commit 001a573a2, which noted that backends must clean up shared memory
state (including releasing lwlocks) before unmapping dynamic shared
memory segments.
Fujii Masao [Fri, 16 Jan 2026 03:37:05 +0000 (12:37 +0900)]
pg_recvlogical: remove unnecessary OutputFsync() return value checks.
Commit 1e2fddfa33d changed OutputFsync() so that it always returns true.
However, pg_recvlogical.c still contained checks of its boolean return
value, which are now redundant.
This commit removes those checks and changes the type of return value of
OutputFsync() to void, simplifying the code.
Suggested-by: Yilin Zhang <jiezhilove@126.com>
Author: Fujii Masao <masao.fujii@gmail.com> Reviewed-by: Mircea Cadariu <cadariu.mircea@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CAHGQGwFeTymZQ7RLvMU6WuDGar8bUQCazg=VOfA-9GeBkg-FzA@mail.gmail.com
Fujii Masao [Fri, 16 Jan 2026 03:36:34 +0000 (12:36 +0900)]
Add test for pg_recvlogical reconnection behavior.
This commit adds a test to verify that data already received and flushed by
pg_recvlogical is not streamed again even after the connection is lost,
reestablished, and logical replication is restarted.
Author: Mircea Cadariu <cadariu.mircea@gmail.com> Co-authored-by: Fujii Masao <masao.fujii@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CAHGQGwFeTymZQ7RLvMU6WuDGar8bUQCazg=VOfA-9GeBkg-FzA@mail.gmail.com
Fujii Masao [Fri, 16 Jan 2026 03:35:56 +0000 (12:35 +0900)]
Add a new helper function wait_for_file() to Utils.pm.
wait_for_file() waits for the contents of a specified file, starting at an
optional offset, to match a given regular expression. If no offset is
provided, the entire file is checked. The function times out after
$PostgreSQL::Test::Utils::timeout_default seconds. It returns the total
file length on success.
The existing wait_for_log() function contains almost identical logic, but
is limited to reading the cluster's log file. This commit also refactors
wait_for_log() to call wait_for_file() instead, avoiding code duplication.
This helper will be used by upcoming changes.
Suggested-by: Mircea Cadariu <cadariu.mircea@gmail.com>
Author: Fujii Masao <masao.fujii@gmail.com> Reviewed-by: Mircea Cadariu <cadariu.mircea@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CAHGQGwFeTymZQ7RLvMU6WuDGar8bUQCazg=VOfA-9GeBkg-FzA@mail.gmail.com
Fujii Masao [Fri, 16 Jan 2026 03:35:26 +0000 (12:35 +0900)]
pg_recvlogical: Prevent flushed data from being re-sent.
Previously, when pg_recvlogical lost connection, reconnected, and restarted
replication, data that had already been flushed could be streamed again.
This happened because the replication start position used when restarting
replication was taken from the last standby status message, which could be
older than the position of the last flushed data. As a result, some flushed
data newer than the replication start position could exist and be re-sent.
This commit fixes the issue by ensuring all written data is flushed to disk
before restarting replication, and by using the last flushed position as
the replication start point. This prevents already flushed data from being
re-sent.
Additionally, previously when the --no-loop option was used, pg_recvlogical
could exit without flushing written data, potentially losing data. To fix
this issue, this commit also ensures all data is flushed to disk before
exiting due to --no-loop.
Author: Fujii Masao <masao.fujii@gmail.com> Reviewed-by: Mircea Cadariu <cadariu.mircea@gmail.com> Reviewed-by: Yilin Zhang <jiezhilove@126.com> Reviewed-by: Dewei Dai <daidewei1970@163.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CAHGQGwFeTymZQ7RLvMU6WuDGar8bUQCazg=VOfA-9GeBkg-FzA@mail.gmail.com
Michael Paquier [Fri, 16 Jan 2026 03:12:26 +0000 (12:12 +0900)]
Fix stability issue with new TAP test of pg_createsubscriber
The test introduced in 639352d904c8 has added a direct pg_ctl command to
start a node, a method that is incompatible with the teardown() routine
used at the end of the test as the PID saved in the Cluster object would
prevent the node to be shut down. This can ultimately prevent the test
to perform its cleanup, failing on timeout.
Like pg_ctl's 001_start_stop or ssl_passphrase_callback's 001_testfunc,
this commit changes the test so a direct pg_ctl command is used to stop
the rogue node. That should be hopefully enough to cool down the
buildfarm.
Per report from buildfarm member fairywren, which is the only animal
that is showing this issue.
Michael Paquier [Thu, 15 Jan 2026 23:13:30 +0000 (08:13 +0900)]
Add pg_clear_extended_stats()
This function is able to clear the data associated to an extended
statistics object, making things so as the object looks as
newly-created.
The caller of this function needs the following arguments for the
extended stats to clear:
- The name of the relation.
- The schema name of the relation.
- The name of the extended stats object.
- The schema name of the extended stats object.
- If the stats are inherited or not.
The first two parameters are especially important to ensure a consistent
lookup and ACL checks for the relation on which is based the extended
stats object that will be cleared, relying first on a RangeVar lookup
where permissions are checked without locking a relation, critical to
prevent denial-of-service attacks when using this kind of function (see
also 688dc6299a5b for a similar concern). The third to fifth arguments
give a way to target the extended stats records to clear.
This has been extracted from a larger patch by the same author, for a
piece which is again useful on its own. I have rewritten large portions
of it. The tests have been extended while discussing this piece,
resulting on what this commit includes. The intention behind this
feature is to add support for the import of extended statistics across
dumps and upgrades, this change building one piece that we will be able
to rely on for the rest of the changes.
Bump catalog version.
Author: Corey Huinker <corey.huinker@gmail.com> Co-authored-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CADkLM=dpz3KFnqP-dgJ-zvRvtjsa8UZv8wDAQdqho=qN3kX0Zg@mail.gmail.com
Support for disowned lwlocks was added for the benefit of AIO, to be able to
have content locks "owned" by the AIO subsystem. But as of commit fcb9c977aa5,
content locks do not use lwlocks anymore.
It does not seem particularly likely that we need this facility outside of the
AIO use-case, therefore remove the now unused functions.
I did choose to keep the comment added in the aforementioned commit about
lock->owner intentionally being left pointing to the last owner.
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/cj5mcjdpucvw4a54hehslr3ctukavrbnxltvuzzhqnimvpju5e@cy3g3mnsefwz
Andres Freund [Thu, 15 Jan 2026 19:54:16 +0000 (14:54 -0500)]
lwlock: Remove ForEachLWLockHeldByMe
As of commit fcb9c977aa5, ForEachLWLockHeldByMe(), introduced in f4ece891fc2f,
is not used anymore, as content locks are now implemented in bufmgr.c. It
doesn't seem that likely that a new user of the functionality will appear all
that soon, making removal of the function seem like the most sensible path. It
can easily be added back if necessary.
Andres Freund [Thu, 15 Jan 2026 19:09:08 +0000 (14:09 -0500)]
bufmgr: Implement buffer content locks independently of lwlocks
Until now buffer content locks were implemented using lwlocks. That has the
obvious advantage of not needing a separate efficient implementation of
locks. However, the time for a dedicated buffer content lock implementation
has come:
1) Hint bits are currently set while holding only a share lock. This leads to
having to copy pages while they are being written out if checksums are
enabled, which is not cheap. We would like to add AIO writes, however once
many buffers can be written out at the same time, it gets a lot more
expensive to copy them, particularly because that copy needs to reside in
shared buffers (for worker mode to have access to the buffer).
In addition, modifying buffers while they are being written out can cause
issues with unbuffered/direct-IO, as some filesystems (like btrfs) do not
like that, due to filesystem internal checksums getting corrupted.
The solution to this is to require a new share-exclusive lock-level to set
hint bits and to write out buffers, making those operations mutually
exclusive. We could introduce such a lock-level into the generic lwlock
implementation, however it does not look like there would be other users,
and it does add some overhead into important code paths.
2) For AIO writes we need to be able to race-freely check whether a buffer is
undergoing IO and whether an exclusive lock on the page can be acquired. That
is rather hard to do efficiently when the buffer state and the lock state
are separate atomic variables. This is a major hindrance to allowing writes
to be done asynchronously.
3) Buffer locks are by far the most frequently taken locks. Optimizing them
specifically for their use case is worth the effort. E.g. by merging
content locks into buffer locks we will be able to release a buffer lock
and pin in one atomic operation.
4) There are more complicated optimizations, like long-lived "super pinned &
locked" pages, that cannot realistically be implemented with the generic
lwlock implementation.
Therefore implement content locks inside bufmgr.c. The lockstate is stored as
part of BufferDesc.state. The implementation of buffer content locks is fairly
similar to lwlocks, with a few important differences:
1) An additional lock-level share-exclusive has been added. This lock-level
conflicts with exclusive locks and itself, but not share locks.
2) Error recovery for content locks is implemented as part of the already
existing private-refcount tracking mechanism in combination with resowners,
instead of a bespoke mechanism as the case for lwlocks. This means we do
not need to add dedicated error-recovery code paths to release all content
locks (like done with LWLockReleaseAll() for lwlocks).
3) The lock state is embedded in BufferDesc.state instead of having its own
struct.
4) The wakeup logic is a tad more complicated due to needing to support the
additional lock-level
This commit unfortunately introduces some code that is very similar to the
code in lwlock.c, however the code is not equivalent enough to easily merge
it. The future wins that this commit makes possible seem worth the cost.
As of this commit nothing uses the new share-exclusive lock mode. It will be
used in a future commit. It seemed too complicated to introduce the lock-level
in a separate commit.
It's worth calling out one wart in this commit: Despite content locks not
being lwlocks anymore, they continue to use PGPROC->lw* - that seemed better
than duplicating the relevant infrastructure.
Another thing worth pointing out is that, after this change, content locks are
not reported as LWLock wait events anymore, but as new wait events in the
"Buffer" wait event class (see also 6c5c393b740). The old BufferContent lwlock
tranche has been removed.
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Heikki Linnakangas <heikki.linnakangas@iki.fi> Reviewed-by: Greg Burd <greg@burd.me> Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff
Andres Freund [Thu, 15 Jan 2026 17:53:50 +0000 (12:53 -0500)]
bufmgr: Change BufferDesc.state to be a 64-bit atomic
This is motivated by wanting to merge buffer content locks into
BufferDesc.state in a future commit, rather than having a separate lwlock (see
commit c75ebc657ff for more details). As this change is rather mechanical, it
seems to make sense to split it out into a separate commit, for easier review.
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff
Tom Lane [Thu, 15 Jan 2026 19:12:03 +0000 (14:12 -0500)]
Optimize LISTEN/NOTIFY via shared channel map and direct advancement.
This patch reworks LISTEN/NOTIFY to avoid waking backends that have
no need to process the notification messages we just sent.
The primary change is to create a shared hash table that tracks
which processes are listening to which channels (where a "channel" is
defined by a database OID and channel name). This allows a notifying
process to accurately determine which listeners are interested,
replacing the previous weak approximation that listeners in other
databases couldn't be interested.
Secondly, if a listener is known not to be interested and is
currently stopped at the old queue head, we avoid waking it at all
and just directly advance its queue pointer past the notifications
we inserted.
These changes permit very significant improvements (integer multiples)
in NOTIFY throughput, as well as a noticeable reduction in latency,
when there are many listeners but only a few are interested in any
specific message. There is no improvement for the simplest case where
every listener reads every message, but any loss seems below the noise
level.
Author: Joel Jacobson <joel@compiler.org> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/6899c044-4a82-49be-8117-e6f669765f7e@app.fastmail.com
Fix 'unexpected data beyond EOF' on replica restart
On restart, a replica can fail with an error like 'unexpected data
beyond EOF in block 200 of relation T/D/R'. These are the steps to
reproduce it:
- A relation has a size of 400 blocks.
- Blocks 201 to 400 are empty.
- Block 200 has two rows.
- Blocks 100 to 199 are empty.
- A restartpoint is done
- Vacuum truncates the relation to 200 blocks
- A FPW deletes a row in block 200
- A checkpoint is done
- A FPW deletes the last row in block 200
- Vacuum truncates the relation to 100 blocks
- The replica restarts
When the replica restarts:
- The relation on disk starts at 100 blocks, because all the
truncations were applied before restart.
- The first truncate to 200 blocks is replayed. It silently fails, but
it will still (incorrectly!) update the cache size to 200 blocks
- The first FPW on block 200 is applied. XLogReadBufferForRead relies
on the cached size and incorrectly assumes that the page already
exists in the file, and thus won't extend the relation.
- The online checkpoint record is replayed, calling smgrdestroyall
which causes the cached size to be discarded
- The second FPW on block 200 is applied. This time, the detected size
is 100 blocks, an extend is attempted. However, the block 200 is
already present in the buffer cache due to the first FPW. This
triggers the 'unexpected data beyond EOF'.
To fix, update the cached size in SmgrRelation with the current size
rather than the requested new size, when the requested new size is
greater.
Andres Freund [Thu, 15 Jan 2026 15:17:51 +0000 (10:17 -0500)]
aio: io_uring: Fix danger of completion getting reused before being read
We called io_uring_cqe_seen(..., cqe) before reading cqe->res. That allows the
completion to be reused, which in turn could lead to cqe->res being
overwritten. The window for that is very narrow and the likelihood of it
happening is very low, as we should never actually utilize all CQEs, but the
consequences would be bad.
Wake up autovacuum launcher from postmaster when a worker exits
When an autovacuum worker exits, the launcher needs to be notified
with SIGUSR2, so that it can rebalance and possibly launch a new
worker. The launcher must be notified only after the worker has
finished ProcKill(), so that the worker slot is available for a new
worker. Before this commit, the autovacuum worker was responsible for
that, which required a slightly complicated dance to pass the
launcher's PID from FreeWorkerInfo() to ProcKill() in a global
variable.
Simplify that by moving the responsibility of the signaling to the
postmaster. The postmaster was already doing it when it failed to fork
a worker process, so it seems logical to make it responsible for
notifying the launcher on worker exit too. That's also how the
notification on background worker exit is done.
Reviewed-by: Nathan Bossart <nathandbossart@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: li carol <carol.li2025@outlook.com>
Discussion: https://www.postgresql.org/message-id/a5e27d25-c7e7-45d5-9bac-a17c8f462def@iki.fi
Add check for invalid offset at multixid truncation
If a multixid with zero offset is left behind after a crash, and that
multixid later becomes the oldest multixid, truncation might try to
look up its offset and read the zero value. In the worst case, we
might incorrectly use the zero offset to truncate valid SLRU segments
that are still needed. I'm not sure if that can happen in practice, or
if there are some other lower-level safeguards or incidental reasons
that prevent the caller from passing an unwritten multixid as the
oldest multi. But better safe than sorry, so let's add an explicit
check for it.
In stable branches, we should perhaps do the same check for
'oldestOffset', i.e. the offset of the old oldest multixid (in master,
'oldestOffset' is gone). But if the old oldest multixid has an invalid
offset, the damage has been done already, and we would never advance
past that point. It's not clear what we should do in that case. The
check that this commit adds will prevent such an multixid with invalid
offset from becoming the oldest multixid in the first place, which
seems enough for now.
Remove some unnecessary code from multixact truncation
With 64-bit multixact offsets, PerformMembersTruncation() doesn't need
the starting offset anymore. The 'oldestOffset' value that
TruncateMultiXact() calculates is no longer used for anything. Remove
it, and the code to calculate it.
'oldestOffset' was included in the WAL record as 'startTruncMemb',
which sounds nice if you e.g. look at the WAL with pg_waldump, but it
was also confusing because we didn't actually use the value for
determining what to truncate. Replaying the WAL would remove all
segments older than 'endTruncMemb', regardless of
'startTruncMemb'. The 'startTruncOff' stored in the WAL record was
similarly unnecessary even before 64-bit multixid offsets, it was
stored just for the sake of symmetry with 'startTruncMemb'. Remove
both from the WAL record, and rename the remaining 'endTruncOff' to
'oldestMulti' and 'endTruncMemb' to 'oldestOffset', for consistency
with the variable names used for them in other places.