Nathan Bossart [Wed, 13 May 2026 14:10:50 +0000 (09:10 -0500)]
pgindent: Fix spacing after != when member name matches typedef.
When a struct member name matches a registered typedef, pgindent
removes the space after "!=" (and some other operators), like so:
entry->dsh.dsa_handle !=DSA_HANDLE_INVALID
The problem is that the related code in lexi.c sets last_u_d to
true before jumping to found_typename, causing the next operator to
be classified as unary and suppressing the following space. This
is correct for type names, but not for struct members. For
example, "Datum *x" needs "*" to be unary to suppress the space
before "x". To fix, only set last_u_d before jumping to
found_typename if the typedef name doesn't appear after "." or
"->".
Note that this does not bump INDENT_VERSION. We'll do that just
once after some other changes to pg_bsd_indent are committed.
Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/aS9hkwnkWf3dZIA_%40nathan
Peter Eisentraut [Wed, 13 May 2026 11:34:09 +0000 (13:34 +0200)]
Fix FOR PORTION OF with non-updatable view columns
Both UPDATE and DELETE were failing to test that the application-time
column was updatable. The column is not part of
perminfo->updatedCols, because it should not be checked for
permissions. And it needs to be checked in the DELETE case as well,
since we might insert leftovers with a value for that column.
Author: Paul A. Jungwirth <pj@illuminatedcomputing.com> Co-authored-by: jian he <jian.universality@gmail.com>
Discussion: https://www.postgresql.org/message-id/CACJufxFRqg8%3DgbZ-Q6ZS_UQ%2BYdwfZpk%2B9rf7jgWrk8m4RMUm%3DA%40mail.gmail.com
Michael Paquier [Wed, 13 May 2026 06:39:44 +0000 (15:39 +0900)]
pg_stat_statements: Set PlannedStmt to NULL after nested utility execution
As mentioned in 8268e41aca23, pgss_ProcessUtility() may free the
PlannedStmt after an internal ROLLBACK. This commit sets the
PlannedStmt "pstmt" to NULL once it is no longer safe to rely on it,
making bugs similar to the one fixed by the previous commit easier to
detect.
Suggested-by: Andres Freund <andres@anarazel.de> Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/0A9A8DAC-BC3C-4C7A-9504-2C6050405544@anarazel.de
Michael Paquier [Wed, 13 May 2026 05:43:42 +0000 (14:43 +0900)]
Add more tests for corrupted data with pglz_decompress()
Two cases fixed by 2b5ba2a0a141 were not covered, to emulate the
handling of corrupted data, for:
- set control bit with a valid 2-byte match tag where offset is 0.
- set control bit with a valid 2-byte match tag where offset exceeds
output written.
Fujii Masao [Wed, 13 May 2026 02:44:31 +0000 (11:44 +0900)]
Fix stale COPY progress during logical replication table sync
Previously, pg_stat_progress_copy in the subscriber could continue to show
the initial COPY operation for logical replication table synchronization as
active even after the data copy had finished. The stale progress entry
remained visible until synchronization caught up with the publisher.
This happened because the table synchronization code called BeginCopyFrom()
and CopyFrom(), but failed to call EndCopyFrom() afterward.
This commit fixes the issue by adding the missing EndCopyFrom() call so that
the COPY progress state in the subscriber is cleared as soon as the initial
data copy completes.
Tom Lane [Tue, 12 May 2026 19:04:42 +0000 (15:04 -0400)]
De-obfuscate the comment in tsrank.c's calc_rank_or().
Oleg's original comment was intelligible only to him.
Aleksander has reverse-engineered what seems like a plausible
explanation of what the code is trying to do, so replace the
comment with that. (Also, re-order the final expression to
match the new comment.)
In passing, this makes the comment satisfy our usual formatting
conventions. pgindent has let it pass as-is so far, but planned
changes would mess it up without some sort of intervention.
Author: Aleksander Alekseev <aleksander@tigerdata.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAJ7c6TO0xvunpeOv89i1eKQBhKF9=GEETkTz+yAGs1xGYH25MQ@mail.gmail.com
Michael Paquier [Tue, 12 May 2026 04:36:38 +0000 (13:36 +0900)]
pg_stat_statements: Fix potential use-after-free of PlannedStmt
pgss_ProcessUtility() included a reference to a portion of a PlannedStmt
after the point where this data's structure could have been freed,
causing an incorrect memory access. There was a comment documenting
this requirement, missed in 3357471cf9f5.
This commit includes a test able to make valgrind complain with a
PlannedStmt freed by an internal ROLLBACK query. Similarly to what is
mentioned in 495e73c2079e, this can be triggered by using the extended
query protocol, something that can be now tested thanks to the recent
meta-command additions in psql. This commit mentions potential other
cases, but as far as I can see the extended protocol case with an
internal ROLLBACK is the only problematic pattern reachable in practice.
Issue introduced by 3357471cf9f5, gone unnoticed due to a lack of test
coverage. The fix is authored by Chao, my contribution being the new
test.
Author: Chao Li <li.evan.chao@gmail.com> Co-authored-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/2F91906A-F2B5-4A6B-9695-D136957D4545@gmail.com
Álvaro Herrera [Mon, 11 May 2026 16:17:46 +0000 (18:17 +0200)]
Fix REPACK with WITHOUT OVERLAPS replica identity indexes
REPACK replay builds scan keys for the replica identity index, but it
hard-coded BTEqualStrategyNumber when looking up the equality operator.
That is not correct for non-btree identity indexes, such as the GiST
indexes created for WITHOUT OVERLAPS primary keys. In addition,
find_target_tuple() accepted the first tuple returned by the identity
index scan, which is unsafe for lossy index scans because the index AM may
return false positives with xs_recheck set.
Fix this by using IndexAmTranslateCompareType() to translate COMPARE_EQ
to the equality strategy number for the index AM, and by continuing the
scan when recheck is required until a candidate tuple matches the locator
tuple on all replica identity key columns.
The recheck uses the same equality operator functions as the identity
index scan keys, preserving ScanKey argument ordering.
Tom Lane [Mon, 11 May 2026 16:12:03 +0000 (12:12 -0400)]
Remove test cases for field overflows in intarray and ltree.
These checks are failing in the buildfarm, reporting stack overflows
rather than the expected errors, though seemingly only on ppc64 and
s390x platforms. Perhaps there is something off about our tests
for stack depth on those architectures? But there's no time to
debug that right now, and surely these tests aren't too essential.
Revert for now and plan to revisit after the release dust settles.
Nathan Bossart [Mon, 11 May 2026 12:13:47 +0000 (05:13 -0700)]
refint: Fix SQL injection and buffer overruns.
Maliciously crafted key value updates could achieve SQL injection
within check_foreign_key(). To fix, ensure new key values are
properly quoted and escaped in the internally generated SQL
statements. While at it, avoid potential buffer overruns by
replacing the stack buffers for internally generated SQL statements
with StringInfo.
Reported-by: Nikolay Samokhvalov <nik@postgres.ai>
Author: Nathan Bossart <nathandbossart@gmail.com> Reviewed-by: Noah Misch <noah@leadboat.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Security: CVE-2026-6637
Backpatch-through: 14
Nathan Bossart [Mon, 11 May 2026 12:13:47 +0000 (05:13 -0700)]
Mark PQfn() unsafe and fix overrun in frontend LO interface.
When result_is_int is set to 0, PQfn() cannot validate that the
result fits in result_buf, so it will write data beyond the end of
the buffer when the server returns more data than requested. Since
this function is insecurable and obsolete, add a warning to the top
of the pertinent documentation advising against its use.
The only in-tree caller of PQfn() is the frontend large object
interface. To fix that, add a buf_size parameter to
pqFunctionCall3() that is used to protect against overruns, and use
it in a private version of PQfn() that also accepts a buf_size
parameter.
Reported-by: Yu Kunpeng <yu443940816@live.com> Reported-by: Martin Heistermann <martin.heistermann@unibe.ch>
Author: Nathan Bossart <nathandbossart@gmail.com> Reviewed-by: Noah Misch <noah@leadboat.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Etsuro Fujita <etsuro.fujita@gmail.com>
Security: CVE-2026-6477
Backpatch-through: 14
Fix integer overflow in array_agg(), when the array grows too large
If you accumulate many arrays full of NULLs, you could overflow
'nitems', before reaching the MaxAllocSize limit on the allocations.
Add an explicit check that the number of items doesn't grow too large.
With more than MaxArraySize items, getting the final result with
makeArrayResultArr() would fail anyway, so better to error out early.
Reported-by: Xint Code
Author: Heikki Linnakangas <heikki.linnakangas@iki.fi> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Backpatch-through: 14
Security: CVE-2026-6473
Tom Lane [Mon, 11 May 2026 12:13:47 +0000 (05:13 -0700)]
Fix integer-overflow and alignment hazards in locale-related code.
pg_locale_icu.c was full of places where a very long input string
could cause integer overflow while calculating a buffer size,
leading to buffer overruns.
It also was cavalier about using char-type local arrays as buffers
holding arrays of UChar. The alignment of a char[] variable isn't
guaranteed, so that this risked failure on alignment-picky platforms.
The lack of complaints suggests that such platforms are very rare
nowadays; but it's likely that we are paying a performance price on
rather more platforms. Declare those arrays as UChar[] instead,
keeping their physical size the same.
pg_locale_libc.c's strncoll_libc_win32_utf8() also had the
disease of assuming it could double or quadruple the input
string length without concern for overflow.
Reported-by: Xint Code Reported-by: Pavel Kohout <pavel.kohout@aisle.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Backpatch-through: 14
Security: CVE-2026-6473
Michael Paquier [Mon, 11 May 2026 12:13:47 +0000 (05:13 -0700)]
Prevent path traversal in pg_basebackup and pg_rewind
pg_rewind and pg_basebackup could be fed paths from rogue endpoints that
could overwrite the contents of the client when received, achieving path
traversal.
There were two areas in the tree that were sensitive to this problem:
- pg_basebackup, through the astreamer code, where no validation was
performed before building an output path when streaming tar data. This
is an issue in v15 and newer versions.
- pg_rewind file operations for paths received through libpq, for all
the stable branches supported.
In order to address this problem, this commit adds a helper function in
path.c, that reuses path_is_relative_and_below_cwd() after applying
canonicalize_path(). This can be used to validate the paths received
from a connection point. A path is considered invalid if any of the two
following conditions is satisfied:
- The path is absolute.
- The path includes a direct parent-directory reference.
Reported-by: XlabAI Team of Tencent Xuanwu Lab Reported-by: Valery Gubanov <valerygubanov95@gmail.com>
Author: Michael Paquier <michael@paquier.xyz> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Backpatch-through: 14
Security: CVE-2026-6475
Nathan Bossart [Mon, 11 May 2026 12:13:47 +0000 (05:13 -0700)]
Avoid overflow in size calculations in formatting.c.
A few functions in this file were incautious about multiplying a
possibly large integer by a factor more than 1 and then using it as
an allocation size. This is harmless on 64-bit systems where we'd
compute a size exceeding MaxAllocSize and then fail, but on 32-bit
systems we could overflow size_t, leading to an undersized
allocation and buffer overrun. To fix, use palloc_array() or
mul_size() instead of handwritten multiplication.
Reported-by: Sven Klemm <sven@tigerdata.com> Reported-by: Xint Code
Author: Nathan Bossart <nathandbossart@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Tatsuo Ishii <ishii@postgresql.org>
Security: CVE-2026-6473
Backpatch-through: 14
Nathan Bossart [Mon, 11 May 2026 12:13:47 +0000 (05:13 -0700)]
Check CREATE privilege on multirange type schema in CREATE TYPE.
This omission allowed roles to create multirange types in any
schema, potentially leading to privilege escalations. Note that
when a multirange type name is not specified in CREATE TYPE, it is
automatically placed in the range type's schema, which is checked
at the beginning of DefineRange().
Nathan Bossart [Mon, 11 May 2026 12:13:47 +0000 (05:13 -0700)]
pg_createsubscriber: Obstruct SQL injection via subscription names.
drop_existing_subscription() neglected to escape the subscription
name when generating its query string. To fix, use
PQescapeIdentifier() to construct a properly escaped name, and use
it in the ALTER SUBSCRIPTION and DROP SUBSCRIPTION commands.
Michael Paquier [Mon, 11 May 2026 12:13:46 +0000 (05:13 -0700)]
Fix MCV input array checks in statistics restore functions
The SQL functions for the restore of attribute and expression statistics
accept "most_common_vals" and "most_common_freqs" as independent arrays.
The planner assumes these have the same number of elements, but it was
possible to insert in the catalogs data that would cause an over-read
when the catalog data is loaded in the planner.
There were two holes in the stats restore logic:
- Both arrays should match in size.
- The input array must be one-dimensional, and it should match with what
is delivered by pg_dump when scanning the pg_stats catalogs.
The multivariate extended statistics MCV path (import_mcv) already
validated these inputs via check_mcvlist_array(), and is not affected.
These problems exist in v18 and newer versions for the restore of
attribute statistics. These problems affect only HEAD for the restore
of the expression statistics.
Reported-by: Jeroen Gui <jeroen.gui1@proton.me>
Author: Michael Paquier <michael@paquier.xyz> Reviewed-by: Amit Langote <amitlangote09@gmail.com> Reviewed-by: John Naylor <johncnaylorls@gmail.com>
Security: CVE-2026-6575
Backpatch-through: 18
Tom Lane [Mon, 11 May 2026 12:13:46 +0000 (05:13 -0700)]
Guard against unsafe conditions in usage of pg_strftime().
Although pg_strftime() has defined error conditions, no callers bother
to check for errors. This is problematic because the output string is
very likely not null-terminated if an error occurs, so that blindly
using it is unsafe. Rather than trusting that we can find and fix all
the callers, let's alter the function's API spec slightly: make it
guarantee a null-terminated result so long as maxsize > 0.
Furthermore, if we do get an error, let's make that null-terminated
result be an empty string. We could instead truncate at the buffer
length, but that risks producing mis-encoded output if the tz_name
string contains multibyte characters. It doesn't seem reasonable for
src/timezone/ to make use of our encoding-aware truncation logic.
Also, the only really likely source of a failure is a user-supplied
timezone name that is intentionally trying to overrun our buffers.
I don't feel a need to be particularly friendly about that case.
Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: John Naylor <johncnaylorls@gmail.com>
Backpatch-through: 14
Security: CVE-2026-6474
Tom Lane [Mon, 11 May 2026 12:13:46 +0000 (05:13 -0700)]
Avoid passing unintended format codes to snprintf().
timeofday() assumed that the output of pg_strftime() could not contain
% signs, other than the one it explicitly asks for with %%. However,
we don't have that guarantee with respect to the time zone name (%Z).
A crafted time zone setting could abuse the subsequent snprintf()
call, resulting in crashes or disclosure of server memory.
To fix, split the pg_strftime() call into two and then treat the
outputs as literal strings, not a snprintf format string. The
extra pg_strftime() call doesn't really cost anything, since the
bulk of the conversion work was done by pg_localtime().
Also, adjust buffer widths so that we're not risking string truncation
during the snprintf() step, as that would create a hazard of producing
mis-encoded output.
This also fixes a latent portability issue: the format string expects
an int, but tp.tv_usec is long int on many platforms.
Reported-by: Xint Code
Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: John Naylor <johncnaylorls@gmail.com>
Backpatch-through: 14
Security: CVE-2026-6474
Noah Misch [Mon, 11 May 2026 12:13:46 +0000 (05:13 -0700)]
Fix SQL injection in logical replication origin checks.
ALTER SUBSCRIPTION ... REFRESH PUBLICATION interpolates schema and
relation names into SQL without quoting them. A crafted subscriber
relation name can inject arbitrary SQL on the publisher. Test such a
name. Back-patch to v16, where commit 875693019053b8897ec3983e292acbb439b088c3 first appeared.
Reported-by: Pavel Kohout <pavel.kohout@aisle.com>
Author: Pavel Kohout <pavel.kohout@aisle.com> Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Backpatch-through: 16
Security: CVE-2026-6638
Michael Paquier [Mon, 11 May 2026 12:13:46 +0000 (05:13 -0700)]
Apply timingsafe_bcmp() in authentication paths
This commit applies timingsafe_bcmp() to authentication paths that
handle attributes or data previously compared with memcpy() or strcmp(),
which are sensitive to timing attacks.
The following data is concerned by this change, some being in the
backend and some in the frontend:
- For a SCRAM or MD5 password, the computed key or the MD5 hash compared
with a password during a plain authentication.
- For a SCRAM exchange, the stored key, the client's final nonce and the
server nonce.
- RADIUS (up to v18), the encrypted password.
- For MD5 authentication, the MD5(MD5()) hash.
Reported-by: Joe Conway <mail@joeconway.com>
Security: CVE-2026-6478
Author: Michael Paquier <michael@paquier.xyz> Reviewed-by: John Naylor <johncnaylorls@gmail.com>
Backpatch-through: 14
Tom Lane [Mon, 11 May 2026 12:13:46 +0000 (05:13 -0700)]
Guard against overflow in "left" fields of query_int and ltxtquery.
contrib/intarray's query_int type uses an int16 field to hold the
offset from a binary operator node to its left operand. However, it
allows the number of nodes to be as much as will fit in MaxAllocSize,
so there is a risk of overflowing int16 depending on the precise shape
of the tree. Simple right-associative cases like "a | b | c | ..."
work fine, so we should not solve this by restricting the overall
number of nodes. Instead add a direct test of whether each individual
offset is too large.
contrib/ltree's ltxtquery type uses essentially the same logic and
has the same 16-bit restriction.
(The core backend's tsquery.c has a variant of this logic too, but
in that case the target field is 32 bits, so it is okay so long
as varlena datums are restricted to 1GB.)
In v16 and up, these types support soft error reporting, so we have
to complicate the recursive findoprnd function's API a bit to allow
the complaint to be reported softly. v14/v15 don't need that.
Undocumented and overcomplicated code like this makes my head hurt,
so add some comments and simplify while at it.
Reported-by: Xint Code
Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Michael Paquier <michael@paquier.xyz>
Backpatch-through: 14
Security: CVE-2026-6473
Michael Paquier [Mon, 11 May 2026 12:13:46 +0000 (05:13 -0700)]
Fix unbounded recursive handling of SSL/GSS in ProcessStartupPacket()
The handling of SSL and GSS negotiation messages in
ProcessStartupPacket() could cause a recursion of the backend,
ultimately crashing the server as the negotiation attempts were not
tracked across multiple calls processing startup packets.
A malicious client could therefore alternate rejected SSL and GSS
requests indefinitely, each adding a stack frame, until the backend
crashed with a stack overflow, taking down a server.
This commit addresses this issue by modifying ProcessStartupPacket() so
as processed negotiation attempts are tracked, preventing infinite
recursive attempts. A TAP test is added to check this problem, where
multiple SSL and GSS negotiated attempts are stacked.
Reported-by: Calif.io in collaboration with Claude and Anthropic
Research
Author: Michael Paquier <michael@paquier.xyz> Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Security: CVE-2026-6479
Backpatch-through: 14
Tom Lane [Mon, 11 May 2026 12:13:46 +0000 (05:13 -0700)]
Fix assorted places that need to use palloc_array().
multirange_recv and BlockRefTableReaderNextRelation were incautious
about multiplying a possibly-large integer by a factor more than 1
and then using it as an allocation size. This is harmless on 64-bit
systems where we'd compute a size exceeding MaxAllocSize and then
fail, but on 32-bit systems we could overflow size_t leading to an
undersized allocation and buffer overrun.
Fix these places by using palloc_array() instead of a handwritten
multiplication. (In HEAD, some of them were fixed already, but
none of that work got back-patched at the time.)
In addition, BlockRefTableReaderNextRelation passes the same value
to BlockRefTableRead's "int length" parameter. If built for
64-bit frontend code, palloc_array() allows a larger array size
than it otherwise would, potentially allowing that parameter to
overflow. Add an explicit check to forestall that and keep the
behavior the same cross-platform.
Reported-by: Xint Code
Author: Tom Lane <tgl@sss.pgh.pa.us>
Backpatch-through: 14
Security: CVE-2026-6473
Tom Lane [Mon, 11 May 2026 12:13:46 +0000 (05:13 -0700)]
Prevent buffer overrun in unicode_normalize().
Some UTF8 characters decompose to more than a dozen codepoints.
It is possible for an input string that fits into well under
1GB to produce more than 4G decomposed codepoints, causing
unicode_normalize()'s decomp_size variable to wrap around to a
small positive value. This results in a small output buffer
allocation and subsequent buffer overrun.
To fix, test after each addition to see if we've overrun MaxAllocSize,
and break out of the loop early if so. In frontend code we want to
just return NULL for this failure (treating it like OOM). In the
backend, we can rely on the following palloc() call to throw error.
I also tightened things up in the calling functions in varlena.c,
using size_t rather than int and allocating the input workspace
with palloc_array(). These changes are probably unnecessary
given the knowledge that the original input and the normalized
output_chars array must fit into 1GB, but it's a lot easier to
believe the code is safe with these changes.
Reported-by: Xint Code Reported-by: Bruce Dang <bruce@calif.io>
Author: Tom Lane <tgl@sss.pgh.pa.us> Co-authored-by: Heikki Linnakangas <hlinnaka@iki.fi>
Backpatch-through: 14
Security: CVE-2026-6473
Tom Lane [Mon, 11 May 2026 12:13:46 +0000 (05:13 -0700)]
Harden our regex engine against integer overflow in size calculations.
The number of NFA states, number of NFA arcs, and number of colors
are all bounded to reasonably small values. However, there are
places where we try to allocate arrays sized by products of those
quantities, and those calculations could overflow, enabling
buffer-overrun attacks. In practice there's no problem on 64-bit
machines, but there are some live scenarios on 32-bit machines.
A related problem is that citerdissect() and creviterdissect()
allocate arrays based on the length of the input string, which
potentially could overflow.
To fix, invent MALLOC_ARRAY and REALLOC_ARRAY macros that rely on
palloc_array_extended and repalloc_array_extended with the NO_OOM
option, similarly to the existing MALLOC and REALLOC macros.
(Like those, they'll throw an error not return a NULL result for
oversize requests. This doesn't really fit into the regex code's
view of error handling, but it'll do for now. We can consider
whether to change that behavior in a non-security follow-up patch.)
I installed similar defenses in the colormap construction code.
It's not entirely clear whether integer overflow is possible
there, but analyzing the behavior in detail seems not worth
the trouble, as the risky spots are not in hot code paths.
I left a bunch of calls as-is after verifying that they can't
overflow given reasonable limits on nstates and narcs. Those
limits were enforced already via REG_MAX_COMPILE_SPACE, but
add commentary to document the interactions.
In passing, also fix a related edge case, which is that the
special color numbers used in LACON carcs could overflow the
"color" data type, if ncolors is close to MAX_COLOR.
In v14 and v15, the regex engine calls malloc() directly instead
of using palloc(), so MALLOC_ARRAY and REALLOC_ARRAY do likewise.
Reported-by: Xint Code
Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Backpatch-through: 14
Security: CVE-2026-6473
Tom Lane [Mon, 11 May 2026 12:13:46 +0000 (05:13 -0700)]
Make palloc_array() and friends safe against integer overflow.
Sufficiently large "count" arguments could result in undetected
overflow, causing the allocated memory chunk to be much smaller
than what the caller will subsequently write into it. This is
unlikely to be a hazard with 64-bit size_t but can sometimes
happen on 32-bit builds, primarily where a function allocates
workspace that's significantly larger than its input data.
Rather than trying to patch the at-risk callers piecemeal,
let's just redefine these macros so that they always check.
To do that, move the longstanding add_size() and mul_size() functions
into palloc.h and mcxt.c, and adjust them to not be specific to
shared-memory allocation. Then invent palloc_mul(), palloc0_mul(),
palloc_mul_extended() to use these functions. Actually, the latter
use inlined copies to save one function call. repalloc_array() gets
similar treatment. I didn't bother trying to inline the calls for
repalloc0_array() though.
In v14 and v15, this also adds repalloc_extended(), which previously
was only available in v16 and up.
We need copies of all this in fe_memutils.[hc] as well, since that
module also provides palloc_array() etc.
Reported-by: Xint Code
Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Backpatch-through: 14
Security: CVE-2026-6473
Michael Paquier [Mon, 11 May 2026 12:13:46 +0000 (05:13 -0700)]
Fix overflows with ts_headline()
The options "StartSel", "StopSel" and "FragmentDelimiter" given by a
caller of the SQL function ts_headline() have their lengths stored as
int16. When providing values larger than PG_INT16_MAX, it was possible
to overflow the length values stored, leading to incorrect behaviors in
generateHeadline(), in most cases translating to a crash.
Attempting to use values for these options larger than PG_INT16_MAX is
now blocked. Some test cases are added to cover our tracks.
Michael Paquier [Mon, 11 May 2026 12:13:46 +0000 (05:13 -0700)]
ltree: Fix overflows with lquery parsing
The lquery parser in contrib/ltree/ had two overflow problems:
- A single lquery level with many OR-separated variants (e.g.,
'label1|label2|...'), could cause an overflow of totallen, this being
stored as a uint16, meaning a maximum value of UINT16_MAX or 65k. Each
variant contributes MAXALIGN(LVAR_HDRSIZE + len) bytes. With enough
long variants, the value would wraparound. This would corrupt the data
written by LQL_NEXT(), leading to a stack corruption, most likely
translating into a crash, but it would allow incorrect memory access.
- numvar, labelled as a uint16, counts the number of OR-variants in a
single level, and it is incremented without bounds checking. With more
than PG_UINT16_MAX (65k) variants in a single level, and a minimum of
131kB of input data, it would wrap to 0. When a (wildcard) '*' is
used, this would change the query results silently.
For both issues, a set of overflows checks are added to guard against
these problematic patterns.
The first issue has been reported by the three people listed below,
affecting v16 and newer versions due to b1665bf01e5f. Its coding was
still unsafe in v14 and v15. The second issue affects all the stable
branches; I have bumped into while reviewing the code of the module.
Reported-by: Vergissmeinnicht <vergissmeinnichtzh@gmail.com> Reported-by: A1ex <alex000young@gmail.com> Reported-by: Jihe Wang <wangjihe.mail@gmail.com>
Author: Michael Paquier <michael@paquier.xyz>
Security: CVE-2026-6473
Backpatch-through: 14
John Naylor [Fri, 8 May 2026 09:44:25 +0000 (16:44 +0700)]
Fix universal builds on MacOS
Commit 16743db06 assumed that the CPUID instruction was always
available when the usual x86 symbols were defined. That is not the
case, so zero out the info rather than error out.
Reported-by: Jakob Egger <jakob@eggerapps.at> Reported-by: Tobias Bussmann <t.bussmann@gmx.net> Suggested-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/223EA201-A0E8-4A13-B220-EB903E8DF817@eggerapps.at
Richard Guo [Fri, 8 May 2026 08:21:48 +0000 (17:21 +0900)]
Enforce RETURNING typmod for empty-set JSON_ARRAY(query)
Commit 8d829f5a0 introduced a COALESCE wrapper around the
JSON_ARRAYAGG subquery so that JSON_ARRAY(query) returns '[]' rather
than NULL when the subquery yields no rows, per the SQL/JSON standard.
The empty-array Const used as the COALESCE fallback was, however,
built with typmod -1 and the type input function was likewise invoked
with typmod -1. As a result, any length restriction from the
RETURNING clause was silently bypassed on the empty-set path, while
the non-empty path enforced it via the JSON_ARRAYAGG coercion.
Build the empty-array Const using the typmod of the COALESCE's
non-empty argument, and pass that typmod to OidInputFunctionCall as
well so the value is length-checked at parse time. This makes the
empty-set and non-empty-set paths behave consistently.
Reported-by: Ayush Tiwari <ayushtiwari.slg01@gmail.com>
Author: Richard Guo <guofenglinux@gmail.com>
Discussion: https://postgr.es/m/CAJTYsWXPYqa58YXrU+SQMVonsAhjLS46HNUMU=wO5zm9MgY3_g@mail.gmail.com
Amit Kapila [Fri, 8 May 2026 04:30:26 +0000 (10:00 +0530)]
Use schema-qualified names in EXCEPT clause error messages.
Error messages in check_publication_add_relation() previously reported
only the relation name when a table in an EXCEPT clause could not be
processed, which is ambiguous when the same name exists in multiple
schemas. Use schema-qualified names instead, consistent with other error
messages that reference relation names.
Author: Dilip Kumar <dilipbalaut@gmail.com>
Author: vignesh C <vignesh21@gmail.com> Reviewed-by: shveta malik <shveta.malik@gmail.com> Reviewed-by: Euler Taveira <euler@eulerto.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Discussion: https://postgr.es/m/CAFiTN-scG7b11Jsp+VoDRT8ZFE84eSKLcDsSB18dZ8AaP=R-mw@mail.gmail.com
Etsuro Fujita [Fri, 8 May 2026 04:15:00 +0000 (13:15 +0900)]
postgres_fdw: Fix syntax error in fetch_attstats().
When importing remote stats for a foreign table backed by a pre-v17
remote server, the query built/executed in this function has three NULL
placeholders for the range stats supported in v17 at the end of the
SELECT list. Previously, it included a trailing comma after the last
NULL, like "SELECT ..., NULL, NULL, NULL, FROM pg_catalog.pg_stats ...",
causing a syntax error on the remote server. Fix by removing the comma.
Richard Guo [Fri, 8 May 2026 03:45:51 +0000 (12:45 +0900)]
Consider opfamily and collation when removing redundant GROUP BY columns
remove_useless_groupby_columns() uses a relation's unique indexes to
prove that some GROUP BY columns are functionally dependent on others,
and so can be dropped from the GROUP BY clause. The match between
index columns and GROUP BY columns was done by attno alone, ignoring
two equality-relation issues.
A type may belong to multiple btree opfamilies whose notions of
equality differ. The record type, for instance, has record_ops
(per-field equality) and record_image_ops (bytewise equality). A
unique index under one opfamily does not prove uniqueness under the
equality used by GROUP BY when the SortGroupClause's eqop comes from a
different opfamily.
Likewise, since nondeterministic collations were introduced in PG 12,
two collations may disagree on equality, and a unique index under one
collation does not prove uniqueness under another.
In either case, rows that the index considers distinct can collapse
into a single GROUP BY group, taking ungrouped columns of differing
values with them, so the planner drops a column that is not in fact
functionally dependent and produces wrong results.
Fix by requiring, for each unique-index key column, that some GROUP BY
item on the same column has an eqop in the index's opfamily and a
collation that agrees on equality with the index's collation. This
mirrors the combined check relation_has_unique_index_for() applies to
join clauses.
This is a v18 regression: commit bd10ec529 extended
remove_useless_groupby_columns() from primary-key constraints to
arbitrary unique indexes. Before that, the function consulted only
primary keys, whose enforcement index is required by parse_utilcmd.c
to use the default opclass and the column's declared collation, so
neither mismatch could arise. Back-patch to v18 only.
Richard Guo [Fri, 8 May 2026 01:57:50 +0000 (10:57 +0900)]
Fix HAVING-to-WHERE pushdown for simple-CASE form
Commit f76686ce7 added a walker that detects when a HAVING clause uses
a collation that conflicts with the GROUP BY's nondeterministic
collation, keeping such clauses in HAVING. The walker uses
exprInputCollation() to identify each ancestor's comparison collation,
but missed the simple-CASE case: parse analysis builds each WHEN as
OpExpr(CaseTestExpr op val), where CaseTestExpr is a placeholder for
the arg, while the actual arg expression sits at cexpr->arg, outside
the OpExpr that carries the comparison's inputcollid. A GROUP Var at
cexpr->arg was therefore visited with the WHEN's inputcollid absent
from the ancestor stack, the conflict went undetected, and the clause
was wrongly pushed to WHERE.
Fix by handling simple CASE explicitly: before walking cexpr->arg,
push every WHEN's inputcollid onto the ancestor stack so a GROUP Var
at the arg is checked against the same collations the WHEN comparisons
would apply. Then walk the WHEN bodies and defresult under the
unchanged stack, where their own collation contexts are picked up by
the default path.
Back-patch to v18 only; this fix extends the walker added by commit f76686ce7 and inherits its dependency on the v18 RTE_GROUP mechanism.
Amit Langote [Thu, 7 May 2026 23:26:04 +0000 (08:26 +0900)]
Fix use-after-free of qs in AfterTriggerEndQuery.
afterTriggerInvokeEvents() may repalloc afterTriggers.query_stack
while firing trigger events, leaving any precomputed entry pointer
dangling. The loop body in AfterTriggerEndQuery() recomputes qs
after each afterTriggerInvokeEvents() call for that reason, but the
"all fired" break path exits without the recompute, and the
subsequent FireAfterTriggerBatchCallbacks(qs->batch_callbacks)
dereferences the freed pointer.
Fix by recomputing qs immediately before
FireAfterTriggerBatchCallbacks(), as the loop body already does
after each afterTriggerInvokeEvents() call.
The hazard was introduced in 34a30786293, which added the
qs->batch_callbacks dereference at this site.
Reported-by: Amul Sul <sulamul@gmail.com>
Author: Amul Sul <sulamul@gmail.com> Reviewed-by: Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com> Reviewed-by: Amit Langote <amitlangote09@gmail.com>
Discussion: https://postgr.es/m/CAAJ_b95p6-qiVpE2Gpr=bUsNAqTcejD_rPgLnfjx9m=fo3Rf3Q@mail.gmail.com
Masahiko Sawada [Thu, 7 May 2026 17:09:42 +0000 (10:09 -0700)]
Fix race condition in XLogLogicalInfo and ProcSignal initialization.
Previously, InitializeProcessXLogLogicalInfo() was called before
ProcSignalInit(). This created a window where a process could miss a
signal barrier if it was issued between these two calls. As a result,
the process could fail to update its local XLogLogicalInfo cache,
leading to an inconsistent logical decoding state.
This commit fixes this by moving InitializeProcessXLogLogicalInfo()
after ProcSignalInit(). This ensures that the process is registered to
participate in signal barriers before its state is initialized,
preventing it from missing any state changes propagated during the
startup sequence.
Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://postgr.es/m/CAD21AoBzdeSyLSSPM5E6ysN1r8qzp8u_BRmnLvuAp_S8QxS_fQ@mail.gmail.com
Discussion: https://postgr.es/m/CAD21AoBj+zKvgw_Q8gjr4YbKccW_uMe3OFQ5+KT246FHUuNXSQ@mail.gmail.com
John Naylor [Thu, 7 May 2026 12:10:51 +0000 (19:10 +0700)]
Rationalize error comments in partition split/merge tests
The regression tests had a copy of the full error, detail, and hint
text in comments above each failing statement in the .sql files. This
is a maintenance hazard, so simplify to "-- ERROR", in line with
other tests.
Author: Ayush Tiwari <ayushtiwari.slg01@gmail.com> Reviewed-by: Jian He <jian.universality@gmail.com>
Discussion: https://postgr.es/m/CANWCAZap26BRLwtd+A7GFDSD6-+C3F0NVdUGUAu2LUfvpOTy=w@mail.gmail.com
John Naylor [Thu, 7 May 2026 12:10:35 +0000 (19:10 +0700)]
Message corrections for partition split/merge commands
Fix spelling and grammar, turn an accidental duplicate errmsg into
errdetail, and remove an errposition that was not pointing at anything
relevant to the error.
Author: Ayush Tiwari <ayushtiwari.slg01@gmail.com> Reviewed-by: Jian He <jian.universality@gmail.com> Reviewed-by: Yuchen Li <liyuchen_xyz@163.com> (earlier version)
Discussion: https://postgr.es/m/CAJTYsWUvMT5uKOasPnm6-o9CrdXbRONiAYHTKJb7wx66LB8S1A@mail.gmail.com
Michael Paquier [Thu, 7 May 2026 01:18:49 +0000 (10:18 +0900)]
Simplify code in objectaddress.c for some property graph objects
Property graph element labels and label properties relied on a direct
systable scan when retrieving their object descriptions. These can be
simplified with get_catalog_object_by_oid(). This offers the benefit to
do a direct syscache lookup, if available.
The same logic will be used in a follow-up patch when retrieving the
object identity parts, applying the same rule across the board for these
object types.
WAIT FOR LSN registers the current backend in shared memory before entering an
interruptible wait loop. Top-level abort and backend exit already call
WaitLSNCleanup(), but subtransaction abort did not. If an interrupt, such as
statement_timeout, occurred while waiting inside a savepoint, rolling back to
the savepoint left the backend marked as present in the WAIT FOR LSN heap.
Clean up WAIT FOR LSN state from AbortSubTransaction() as well, and add
a TAP test covering reuse of WAIT FOR LSN after a savepoint rollback.
Fix regex searching for page verification failures in tests
The test for finding page verification failures in the logfiles
were missing the /m modifier to make sure it anchors to every
newline in the search space buffer, and not just the last one.
Spotted while adding a test for the recently reported issue with
excessive WAL for unlogged relations.
The DataChecksumsWorker accepts cost_delay and cost_limit parameters
from pg_enable_data_checksums() so users can throttle the I/O caused
by enabling checksums. Due to the API for setting the cost parameters
changing between when the code was written, and when it was committed
the new cost update function call was omitted and thus the parameters
were silently ignored.
Fix by calling VacuumUpdateCosts() after assigning the parameters
(both during worker startup and on the runtime cost-update path), and
by leaving the page-cost weights at their GUC-controlled defaults.
Skip WAL for unlogged main fork during online checksum enable
ProcessSingleRelationFork() unconditionally generated an FPI WAL
record for every page of every relation when enabling checksums.
Unlogged relations, which by definition never generate WAL for
data changes, were not exempt which generated excessive WAL to
be emitted.
Fix by guarding the FPI WAL record call with RelationNeedsWAL()
to avoid emitting WAL for unlogged main forks. Unlogged pages
are still dirtied to ensure the checksum is written to disk at
the next checkpoint. The init fork remains WAL-logged even for
unlogged relations, as it's needed on the standby to materialize
the relation after promotion (see ResetUnloggedRelations()).
Skipping init-fork WAL would leave the standby with a stale init
fork that, once copied to the main fork on promotion, would fail
checksum verification on every read of the unlogged relation.
A test which creates an unlogged table with an index, enables
checksums, promotes the standby, and verifies that the unlogged
relation and its indexes are still readable post-promotion has
been added.
Document deprecated --wal-directory option for pg_verifybackup
Commit b3cf461b3cf renamed --wal-directory to --wal-path but retained
the former as a silent alias. Per project policy, all options,
including deprecated ones, should be documented to assist users
transitioning between versions.
This patch restores --wal-directory to the documentation and --help
output.
Author: Amul Sul <sulamul@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/E1w3fZp-000gje-31%40gemulon.postgresql.org
Álvaro Herrera [Tue, 5 May 2026 14:20:26 +0000 (16:20 +0200)]
Skip other sessions' temp tables in REPACK, CLUSTER, and VACUUM FULL
get_tables_to_repack() and get_all_vacuum_rels() were including other
sessions' temporary tables in their output work list, causing REPACK,
CLUSTER and VACUUM FULL (when executed without a table list) to attempt
to acquire AccessExclusiveLock on them, potentially blocking for an
extended time. Fix by skipping other-session temp tables early, before
they are added to the list.
This issue is ancient, but there have been no complaints about it that I
know of, so I'm opting for not backpatching at present.
Author: Jim Jones <jim.jones@uni-muenster.de> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Discussion: https://postgr.es/m/0b555318-2bf2-46df-9377-09629a2a59db@uni-muenster.de
Etsuro Fujita [Tue, 5 May 2026 09:55:00 +0000 (18:55 +0900)]
postgres_fdw: Fix handling of abort-cleanup-failed connections.
As connections that failed abort cleanup can't safely be further used,
if a remote query tries to get such a connection, we reject it.
Previously, this rejection involved dropping the connection if it was
open, without accounting for the possibility of open cursors using it,
causing a server crash when such an open cursor tried to use an
already-dropped connection, as a cursor-handling function
(create_cursor, fetch_more_data, or close_cursor) was called on a freed
PGconn. To fix, delay dropping failed connections until abort cleanup
of the main transaction, to ensure open cursors using such a connection
can safely refer to the PGconn for it.
Álvaro Herrera [Tue, 5 May 2026 08:24:49 +0000 (10:24 +0200)]
Don't lose column values on REPACK
Commit 28d534e2ae0a introduced reform_tuple() with a fast path that
returns the source tuple verbatim when no dropped columns require fixing
up. I (Álvaro) failed to realize that this broke handling of columns
with a 'missingval' defined: after a VACUUM FULL, CLUSTER, or REPACK
operation, the catalogued missingval is thrown away, so the tuples are
no longer correct.
Fix by forcing the rewrite when the tuple is shorter than the tuple
descriptor.
Richard Guo [Tue, 5 May 2026 01:23:31 +0000 (10:23 +0900)]
Consider collation when proving subquery uniqueness
rel_is_distinct_for()'s RTE_SUBQUERY branch passed only the equality
operator from each join clause to query_is_distinct_for(), discarding
the operator's input collation. query_is_distinct_for() then verified
opfamily compatibility but never checked collations, so a DISTINCT /
GROUP BY / set-op operating under one collation was trusted to prove
uniqueness for a comparison performed under an unrelated collation.
As with the recent fix in relation_has_unique_index_for(), this is
unsound for nondeterministic collations and yields wrong query results
in any optimization that consumes the proof.
Fix by carrying each clause's operator input collation into
query_is_distinct_for() and validating it at every check-site against
the subquery target expression's collation.
Back-patch to all supported branches. query_is_distinct_for() is
declared in an installed header, so on stable branches the existing
two-list signature is retained as a thin wrapper that forwards to a
new collation-aware entry point; external callers continue to receive
the historical collation-blind answer.
Author: Richard Guo <guofenglinux@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAMbWs4_XUUSTyzCaRjUeeahWNqi=8ZOA5Q4coi8zUVEDSBkM6A@mail.gmail.com
Backpatch-through: 14
Richard Guo [Tue, 5 May 2026 01:22:53 +0000 (10:22 +0900)]
Consider collation when proving uniqueness from unique indexes
relation_has_unique_index_for() has long had an XXX noting that it
doesn't check collations when matching a unique index's columns
against equality clauses. This was benign as long as all collations
in play reduced to the same notion of equality, but has been incorrect
since nondeterministic collations were introduced in PG 12: a unique
index under a deterministic collation does not prove uniqueness under
a nondeterministic collation, nor vice versa.
The consequence is wrong query results for any planner optimization
that consumes the faulty proof, including inner-unique join execution
(which stops the inner search after the first match per outer row),
useless-left-join removal, semijoin-to-innerjoin reduction, and
self-join elimination.
Fix by requiring the index's collation to agree on equality with the
clause's input collation. Two collations agree on equality if either
is InvalidOid (denoting a non-collation-sensitive operation, which
cannot conflict with the other side), if they have the same OID, or if
both are deterministic: by definition a deterministic collation treats
two strings as equal iff they are byte-wise equal (see CREATE
COLLATION), so any two deterministic collations share the same
equality relation and the uniqueness proof carries over. Any mismatch
involving a nondeterministic collation is rejected.
Back-patch to all supported branches; the bug has existed since
nondeterministic collations were introduced in PG 12.
Author: Richard Guo <guofenglinux@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAMbWs4_XUUSTyzCaRjUeeahWNqi=8ZOA5Q4coi8zUVEDSBkM6A@mail.gmail.com
Backpatch-through: 14
Tom Lane [Mon, 4 May 2026 22:33:06 +0000 (18:33 -0400)]
Declare load_hosts() as returning HostsFileLoadResult.
This function returns some value of enum HostsFileLoadResult,
but for reasons lost in the development process was declared to
return "int". Fix that, for clarity and so that our typedefs
collection tooling sees the typedef as used. Also fix the
variable that the sole call assigns into. Move the typedef
to the header file that declares load_hosts() to avoid creating
header dependency problems.
Álvaro Herrera [Mon, 4 May 2026 18:01:19 +0000 (20:01 +0200)]
Fix off-by-one in repack index loop
A blunder of mine (Álvaro) in commit 28d534e2ae0a.
Author: Lakshmi N <lakshmin.jhs@gmail.com> Reviewed-by: Xiaopeng Wang <wxp_728@163.com> Reviewed-by: John Naylor <johncnaylorls@gmail.com>
Discussion: https://postgr.es/m/CA+3i_M9ytFufvD8Tm0rhpfxuC4XrpgQDBHxM7NJQYxv488JW7w@mail.gmail.com
Handle nodes that may appear in GraphPattern expression trees
expression_tree_mutator_impl() did not handle T_GraphPattern,
T_GraphElementPattern, and T_GraphPropertyRef. The corresponding
expression_tree_walker_impl() already handles all three node types.
This causes an "unrecognized node type" error whenever a GRAPH_TABLE
appeared in an expression tree.
While at it, also update raw_expression_tree_walker() and
expression_tree_walker() to handle missing nodes that may appear in
GraphPattern expression trees. When raw_expression_tree_walker() is
called, GraphElementPattern::labelexpr contains ColumnRefs instead of
GraphLabelRefs. Hence those are not handled in
raw_expression_tree_walker().
Even though a property graph is defined in pg_class it does not
contain any rows by itself and need not have a type defined. Avoid
creating a type for it.
Amit Kapila [Mon, 4 May 2026 06:36:41 +0000 (12:06 +0530)]
Simplify translatable messages for tuple value details in conflict.c.
append_tuple_value_detail() constructed user-visible messages using
separately translated fragments such as ": ", ", ", and ".",. This
makes correct translation difficult or impossible in some languages.
Refactor append_tuple_value_detail() to move all punctuation and
sentence construction to the callers, which now use a single
translatable string with a %s placeholder for the tuple data.
Reported-by: David Rowley <dgrowleyml@gmail.com>
Author: vignesh C <vignesh21@gmail.com> Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Zhijie Hou <houzj.fnst@fujitsu.com> Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Discussion: https://postgr.es/m/227279.1775956328%40sss.pgh.pa.us#8f3a5f50543556c60cc5a13270cb7ba4
Discussion: https://postgr.es/m/CAApHDvohYOdrvhVxXzCJNX_GYMSWBfjTTtB6hgDauEtZ8Nar2A@mail.gmail.com
Mark modified the FSM buffer as dirty during recovery
The XLogRecordPageWithFreeSpace function updates the freespace map (FSM) data
while replaying data-level WAL records during the recovery. If the FSM block
is updated, it needs to be marked as modified. Currently, this is done with
the MarkBufferDirtyHint call (as in all other cases for modifying FSM data).
However, in the recovery context, this function will actually do nothing if
checksums are enabled. It's assumed that the page should not be dirtied
during recovery while modifying hints to protect against torn pages, since no
new WAL data can be generated at this point to store FPI.
Such logic does not seem fully aligned with the FSM case, as its blocks could
be simply zeroed if a checksum mismatch is detected. Currently, changes to an
FSM block could be lost if each change to that block occurs infrequently
enough to allow it to be evicted from the cache. To persist the change, the
modification needs to be performed while the FSM block is still kept in
buffers and marked as dirty after receiving its FPI. If the block has already
been cleaned, the change won't be persisted, so stored FSM blocks may remain
in an obsolete state.
If a large number of discrepancies between the data in leaf FSM blocks and the
actual data blocks accumulate on the replica server, this could cause
significant delays in insert operations after switchover. Such an insert
operation may need to visit many data blocks marked as having sufficient
space in the FSM, only to discover that the information is incorrect and the
FSM records need to be corrected. In a heavily trafficked insert-only table
with many concurrent clients performing inserts, this has been observed to
cause several-second stalls, causing visible application malfunction. The
desire to avoid such cases was the reason behind the commit ab7dbd681, which
introduced an update of FSM data during the heap_xlog_visible invocation.
However, an update to the FSM data on the standby side could be lost due to a
missing 'dirty' flag, so there is still a possibility that a large number of
FSM records will contain incorrect data. Note that having a zeroed FSM page
in such a case (due to a checksum mismatch) is preferable, as a zero value
will be interpreted as an indication of full data blocks, and the inserter
will be routed to the next FSM block or to the end of the table.
Given that FSM is ready to handle torn page writes and
XLogRecordPageWithFreeSpace is called only during the recovery, there seems
to be no reason to use MarkBufferDirtyHint here instead of a regular
MarkBufferDirty call.
WAIT FOR LSN compares only the numeric LSN and has no notion of which
timeline a WAL record belongs to. There are many possible scenarios when
timeline-switching can break read-your-writes consistency. The proper
analysis and timeline support is possible in the next major release. Yet
just document the current behaviour.
Reported-by: Xuneng Zhou <xunengzhou@gmail.com>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Add regression coverage for several WAIT FOR LSN edge cases.
First, cover fresh walreceiver shared-memory initialization after a
standby restart. Restart the standby while its upstream is down, so
RequestXLogStreaming() seeds writtenUpto/flushedUpto to the
segment-aligned receiveStart and the walreceiver cannot immediately
advance them. Verify that the seeded flush position is segment-aligned,
that replay can be ahead of it, and that standby_write/standby_flush
still succeed for an already-replayed LSN via the replay-position floor
in GetCurrentLSNForWaitType().
Second, add fencepost checks for the target <= currentLSN predicate.
With replay paused and walreceiver stopped, verify exact boundaries for
standby_replay using pg_last_wal_replay_lsn(), and for standby_flush
using pg_last_wal_receive_lsn(). Also verify that a waiter for
current + 1 sleeps while replay is paused and wakes with success once
new WAL is delivered and replay advances.
Finally, add a cascading-standby timeline-switch test. Start a waiter
on the downstream standby, promote its upstream, generate WAL on the new
timeline, and verify that the cascade follows the new timeline and the
wait completes successfully once replay reaches the target LSN.
Reported-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/1957514.1775526774%40sss.pgh.pa.us
Author: Alexander Korotkov <aekorotkov@gmail.com>
Author: Xuneng Zhou <xunengzhou@gmail.com>
Wake standby_write/standby_flush waiters from the WAL replay loop
The startup process only woke STANDBY_REPLAY waiters after replaying
each WAL record. STANDBY_WRITE and STANDBY_FLUSH waiters depended only
on walreceiver write/flush callbacks. As a result, replay progress alone
did not wake those waiters, and in pure archive recovery (where no
walreceiver exists) they could sleep until timeout.
Fix by also calling WaitLSNWakeup() for STANDBY_WRITE and
STANDBY_FLUSH after each replay. For the replay-floor semantics used by
GetCurrentLSNForWaitType(), replay progress is a valid lower bound for
both modes: WAL cannot be replayed unless it has already been written
and flushed locally.
This works together with the replay-position floor in
GetCurrentLSNForWaitType(). The getter ensures that a waiter woken by
replay can recheck successfully; the replay-side wakeups ensure that a
waiter already asleep is notified when replay reaches its target.
Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/1957514.1775526774%40sss.pgh.pa.us
Author: Xuneng Zhou <xunengzhou@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Use replay position as floor for WAIT FOR LSN standby_(write|flush)
GetCurrentLSNForWaitType() for standby_write and standby_flush modes
returned only the walreceiver position, which may lag behind WAL
already present on the standby from a base backup, archive restore,
or prior streaming. This could cause unnecessary blocking if the
target LSN falls between the walreceiver's tracked position and the
replay position.
Fix by returning the maximum of the walreceiver position and the
replay position. WAL up to the replay point is physically on disk
regardless of its origin, so there is no reason to wait for the
walreceiver to re-receive it.
This complements 29e7dbf5e4d, which seeded writtenUpto to
receiveStart in RequestXLogStreaming() to fix the most common
hang scenario. The getter-level floor handles the remaining edge
cases: targets between receiveStart and the replay position, and
standbys running with archive recovery only (no walreceiver).
Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/1957514.1775526774%40sss.pgh.pa.us
Author: Xuneng Zhou <xunengzhou@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Remove redundant WAIT FOR LSN caller-side pre-checks
All five wakeup call sites duplicate WaitLSNWakeup()'s internal
fast-path minWaitedLSN check and add an unnecessary NULL check
on waitLSNState.
Remove the inline pre-checks and call WaitLSNWakeup() directly.
The fast-path check inside WaitLSNWakeup() already returns early
when no waiter's target has been reached, so there is no
performance difference.
The waitLSNState NULL checks are also unnecessary: shared memory
is fully initialized before any backend or auxiliary process
starts, so waitLSNState is always non-NULL at these call sites.
Reported-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/jzq5shdewncpxc35r3s2mcfsmo4bjovkza5mnqf5bdfumhfi3g%40bglckf7dxmw5
Author: Xuneng Zhou <xunengzhou@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Fix memory ordering in WAIT FOR LSN wakeup mechanism
WAIT FOR LSN uses a Dekker-style handshake: the waker stores an LSN
position then reads minWaitedLSN; the waiter stores its target into
minWaitedLSN then reads the position. Without a barrier between each
side's store and load, a CPU may satisfy the load before the store
becomes globally visible, causing either side to miss a concurrent
update. The result is a missed wakeup: the waiter sleeps indefinitely
until the next unrelated event.
Fix by embedding the required barriers into the atomic operations on
minWaitedLSN:
- In updateMinWaitedLSN(), use pg_atomic_write_membarrier_u64() so the
waiter's preceding heap update is visible before the new minWaitedLSN
value is published.
- In WaitLSNWakeup(), use pg_atomic_read_membarrier_u64() in the
fast-path check so the waker's preceding position store is globally
visible before minWaitedLSN is read.
The waiter side is also covered by the barrier semantics already present
in GetCurrentLSNForWaitType(): GetWalRcvWriteRecPtr() uses an explicit
read barrier (from patch 0001), while the remaining getters acquire a
spinlock, which implies the same ordering.
Also call ResetLatch() unconditionally after WaitLatch(), following the
standard latch loop pattern. WaitLatch() does not guarantee that all
simultaneously true wake conditions are reported in one return, so a
timeout can race with SetLatch(). If we skip ResetLatch() on a timeout
return, the code performs further asynchronous-state checks before
consuming the latch, violating the latch API's required wait/reset
pattern. That can leave the latch set across loop exit and cause a
later unrelated WaitLatch() in the same backend to return immediately.
Reported-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/zqbppucpmkeqecfy4s5kscnru4tbk6khp3ozqz6ad2zijz354k%40w4bdf4z3wqoz
Author: Xuneng Zhou <xunengzhou@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Use barrier semantics when reading/writing writtenUpto
The walreceiver publishes its write position lock-free via writtenUpto.
On weakly-ordered architectures (ARM, PowerPC), both sides of this
handshake need explicit barriers so that the lock-less reader sees a
consistent state.
Use pg_atomic_write_membarrier_u64() at both write sites and
pg_atomic_read_membarrier_u64() in GetWalRcvWriteRecPtr(). This matches
the barrier semantics that GetWalRcvFlushRecPtr() and other LSN-position
functions get implicitly from their spinlock acquire/release, and
protects from bugs caused by expectations of similar barrier guarantees
from different LSN-position functions.
Reported-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/zqbppucpmkeqecfy4s5kscnru4tbk6khp3ozqz6ad2zijz354k%40w4bdf4z3wqoz
Author: Xuneng Zhou <xunengzhou@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Andrew Dunstan [Fri, 1 May 2026 19:12:28 +0000 (15:12 -0400)]
Add missing connection validation in ECPG
ECPGdeallocate_all(), ECPGprepared_statement(), ECPGget_desc(), and
ecpg_freeStmtCacheEntry() could crash with a SIGSEGV when called
without an established connection (for example, when EXEC SQL CONNECT
was forgotten or a non-existent connection name was used), because
they dereferenced the result of ecpg_get_connection() without first
checking it for NULL.
Each site is fixed in the style of the surrounding code.
Andrew Dunstan [Fri, 1 May 2026 15:52:14 +0000 (11:52 -0400)]
Only show signal-sender PID/UID detail in server log
The errdetail() added in 55890a91945 (and reworked in 3e2a1496bae)
exposed the operating-system PID and UID of whoever sent the
termination signal directly to the affected client.
Discussion suggested this should not be sent to the client, but only
recorded in the server log where the admin can use it for diagnosis.
Author: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com>
Discussion: https://postgr.es/m/E5CA274C-74BD-4067-8B73-A3AD8C080EFA@gmail.com
The sequence subscription test switches regress_seq_sub to connect to the
publisher as regress_seq_repl (a non-superuser) when checking behavior
with insufficient sequence privileges but forgot to set up pg_hba.conf to
allow connections from it. The special setup is only needed on Windows
machines that don't use UNIX sockets.
Michael Paquier [Fri, 1 May 2026 04:10:35 +0000 (13:10 +0900)]
doc: Mention validation attempt during ALTER INDEX .. ATTACH PARTITION
Since 9d3e094f12, the command tries to validate the parent index of the
named index, if invalid. The documentation did not mention this
behavior, which could be confusing.
Author: Mohamed ALi <moali.pg@gmail.com>
Discussion: https://postgr.es/m/CAGnOmWpHu25_LpT=zv7KtetQhqV1QEZzFYLd_TDyOLu1Od9fpw@mail.gmail.com
Backpatch-through: 14
Fujii Masao [Fri, 1 May 2026 03:12:44 +0000 (12:12 +0900)]
Avoid blocking indefinitely while finishing walsender shutdown
When walsender finishes streaming during shutdown, it sends a
CommandComplete message to tell the receiver that WAL streaming is done.
Previously, that path used EndCommand() followed by pq_flush().
Those functions can block indefinitely waiting for the socket to become
writeable. As a result, even when wal_sender_shutdown_timeout is set,
walsender could remain stuck while sending the final completion message,
and the shutdown timeout would not be enforced.
Fix this by introducing EndCommandExtended(), which allows
CommandComplete to be queued with pq_putmessage_noblock(), and by
using the walsender nonblocking flush path instead of pq_flush(), so
the shutdown timeout continues to be checked while pending output is
flushed.
Per CI testing on FreeBSD.
Reported-by: Andres Freund <andres@anarazel.de>
Author: Fujii Masao <masao.fujii@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/vwlugmsogfn36jhm56zwrgd7m6xe6ircltvfh3kzt6kldvbtht@f45dgow5uhnx
Richard Guo [Fri, 1 May 2026 02:13:50 +0000 (11:13 +0900)]
Fix HAVING-to-WHERE pushdown with nondeterministic collations
When GROUP BY uses a nondeterministic collation, the planner's
optimization of moving HAVING clauses to WHERE can produce incorrect
query results. The HAVING clause may apply a stricter collation that
distinguishes values the GROUP BY considers equal. Pushing such a
clause to WHERE causes it to filter individual rows before grouping,
potentially eliminating group members and changing aggregate results.
Fix this by detecting collation conflicts before flatten_group_exprs,
while the HAVING clause still contains GROUP Vars (Vars referencing
RTE_GROUP). At that point, each GROUP Var directly carries the GROUP
BY collation as its varcollid, making it straightforward to compare
against the operator's inputcollid. A mismatch where the GROUP BY
collation is nondeterministic means the clause is unsafe to push down.
RowCompareExpr is treated specially, since it carries per-column
inputcollids[] rather than a single inputcollid.
The conflicting clause indices are recorded in a Bitmapset and
consulted during the existing HAVING-to-WHERE loop, so that only
affected clauses are kept in HAVING; other safe clauses in the same
query are still pushed.
Back-patch to v18 only. The fix relies on the RTE_GROUP mechanism
introduced in v18 (commit 247dea89f), which is what lets us identify
grouping expressions and their resolved collations via GROUP Vars on
pre-flatten havingQual. Pre-v18 branches lack that machinery, so a
back-patch there would need a different approach. Given the absence
of field reports of this bug on back branches, the risk of carrying a
different fix on stable branches is not justified.
Amit Langote [Fri, 1 May 2026 00:56:10 +0000 (09:56 +0900)]
Use "concurrent delete" in serialization error for TM_Deleted cases
In ExecLockRows() and ri_LockPKTuple(), the TM_Deleted code path was
using the same "could not serialize access due to concurrent update"
message as the TM_Updated path. Use "concurrent delete" instead, since
the tuple was deleted, not updated. The ExecLockRows() instance was
likely a copy-paste error per Andres; the ri_LockPKTuple() instance
was carried over from the same pattern in commit 2da86c1ef9.
Update affected isolation test expected files accordingly and add
a new test to fk-concurrent-pk-upd.spec with concurrent delete of the
PK row.
The ExecLockRows() change is master-only for lack of user complaints
and to avoid breaking anything that might match on the error text.
Reported-by: Jian He <jian.universality@gmail.com>
Author: Amit Langote <amitlangote09@gmail.com> Reviewed-by: Junwang Zhao <zhjwpku@gmail.com>
Discussion: https://postgr.es/m/CACJufxEG1JTCq4A1gnNAu-bGAq9Xn=Xkf7kC3TRWFz6iuUOuRA@mail.gmail.com
Richard Guo [Fri, 1 May 2026 00:42:00 +0000 (09:42 +0900)]
Fix JSON_ARRAY(query) empty set handling and view deparsing
According to the SQL/JSON standard, JSON_ARRAY(query) must return an
empty JSON array ('[]') when the subquery returns zero rows.
Previously, the parser rewrote JSON_ARRAY(query) into a JSON_ARRAYAGG
aggregate function. Because this aggregate evaluates to NULL over an
empty set without a GROUP BY clause, the constructor erroneously
returned NULL. Additionally, this premature rewrite baked physical
implementation details into the catalog, preventing ruleutils.c from
deparsing the original syntax for views.
This patch resolves both issues by introducing a new
JSCTOR_JSON_ARRAY_QUERY constructor type. The parser builds the
executable form --- a COALESCE-wrapped JSON_ARRAYAGG subquery --- from
raw parse nodes via transformExprRecurse, and stores it in the func
field. The original transformed Query is kept in a new orig_query
field so that ruleutils.c can deparse the original syntax for views.
During planning, eval_const_expressions replaces the node with the
pre-built func expression.
The deparsing issue was reported by Tom Lane.
Bump catalog version.
Bug: #19418 Reported-by: Lukas Eder <lukas.eder@gmail.com>
Author: Richard Guo <guofenglinux@gmail.com> Reviewed-by: Amit Langote <amitlangote09@gmail.com>
Discussion: https://postgr.es/m/19418-591ba1f29862ef5b@postgresql.org
REPACK CONCURRENTLY: fix processing of toasted tuples
In order to process tuples inserted or updated while REPACK executes, we
write those tuples to disk and later restore them; however, some forms
of toasted tuples were not being processed correctly. Fix that.
Andrew Dunstan [Thu, 30 Apr 2026 15:04:57 +0000 (11:04 -0400)]
Fix attnum remapping in generateClonedExtStatsStmt()
When cloning extended statistics via CREATE TABLE ... LIKE ... INCLUDING
STATISTICS, stxkeys holds attribute numbers from the source (parent)
table, but get_attname() was being called with the child relation's
OID. If the parent has dropped columns, the child's attribute numbers
are renumbered sequentially and no longer match, so the lookup either
returns the wrong column name (silent corruption) or errors out when
the attnum does not exist in the child.
Fix it by remapping the parent attnum through attmap before the lookup,
consistent with how expression statistics are already handled a few
lines below.
Add a regression test covering both manifestations: a 3-column parent
where the stale attnum refers to no child column (cache-lookup error),
and a 4-column parent where the stale attnum silently refers to the
wrong child column.
Andrew Dunstan [Thu, 30 Apr 2026 14:14:52 +0000 (10:14 -0400)]
Avoid SIGSEGV in pg_get_database_ddl() on NULL tablespace
There is a narrow race in which a concurrent ALTER DATABASE ... SET
TABLESPACE moves the database off the tablespace and a DROP TABLESPACE
removes it between the syscache lookup and the catalog scan. If that
happens, output an error.
Author: Chao Li <lic@highgo.com> Reviewed-by: Jack Bonatakis <jack@bonatak.is> Reviewed-by: Satyanarayana Narlapuram <satyanarlapuram@gmail.com> Reviewed-by: Japin Li <japinli@hotmail.com>
Discussion: https://postgr.es/m/573E45C1-31A4-4885-A00C-1A2171159A2A@gmail.com
Improve database detection logic in datachecksumsworker
The worker need to know whether a database which failed checksum
processing still exists, or has been dropped. This improves the
detection logic by checking for being partially dropped.
Author: Daniel Gustafsson <daniel@yesql.se> Reviewed-by: Tomas Vondra <tomas@vondra.me> Reviewed-by: SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> Reviewed-by: Ayush Tiwari <ayushtiwari.slg01@gmail.com>
Discussion: https://postgr.es/m/9197F930-DDEB-4CAC-82A2-16FEC715CCE8@yesql.se
When pg_{enable|disable}_data_checksums is called while checksums are
being enabled or disabled, the already running launcher is detected
and the new desired state is recorded. Processing will then pick up
the new state and change its operation to fulfill the new request.
If the same state is requested but with different cost values, the
new cost values will take effect on the next relation processed.
The previous coding had a complex logic of starting a new launcher
for this, which is now avoided with the shared mem structure instead
used to signal current processing.
This makes the logic more robust, and fixes a bug where the launcher
would erroneously revert back to the "off" state.
Access to the shared memory is also protected with LWLocks in all
cases. Since the shmem structure is used for signalling between
the worker and the launcher, and there can be only one of each,
there were no concurrency issues detected but it's better to stick
to proper locking protocol should this ever be updated to handle
multiple workers.
Author: Daniel Gustafsson <daniel@yesql.se> Reviewed-by: Tomas Vondra <tomas@vondra.me> Reviewed-by: SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> Reviewed-by: Ayush Tiwari <ayushtiwari.slg01@gmail.com>
Discussion: https://postgr.es/m/9197F930-DDEB-4CAC-82A2-16FEC715CCE8@yesql.se