Andres Freund [Thu, 15 Jan 2026 19:09:08 +0000 (14:09 -0500)]
bufmgr: Implement buffer content locks independently of lwlocks
Until now buffer content locks were implemented using lwlocks. That has the
obvious advantage of not needing a separate efficient implementation of
locks. However, the time for a dedicated buffer content lock implementation
has come:
1) Hint bits are currently set while holding only a share lock. This leads to
having to copy pages while they are being written out if checksums are
enabled, which is not cheap. We would like to add AIO writes, however once
many buffers can be written out at the same time, it gets a lot more
expensive to copy them, particularly because that copy needs to reside in
shared buffers (for worker mode to have access to the buffer).
In addition, modifying buffers while they are being written out can cause
issues with unbuffered/direct-IO, as some filesystems (like btrfs) do not
like that, due to filesystem internal checksums getting corrupted.
The solution to this is to require a new share-exclusive lock-level to set
hint bits and to write out buffers, making those operations mutually
exclusive. We could introduce such a lock-level into the generic lwlock
implementation, however it does not look like there would be other users,
and it does add some overhead into important code paths.
2) For AIO writes we need to be able to race-freely check whether a buffer is
undergoing IO and whether an exclusive lock on the page can be acquired. That
is rather hard to do efficiently when the buffer state and the lock state
are separate atomic variables. This is a major hindrance to allowing writes
to be done asynchronously.
3) Buffer locks are by far the most frequently taken locks. Optimizing them
specifically for their use case is worth the effort. E.g. by merging
content locks into buffer locks we will be able to release a buffer lock
and pin in one atomic operation.
4) There are more complicated optimizations, like long-lived "super pinned &
locked" pages, that cannot realistically be implemented with the generic
lwlock implementation.
Therefore implement content locks inside bufmgr.c. The lockstate is stored as
part of BufferDesc.state. The implementation of buffer content locks is fairly
similar to lwlocks, with a few important differences:
1) An additional lock-level share-exclusive has been added. This lock-level
conflicts with exclusive locks and itself, but not share locks.
2) Error recovery for content locks is implemented as part of the already
existing private-refcount tracking mechanism in combination with resowners,
instead of a bespoke mechanism as the case for lwlocks. This means we do
not need to add dedicated error-recovery code paths to release all content
locks (like done with LWLockReleaseAll() for lwlocks).
3) The lock state is embedded in BufferDesc.state instead of having its own
struct.
4) The wakeup logic is a tad more complicated due to needing to support the
additional lock-level
This commit unfortunately introduces some code that is very similar to the
code in lwlock.c, however the code is not equivalent enough to easily merge
it. The future wins that this commit makes possible seem worth the cost.
As of this commit nothing uses the new share-exclusive lock mode. It will be
used in a future commit. It seemed too complicated to introduce the lock-level
in a separate commit.
It's worth calling out one wart in this commit: Despite content locks not
being lwlocks anymore, they continue to use PGPROC->lw* - that seemed better
than duplicating the relevant infrastructure.
Another thing worth pointing out is that, after this change, content locks are
not reported as LWLock wait events anymore, but as new wait events in the
"Buffer" wait event class (see also 6c5c393b740). The old BufferContent lwlock
tranche has been removed.
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Heikki Linnakangas <heikki.linnakangas@iki.fi> Reviewed-by: Greg Burd <greg@burd.me> Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff
Andres Freund [Thu, 15 Jan 2026 17:53:50 +0000 (12:53 -0500)]
bufmgr: Change BufferDesc.state to be a 64-bit atomic
This is motivated by wanting to merge buffer content locks into
BufferDesc.state in a future commit, rather than having a separate lwlock (see
commit c75ebc657ff for more details). As this change is rather mechanical, it
seems to make sense to split it out into a separate commit, for easier review.
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff
Tom Lane [Thu, 15 Jan 2026 19:12:03 +0000 (14:12 -0500)]
Optimize LISTEN/NOTIFY via shared channel map and direct advancement.
This patch reworks LISTEN/NOTIFY to avoid waking backends that have
no need to process the notification messages we just sent.
The primary change is to create a shared hash table that tracks
which processes are listening to which channels (where a "channel" is
defined by a database OID and channel name). This allows a notifying
process to accurately determine which listeners are interested,
replacing the previous weak approximation that listeners in other
databases couldn't be interested.
Secondly, if a listener is known not to be interested and is
currently stopped at the old queue head, we avoid waking it at all
and just directly advance its queue pointer past the notifications
we inserted.
These changes permit very significant improvements (integer multiples)
in NOTIFY throughput, as well as a noticeable reduction in latency,
when there are many listeners but only a few are interested in any
specific message. There is no improvement for the simplest case where
every listener reads every message, but any loss seems below the noise
level.
Author: Joel Jacobson <joel@compiler.org> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/6899c044-4a82-49be-8117-e6f669765f7e@app.fastmail.com
Fix 'unexpected data beyond EOF' on replica restart
On restart, a replica can fail with an error like 'unexpected data
beyond EOF in block 200 of relation T/D/R'. These are the steps to
reproduce it:
- A relation has a size of 400 blocks.
- Blocks 201 to 400 are empty.
- Block 200 has two rows.
- Blocks 100 to 199 are empty.
- A restartpoint is done
- Vacuum truncates the relation to 200 blocks
- A FPW deletes a row in block 200
- A checkpoint is done
- A FPW deletes the last row in block 200
- Vacuum truncates the relation to 100 blocks
- The replica restarts
When the replica restarts:
- The relation on disk starts at 100 blocks, because all the
truncations were applied before restart.
- The first truncate to 200 blocks is replayed. It silently fails, but
it will still (incorrectly!) update the cache size to 200 blocks
- The first FPW on block 200 is applied. XLogReadBufferForRead relies
on the cached size and incorrectly assumes that the page already
exists in the file, and thus won't extend the relation.
- The online checkpoint record is replayed, calling smgrdestroyall
which causes the cached size to be discarded
- The second FPW on block 200 is applied. This time, the detected size
is 100 blocks, an extend is attempted. However, the block 200 is
already present in the buffer cache due to the first FPW. This
triggers the 'unexpected data beyond EOF'.
To fix, update the cached size in SmgrRelation with the current size
rather than the requested new size, when the requested new size is
greater.
Andres Freund [Thu, 15 Jan 2026 15:17:51 +0000 (10:17 -0500)]
aio: io_uring: Fix danger of completion getting reused before being read
We called io_uring_cqe_seen(..., cqe) before reading cqe->res. That allows the
completion to be reused, which in turn could lead to cqe->res being
overwritten. The window for that is very narrow and the likelihood of it
happening is very low, as we should never actually utilize all CQEs, but the
consequences would be bad.
Wake up autovacuum launcher from postmaster when a worker exits
When an autovacuum worker exits, the launcher needs to be notified
with SIGUSR2, so that it can rebalance and possibly launch a new
worker. The launcher must be notified only after the worker has
finished ProcKill(), so that the worker slot is available for a new
worker. Before this commit, the autovacuum worker was responsible for
that, which required a slightly complicated dance to pass the
launcher's PID from FreeWorkerInfo() to ProcKill() in a global
variable.
Simplify that by moving the responsibility of the signaling to the
postmaster. The postmaster was already doing it when it failed to fork
a worker process, so it seems logical to make it responsible for
notifying the launcher on worker exit too. That's also how the
notification on background worker exit is done.
Reviewed-by: Nathan Bossart <nathandbossart@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: li carol <carol.li2025@outlook.com>
Discussion: https://www.postgresql.org/message-id/a5e27d25-c7e7-45d5-9bac-a17c8f462def@iki.fi
Add check for invalid offset at multixid truncation
If a multixid with zero offset is left behind after a crash, and that
multixid later becomes the oldest multixid, truncation might try to
look up its offset and read the zero value. In the worst case, we
might incorrectly use the zero offset to truncate valid SLRU segments
that are still needed. I'm not sure if that can happen in practice, or
if there are some other lower-level safeguards or incidental reasons
that prevent the caller from passing an unwritten multixid as the
oldest multi. But better safe than sorry, so let's add an explicit
check for it.
In stable branches, we should perhaps do the same check for
'oldestOffset', i.e. the offset of the old oldest multixid (in master,
'oldestOffset' is gone). But if the old oldest multixid has an invalid
offset, the damage has been done already, and we would never advance
past that point. It's not clear what we should do in that case. The
check that this commit adds will prevent such an multixid with invalid
offset from becoming the oldest multixid in the first place, which
seems enough for now.
Remove some unnecessary code from multixact truncation
With 64-bit multixact offsets, PerformMembersTruncation() doesn't need
the starting offset anymore. The 'oldestOffset' value that
TruncateMultiXact() calculates is no longer used for anything. Remove
it, and the code to calculate it.
'oldestOffset' was included in the WAL record as 'startTruncMemb',
which sounds nice if you e.g. look at the WAL with pg_waldump, but it
was also confusing because we didn't actually use the value for
determining what to truncate. Replaying the WAL would remove all
segments older than 'endTruncMemb', regardless of
'startTruncMemb'. The 'startTruncOff' stored in the WAL record was
similarly unnecessary even before 64-bit multixid offsets, it was
stored just for the sake of symmetry with 'startTruncMemb'. Remove
both from the WAL record, and rename the remaining 'endTruncOff' to
'oldestMulti' and 'endTruncMemb' to 'oldestOffset', for consistency
with the variable names used for them in other places.
Peter Eisentraut [Thu, 15 Jan 2026 11:11:52 +0000 (12:11 +0100)]
plpython: Streamline initialization
The initialization of PL/Python (the Python interpreter, the global
state, the plpy module) was arranged confusingly across different
functions with unclear and confusing boundaries. For example,
PLy_init_interp() said "Initialize the Python interpreter ..." but it
didn't actually do this, and PLy_init_plpy() said "initialize plpy
module" but it didn't do that either. After this change, all the
global initialization is called directly from _PG_init(), and the plpy
module initialization is all called from its registered initialization
function PyInit_plpy().
Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com> Reviewed-by: li carol <carol.li2025@outlook.com> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://www.postgresql.org/message-id/f31333f1-fbb7-4098-b209-bf2d71fbd4f3%40eisentraut.org
Peter Eisentraut [Thu, 15 Jan 2026 09:24:49 +0000 (10:24 +0100)]
plpython: Remove duplicate PyModule_Create()
This seems to have existed like this since Python 3 support was
added (commit dd4cd55c158), but it's unclear what this second call is
supposed to accomplish.
Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com> Reviewed-by: li carol <carol.li2025@outlook.com> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://www.postgresql.org/message-id/f31333f1-fbb7-4098-b209-bf2d71fbd4f3%40eisentraut.org
Peter Eisentraut [Thu, 15 Jan 2026 09:24:49 +0000 (10:24 +0100)]
plpython: Clean up PyModule_AddObject() uses
The comments "PyModule_AddObject does not add a refcount to the
object, for some odd reason" seem distracting. Arguably, this
behavior is expected, not odd. Also, the additional references
created by the existing code are apparently not necessary. But we
should clean up the reference in the error case, as suggested by the
Python documentation.
Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com> Reviewed-by: li carol <carol.li2025@outlook.com> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://www.postgresql.org/message-id/f31333f1-fbb7-4098-b209-bf2d71fbd4f3%40eisentraut.org
Michael Paquier [Thu, 15 Jan 2026 00:36:05 +0000 (09:36 +0900)]
Introduce routines to validate and free MVNDistinct and MVDependencies
These routines are useful to perform some basic validation checks on
each object structure, working currently on attribute numbers for
non-expression and expression attnums. These checks could be extended
in the future.
Note that this code is not used yet in the tree, and that these
functions will become handy for an upcoming patch for the import of
extended statistics data. However, they are worth their own independent
change as they are actually useful by themselves, with at least the
extension code argument in mind (or perhaps I am just feeling more
pedantic today).
Extracted from a larger patch by the same author, with many adjustments
and fixes by me.
Author: Corey Huinker <corey.huinker@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CADkLM=dpz3KFnqP-dgJ-zvRvtjsa8UZv8wDAQdqho=qN3kX0Zg@mail.gmail.com
Jeff Davis [Wed, 14 Jan 2026 20:01:36 +0000 (12:01 -0800)]
Remove redundant assignment in CreateWorkExprContext
In CreateWorkExprContext(), maxBlockSize is initialized to
ALLOCSET_DEFAULT_MAXSIZE, and it then immediately reassigned,
thus the initialization is a redundant.
Author: Andreas Karlsson <andreas@proxel.se> Reported-by: Chao Li <lic@highgo.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/83a14f3c-f347-4769-9c01-30030b31f1eb@gmail.com
The original problem that led to the use of pg_restrict was that MSVC
couldn't handle plain restrict, and defining it to something else
would conflict with its __declspec(restrict) that is used in system
header files. In C11 mode, this is no longer a problem, as MSVC
handles plain restrict. This led to the commit to replace pg_restrict
with restrict. But this did not take C++ into account. Standard C++
does not have restrict, so we defined it as something else (for
example, MSVC supports __restrict). But this then again conflicts
with __declspec(restrict) in system header files. So we have to
revert this attempt. The comments are updated to clarify that the
reason for this is now C++ only.
Reported-by: Jelte Fennema-Nio <postgres@jeltef.nl> Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://www.postgresql.org/message-id/CAGECzQRoD7chJP1-dneSrhxUJv%2BBRcigoGOO4UwGzaShLot2Yw%40mail.gmail.com
Peter Eisentraut [Wed, 14 Jan 2026 09:17:36 +0000 (10:17 +0100)]
Enable Python Limited API for PL/Python on MSVC
Previously, the Python Limited API was disabled on MSVC due to build
failures caused by Meson not knowing to link against python3.lib
instead of python3XX.lib when using the Limited API.
This commit works around the Meson limitation by explicitly finding
and linking against python3.lib on MSVC, and removes the preprocessor
guard that was disabling the Limited API on MSVC in plpython.h.
This requires python3.lib to be present in the Python installation,
which is included when Python is installed.
Author: Bryan Green <dbryan.green@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/ee410de1-1e0b-4770-b125-eeefd4726a24%40eisentraut.org
Michael Paquier [Wed, 14 Jan 2026 08:07:49 +0000 (17:07 +0900)]
Use more consistent *GetDatum() macros for some unsigned numbers
This patch switches some code paths to use GetDatum() macros more in
line with the data types of the variables they manipulate. This set of
changes does not fix a problem, but it is always nice to be more
consistent across the board.
Author: Kirill Reshke <reshkekirill@gmail.com> Reviewed-by: Roman Khapov <rkhapov@yandex-team.ru> Reviewed-by: Yuan Li <carol.li2025@outlook.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Man Zeng <zengman@halodbtech.com>
Discussion: https://postgr.es/m/CALdSSPidtC7j3MwhkqRj0K2hyp36ztnnjSt6qzGxQtiePR1dzw@mail.gmail.com
Amit Kapila [Wed, 14 Jan 2026 07:13:35 +0000 (07:13 +0000)]
Prevent unintended dropping of active replication origins.
Commit 5b148706c5 exposed functionality that allows multiple processes to
use the same replication origin, enabling non-builtin logical replication
solutions to implement parallel apply for large transactions.
With this functionality, if two backends acquire the same replication
origin and one of them resets it first, the acquired_by flag is cleared
without acknowledging that another backend is still actively using the
origin. This can lead to the origin being unintentionally dropped. If the
shared memory for that dropped origin is later reused for a newly created
origin, the remaining backend that still holds a pointer to the old memory
may inadvertently advance the LSN of a completely different origin,
causing unpredictable behavior.
Although the underlying issue predates commit 5b148706c5, it did not
surface earlier because the internal parallel apply worker mechanism
correctly coordinated origin resets and drops.
This commit resolves the problem by introducing a reference counter for
replication origins. The reference count increases when a backend sets the
origin and decreases when it resets it. Additionally, the backend that
first acquires the origin will not release it until all other backends
using the origin have released it as well.
The patch also prevents dropping a replication origin when acquired_by is
zero but the reference counter is nonzero, covering the scenario where the
first session exits without properly releasing the origin.
Author: Hou Zhijie <houzj.fnst@fujitsu.com>
Author: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Shveta Malik <shveta.malik@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/TY4PR01MB169077EE72ABE9E55BAF162D494B5A@TY4PR01MB16907.jpnprd01.prod.outlook.com
Discussion: https://postgr.es/m/CAMPB6wfe4zLjJL8jiZV5kjjpwBM2=rTRme0UCL7Ra4L8MTVdOg@mail.gmail.com
Michael Paquier [Wed, 14 Jan 2026 07:02:30 +0000 (16:02 +0900)]
pg_waldump: Relax LSN comparison check in TAP test
The test 002_save_fullpage.pl, checking --save-fullpage fails with
wal_consistency_checking enabled, due to the fact that the block saved
in the file has the same LSN as the LSN used in the file name. The test
required that the block LSN is stritly lower than file LSN. This commit
relaxes the check a bit, by allowing the LSNs to match.
While on it, the test name is reworded to include some information about
the file and block LSNs, which is useful for debugging.
Andres Freund [Wed, 14 Jan 2026 00:38:29 +0000 (19:38 -0500)]
bufmgr: Make definitions related to buffer descriptor easier to modify
This is in preparation to widening the buffer state to 64 bits, which in turn
is preparation for implementing content locks in bufmgr. This commit aims to
make the subsequent commits a bit easier to review, by separating out
reformatting etc from the actual changes.
Andres Freund [Wed, 14 Jan 2026 00:38:29 +0000 (19:38 -0500)]
lwlock: Invert meaning of LW_FLAG_RELEASE_OK
Previously, a flag was set to indicate that a lock release should wake up
waiters. Since waking waiters is the default behavior in the majority of
cases, this logic has been inverted. The new LW_FLAG_WAKE_IN_PROGRESS flag is
now set iff wakeups are explicitly inhibited.
The motivation for this change is that in an upcoming commit, content locks
will be implemented independently of lwlocks, with the lock state stored as
part of BufferDesc.state. As all of a buffer's flags are cleared when the
buffer is invalidated, without this change we would have to re-add the
RELEASE_OK flag after clearing the flags; otherwise, the next lock release
would not wake waiters.
It seems good to keep the implementation of lwlocks and buffer content locks
as similar as reasonably possible.
Michael Paquier [Tue, 13 Jan 2026 23:44:12 +0000 (08:44 +0900)]
Fix query jumbling with GROUP BY clauses
RangeTblEntry.groupexprs was marked with the node attribute
query_jumble_ignore, causing a list of GROUP BY expressions to be
ignored during the query jumbling. For example, these two queries could
be grouped together within the same query ID:
SELECT count(*) FROM t GROUP BY a;
SELECT count(*) FROM t GROUP BY b;
However, as such queries use different GROUP BY clauses, they should be
split across multiple entries.
This fixes an oversight in 247dea89f761, that has introduced an RTE for
GROUP BY clauses. Query IDs are documented as being stable across minor
releases, but as this is a regression new to v18 and that we are still
early in its support cycle, a backpatch is exceptionally done as this
has broken a behavior that exists since query jumbling is supported in
core, since its introduction in pg_stat_statements.
The tests of pg_stat_statements are expanded to cover this area, with
patterns involving GROUP BY and GROUPING clauses.
Author: Jian He <jian.universality@gmail.com>
Discussion: https://postgr.es/m/CACJufxEy2W+tCqC7XuJ94r3ivWsM=onKJp94kRFx3hoARjBeFQ@mail.gmail.com
Backpatch-through: 18
Fujii Masao [Tue, 13 Jan 2026 13:54:45 +0000 (22:54 +0900)]
doc: Document DEFAULT option in file_fdw.
Commit 9f8377f7a introduced the DEFAULT option for file_fdw but did not
update the documentation. This commit adds the missing description of
the DEFAULT option to the file_fdw documentation.
Backpatch to v16, where the DEFAULT option was introduced.
Álvaro Herrera [Tue, 13 Jan 2026 09:03:33 +0000 (10:03 +0100)]
Fix test_misc/010_index_concurrently_upsert for cache-clobbering builds
The test script added by commit e1c971945d62 failed to handle the case
of cache-clobbering builds (CLOBBER_CACHE_ALWAYS and
CATCACHE_FORCE_RELEASE) properly -- it would only exit a loop on
timeout, which is slow, and unfortunate because I (Álvaro) increased the
timeout for that loop to the complete default TAP test timeout, causing
the buildfarm to report the whole test run as a timeout failure. We can
be much quicker: exit the loop as soon as the backend is seen as waiting
on the injection point.
In this commit we still reduce the timeout (of that loop and a nearby
one just to be safe) to half of the default.
I (Álvaro) had also changed Mihail's "sleep(1)" to "sleep(0.1)", which
apparently turns a 1s sleep into a 0s sleep, because Perl -- probably
making this a busy loop. Use Time::HiRes::usleep instead, like we do in
other tests.
Author: Mihail Nikalayeu <mihailnikalayeu@gmail.com> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/CADzfLwWOVyJygX6BFuyuhTKkJ7uw2e8OcVCDnf6iqnOFhMPE%2BA%40mail.gmail.com
John Naylor [Tue, 13 Jan 2026 05:33:08 +0000 (12:33 +0700)]
Improve some comment wording and grammar in extension.c
Noted while looking at reports of grammatical errors.
Reported-by: albert tan <alterttan1223@gmail.com> Reported-by: Yuan Li(carol) <carol.li2025@outlook.com>
Discussion: https://postgr.es/m/CAEzortnJB7aue6miGT_xU2KLb3okoKgkBe4EzJ6yJ%3DY8LMB7gw%40mail.gmail.com
Andres Freund [Mon, 12 Jan 2026 18:14:58 +0000 (13:14 -0500)]
heapam: Add batch mode mvcc check and use it in page mode
There are two reasons for doing so:
1) It is generally faster to perform checks in a batched fashion and making
sequential scans faster is nice.
2) We would like to stop setting hint bits while pages are being written
out. The necessary locking becomes visible for page mode scans, if done for
every tuple. With batching, the overhead can be amortized to only happen
once per page.
There are substantial further optimization opportunities along these
lines:
- Right now HeapTupleSatisfiesMVCCBatch() simply uses the single-tuple
HeapTupleSatisfiesMVCC(), relying on the compiler to inline it. We could
instead write an explicitly optimized version that avoids repeated xid
tests.
- Introduce batched version of the serializability test
- Introduce batched version of HeapTupleSatisfiesVacuum
Andres Freund [Mon, 12 Jan 2026 16:50:05 +0000 (11:50 -0500)]
heapam: Use exclusive lock on old page in CLUSTER
To be able to guarantee that we can set the hint bit, acquire an exclusive
lock on the old buffer. This is required as a future commit will only allow
hint bits to be set with a new lock level, which is acquired as-needed in a
non-blocking fashion.
We need the hint bits, set in heapam_relation_copy_for_cluster() ->
HeapTupleSatisfiesVacuum(), to be set, as otherwise reform_and_rewrite_tuple()
-> rewrite_heap_tuple() will get confused. Specifically, rewrite_heap_tuple()
checks for HEAP_XMAX_INVALID in the old tuple to determine whether to check
the old-to-new mapping hash table.
It'd be better if we somehow could avoid setting hint bits on the old page. A
common reason to use VACUUM FULL is very bloated tables - rewriting most of
the old table during VACUUM FULL doesn't exactly help.
Reviewed-by: Heikki Linnakangas <heikki.linnakangas@iki.fi> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://postgr.es/m/4wggb7purufpto6x35fd2kwhasehnzfdy3zdcu47qryubs2hdz@fa5kannykekr
Andres Freund [Mon, 12 Jan 2026 16:32:17 +0000 (11:32 -0500)]
freespace: Don't modify page without any lock
Before this commit fsm_vacuum_page() modified the page without any lock on the
page. Historically that was kind of ok, as we didn't rely on the freespace to
really stay consistent and we did not have checksums. But these days pages are
checksummed and there are ways for FSM pages to be included in WAL records,
even if the FSM itself is still not WAL logged. If a FSM page ever were
modified while a WAL record referenced that page, we'd be in trouble, as the
WAL CRC could end up getting corrupted.
The reason to address this right now is a series of patches with the goal to
only allow modifications of pages with an appropriate lock level. Obviously
not having any lock is not appropriate :)
Álvaro Herrera [Mon, 12 Jan 2026 17:09:36 +0000 (18:09 +0100)]
Stop including {brin,gin}_tuple.h in tuplesort.h
Doing this meant that those two headers, which are supposed to be
internal to their corresponding index AMs, were being included pretty
much universally, because tuplesort.h is included by execnodes.h which
is very widely used. Stop that, and fix fallout.
We also change indexing.h to no longer include execnodes.h (tuptable.h
is sufficient), and relscan.h to no longer include buf.h (pointless
since c2fe139c201c).
Author: Mario González <gonzalemario@gmail.com>
Discussion: https://postgr.es/m/CAFsReFUcBFup=Ohv_xd7SNQ=e73TXi8YNEkTsFEE2BW7jS1noQ@mail.gmail.com
Álvaro Herrera [Mon, 12 Jan 2026 15:59:28 +0000 (16:59 +0100)]
Move instrumentation-related structs to instrument_node.h
Some structs and enums related to parallel query instrumentation had
organically grown scattered across various files, and were causing
header pollution especially through execnodes.h. Create a single file
where they can live together.
This only moves the structs to the new file; cleaning up the pollution
by removing no-longer-necessary cross-header inclusion will be done in
future commits.
Co-authored-by: Álvaro Herrera <alvherre@kurilemu.de> Co-authored-by: Mario González <gonzalemario@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/202510051642.wwmn4mj77wch@alvherre.pgsql
Discussion: https://postgr.es/m/CAFsReFUr4KrQ60z+ck9cRM4WuUw1TCghN7EFwvV0KvuncTRc2w@mail.gmail.com
Peter Eisentraut [Mon, 12 Jan 2026 15:12:56 +0000 (16:12 +0100)]
Avoid casting void * function arguments
In many cases, the cast would silently drop a const qualifier. To
fix, drop the unnecessary cast and let the compiler check the types
and qualifiers. Add const to read-only local variables, preserving
the const qualifiers from the function signatures.
Co-authored-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Co-authored-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://postgr.es/m/aUQHy/MmWq7c97wK%40ip-10-97-1-34.eu-west-3.compute.internal
Peter Eisentraut [Mon, 12 Jan 2026 13:26:26 +0000 (14:26 +0100)]
Add const to read only TableInfo pointers in pg_dump
Functions that dump table data receive their parameters through const
void * but were casting away const. Add const qualifiers to functions
that only read the table information.
Peter Eisentraut [Mon, 12 Jan 2026 07:35:48 +0000 (08:35 +0100)]
Make dmetaphone collation-aware
The dmetaphone() SQL function internally upper-cases the argument
string. It did this using the toupper() function. That way, it has a
dependency on the global LC_CTYPE locale setting, which we want to get
rid of.
The "double metaphone" algorithm specifically supports the "C with
cedilla" letter, so just using ASCII case conversion wouldn't work.
To fix that, use the passed-in collation and use the str_toupper()
function, which has full awareness of collations and collation
providers.
Note that this does not change the fact that this function only works
correctly with single-byte encodings. The change to str_toupper()
makes the case conversion multibyte-enabled, but the rest of the
function is still not ready.
Reviewed-by: Jeff Davis <pgsql@j-davis.com>
Discussion: https://www.postgresql.org/message-id/108e07a2-0632-4f00-984d-fe0e0d0ec726%40eisentraut.org
Michael Paquier [Sun, 11 Jan 2026 06:24:02 +0000 (15:24 +0900)]
doc: Improve description of pg_restore --jobs
The parameter name used for the option value was named "number-of-jobs",
which was inconsistent with what all the other tools with an option
called --jobs use. This commit updates the parameter name to "njobs".
Andres Freund [Fri, 9 Jan 2026 17:10:53 +0000 (12:10 -0500)]
instrumentation: Keep time fields as instrtime, convert in callers
Previously the instrumentation logic always converted to seconds, only for
many of the callers to do unnecessary division to get to milliseconds. As an
upcoming refactoring will split the Instrumentation struct, utilize instrtime
always to keep things simpler. It's also a bit faster to not have to first
convert to a double in functions like InstrEndLoop(), InstrAggNode().
Author: Lukas Fittl <lukas@fittl.com> Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAP53PkzZ3UotnRrrnXWAv=F4avRq9MQ8zU+bxoN9tpovEu6fGQ@mail.gmail.com
Jacob Champion [Fri, 9 Jan 2026 17:59:15 +0000 (09:59 -0800)]
doc: Improve description of publish_via_partition_root
Reword publish_via_partition_root's opening paragraph. Describe its
behavior more clearly, and directly state that its default is false.
Per complaint by Peter Smith; final text of the patch made in
collaboration with Chao Li.
Author: Chao Li <li.evan.chao@gmail.com>
Author: Peter Smith <peter.b.smith@fujitsu.com> Reported-by: Peter Smith <peter.b.smith@fujitsu.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/CAHut%2BPu7SpK%2BctOYoqYR3V4w5LKc9sCs6c_qotk9uTQJQ4zp6g%40mail.gmail.com
Backpatch-through: 14
Tom Lane [Fri, 9 Jan 2026 17:59:35 +0000 (12:59 -0500)]
Improve "constraint must include all partitioning columns" message.
This formerly said "unique constraint must ...", which was accurate
enough when it only applied to UNIQUE and PRIMARY KEY constraints.
However, now we use it for exclusion constraints too, and in that
case it's a tad confusing. Do what we already did in the errdetail
message: print the constraint_type, so that it looks like "UNIQUE
constraint ...", "EXCLUDE constraint ...", etc.
Author: jian he <jian.universality@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CACJufxH6VhAf65Vghg4T2q315gY=Rt4BUfMyunkfRj0n2S9n-g@mail.gmail.com
Nathan Bossart [Fri, 9 Jan 2026 16:12:54 +0000 (10:12 -0600)]
pg_dump: Fix gathering of sequence information.
Since commit bd15b7db48, pg_dump uses pg_get_sequence_data() (née
pg_sequence_read_tuple()) to gather all sequence data in a single
query as opposed to a query per sequence. Two related bugs have
been identified:
* If the user lacks appropriate privileges on the sequence, pg_dump
generates a setval() command with garbage values instead of
failing as expected.
* pg_dump can fail due to a concurrently dropped sequence, even if
the dropped sequence's data isn't part of the dump.
This commit fixes the above issues by 1) teaching
pg_get_sequence_data() to return nulls instead of erroring for a
missing sequence and 2) teaching pg_dump to fail if it tries to
dump the data of a sequence for which pg_get_sequence_data()
returned nulls. Note that pg_dump may still fail due to a
concurrently dropped sequence, but it should now only do so when
the sequence data is part of the dump. This matches the behavior
before commit bd15b7db48.
Bug: #19365 Reported-by: Paveł Tyślacki <pavel.tyslacki@gmail.com> Suggested-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/19365-6245240d8b926327%40postgresql.org
Discussion: https://postgr.es/m/2885944.1767029161%40sss.pgh.pa.us
Backpatch-through: 18
Decouple C++ support in Meson's PGXS from LLVM enablement
This is important for Postgres extensions that are written in C++,
such as pg_duckdb, which uses PGXS as the build system currently. In
the autotools build, C++ is not coupled to LLVM. If the autotools
build is configured without --with-llvm, the C++ compiler and the
various flags get persisted into the Makefile.global.
In future commits, we'll start to make improvements to C++
infrastructure. This requires that for our 32-bit builds we also
build C++ code for the 32-bit architecture.
Since CPP is also used to mean C PreProcessor in our meson.build
files, it's confusing to use it to also mean C PlusPlus. This uses
the non-ambiguous cxx wherever possible.
David Rowley [Thu, 8 Jan 2026 22:01:36 +0000 (11:01 +1300)]
Fix possible incorrect column reference in ERROR message
When creating a partition for a RANGE partitioned table, the reporting
of errors relating to converting the specified range values into
constant values for the partition key's type could display the name of a
previous partition key column when an earlier range was specified as
MINVALUE or MAXVALUE.
This was caused by the code not correctly incrementing the index that
tracks which partition key the foreach loop was working on after
processing MINVALUE/MAXVALUE ranges.
Fix by using foreach_current_index() to ensure the index variable is
always set to the List element being worked on.
Tom Lane [Thu, 8 Jan 2026 19:09:58 +0000 (14:09 -0500)]
Remove now-useless btree_gist--1.2.sql script.
In the wake of the previous commit, this script will fail
if executed via CREATE EXTENSION, so it's useless. Remove it,
but keep the delta scripts, allowing old (even very old)
versions of the btree_gist SQL objects to be upgraded to 1.9
via ALTER EXTENSION UPDATE after a pg_upgrade.
Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com> Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/2483812.1754072263@sss.pgh.pa.us
Tom Lane [Thu, 8 Jan 2026 19:03:56 +0000 (14:03 -0500)]
Mark GiST inet_ops as opcdefault, and deal with ensuing fallout.
This patch completes the transition to making inet_ops be default for
inet/cidr columns, rather than btree_gist's opclasses. Once we do
that, though, pg_upgrade has a big problem. A dump from an older
version will see btree_gist's opclasses as being default, so it will
not mention the opclass explicitly in CREATE INDEX commands, which
would cause the restore to create the indexes using inet_ops. Since
that's not compatible with what's actually in the files, havoc would
ensue.
This isn't readily fixable, because the CREATE INDEX command strings
are built by the older server's pg_get_indexdef() function; pg_dump
hasn't nearly enough knowledge to modify those strings successfully.
Even if we cared to put in the work to make that happen in pg_dump,
it would be counterproductive because the end goal here is to get
people off of these opclasses. Allowing such indexes to persist
through pg_upgrade wouldn't advance that goal.
Therefore, this patch just adds code to pg_upgrade to detect indexes
that would be problematic and refuse to upgrade.
There's another issue too: even without any indexes to worry about,
pg_dump in binary-upgrade mode will reproduce the "CREATE OPERATOR
CLASS ... DEFAULT" commands for btree_gist's opclasses, and those
will fail because now we have a built-in opclass that provides a
conflicting default. We could ask users to drop the btree_gist
extension altogether before upgrading, but that would carry very
severe penalties. It would affect perfectly-valid indexes for other
data types, and it would drop operators that might be relied on in
views or other database objects. Instead, put a hack in DefineOpClass
to ignore the DEFAULT clauses for these opclasses when in
binary-upgrade mode. This will result in installing a version of
btree_gist that isn't quite the version it claims to be, but that can
be fixed by issuing ALTER EXTENSION UPDATE afterwards.
Since we don't apply that hack when not in binary-upgrade mode,
it is now impossible to install any version of btree_gist
less than 1.9 via CREATE EXTENSION.
Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com> Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/2483812.1754072263@sss.pgh.pa.us
Tom Lane [Thu, 8 Jan 2026 18:56:08 +0000 (13:56 -0500)]
Create btree_gist v1.9, in which inet/cidr opclasses aren't default.
btree_gist's gist_inet_ops and gist_cidr_ops opclasses are
fundamentally broken: they rely on an approximate representation of
the inet values and hence sometimes miss rows they should return.
We want to eventually get rid of them altogether, but as the first
step on that journey, we should mark them not-opcdefault.
To do that, roll up the preceding deltas since 1.2 into a new base
script btree_gist--1.9.sql. This will allow installing 1.9 without
going through a transient situation where gist_inet_ops and
gist_cidr_ops are marked as opcdefault; trying to create them that
way will fail if there's already a matching default opclass in the
core system. Additionally provide btree_gist--1.8--1.9.sql, so
that a database that's been pg_upgraded from an older version can
be migrated to 1.9.
I noted along the way that commit 57e3c5160 had missed marking the
gist_bool_ops support functions as PARALLEL SAFE. While that probably
has little harmful effect (since AFAIK we don't check that when
calling index support functions), this seems like a good time to make
things consistent.
Readers will also note that I removed the former habit of installing
some opclass operators/functions with ALTER OPERATOR FAMILY, instead
just rolling them all into the CREATE OPERATOR CLASS steps. The
comment in btree_gist--1.2.sql that it's necessary to use ALTER for
pg_upgrade reproducibility has been obsolete since we invented the
amadjustmembers infrastructure. Nowadays, gistadjustmembers will
force all operators and non-required support functions to have "soft"
opfamily dependencies, regardless of whether they are installed by
CREATE or ALTER.
Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com> Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/2483812.1754072263@sss.pgh.pa.us
The only user-visible change is the fix in the "malformed
pg_dependencies" error detail. That one is new in commit e1405aa5e3ac,
so no backpatching required.
rindex() has been removed from POSIX 2008. Replace the one remaining
use with the equivalent and more standard strrchr().
Reviewed-by: Jelte Fennema-Nio <postgres@jeltef.nl> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://www.postgresql.org/message-id/flat/98ce805c-6103-421b-adc3-fcf8f3dddbe3%40eisentraut.org
Remove all configure checks and workarounds for strnlen() missing. It
is required by POSIX 2008.
Reviewed-by: Jelte Fennema-Nio <postgres@jeltef.nl> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://www.postgresql.org/message-id/flat/98ce805c-6103-421b-adc3-fcf8f3dddbe3%40eisentraut.org
Michael Paquier [Thu, 8 Jan 2026 01:12:33 +0000 (10:12 +0900)]
pg_createsubscriber: Improve handling of automated recovery configuration
When repurposing a standby to a logical replica, pg_createsubscriber
uses for the new replica a set of configuration parameters saved into
postgresql.auto.conf, to force recovery patterns when the physical
replica is promoted.
While not wrong in practice, this approach can cause issues when forcing
again recovery on a logical replica or its base backup as the recovery
parameters are not reset on the target server once pg_createsubscriber
is done with the node.
This commit aims at improving the situation, by changing the way
recovery parameters are saved on the target node. Instead of writing
all the configuration to postgresql.auto.conf, this file now uses an
include_if_exists, that points to a pg_createsubscriber.conf. This new
file contains all the recovery configuration, and is renamed to
pg_createsubscriber.conf.disabled when pg_createsubscriber exits. This
approach resets the recovery parameters, and offers the benefit to keep
a trace of the setup used when the target node got promoted, for
debugging purposes. If pg_createsubscriber.conf cannot be renamed
(unlikely scenario), a warning is issued to inform users that a manual
intervention may be required to reset this configuration.
This commit includes a test case to demonstrate the problematic case: a
standby node created from a base backup of what was the target node of
pg_createsubscriber does not get confused when started. If removing
this new logic, the test fails with the standby not able to start due
to an incorrect recovery target setup, where the startup process fails
quickly with a FATAL.
I have provided the design idea for the patch, that Alyona has written
(with some code adjustments from me). This could be considered as a
bug, but after discussion this is put into the bucket for improvements.
Redesigning pg_createsubscriber would not be acceptable in the stable
branches anyway.
Author: Alyona Vinter <dlaaren8@gmail.com> Reviewed-by: Ilyasov Ian <ianilyasov@outlook.com> Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com> Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Andrey Rudometov <unlimitedhikari@gmail.com>
Discussion: https://postgr.es/m/CAGWv16K6L6Pzm99i1KiXLjFWx2bUS3DVsR6yV87-YR9QO7xb3A@mail.gmail.com
Masahiko Sawada [Thu, 8 Jan 2026 00:22:42 +0000 (16:22 -0800)]
psql: Add tab completion for pstdin and pstdout in \copy.
This commit adds tab completion support for the keywords "pstdin" and
"pstdout" in the \copy command. "pstdin" is now suggested after FROM,
and "pstdout" is suggested after TO, alongside filenames and other
keywords.
Álvaro Herrera [Wed, 7 Jan 2026 22:44:57 +0000 (19:44 -0300)]
Replace flaky CIC/RI isolation tests with a TAP test
The isolation tests for INSERT ON CONFLICT behavior during CREATE INDEX
CONCURRENTLY and REINDEX CONCURRENTLY (added by bc32a12e0db2, 2bc7e886fc1b, and 90eae926abbb) were disabled in 77038d6d0b49 due to
persistent CI flakiness, after several attempts at stabilization.
This commit removes them and introduces a TAP test in test_misc module
(010_index_concurrently_upsert.pl) that covers the same scenarios. This
new test should hopefully be more stable while providing assurance that
the fixes in all those commits (plus 81f72115cf18) continue to work.
Author: Mihail Nikalayeu <mihailnikalayeu@gmail.com> Reported-by: Andres Freund <andres@anarazel.de> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/ccssrhafzbp3a3beju3ptyc56a7gbfimj4vwkbokoldofckrc7@bso37rxskjtf
Discussion: https://postgr.es/m/CANtu0ogv+6wqRzPK241jik4U95s1pW3MCZ3rX5ZqbFdUysz7Qw@mail.gmail.com
Discussion: https://postgr.es/m/202512112014.icpomgc37zx4@alvherre.pgsql
Nathan Bossart [Wed, 7 Jan 2026 19:42:57 +0000 (13:42 -0600)]
MSVC: Support building for AArch64.
This commit does the following to get tests passing for
MSVC/AArch64:
* Implements spin_delay() with an ISB instruction (like we do for
gcc/clang on AArch64).
* Sets USE_ARMV8_CRC32C unconditionally. Vendor-supported versions
of Windows for AArch64 require at least ARMv8.1, which is where CRC
extension support became mandatory.
* Implements S_UNLOCK() with _InterlockedExchange(). The existing
implementation for MSVC uses _ReadWriteBarrier() (a compiler
barrier), which is insufficient for this purpose on non-TSO
architectures.
There are likely other changes required to take full advantage of
the hardware (e.g., atomics/arch-arm.h, simd.h,
pg_popcount_aarch64.c), but those can be dealt with later.
Author: Niyas Sait <niyas.sait@linaro.org> Co-authored-by: Greg Burd <greg@burd.me> Co-authored-by: Dave Cramer <davecramer@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: John Naylor <johncnaylorls@gmail.com> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Thomas Munro <thomas.munro@gmail.com> Tested-by: Andrew Dunstan <andrew@dunslane.net>
Discussion: https://postgr.es/m/A6152C7C-F5E3-4958-8F8E-7692D259FF2F%40greg.burd.me
Discussion: https://postgr.es/m/CAFPTBD-74%2BAEuN9n7caJ0YUnW5A0r-KBX8rYoEJWqFPgLKpzdg%40mail.gmail.com
Peter Geoghegan [Wed, 7 Jan 2026 17:53:07 +0000 (12:53 -0500)]
Fix nbtree skip array transformation comments.
Fix comments that incorrectly described transformations performed by the
"Avoid extra index searches through preprocessing" mechanism introduced
by commit b3f1a13f.
Author: Yugo Nagata <nagata@sraoss.co.jp> Reviewed-By: Chao Li <li.evan.chao@gmail.com> Reviewed-By: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/20251230190145.c3c88c5eb0f88b136adda92f@sraoss.co.jp
Backpatch-through: 18
Fujii Masao [Wed, 7 Jan 2026 16:10:36 +0000 (01:10 +0900)]
doc: Remove deprecated clauses from CREATE USER/GROUP syntax synopsis.
The USER and IN GROUP clauses of CREATE ROLE are deprecated, and
commit 8e78f0a1 removed them from the CREATE ROLE syntax syntax
synopsis in the docs. However, previously CREATE USER and
CREATE GROUP docs still listed these clauses.
Since CREATE USER is equivalent to CREATE ROLE ... WITH LOGIN and
CREATE GROUP is equivalent to CREATE ROLE, their documented syntax
synopsis should match CREATE ROLE to avoid confusion.
Therefore this commit removes the deprecated USER and IN GROUP
clauses from the CREATE USER and CREATE GROUP syntax synopsis
in the docs.
Author: Japin Li <japinli@hotmail.com> Reviewed-by: wenhui qiu <qiuwenhuifx@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/MEAPR01MB3031C30E72EF16CFC08C8565B687A@MEAPR01MB3031.ausprd01.prod.outlook.com
Revert "Use WAIT FOR LSN in PostgreSQL::Test::Cluster::wait_for_catchup()"
This reverts commit f30848cb05d4d63e1a5a2d6a9d72604f3b63370d. Due to
buildfarm failures related to recovery conflicts while executing the
WAIT FOR command. It appears that WAIT FOR still holds xmin for a while.
Reported-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKG%2BL0OQR8dQtsNBSaA3FNNyQeFabdeRVzurm0b7Xtp592g%40mail.gmail.com
John Naylor [Wed, 7 Jan 2026 09:02:19 +0000 (16:02 +0700)]
createuser: Update docs to reflect defaults
Commit c7eab0e97 changed the default password_encryption setting to
'scram-sha-256', so update the example for creating a user with an
assigned password.
In addition, commit 08951a7c9 added new options that in turn pass
default tokens NOBYPASSRLS and NOREPLICATION to the CREATE ROLE
command, so fix this omission as well for v16 and later.
Reported-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/cff1ea60-c67d-4320-9e33-094637c2c4fb%40iki.fi
Backpatch-through: 14
Michael Paquier [Wed, 7 Jan 2026 08:52:54 +0000 (17:52 +0900)]
Fix unexpected reversal of lists during catcache rehash
During catcache searches, the most-recently searched entries are kept at
the head of the list to speed up subsequent searches, keeping the
"freshest" entries at its beginning. A rehash of the catcache was doing
the opposite: fresh entries were moved to the tail of the newly-created
buckets, causing a rehash to slow down a bit.
When a rehash is done, this commit switches the code to use
dlist_push_tail() instead of dlist_push_head(), so as fresh entries are
kept at the head of the lists, not their tail.
Author: ChangAo Chen <cca5507@qq.com> Reviewed-by: John Naylor <johncnaylorls@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/tencent_9EA10D8512B5FE29E7323F780A0749768708@qq.com
Fujii Masao [Wed, 7 Jan 2026 04:58:07 +0000 (13:58 +0900)]
doc: Add glossary and index entries for GUC.
GUC is a commonly used term but previously appeared only
in the acronym documentation. This commit adds glossary and
documentation index entries for GUC to make it easier to find
and understand.
Author: Robert Treat <rob@xzilla.net> Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/CABV9wwPQnkeo_G6-orMGnHPK9SXGVWm7ajJPzsbE6944tDx=hQ@mail.gmail.com
Michael Paquier [Wed, 7 Jan 2026 02:37:00 +0000 (11:37 +0900)]
Add data type oid8, 64-bit unsigned identifier
This new identifier type provides support for 64-bit unsigned values,
to be used in catalogs, like OIDs. An advantage of a new data type is
that it becomes easier to grep for it in the code when assigning this
type to a catalog attribute, linking it to dedicated APIs and internal
structures.
The following operators are added in this commit, with dedicated tests:
- Casts with integer types and OID.
- btree and hash operators
- min/max functions.
- C type with related macros and defines, named around "Oid8".
This has been mentioned as useful on its own on the thread to add
support for 64-bit TOAST values, so as it becomes possible to attach
this data type to the TOAST code and catalog definitions. However, as
this concept can apply to many more areas, it is implemented as its own
independent change. This is based on a discussion with Andres Freund
and Tom Lane.
Andres Freund [Wed, 7 Jan 2026 00:51:10 +0000 (19:51 -0500)]
Fix buggy interaction between array subscripts and subplan params
In a7f107df2 I changed subplan param evaluation to happen within the
containing expression. As part of that, ExecInitSubPlanExpr() was changed to
evaluate parameters via a new EEOP_PARAM_SET expression step. These parameters
were temporarily stored into ExprState->resvalue/resnull, with some reasoning
why that would be fine. Unfortunately, that analysis was wrong -
ExecInitSubscriptionRef() evaluates the input array into "resv"/"resnull",
which will often point to ExprState->resvalue/resnull. This means that the
EEOP_PARAM_SET, if inside an array subscript, would overwrite the input array
to array subscript.
The fix is fairly simple - instead of evaluating into
ExprState->resvalue/resnull, store the temporary result of the subplan in the
subplan's return value.
Bug: #19370 Reported-by: Zepeng Zhang <redraiment@gmail.com> Diagnosed-by: Tom Lane <tgl@sss.pgh.pa.us> Diagnosed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/19370-7fb7a5854b7618f1@postgresql.org
Backpatch-through: 18
Michael Paquier [Tue, 6 Jan 2026 11:24:17 +0000 (20:24 +0900)]
Improve portability of new worker_spi test
The new test 002_worker_terminate relies on the generation of a LOG
entry to check that a worker has been started, but missed the fact that
a node set with log_error_verbosity = verbose would add an error code.
The regexp used for the matching check did not take this case into
account, making the test fail on a timeout. The regexp is now fixed to
handle the verbose case correctly.
Per buildfarm member prion, that uses log_error_verbosity = verbose.
The error was reproducible by setting this GUC the same way in the test.
These tests cover nested arrays of composite data types,
single-argument functions, and casting using dot-notation, providing a
baseline for future enhancements to jsonb dot-notation support.
Author: Alexandra Wang <alexandra.wang.oss@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAK98qZ1JNNAx4QneJG+eX7iLesOhd6A68FNQVvvHP6Up_THf3A@mail.gmail.com
Michael Paquier [Tue, 6 Jan 2026 07:17:59 +0000 (16:17 +0900)]
Use relation_close() more consistently in contrib/
All the code paths updated here have been using index_close() to
close a relation that was opened with relation_open(), in pgstattuple
and pageinspect. index_close() does the same thing as relation_close(),
so there is no harm, but being inconsistent could lead to issues if the
internals of these close() functions begin to introduce some specific
logic in the future.
In passing, this commit adds some comments explaining why we are using
relation_open() instead of index_open() in a few places, which is due to
the fact that partitioned indexes are not allowed in these functions.
Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Japin Li <japinli@hotmail.com>
Discussion: https://postgr.es/m/aUKamYGiDKO6byp5@ip-10-97-1-34.eu-west-3.compute.internal
Michael Paquier [Tue, 6 Jan 2026 05:24:29 +0000 (14:24 +0900)]
Allow bgworkers to be terminated for database-related commands
Background workers gain a new flag, called BGWORKER_INTERRUPTIBLE, that
offers the possibility to terminate the workers when these are connected
to a database that is involved in one of the following commands:
ALTER DATABASE RENAME TO
ALTER DATABASE SET TABLESPACE
CREATE DATABASE
DROP DATABASE
This is useful to give background workers the same behavior as backends
and autovacuum workers, which are stopped when these commands are
executed. The default behavior, that exists since 9.3, is still to
never terminate bgworkers connected to the database involved in any of
these commands. The new flag has to be set to terminate the workers.
A couple of tests are added to worker_spi to track the commands that
impact the termination of the workers. There is a test case for a
non-interruptible worker, additionally, that relies on an injection
point to make the wait time in CountOtherDBBackends() reduced from 5s to
0.3s for faster test runs. The tests rely on the contents of the server
logs to check if a worker has been started or terminated:
- LOG generated by worker_spi_main() at startup, once connection to
database is done.
- FATAL in bgworker_die() when terminated.
A couple of tests run in the CI have showed that this method is stable
enough. The safe_psql() calls that scan pg_stat_activity could be
replaced with some poll_query_until() for more stability, if the current
method proves to be an issue in the buildfarm.
Author: Aya Iwata <iwata.aya@fujitsu.com> Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Ryo Matsumura <matsumura.ryo@fujitsu.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Discussion: https://postgr.es/m/OS7PR01MB11964335F36BE41021B62EAE8EAE4A@OS7PR01MB11964.jpnprd01.prod.outlook.com
Amit Kapila [Tue, 6 Jan 2026 05:02:25 +0000 (05:02 +0000)]
Update comments atop ReplicationSlotCreate.
Since commit 1462aad2e4, which introduced the ability to modify the
two_phase property of a slot, the comments above ReplicationSlotCreate
have become outdated. We have now added a cautionary note in the comments
above ReplicationSlotAlter explaining when it is safe to modify the
two_phase property of a slot.
Author: Daniil Davydov <3danissimo@gmail.com>
Author: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Backpatch-through: 18
Discussion: https://postgr.es/m/CAJDiXggZXQZ7bD0QcTizDt6us9aX6ZKK4dWxzgb5x3+TsVHjqQ@mail.gmail.com
David Rowley [Tue, 6 Jan 2026 04:29:12 +0000 (17:29 +1300)]
Fix issue with EVENT TRIGGERS and ALTER PUBLICATION
When processing the "publish" options of an ALTER PUBLICATION command,
we call SplitIdentifierString() to split the options into a List of
strings. Since SplitIdentifierString() modifies the delimiter
character and puts NULs in their place, this would overwrite the memory
of the AlterPublicationStmt. Later in AlterPublicationOptions(), the
modified AlterPublicationStmt is copied for event triggers, which would
result in the event trigger only seeing the first "publish" option
rather than all options that were specified in the command.
To fix this, make a copy of the string before passing to
SplitIdentifierString().
Here we also adjust a similar case in the pgoutput plugin. There's no
known issues caused by SplitIdentifierString() here, so this is being
done out of paranoia.
Thanks to Henson Choi for putting together an example case showing the
ALTER PUBLICATION issue.
Fujii Masao [Tue, 6 Jan 2026 02:57:12 +0000 (11:57 +0900)]
Add TAP test for GUC settings passed via CONNECTION in logical replication.
Commit d926462d819 restored the behavior of passing GUC settings from
the CONNECTION string to the publisher's walsender, allowing per-connection
configuration.
This commit adds a TAP test to verify that behavior works correctly.
Since commit d926462d819 was recently applied and backpatched to v15,
this follow-up commit is also backpatched accordingly.
Author: Fujii Masao <masao.fujii@gmail.com> Reviewed-by: Chao Li <lic@highgo.com> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Japin Li <japinli@hotmail.com>
Discussion: https://postgr.es/m/CAHGQGwGYV+-abbKwdrM2UHUe-JYOFWmsrs6=QicyJO-j+-Widw@mail.gmail.com
Backpatch-through: 15
Fujii Masao [Tue, 6 Jan 2026 02:52:22 +0000 (11:52 +0900)]
Honor GUC settings specified in CREATE SUBSCRIPTION CONNECTION.
Prior to v15, GUC settings supplied in the CONNECTION clause of
CREATE SUBSCRIPTION were correctly passed through to
the publisher's walsender. For example:
would cause wal_sender_timeout to take effect on the publisher's walsender.
However, commit f3d4019da5d changed the way logical replication
connections are established, forcing the publisher's relevant
GUC settings (datestyle, intervalstyle, extra_float_digits) to
override those provided in the CONNECTION string. As a result,
from v15 through v18, GUC settings in the CONNECTION string were
always ignored.
This regression prevented per-connection tuning of logical replication.
For example, using a shorter timeout for walsender connecting
to a nearby subscriber and a longer one for walsender connecting
to a remote subscriber.
This commit restores the intended behavior by ensuring that
GUC settings in the CONNECTION string are again passed through
and applied by the walsender, allowing per-connection configuration.
Backpatch to v15, where the regression was introduced.
Author: Fujii Masao <masao.fujii@gmail.com> Reviewed-by: Chao Li <lic@highgo.com> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Japin Li <japinli@hotmail.com>
Discussion: https://postgr.es/m/CAHGQGwGYV+-abbKwdrM2UHUe-JYOFWmsrs6=QicyJO-j+-Widw@mail.gmail.com
Backpatch-through: 15
David Rowley [Tue, 6 Jan 2026 02:25:13 +0000 (15:25 +1300)]
Simplify GetOperatorFromCompareType() code
The old code would set *opid = InvalidOid to determine if the
get_opclass_opfamily_and_input_type() worked or not. This means more
moving parts that what's really needed here. Let's just fail
immediately if the get_opclass_opfamily_and_input_type() lookup fails.
Author: Paul A Jungwirth <pj@illuminatedcomputing.com> Reviewed-by: Neil Chen <carpenter.nail.cz@gmail.com>
Discussion: https://postgr.es/m/CA+renyXOrjLacP_nhqEQUf2W+ZCoY2q5kpQCfG05vQVYzr8b9w@mail.gmail.com
David Rowley [Tue, 6 Jan 2026 02:16:14 +0000 (15:16 +1300)]
Fix misleading comment for GetOperatorFromCompareType
The comment claimed *strat got set to InvalidStrategy when the function
lookup fails. This isn't true; an ERROR is raised when that happens.
Author: Paul A Jungwirth <pj@illuminatedcomputing.com> Reviewed-by: Neil Chen <carpenter.nail.cz@gmail.com>
Discussion: https://postgr.es/m/CA+renyXOrjLacP_nhqEQUf2W+ZCoY2q5kpQCfG05vQVYzr8b9w@mail.gmail.com
Backpatch-through: 18
Tom Lane [Mon, 5 Jan 2026 21:51:36 +0000 (16:51 -0500)]
Fix meson build of snowball code.
include/snowball/libstemmer has to be in the -I search path,
as it is in the autoconf build. It's not apparent to me how
this ever worked before, nor why my recent commit made it
stop working.
Tom Lane [Mon, 5 Jan 2026 20:22:12 +0000 (15:22 -0500)]
Update to latest Snowball sources.
It's been almost a year since we last did this, and upstream has
been busy. They've added stemmers for Polish and Esperanto,
and also deprecated their old Dutch stemmer in favor of the
Kraaij-Pohlmann algorithm. (The "dutch" stemmer is now the
latter, and "dutch_porter" is the old algorithm.)
Upstream also decided to rename their internal header "header.h"
to something less generic: "snowball_runtime.h". Seems like a good
thing, but it complicates this patch a bit because we were relying on
interposing our own version of "header.h" to control system header
inclusion order. (We're partially failing at that now, because now the
generated stemmer files include <stddef.h> before snowball_runtime.h.
I think that'll be okay, but if the buildfarm complains then we'll
have to do more-extensive editing of the generated files.)
I realized that we weren't documenting the available stemmers in
any user-visible place, except indirectly through sample \dFd output.
That's incomplete because we only provide built-in dictionaries for
the recommended stemmers for each language, not alternative stemmers
such as dutch_porter. So I added a list to the documentation.
I did not do anything with the stopword lists. If those are still
available from snowballstem.org, they are mighty well hidden.
Andres Freund [Mon, 5 Jan 2026 18:09:03 +0000 (13:09 -0500)]
ci: Remove ulimit -p for netbsd/openbsd
Previously the ulimit -p 256 was needed to increase the limit on
openbsd. However, sometimes the limit actually was too low, causing
"could not fork new process for connection: Resource temporarily unavailable"
errors. Most commonly on netbsd, but also on openbsd.
The ulimit on openbsd couldn't trivially be increased with ulimit, because of
hitting the hard limit.
Instead of increasing the limit in the CI script, the CI image generation now
increases the limits: https://github.com/anarazel/pg-vm-images/pull/129