This view contains one row for each table in the current database,
showing the current autovacuum scores for that specific table. It
also shows whether autovacuum would vacuum or analyze the table.
Bumps catversion.
Author: Sami Imseih <samimseih@gmail.com> Reviewed-by: Satyanarayana Narlapuram <satyanarlapuram@gmail.com> Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> Reviewed-by: Robert Treat <rob@xzilla.net>
Discussion: https://postgr.es/m/CAA5RZ0s4xjMrB-VAnLccC7kY8d0-4806-Lsac-czJsdA1LXtAw%40mail.gmail.com
Use PG_DATA_CHECKSUM_OFF instead of hardcoded value
For a long time, the online checksums patchset kept the "off" state as
literal zero without a label to be consistent with the previous coding
which only had a label for the "on" state. Later, when an "off" label
was made not all uses in the code got the memo. Fix by setting these
to PG_DATA_CHECKSUM_OFF.
While there, fix a duplicate word in a comment introduced by the same
commit.
Author: Aleksander Alekseev <aleksander@tigerdata.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Discussion: https://postgr.es/m/CAJ7c6TPRTnQFXXX1CRcYoTLXw2swtDH==uSz1MYoMKdLrKZHjA@mail.gmail.com
When this flag is specified, REPACK no longer acquires access-exclusive
lock while the new copy of the table is being created; instead, it
creates the initial copy under share-update-exclusive lock only (same as
vacuum, etc), and it follows an MVCC snapshot; it sets up a replication
slot starting at that snapshot, and uses a concurrent background worker
to do logical decoding starting at the snapshot to populate a stash of
concurrent data changes. Those changes can then be re-applied to the
new copy of the table just before swapping the relfilenodes.
Applications can continue to access the original copy of the table
normally until just before the swap, which is the only point at which
the access-exclusive lock is needed.
There are some loose ends in this commit:
1. concurrent repack needs its own replication slot in order to apply
logical decoding, which are a scarce resource and easy to run out of.
2. due to the way the historic snapshot is initially set up, only one
REPACK process can be running at any one time on the whole system.
3. there's a danger of deadlocking (and thus abort) due to the lock
upgrade required at the final phase.
These issues will be addressed in upcoming commits.
The design and most of the code are by Antonin Houska, heavily based on
his own pg_squeeze third-party implementation.
Author: Antonin Houska <ah@cybertec.at> Co-authored-by: Mihail Nikalayeu <mihailnikalayeu@gmail.com> Co-authored-by: Álvaro Herrera <alvherre@kurilemu.de> Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com> Reviewed-by: Srinath Reddy Sadipiralla <srinath2133@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Jim Jones <jim.jones@uni-muenster.de> Reviewed-by: Robert Treat <rob@xzilla.net> Reviewed-by: Noriyoshi Shinoda <noriyoshi.shinoda@hpe.com> Reviewed-by: vignesh C <vignesh21@gmail.com>
Discussion: https://postgr.es/m/5186.1706694913@antos
Discussion: https://postgr.es/m/202507262156.sb455angijk6@alvherre.pgsql
Document that WAIT FOR may be interrupted by recovery conflicts
Add a note to the WAIT FOR documentation explaining that sessions
using this command on a standby server may be interrupted by recovery
conflicts. Some conflicts are unavoidable - for example, replaying
a tablespace drop terminates all backends unconditionally.
Discussion: https://postgr.es/m/CAPpHfds7oSCbZqob7ytT_Lso8fv-NW8LnedUTE4Krde%2B3rkJeA%40mail.gmail.com
Author: Alexander Korotkov <aekorotkov@gmail.com> Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
Use WAIT FOR LSN in PostgreSQL::Test::Cluster::wait_for_catchup()
When the standby is passed as a PostgreSQL::Test::Cluster instance,
use the WAIT FOR LSN command on the standby server to implement
wait_for_catchup() for replay, write, and flush modes. This is more
efficient than polling pg_stat_replication on the upstream, as the
WAIT FOR LSN command uses a latch-based wakeup mechanism.
The optimization applies when:
- The standby is passed as a Cluster object (not just a name string)
- The mode is 'replay', 'write', or 'flush' (not 'sent')
Rather than pre-checking pg_is_in_recovery() on the standby (which
would add an extra round-trip on every call), we issue WAIT FOR LSN
directly and handle the 'not in recovery' result as a signal to fall
back to polling.
For 'sent' mode, when the standby is passed as a string (e.g., a
subscription name for logical replication), when the standby has been
promoted, or when WAIT FOR LSN is interrupted by a recovery conflict,
the function falls back to the original polling-based approach using
pg_stat_replication on the upstream. The recovery conflict fallback
is necessary because some conflicts are unavoidable - for example,
ResolveRecoveryConflictWithTablespace() kills all backends
unconditionally, regardless of what they are doing.
The recovery conflict detection matches the English error message
"conflict with recovery", which is reliable because the test suite
runs with LC_MESSAGES=C.
Discussion: https://postgr.es/m/CABPTF7UiArgW-sXj9CNwRzUhYOQrevLzkYcgBydmX5oDes1sjg%40mail.gmail.com
Author: Xuneng Zhou <xunengzhou@gmail.com> Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Alvaro Herrera <alvherre@kurilemu.de>
Avoid syscache lookup while building a WAIT FOR tuple descriptor
Use TupleDescInitBuiltinEntry instead of TupleDescInitEntry when building
the result tuple descriptor for the WAIT FOR command. This avoids a syscache
access that could re-establish a catalog snapshot after we've explicitly
released all snapshots before the wait.
Discussion: https://postgr.es/m/CABPTF7U%2BSUnJX_woQYGe%3D%3DR9Oz%2B-V6X0VO2stBLPGfJmH_LEhw%40mail.gmail.com
Author: Xuneng Zhou <xunengzhou@gmail.com> Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
This function is a thin wrapper around relation_needs_vacanalyze()
that handles fetching and freeing the pgstat entry for the table.
Since all callers of relation_needs_vacanalyze() do that anyway, we
can teach that function to fetch/free the pgstat entry and use it
instead.
Robert Haas [Mon, 6 Apr 2026 19:09:24 +0000 (15:09 -0400)]
auto_explain: Add new GUC, auto_explain.log_extension_options.
The associated value should look like something that could be
part of an EXPLAIN options list, but restricted to EXPLAIN options
added by extensions.
For example, if pg_overexplain is loaded, you could set
auto_explain.log_extension_options = 'DEBUG, RANGE_TABLE'.
You can also specify arguments to these options in the same manner
as normal e.g. 'DEBUG 1, RANGE_TABLE false'.
Tom Lane [Mon, 6 Apr 2026 19:16:21 +0000 (15:16 -0400)]
Support more object types within CREATE SCHEMA.
Having rejected the principle that we should know how to re-order
the sub-commands of CREATE SCHEMA, there is not really anything
except a little coding to stop us from supporting more object types.
This patch adds support for creating functions (including procedures
and aggregates), operators, types (including domains), collations,
and text search objects.
SQL:2021 specifies that we should allow functions, procedures,
types, domains, and collations, so this moves us a great deal
closer to full SQL compatibility of CREATE SCHEMA. What remains
missing from their list are casts, transforms, roles, and some
object types we don't support yet (e.g. CREATE CHARACTER SET).
Supporting casts or transforms would be problematic because
they don't have names at all, let alone schema-qualified names,
so it'd be quite a stretch to say that they belong to a schema.
Roles likewise are not schema-qualified, plus they are global
to a cluster, making it even less reasonable to consider them
as belonging to a schema. So I don't see us trying to complete
the list.
User-defined aggregates and operators are outside the spec's ken,
as are text search objects, so adding them does not do anything for
spec compatibility. But they go along with these other object types,
plus it takes no additional code to support them since they are
represented as DefineStmts like some variants of CREATE TYPE.
It would indeed take some effort to reject them.
Author: Kirill Reshke <reshkekirill@gmail.com>
Author: Jian He <jian.universality@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CALdSSPh4jUSDsWu3K58hjO60wnTRR0DuO4CKRcwa8EVuOSfXxg@mail.gmail.com
Tom Lane [Mon, 6 Apr 2026 18:52:28 +0000 (14:52 -0400)]
Execute foreign key constraints in CREATE SCHEMA at the end.
The previous patch simplified CREATE SCHEMA's behavior to "execute all
subcommands in the order they are written". However, that's a bit too
simple, as the spec clearly requires forward references in foreign key
constraint clauses to work, see feature F311-01. (Most other SQL
implementations seem to read more into the spec than that, but it's
not clear that there's justification for more in the text, and this is
the only case that doesn't introduce unresolvable issues.) We never
implemented that before, but let's do so now.
To fix it, transform FOREIGN KEY clauses into ALTER TABLE ... ADD
FOREIGN KEY commands and append them to the end of the CREATE SCHEMA's
subcommand list. This works because the foreign key constraints are
independent and don't affect any other DDL that might be in CREATE
SCHEMA. For simplicity, we do this for all FOREIGN KEY clauses even
if they would have worked where they were.
Author: Jian He <jian.universality@gmail.com> Co-authored-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/1075425.1732993688@sss.pgh.pa.us
Tom Lane [Mon, 6 Apr 2026 18:42:55 +0000 (14:42 -0400)]
Don't try to re-order the subcommands of CREATE SCHEMA.
transformCreateSchemaStmtElements has always believed that it is
supposed to re-order the subcommands of CREATE SCHEMA into a safe
execution order. However, it is nowhere near being capable of doing
that correctly. Nor is there reason to think that it ever will be,
or that that is a well-defined requirement. (The SQL standard does
say that it should be possible to do foreign-key forward references
within CREATE SCHEMA, but it's not clear that the text requires
anything more than that.) Moreover, the problem will get worse as
we add more subcommand types. Let's just drop the whole idea and
execute the commands in the order given, which seems like a much
less astonishment-prone definition anyway. The foreign-key issue
will be handled in a follow-up patch.
This will result in a release-note-worthy incompatibility,
which is that forward references like
CREATE SCHEMA myschema
CREATE VIEW myview AS SELECT * FROM mytable
CREATE TABLE mytable (...);
used to work and no longer will. Considering how many closely
related variants never worked, this isn't much of a loss.
Along the way, pass down a ParseState so that we can provide an
error cursor for "wrong schema name" and related errors, and fix
transformCreateSchemaStmtElements so that it doesn't scribble
on the parsetree passed to it.
Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com> Reviewed-by: Jian He <jian.universality@gmail.com>
Discussion: https://postgr.es/m/1075425.1732993688@sss.pgh.pa.us
Previously, autovacuum always disabled parallel vacuum regardless of
the table's index count or configuration. This commit enables
autovacuum workers to use parallel index vacuuming and index cleanup,
using the same parallel vacuum infrastructure as manual VACUUM.
Two new configuration options control the feature. The GUC
autovacuum_max_parallel_workers sets the maximum number of parallel
workers a single autovacuum worker may launch; it defaults to 0,
preserving existing behavior unless explicitly enabled. The per-table
storage parameter autovacuum_parallel_workers provides per-table
limits. A value of 0 disables parallel vacuum for the table, a
positive value caps the worker count (still bounded by the GUC), and
-1 (the default) defers to the GUC.
To handle cases where autovacuum workers receive a SIGHUP and update
their cost-based vacuum delay parameters mid-operation, a new
propagation mechanism is added to vacuumparallel.c. The leader stores
its effective cost parameters in a DSM segment. Parallel vacuum
workers poll for changes in vacuum_delay_point(); if an update is
detected, they apply the new values locally via VacuumUpdateCosts().
A new test module, src/test/modules/test_autovacuum, is added to
verify that parallel autovacuum workers are correctly launched and
that cost-parameter updates are propagated as expected.
The patch was originally proposed by Maxim Orlov, but the
implementation has undergone significant architectural changes
since then during the review process.
Rename cluster.c to repack.c (and corresponding .h)
CLUSTER is no longer the favored way to invoke this functionality, and
the code is about to shift its focus to the REPACK more ambitiously.
Rename the file to avoid leaving an unnecessary historical artifact
around.
Tom Lane [Mon, 6 Apr 2026 18:05:01 +0000 (14:05 -0400)]
Disallow system columns in COPY FROM WHERE conditions.
These columns haven't been computed yet when the filtering happens
(since we've not written the candidate tuple into the table); so
any check on them is wrong or useless. Worse, since aa606b931 such a
reference results in an access off the end of a TupleDesc, potentially
causing a phony "generated columns are not supported in COPY FROM
WHERE conditions" error; and since c98ad086a it throws an Assert
instead.
Actually we could allow tableoid, which has been set to the OID of the
table named as the COPY target. However, plausible uses for tests of
tableoid would involve a partitioned target table, and the user would
wish it to read as the OID of the destination partition. There has
been some discussion of changing things to make it work like that,
but pending that happening we should just disallow tableoid along
with other system columns.
It seems best though to install this prohibition only in HEAD.
In the back branches we'll just guard the unsafe TupleDesc access,
and people will keep getting whatever semantics they got before.
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/6f435023-8ab6-47c2-ba07-035d0c4212f9@gmail.com
Tom Lane [Mon, 6 Apr 2026 17:14:50 +0000 (13:14 -0400)]
Fix null-bitmap combining in array_agg_array_combine().
This code missed the need to update the combined state's
nullbitmap if state1 already had a bitmap but state2 didn't.
We need to extend the existing bitmap with 1's but didn't.
This could result in wrong output from a parallelized
array_agg(anyarray) calculation, if the input has a mix of
null and non-null elements. The errors depended on timing
of the parallel workers, and therefore would vary from one
run to another.
Also install guards against integer overflow when calculating
the combined object's sizes, and make some trivial cosmetic
improvements.
Author: Dmytro Astapov <dastapov@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAFQUnFj2pQ1HbGp69+w2fKqARSfGhAi9UOb+JjyExp7kx3gsqA@mail.gmail.com
Backpatch-through: 16
Robert Haas [Mon, 6 Apr 2026 16:29:59 +0000 (12:29 -0400)]
Add a guc_check_handler to the EXPLAIN extension mechanism.
It would be useful to be able to tell auto_explain to set a custom
EXPLAIN option, but it would be bad if it tried to do so and the
option name or value wasn't valid, because then every query would fail
with a complaint about the EXPLAIN option. So add a guc_check_handler
that auto_explain will be able to use to only try to set option
name/value/type combinations that have been determined to be legal,
and to emit useful messages about ones that aren't.
The restructuring in commit 53b8ca6881 revealed an interesting
corner case: if a table needs vacuuming for wraparound prevention
and autovacuum is disabled for it, we might still choose to analyze
it. Research seems to indicate this was an accidental addition by
commit 48188e1621, and further discussion indicates there is
consensus that it is unnecessary and can be removed.
Reviewed-by: Robert Treat <rob@xzilla.net> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Reviewed-by: Sami Imseih <samimseih@gmail.com> Reviewed-by: Shinya Kato <shinya11.kato@gmail.com>
Discussion: https://postgr.es/m/adB9nSsm_S0D9708%40nathan
Robert Haas [Mon, 6 Apr 2026 15:13:25 +0000 (11:13 -0400)]
Expose helper functions scan_quoted_identifier and scan_identifier.
Previously, this logic was embedded within SplitIdentifierString,
SplitDirectoriesString, and SplitGUCList. Factoring it out saves
a bit of duplicated code, and also makes it available to extensions
that might want to do similar things without necessarily wanting to
do exactly the same thing.
This commit updates 011_lock_stats.pl to verify log_lock_waits behavior.
The tests check that messages are emitted both when a wait occurs and
when the lock is acquired, and that the "still waiting for" message is logged
exactly once per wait, even if the backend wakes up during the wait.
The latter covers the behavior introduced by commit fd6ecbfa75f.
Author: Hüseyin Demir <huseyin.d3r@gmail.com> Co-authored-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/CAB5wL7YB1my9W5k5i=SY+=sTjeozyJ0YkvGXrVfeDNzuRkoTPg@mail.gmail.com
Release postmaster working memory context in slotsync worker
Child processes do not need the postmaster's working memory context and
normally release it at the start of their main entry point. However,
the slotsync worker forgot to do so.
This commit makes the slotsync worker release the postmaster's working
memory context at startup, preventing unintended use.
Author: Fujii Masao <masao.fujii@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Tiancheng Ge <getiancheng_2012@163.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CAHGQGwHO05JaUpgKF8FBDmPdBUJsK22axRRcgmAUc2Jyi8OK8g@mail.gmail.com
Robert Haas [Mon, 6 Apr 2026 11:41:28 +0000 (07:41 -0400)]
Add pg_stash_advice contrib module.
This module allows plan advice strings to be provided automatically
from an in-memory advice stash. Advice stashes are stored in dynamic
shared memory and must be recreated and repopulated after a server
restart. If pg_stash_advice.stash_name is set to the name of an advice
stash, and if query identifiers are enabled, the query identifier
for each query will be looked up in the advice stash and the
associated advice string, if any, will be used each time that query
is planned.
Reviewed-by: Lukas Fittl <lukas@fittl.com> Reviewed-by: Alexandra Wang <alexandra.wang.oss@gmail.com> Reviewed-by: David G. Johnston <david.g.johnston@gmail.com> Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com>
Discussion: http://postgr.es/m/CA+TgmoaeNuHXQ60P3ZZqJLrSjP3L1KYokW9kPfGbWDyt+1t=Ng@mail.gmail.com
Michael Paquier [Mon, 6 Apr 2026 05:01:04 +0000 (14:01 +0900)]
Use single LWLock for lock statistics in pgstats
Previously, one LWLock was used for each lock type, adding complexity
without an observable performance benefit as data is gathered only for
paths involving lock waits, at least currently. This commit replaces
the per-type set of LWLocks with a single LWLock protecting the stats
data of all the lock types, like the stats kinds for SLRU or WAL. A
good chunk of the callbacks get simpler thanks to this change.
The previous approach also had one bug in the flush callback when nowait
was called with "true": a backend iterating over all entries could
successfully flush some entries while skipping others due to contention,
then unconditionally reset the pending data. This would cause some
stats data loss.
Michael Paquier [Mon, 6 Apr 2026 04:23:28 +0000 (13:23 +0900)]
Improve more stability of worker_spi termination test
Alexander Lakhin has noticed that it can be possible on machines with
slow storage to have the spawned workers be stuck in
initialize_worker_spi(), before they reach their main loop. Waiting for
a flush to happen would block the interrupt attempts done by the
database commands, causing the test to fail on timeout once the number
of interrupt attempts is reached in CountOtherDBBackends().
This commit switches the test to wait for the spawned bgworkers to reach
their main loops before attempting the database commands that would
trigger the interrupts, napping for a time larger than the default, with
worker_spi.naptime set at 10 minutes. Another thing that could be
attempted is to enforce a larger number of tries in
CountOtherDBBackends(), if what is done here is not enough. Let's see
first if what this commit does is enough for the buildfarm members
widowbird and jay.
Analyzed-by: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://postgr.es/m/f913fba1-da59-404c-9eb3-07c7304be637@gmail.com
Richard Guo [Mon, 6 Apr 2026 02:54:08 +0000 (11:54 +0900)]
Fix volatile function evaluation in eager aggregation
Pushing aggregates containing volatile functions below a join can
violate volatility semantics by changing the number of times the
function is executed.
Here we check the Aggref nodes in the targetlist and havingQual for
volatile functions and disable eager aggregation when such functions
are present.
Author: Richard Guo <guofenglinux@gmail.com> Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Discussion: https://postgr.es/m/CAMbWs48A53PY1Y4zoj7YhxPww9fO1hfnbdntKfA855zpXfVFRA@mail.gmail.com
Richard Guo [Mon, 6 Apr 2026 02:52:33 +0000 (11:52 +0900)]
Fix collation handling for grouping keys in eager aggregation
When determining if it is safe to use an expression as a grouping key
for partial aggregation, eager aggregation relies on the B-tree
equalimage support function to ensure that equality implies image
equality.
Previously, the code incorrectly passed the default collation of the
expression's data type to the equalimage procedure, rather than the
expression's actual collation. As a result, if a column used a
non-deterministic collation but the base type's default collation was
deterministic, eager aggregation would incorrectly assume that the
column was safe for byte-level grouping. This could cause rows to be
prematurely grouped and subsequently discarded by strict join
conditions, resulting in incorrect query results.
This patch fixes the issue by passing the expression's actual
collation to the equalimage procedure.
Author: Richard Guo <guofenglinux@gmail.com> Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Discussion: https://postgr.es/m/CAMbWs48A53PY1Y4zoj7YhxPww9fO1hfnbdntKfA855zpXfVFRA@mail.gmail.com
Add wal_sender_shutdown_timeout GUC to limit shutdown wait for replication
Previously, during shutdown, walsenders always waited until all pending data
was replicated to receivers. This ensures sender and receiver stay in sync
after shutdown, which is important for physical replication switchovers,
but it can significantly delay shutdown. For example, in logical replication,
if apply workers are blocked on locks, walsenders may wait until those locks
are released, preventing shutdown from completing for a long time.
This commit introduces a new GUC, wal_sender_shutdown_timeout,
which specifies the maximum time a walsender waits during shutdown for all
pending data to be replicated. When set, shutdown completes once all data is
replicated or the timeout expires. A value of -1 (the default) disables
the timeout.
This can reduce shutdown time when replication is slow or stalled. However,
if the timeout is reached, the sender and receiver may be left out of sync,
which can be problematic for physical replication switchovers.
John Naylor [Mon, 6 Apr 2026 02:30:01 +0000 (09:30 +0700)]
Fix unportable use of __builtin_constant_p
On MSVC Arm, USE_ARMV8_CRC32C is defined, but __builtin_constant_p
is not available. Use pg_integer_constant_p and add appropriate
guards. There is a similar potential hazard for the x86 path, but
for now let's get the buildfarm green.
Oversight in commit fbc57f2bc, per buildfarm member hoatzin.
Postcommit review and buildfarm/CI failures revealed a few issues in
the test code which this commit attempts to resolve. These failures
are verified using synthetic means.
* Wait for launcher exit in enable/disable checksum tests
When enabling or disabling data checksums in a test with waiting
for an end state (on or off), the test typically want to perform
more test against the cluster immediately. Make sure to wait for
the launcher to exit in these cases before returning in order to
know it can immediately be acted on. This is a more generic way
of implementating 0036232ba8f.
* Refactor injection point tests to use the injection_points test
extension. Two injection points added for online checksums were
better expressed using the injection_points extension with the
test code embedded in datachecksum_state.c.
* Make tests less timing dependent and allow transitions to "on"
and not just "inprogress-on" in case a test manages to finish
before it's checked for state.
* When waiting on a blocking background psql keeping a temporary
table open, the test first closed the background session abd
then the server. This could cause data checksums to manage to
get enabled in the brief window between dropping the temporary
table and closing the server. Fix by closing the server first
before the background session.
* Remove a few superfluous duplicate checks and general cleanup
of comments as well as making LSN logging consistent.
These issues were reported by Andres as well as spotted in the
buildfarm and on CI.
Author: Daniel Gustafsson <daniel@yesql.se> Reported-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/92F25C14-801E-4198-994D-D83E31FEB0D8@yesql.se
If the background worker for processing databases manages to finish
before the launcher starts waiting for it, the launcher would treat
it erroneously as an error. Fix by ensureing to check result state
in this case. Identified on CI and synthetically reproduced during
local testing.
Also while, make sure to properly lock the shared memory structure
before updating tje result state.
Author: Daniel Gustafsson <daniel@yesql.seA Reported-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/4fxw37ge47v5baeozla5phymi233hxbcjbwwsfwv3mpg3kyl2z@6jk4nkf6jp4
Michael Paquier [Sun, 5 Apr 2026 23:51:30 +0000 (08:51 +0900)]
Add tests for lock statistics, take two
Commit 7c64d56fd976 has removed the isolation test providing coverage
for lock statistics due to some instability in the CI, where the
deadlock timeout may not have enough time to process, preventing the
stats data to be updated. These also relied on a set of hardcoded
sleeps.
This commit switches the test suite to TAP, instead, that uses an
injection point with a wait to avoid the sleeps. The injection point is
added in ProcSleep(), once we know that the deadlock timeout has fired
and that the stats have been updated.
Multiple lock patterns are checked, all rely on the same workflow, with
two sessions:
- session 1 holds a given lock type.
- session 2 attaches to the new injection point with the wait action.
- session 2 attempts to acquire a lock conflicting with the lock of
session 1, waiting for the injection point to be reached.
- session 1 releases its lock, session 2 commits.
- pg_stat_lock is polled until the counters are updated for the lock
type.
Bertrand's version of the patch introduced a new routine to
BackgroundPsql() to detect the blocked background sessions. I have
tweaked the test so as we use the same method as some of the other tests
instead, based on some \echo commands. This test has been run multiple
times in the CI, all passing, so I'd like to think that this is more
stable than the first version attempted.
Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Co-authored-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/acNTR1lLHwQJ0o+P@ip-10-97-1-34.eu-west-3.compute.internal
Convert buffer manager to use the new shmem allocation functions
This rectifies the initialization functions a little, making the
"buffer strategy" stuff in freelist.c and buffer mapping hash table in
buf_init.c top-level "subsystems" of their own, registered directly in
subsystemlist.h. Previously they were called indirectly from
BufferManagerShmemInit() and BufferManagerShmemSize()
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Discussion: https://www.postgresql.org/message-id/CAExHW5vM1bneLYfg0wGeAa=52UiJ3z4vKd3AJ72X8Fw6k3KKrg@mail.gmail.com
The buffer blocks, converted to use ShmemRequestStruct() in the next
commit, are IO-aligned. This might come handy in other places too, so
make it an explicit feature of ShmemRequestStruct().
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Discussion: https://www.postgresql.org/message-id/CAExHW5vM1bneLYfg0wGeAa=52UiJ3z4vKd3AJ72X8Fw6k3KKrg@mail.gmail.com
Convert AIO to use the new shmem allocation functions
This replaces the "shmem_size" and "shmem_init" callbacks in the IO
methods table with the same ShmemCallback struct that we now use in
other subsystems
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Discussion: https://www.postgresql.org/message-id/CAExHW5vM1bneLYfg0wGeAa=52UiJ3z4vKd3AJ72X8Fw6k3KKrg@mail.gmail.com
This is in preparation to convert it to use the new shmem allocation
functions, making the next commit that does that smaller. This inlines
SerialInit() to the caller, and moves all the initialization steps
within PredicateLockShmemInit() to happen after all the
ShmemInit{Struct|Hash}() calls.
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Discussion: https://www.postgresql.org/message-id/CAExHW5vM1bneLYfg0wGeAa=52UiJ3z4vKd3AJ72X8Fw6k3KKrg@mail.gmail.com
Use the new shmem allocation functions in a few core subsystems
These subsystems have some complicating properties, making them
slightly harder to convert than most:
- The initialization callbacks of some of these subsystems have
dependencies, i.e. they need to be initialized in the right order.
- The ProcGlobal pointer still needs to be inherited by the
BackendParameters mechanism on EXEC_BACKEND builds, because
ProcGlobal is required by InitProcess() to get a PGPROC entry, and
the PGPROC entry is required to use LWLocks, and usually attaching
to shared memory areas requires the use of LWLocks.
- Similarly, ProcSignal pointer still needs to be handled by
BackendParameters, because query cancellation connections access it
without calling InitProcess
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Discussion: https://www.postgresql.org/message-id/CAExHW5vM1bneLYfg0wGeAa=52UiJ3z4vKd3AJ72X8Fw6k3KKrg@mail.gmail.com
To add a new built-in subsystem, add it to subsystemslist.h. That
hooks up its shmem callbacks so that they get called at the right
times during postmaster startup. For now this is unused, but will
replace the current SubsystemShmemSize() and SubsystemShmemInit()
calls in the next commits.
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Discussion: https://www.postgresql.org/message-id/CAExHW5vM1bneLYfg0wGeAa=52UiJ3z4vKd3AJ72X8Fw6k3KKrg@mail.gmail.com
Convert pg_stat_statements to use the new shmem allocation functions
As part of this, embed the LWLock it needs in the shared memory struct
itself, so that we don't need to use RequestNamedLWLockTranche()
anymore. LWLockNewTrancheId() + LWLockInitialize() is more convenient
to use in extensions.
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Discussion: https://www.postgresql.org/message-id/CAExHW5vM1bneLYfg0wGeAa=52UiJ3z4vKd3AJ72X8Fw6k3KKrg@mail.gmail.com
Add a test module to test after-startup shmem allocations
The old ShmemInit{Struct/Hash}() functions could be used after
postmaster statup, as long as the allocation is small enough to fit in
spare shmem reserved at startup. I believe some extensions do that,
although we hadn't really documented it and had not coverage for it.
The new test module covers that after-startup usage with the new
ShmemRequestStruct() functions.
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Discussion: https://www.postgresql.org/message-id/CAExHW5vM1bneLYfg0wGeAa=52UiJ3z4vKd3AJ72X8Fw6k3KKrg@mail.gmail.com
Introduce a new mechanism for registering shared memory areas
This replaces the [Subsystem]ShmemSize() and [Subsystem]ShmemInit()
functions called at postmaster startup with a new set of callbacks.
The new mechanism is designed to be more ergonomic. Notably, the size
of each shmem area is specified in the same ShmemRequestStruct() call,
together with its name. The same mechanism is used in extensions,
replacing the shmem_{request/startup}_hooks.
ShmemInitStruct() and ShmemInitHash() become backwards-compatibility
wrappers around the new functions. In future commits, I will replace
all ShmemInitStruct() and ShmemInitHash() calls with the new
functions, although we'll still need to keep them around for
extensions.
Co-authored-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com> Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Discussion: https://www.postgresql.org/message-id/CAExHW5vM1bneLYfg0wGeAa=52UiJ3z4vKd3AJ72X8Fw6k3KKrg@mail.gmail.com
Andres Freund [Sun, 5 Apr 2026 21:18:00 +0000 (17:18 -0400)]
instrumentation: Separate per-node logic from other uses
Previously, different places (e.g. query "total time") were repurposing the
Instrumentation struct initially introduced for capturing per-node statistics
during execution. This overuse of the same struct is confusing, e.g. by
cluttering calls of InstrStartNode/InstrStopNode in unrelated code paths, and
prevents future refactorings.
Instead, simplify the Instrumentation struct to only track time and WAL/buffer
usage. Similarly, drop the use of InstrEndLoop outside of per-node
instrumentation - these calls were added without any apparent benefit since
the relevant fields were never read.
Introduce the NodeInstrumentation struct to carry forward the per-node
instrumentation information. WorkerInstrumentation is renamed to
WorkerNodeInstrumentation for clarity.
In passing, clarify that InstrAggNode is expected to only run after
InstrEndLoop (as it does in practice), and drop unused code.
This also fixes a consequence-less bug: Previously ->async_mode was only set
when a non-zero instrument_option was passed. That turns out to be harmless
right now, as ->async_mode only affects a timing related field.
Author: Lukas Fittl <lukas@fittl.com> Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAP53PkzdBK8VJ1fS4AZ481LgMN8f9mJiC39ZRHqkFUSYq6KWmg@mail.gmail.com
Andres Freund [Sun, 5 Apr 2026 20:47:12 +0000 (16:47 -0400)]
instrumentation: Separate trigger logic from other uses
Introduce TriggerInstrumentation to capture trigger timing and firings
(previously counted in "ntuples"), to aid a future refactoring that
splits out all Instrumentation fields beyond timing and WAL/buffers into
more specific structs.
In passing, drop the "n" argument to InstrAlloc, as all remaining callers need
exactly one Instrumentation struct. The duplication between InstrAlloc() and
InstrInit(), as well as the conditional initialization of async_mode will be
addressed in a subsequent commit.
Author: Lukas Fittl <lukas@fittl.com> Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://www.postgresql.org/message-id/flat/CAP53PkzdBK8VJ1fS4AZ481LgMN8f9mJiC39ZRHqkFUSYq6KWmg@mail.gmail.com
Andres Freund [Sun, 5 Apr 2026 18:45:27 +0000 (14:45 -0400)]
Add tid_block() and tid_offset() accessor functions
The two new functions allow to extract the block number and offset from a tid.
There are existing ways to do so (e.g. by doing (ctid::text::point)[0]), but
they are hard to remember and not pretty.
tid_block() returns int8 (bigint) because BlockNumber is uint32, which exceeds
the range of int4. tid_offset() returns int4 (integer) because OffsetNumber is
uint16, which fits safely in int4.
Check that the tranche name is unique in RequestNamedLWLockTranche
You could request two tranches with same name, but things would get
confusing when you called GetNamedLWLockTranche() to get the LWLocks
allocated for them; it would always return the first tranche with the
name. That doesn't make sense, so forbid duplicates.
We still allow duplicates with LWLockNewTrancheId(). That works better
as you don't use the name to look up the tranche ID later. It's still
confusing in wait events, for example, but it's not dangerous in the
same way.
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Discussion: https://www.postgresql.org/message-id/463a28db-0c0b-4af6-bac6-3891828bbbfe@iki.fi
While working on refactoring how shmem is allocated, I made a mistake
where the main LWLock array did not reserve space for the LWLocks
allocated with RequestNamedLWLockTranche(), and the test still
passed. Matthias van de Meent spotted that before it got committed,
but in order to catch such mistakes in the future, add checks in
test_lwlock_tranches that the locks allocated with
RequestNamedLWLockTranche() can be acquired and released.
Another change is to stop requesting multiple tranches with the same
name with RequestNamedLWLockTranche(). As soon as I started to test
using the locks I realized that's bogus, and the next commit will
forbid it. Keep test coverage for duplicates requested with
LWLockNewTrancheId() for now, but make it more clear that that's what
the test does.
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Discussion: https://www.postgresql.org/message-id/463a28db-0c0b-4af6-bac6-3891828bbbfe@iki.fi
Discussion: https://www.postgresql.org/message-id/CAEze2WjgCROMMXY0+j8FFdm3iFcr7By-+6Mwiz=PgGSEydiW3A@mail.gmail.com
Andrew Dunstan [Thu, 19 Mar 2026 13:57:35 +0000 (09:57 -0400)]
Add pg_get_database_ddl() function
Add a new SQL-callable function that returns the DDL statements needed
to recreate a database. It takes a regdatabase argument and an optional
VARIADIC text argument for options that are specified as alternating
name/value pairs. The following options are supported: pretty (boolean)
for formatted output, owner (boolean) to include OWNER and tablespace
(boolean) to include TABLESPACE. The return is one or multiple rows
where the first row is a CREATE DATABASE statement and subsequent rows are
ALTER DATABASE statements to set some database properties.
The caller must have CONNECT privilege on the target database.
Author: Akshay Joshi <akshay.joshi@enterprisedb.com> Co-authored-by: Andrew Dunstan <andrew@dunslane.net> Co-authored-by: Euler Taveira <euler@eulerto.com> Reviewed-by: Japin Li <japinli@hotmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Reviewed-by: Quan Zongliang <quanzongliang@yeah.net>
Discussion: https://postgr.es/m/CANxoLDc6FHBYJvcgOnZyS+jF0NUo3Lq_83-rttBuJgs9id_UDg@mail.gmail.com
Discussion: https://postgr.es/m/e247c261-e3fb-4810-81e0-a65893170e94@dunslane.net
Andrew Dunstan [Thu, 19 Mar 2026 13:55:16 +0000 (09:55 -0400)]
Add pg_get_tablespace_ddl() function
Add a new SQL-callable function that returns the DDL statements needed
to recreate a tablespace. It takes a tablespace name or OID and an
optional VARIADIC text argument for options that are specified as
alternating name/value pairs. The following options are supported: pretty
(boolean) for formatted output and owner (boolean) to include OWNER.
(It includes two variants because there is no regtablespace pseudotype.)
The return is one or multiple rows where the first row is a CREATE
TABLESPACE statement and subsequent rows are ALTER TABLESPACE statements
to set some tablespace properties.
The caller must have SELECT privilege on pg_tablespace.
get_reloptions() in ruleutils.c is made non-static so it can be called
from the new ddlutils.c file.
Author: Nishant Sharma <nishant.sharma@enterprisedb.com>
Author: Manni Wood <manni.wood@enterprisedb.com> Co-authored-by: Andrew Dunstan <andrew@dunslane.net> Co-authored-by: Euler Taveira <euler@eulerto.com> Reviewed-by: Jim Jones <jim.jones@uni-muenster.de> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CAKWEB6rmnmGKUA87Zmq-s=b3Scsnj02C0kObQjnbL2ajfPWGEw@mail.gmail.com
Discussion: https://postgr.es/m/e247c261-e3fb-4810-81e0-a65893170e94@dunslane.net
Andrew Dunstan [Thu, 19 Mar 2026 13:52:25 +0000 (09:52 -0400)]
Add pg_get_role_ddl() function
Add a new SQL-callable function that returns the DDL statements needed
to recreate a role. It takes a regrole argument and an optional VARIADIC
text argument for options that are specified as alternating name/value
pairs. The following options are supported: pretty (boolean) for
formatted output and memberships (boolean) to include GRANT statements
for role memberships and membership options. The return is one or
multiple rows where the first row is a CREATE ROLE statement and
subsequent rows are ALTER ROLE statements to set some role properties.
Password information is never included in the output.
The caller must have SELECT privilege on pg_authid.
Author: Mario Gonzalez <gonzalemario@gmail.com>
Author: Bryan Green <dbryan.green@gmail.com> Co-authored-by: Andrew Dunstan <andrew@dunslane.net> Co-authored-by: Euler Taveira <euler@eulerto.com> Reviewed-by: Japin Li <japinli@hotmail.com> Reviewed-by: Quan Zongliang <quanzongliang@yeah.net> Reviewed-by: jian he <jian.universality@gmail.com>
Discussion: https://postgr.es/m/4c5f895e-3281-48f8-b943-9228b7da6471@gmail.com
Discussion: https://postgr.es/m/e247c261-e3fb-4810-81e0-a65893170e94@dunslane.net
Andrew Dunstan [Thu, 19 Mar 2026 13:50:41 +0000 (09:50 -0400)]
Add infrastructure for pg_get_*_ddl functions
Add parse_ddl_options(), append_ddl_option(), and append_guc_value()
helper functions in a new ddlutils.c file that provide common option
parsing and output formatting for the pg_get_*_ddl family of functions
which will follow in later patches. These accept VARIADIC text
arguments as alternating name/value pairs.
Callers declare an array of DdlOption descriptors specifying the
accepted option names and their types (boolean, text, or integer).
parse_ddl_options() matches each supplied pair against the array,
validates the value, and fills in the result fields. This
descriptor-based scheme is based on an idea from Euler Taveira.
This is placed in a new ddlutils.c file which will contain the
pg_get_*_ddl functions.
Allow index_create to suppress index_build progress reporting
A future REPACK patch wants a way to suppress index_build doing its
progress reports when building an index, because that would interfere
with repack's own reporting; so add an INDEX_CREATE_SUPPRESS_PROGRESS
bit that enables this.
Furthermore, change the index_create_copy() API so that it takes flag
bits for index_create() and passes them unchanged. This gives its
callers more direct control, which eases the interface -- now its
callers can pass the INDEX_CREATE_SUPPRESS_PROGRESS bit directly. We
use it for the current caller in REINDEX CONCURRENTLY, since it's also
not interested in progress reporting, since it doesn't want
index_build() to be called at all in the first place.
One thing to keep in mind, pointed out by Mihail, is that we're not
suppressing the index-AM-specific progress report updates which happen
during ambuild(). At present this is not a problem, because the values
updated by those don't overlap with those used by commands other than
CREATE INDEX; but maybe in the future we'll want the ability to suppress
them also. (Alternatively we might want to display how each
index-build-subcommand progresses during REPACK and others.)
postgres_fdw: Inherit the local transaction's access/deferrable modes.
READ ONLY transactions should prevent modifications to foreign data as
well as local data, but postgres_fdw transactions declared as READ ONLY
that reference foreign tables mapped to a remote view executing volatile
functions would modify data on remote servers, as it would open remote
transactions in READ WRITE mode.
Similarly, DEFERRABLE transactions should not abort due to a
serialization failure even when accessing foreign data, but postgres_fdw
transactions declared as DEFERRABLE would abort due to that failure in a
remote server, as it would open remote transactions in NOT DEFERRABLE
mode.
To fix, modify postgres_fdw to open remote transactions in the same
access/deferrable modes as the local transaction. This commit also
modifies it to open remote subtransactions in the same access mode as
the local subtransaction.
This commit changes the behavior of READ ONLY/DEFERRABLE transactions
using postgres_fdw; in particular, it doesn't allow the READ ONLY
transactions to modify data on remote servers anymore, so such
transactions should be redeclared as READ WRITE or rewritten using other
tools like dblink. The release notes should note this as an
incompatibility.
These issues exist since the introduction of postgres_fdw, but to avoid
the incompatibility in the back branches, fix them in master only.
Author: Etsuro Fujita <etsuro.fujita@gmail.com> Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/CAPmGK16n_hcUUWuOdmeUS%2Bw4Q6dZvTEDHb%3DOP%3D5JBzo-M3QmpQ%40mail.gmail.com
Discussion: https://postgr.es/m/E1uLe9X-000zsY-2g%40gemulon.postgresql.org
Thomas Munro [Sun, 5 Apr 2026 06:01:10 +0000 (18:01 +1200)]
aio: Simplify pgaio_worker_submit().
Merge pgaio_worker_submit_internal() and pgaio_worker_submit(). The
separation didn't serve any purpose.
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKG%2Bm4xV0LMoH2c%3DoRAdEXuCnh%2BtGBTWa7uFeFMGgTLAw%2BQ%40mail.gmail.com
Andres Freund [Sun, 5 Apr 2026 04:43:54 +0000 (00:43 -0400)]
read_stream: Only increase read-ahead distance when waiting for IO
This avoids increasing the distance to the maximum in cases where the I/O
subsystem is already keeping up. This turns out to be important for
performance for two reasons:
- Pinning a lot of buffers is not cheap. If additional pins allow us to avoid
IO waits, it's definitely worth it, but if we can already do all the
necessary readahead at a distance of 16, reading ahead 512 buffers can
increase the CPU overhead substantially. This is particularly noticeable
when the to-be-read blocks are already in the kernel page cache.
- If the read stream is read to completion, reading in data earlier than
needed is of limited consequences, leaving aside the CPU costs mentioned
above. But if the read stream will not be fully consumed, e.g. because it is
on the inner side of a nested loop join, the additional IO can be a serious
performance issue. This is not that commonly a problem for current read
stream users, but the upcoming work, to use a read stream to fetch table
pages as part of an index scan, frequently encounters this.
Note that this commit would have substantial performance downsides without
earlier commits:
- Commit 6e36930f9aa, which avoids decreasing the readahead distance when
there was recent IO, is crucial, as otherwise we very often would end up not
reading ahead aggressively enough anymore with this commit, due to
increasing the distance less often.
- "read stream: Split decision about look ahead for AIO and combining" is
important as we would otherwise not perform IO combining when the IO
subsystem can keep up.
- "aio: io_uring: Trigger async processing for large IOs" is important to
continue to benefit from memory copy parallelism when using fewer IOs.
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Tested-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/f3xxfrkafjxpyqxywcxricxgyizjirfceychyxsgn7bwjp5eda@kwbduhy7tfmu
Discussion: https://postgr.es/m/CA+hUKGL2PhFyDoqrHefqasOnaXhSg48t1phs3VM8BAdrZqKZkw@mail.gmail.com
Andres Freund [Sun, 5 Apr 2026 04:43:54 +0000 (00:43 -0400)]
read stream: Split decision about look ahead for AIO and combining
In a subsequent commit the read-ahead distance will only be increased when
waiting for IO. Without further work that would cause a regression: As IO
combining and read-ahead are currently controlled by the same mechanism, we
would end up not allowing IO combining when never needing to wait for IO (as
the distance ends up too small to allow for full sized IOs), which can
increase CPU overhead. A typical reason to not have to wait for IO completion
at a low look-ahead distance is use of io_uring with the to-be-read data in
the page cache. But even with worker the IO submission rate may be low enough
for the worker to keep up.
One might think that we could just always perform IO combining, but doing so
at the start of a scan can cause performance regressions:
1) Performing a large IO commonly has a higher latency than smaller IOs. That
is not a problem once reading ahead far enough, but at the start of a stream
it can lead to longer waits for IO completion.
2) Sometimes read streams will not be read to completion. Immediately starting
with full sized IOs leads to more wasted effort. This is not commonly an
issue with existing read stream users, but the upcoming use of read streams
to fetch table pages as part of an index scan frequently encounters this.
Solve this issue by splitting ReadStream->distance into ->combine_distance and
->readahead_distance. Right now they are increased/decreased at the same time,
but that will change in the next commit.
One of the comments in read_stream_should_look_ahead() refers to a motivation
that only really exists as of the next commit, but without it the code doesn't
make sense on its own.
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/f3xxfrkafjxpyqxywcxricxgyizjirfceychyxsgn7bwjp5eda@kwbduhy7tfmu
Discussion: https://postgr.es/m/CA+hUKGL2PhFyDoqrHefqasOnaXhSg48t1phs3VM8BAdrZqKZkw@mail.gmail.com
Andres Freund [Sun, 5 Apr 2026 04:43:54 +0000 (00:43 -0400)]
read_stream: Move logic about IO combining & issuing to helpers
The long if statements were hard to read and hard to document. Splitting them
into inline helpers makes it much easier to explain each part separately.
This is done in preparation for making the logic more complicated...
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/f3xxfrkafjxpyqxywcxricxgyizjirfceychyxsgn7bwjp5eda@kwbduhy7tfmu
Andres Freund [Sun, 5 Apr 2026 04:43:54 +0000 (00:43 -0400)]
aio: io_uring: Trigger async processing for large IOs
io_method=io_uring has a heuristic to trigger asynchronous processing of IOs
once the IO depth is a bit larger. That heuristic is important when doing
buffered IO from the kernel page cache, to allow parallelizing of the memory
copy, as otherwise io_method=io_uring would be a lot slower than
io_method=worker in that case.
An upcoming commit will make read_stream.c only increase the read-ahead
distance if we needed to wait for IO to complete. If to-be-read data is in the
kernel page cache, io_uring will synchronously execute IO, unless the IO is
flagged as async. Therefore the aforementioned change in read_stream.c
heuristic would lead to a substantial performance regression with io_uring
when data is in the page cache, as we would never reach a deep enough queue to
actually trigger the existing heuristic.
Parallelizing the copy from the page cache is mainly important when doing a
lot of IO, which commonly is only possible when doing largely sequential IO.
The reason we don't just mark all io_uring IOs as asynchronous is that the
dispatch to a kernel thread has overhead. This overhead is mostly noticeable
with small random IOs with a low queue depth, as in that case the gain from
parallelizing the memory copy is small and the latency cost high.
The facts from the two prior paragraphs show a way out: Use the size of the IO
in addition to the depth of the queue to trigger asynchronous processing.
One might think that just using the IO size might be enough, but
experimentation has shown that not to be the case - with deep look-ahead
distances being able to parallelize the memory copy is important even with
smaller IOs.
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/f3xxfrkafjxpyqxywcxricxgyizjirfceychyxsgn7bwjp5eda@kwbduhy7tfmu
Discussion: https://postgr.es/m/CA+hUKGL2PhFyDoqrHefqasOnaXhSg48t1phs3VM8BAdrZqKZkw@mail.gmail.com
Also rename it to index_create_copy. Add a 'boolean concurrent' option,
and make it work for both cases: in concurrent mode, just create the
catalog entries; caller is responsible for the actual building later.
In non-concurrent mode, the index is built right away.
This allows it to be reused for other purposes -- specifically, for
concurrent REPACK.
(With the CONCURRENTLY option, REPACK cannot simply swap the heap file and
rebuild its indexes. Instead, it needs to build a separate set of
indexes, including their system catalog entries, *before* the actual
swap, to reduce the time AccessExclusiveLock needs to be held for. This
approach is different from what CREATE INDEX CONCURRENTLY does.)
Peter Geoghegan [Sat, 4 Apr 2026 17:49:37 +0000 (13:49 -0400)]
heapam: Keep buffer pins across index scan resets.
Avoid dropping the heap page pin (xs_cbuf) and visibility map pin
(xs_vmbuffer) within heapam_index_fetch_reset. Retaining these pins
saves cycles during certain nested loop joins and merge joins that
frequently restore a saved mark: cases where the next tuple fetched
after a reset often falls on the same heap page will now avoid the cost
of repeated pinning and unpinning.
Avoiding dropping the scan's heap page buffer pin is preparation for an
upcoming patch that will add I/O prefetching to index scans. Testing of
that patch (which makes heapam tend to pin more buffers concurrently
than was typical before now) shows that the aforementioned cases get a
small but clearly measurable benefit from this optimization.
Upcoming work to add a slot-based table AM interface for index scans
(which is further preparation for prefetching) will move VM checks for
index-only scans out of the executor and into heapam. That will expand
the role of xs_vmbuffer to include VM lookups for index-only scans (the
field won't just be used for setting pages all-visible during on-access
pruning via the enhancement recently introduced by commit b46e1e54).
Avoiding dropping the xs_vmbuffer pin will preserve the historical
behavior of nodeIndexonlyscan.c, which always kept this pin on a rescan;
that aspect of this commit isn't really new.
Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAH2-Wz=g=JTSyDB4UtB5su2ZcvsS7VbP+ZMvvaG6ABoCb+s8Lw@mail.gmail.com
Peter Geoghegan [Sat, 4 Apr 2026 15:45:33 +0000 (11:45 -0400)]
heapam: Track heap block in IndexFetchHeapData.
Add an explicit BlockNumber field (xs_blk) to IndexFetchHeapData that
tracks which heap block is currently pinned in xs_cbuf.
heapam_index_fetch_tuple now uses xs_blk to determine when buffer
switching is needed, replacing the previous approach that compared
buffer identities via ReleaseAndReadBuffer on every non-HOT-chain call.
This is preparatory work for an upcoming commit that will add index
prefetching using a read stream. Delegating the release of a currently
pinned buffer to ReleaseAndReadBuffer won't work anymore -- at least not
when the next buffer that the scan needs to pin is one returned by
read_stream_next_buffer (not a buffer returned by ReadBuffer).
Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAH2-Wz=g=JTSyDB4UtB5su2ZcvsS7VbP+ZMvvaG6ABoCb+s8Lw@mail.gmail.com
Peter Geoghegan [Sat, 4 Apr 2026 15:30:41 +0000 (11:30 -0400)]
Move heapam_handler.c index scan code to new file.
Move the heapam index fetch callbacks (index_fetch_begin,
index_fetch_reset, index_fetch_end, and index_fetch_tuple) into a new
dedicated file. Also move heap_hot_search_buffer over. This is a
purely mechanical move with no functional impact.
Upcoming work to add a slot-based table AM interface for index scans
will substantially expand this code. Keeping it in heapam_handler.c
would clutter a file whose primary role is to wire up the TableAmRoutine
callbacks. Bitmap heap scans and sequential scans would benefit from
similar separation in the future.
Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/bmbrkiyjxoal6o5xadzv5bveoynrt3x37wqch7w3jnwumkq2yo@b4zmtnrfs4mh
Peter Geoghegan [Sat, 4 Apr 2026 15:30:05 +0000 (11:30 -0400)]
Rename heapam_index_fetch_tuple argument for clarity.
Rename heapam_index_fetch_tuple's call_again argument to heap_continue,
for consistency with the pointed-to variable name (IndexScanDescData's
xs_heap_continue field).
Preparation for an upcoming commit that will move index scan related
heapam functions into their own file.
Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/bmbrkiyjxoal6o5xadzv5bveoynrt3x37wqch7w3jnwumkq2yo@b4zmtnrfs4mh
John Naylor [Sat, 4 Apr 2026 13:47:01 +0000 (20:47 +0700)]
Compute CRC32C on ARM using the Crypto Extension where available
In similar vein to commit 3c6e8c123, the ARMv8 cryptography extension
has 64x64 -> 128-bit carryless multiplication instructions suitable
for computing CRC. This was tested to be around twice as fast as
scalar CRC instructions for longer inputs.
We now do a runtime check, even for builds that target "armv8-a+crc",
but those builds can still use a direct call for constant inputs,
which we assume are short.
As for x86, the MIT-licensed implementation was generated with the
"generate" program from
John Naylor [Sat, 4 Apr 2026 11:07:15 +0000 (18:07 +0700)]
Use AVX2 for calculating page checksums where available
We already rely on autovectorization for computing page checksums,
but on x86 we can get a further several-fold performance increase by
annotating pg_checksum_block() with a function target attribute for
the AVX2 instruction set extension. Not only does that use 256-bit
registers, it can also use vector multiplication rather than the
vector shifts and adds used in SSE2.
Similar to other hardware-specific paths, we set a function pointer
on first use. We don't bother to avoid this on platforms without AVX2
since the overhead of indirect calls doesn't matter for multi-kilobyte
inputs. However, we do arrange so that only core has the function
pointer mechanism. External programs will continue to build a normal
static function and don't need to be aware of this.
This matters most when using io_uring since in that case the checksum
computation is not done in parallel by IO workers.
Co-authored-by: Matthew Sterrett <matthewsterrett2@gmail.com> Co-authored-by: Andrew Kim <andrew.kim@intel.com> Reviewed-by: Oleg Tselebrovskiy <o.tselebrovskiy@postgrespro.ru> Tested-by: Ants Aasma <ants.aasma@cybertec.at> Tested-by: Stepan Neretin <slpmcf@gmail.com> (earlier version)
Discussion: https://postgr.es/m/CA+vA85_5GTu+HHniSbvvP+8k3=xZO=WE84NPwiKyxztqvpfZ3Q@mail.gmail.com
Discussion: https://postgr.es/m/20250911054220.3784-1-root%40ip-172-31-36-228.ec2.internal
Add missing shmem size estimate for fast-path locking struct
It's been missing ever since fast-path locking was introduced. It's a
small discrepancy, about 4 kB, but let's be tidy. This doesn't seem
worth backpatching, however; in stable branches we were less precise
about the estimates and e.g. added a 10% margin to the hash table
estimates, which is usually much bigger than this discrepancy.
Thomas Munro [Fri, 3 Apr 2026 22:13:18 +0000 (11:13 +1300)]
More tar portability adjustments.
For the three implementations that have caused problems so far:
* GNU and BSD (libarchive) tar both understand --format=ustar
* ustar doesn't support large UID/GID values, so set them to 0 to
avoid a hard error from at least GNU tar
* OpenBSD tar needs -F ustar, and it appears to warn but carry
on with "nobody" if a UID is too large
* -f /dev/null is a more portable way to throw away the output, since
the default destination might be a tape device depending on build
options that a distribution might change
* Windows ships BSD tar but lacks /dev/null, so ask perl for its name
Based on their manuals, the other two implementations the tests are
likely to encounter in the wild don't seem to need any special handling:
* Solaris/illumos tar uses ustar and replaces large UIDs with 60001
* AIX tar uses ustar (unless --format=pax) and truncates large UIDs
Backpatch-through: 18 Co-authored-by: Thomas Munro <thomas.munro@gmail.com> Co-authored-by: Sami Imseih <samimseih@gmail.com> (large UIDs) Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> (earlier version) Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com> (OpenBSD) Reviewed-by: Andrew Dunstan <andrew@dunslane.net> (Windows)
Discussion: https://postgr.es/m/3676229.1775170250%40sss.pgh.pa.us
Discussion: https://postgr.es/m/CAA5RZ0tt89MgNi4-0F4onH%2B-TFSsysFjMM-tBc6aXbuQv5xBXw%40mail.gmail.com
Remove HASH_DIRSIZE, always use the default algorithm to select it
It's not very useful to specify a non-standard directory size. The
HASH_DIRSIZE option was only used for shared memory hash tables, and
those always used hash_select_dirsize() to choose the size, which in
turn just uses the default algorithm anyway. That assumption was
ingrained in hash_estimate_size(), too.
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://www.postgresql.org/message-id/01ab1d41-3eda-4705-8bbd-af898f5007f1@iki.fi
Allocate all parts of shmem hash table from a single contiguous area
Previously, the shared header (HASHHDR) and the directory were
allocated by the caller, and passed to hash_create(), while the actual
elements were allocated separately with ShmemAlloc(). After this
commit, all the memory needed by the header, the directory, and all
the elements is allocated using a single ShmemInitStruct() call, and
the different parts are carved out of that allocation. This way the
ShmemIndex entries (and thus pg_shmem_allocations) reflect the size of
the whole hash table, rather than just the directories.
Commit f5930f9a98 attempted this earlier, but it had to be reverted.
The new strategy is to let dynahash.c perform all the allocations with
the alloc function, but have the alloc function carve out the parts
from the one larger allocation. The shared header and the directory
are now also allocated with alloc calls, instead of passing the area
for those directly from the caller.
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://www.postgresql.org/message-id/01ab1d41-3eda-4705-8bbd-af898f5007f1@iki.fi
Prevent shared memory hash tables from growing beyond initial size
Set HASH_FIXED_SIZE on all shared memory hash tables, to prevent them
from growing after the initial allocation. It was always weirdly
indeterministic that if one hash table used up all the unused shared
memory, you could not use that space for other things anymore until
restart. We just got rid of that behavior for the LOCK and PROCLOCK
tables, but it's similarly weird for all other hash tables.
Increase SHMEM_INDEX_SIZE because we were already above the max size,
on that one, and it's now a hard limit.
Some callers of ShmemInitHash() still pass HASH_FIXED_SIZE, but that's
now unnecessary. They should perhaps now be removed, but it doesn't do
any harm either to pass it.
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://www.postgresql.org/message-id/01ab1d41-3eda-4705-8bbd-af898f5007f1@iki.fi
Merge init and max size options on shmem hash tables
Replace the separate init and max size options with a single size
option. We didn't make much use of the feature, all callers except the
ones in wait_event.c already used the same size for both, and the hash
tables in wait_event.c are small so there's little harm in just
allocating them to the max size.
The only reason why you might want to not reserve the max size upfront
is to make the memory available for other hash tables to grow beyond
their max size. Letting hash tables grow much beyond their max size is
bad for performance, however, because we cannot resize the directory,
and we never had very much "wiggle room" to grow to anyway so you
couldn't really rely on it. We recently marked the LOCK and PROCLOCK
tables with HAS_FIXED_SIZE, so there's nothing left in core that would
benefit from more unallocated shared memory.
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://www.postgresql.org/message-id/01ab1d41-3eda-4705-8bbd-af898f5007f1@iki.fi
Jacob Champion [Fri, 3 Apr 2026 23:05:33 +0000 (16:05 -0700)]
oauth: Let validators provide failure DETAILs
At the moment, the only way for a validator module to report error
details on failure is to log them separately before returning from
validate_cb. Independently of that problem, the ereport() calls that we
make during validation failure partially duplicate some of the work of
auth_failed().
The end result is overly verbose and confusing for readers of the logs:
[768233] LOG: [my_validator] bad signature in bearer token
[768233] LOG: OAuth bearer authentication failed for user "jacob"
[768233] DETAIL: Validator failed to authorize the provided token.
[768233] FATAL: OAuth bearer authentication failed for user "jacob"
[768233] DETAIL: Connection matched file ".../pg_hba.conf" line ...
Solve both problems by making use of the existing logdetail pointer
that's provided by ClientAuthentication. Validator modules may set
ValidatorModuleResult->error_detail to override our default generic
message.
The end result looks something like
[242284] FATAL: OAuth bearer authentication failed for user "jacob"
[242284] DETAIL: [my_validator] bad signature in bearer token
Connection matched file ".../pg_hba.conf" line ...
Make data checksum tests more resilient for slow machines
The test for re-running checksum enabling was only checking for the
data checksum state to transition to 'on', but didn't account for
the launcher process having had time to exit, thus getting an error
instead of the expected no-op. Adding a pg_stat_activity check for
the launcher exiting resolves the error, verified by inducing delay
in the launcher.
Also wrap a variable only used in injection point tests within the
correct USE macros to avoid warning for an unused variable.
All per the buildfarm.
Author: Daniel Gustafsson <daniel@yesql.se> Reported-by: Buildfarm
Discussion: https://postgr.es/m/1CB288C9-564B-4664-B096-C2F4377D17AB@yesql.se
Teach relation_needs_vacanalyze() to always compute scores.
Presently, this function only computes component scores when the
corresponding threshold is reached. A follow-up commit will add a
view that shows tables' autovacuum scores, and we anticipate that
users will want to use this view to discover tables that are
nearing autovacuum eligibility. This commit teaches this function
to always compute autovacuum scores, even when a threshold has not
been reached or autovacuum is disabled.
The restructuring in this commit revealed an interesting edge case.
If the table needs vacuuming for wraparound prevention and
autovacuum is disabled for it, we might still choose to analyze it.
It's not clear if this is intentional, but it has been this way for
nearly 20 years, so it seems best to avoid changing it without
further discussion.
Author: Sami Imseih <samimseih@gmail.com> Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Discussion: https://postgr.es/m/CAA5RZ0s4xjMrB-VAnLccC7kY8d0-4806-Lsac-czJsdA1LXtAw%40mail.gmail.com
This allows data checksums to be enabled, or disabled, in a running
cluster without restricting access to the cluster during processing.
Data checksums could prior to this only be enabled during initdb or
when the cluster is offline using the pg_checksums app. This commit
introduce functionality to enable, or disable, data checksums while
the cluster is running regardless of how it was initialized.
A background worker launcher process is responsible for launching a
dynamic per-database background worker which will mark all buffers
dirty for all relation with storage in order for them to have data
checksums calculated on write. Once all relations in all databases
have been processed, the data_checksums state will be set to on and
the cluster will at that point be identical to one which had data
checksums enabled during initialization or via offline processing.
When data checksums are being enabled, concurrent I/O operations
from backends other than the data checksums worker will write the
checksums but not verify them on reading. Only when all backends
have absorbed the procsignalbarrier for setting data_checksums to
on will they also start verifying checksums on reading. The same
process is repeated during disabling; all backends write checksums
but do not verify them until the barrier for setting the state to
off has been absorbed by all. This in-progress state is used to
ensure there are no false negatives (or positives) due to reading
a checksum which is not in sync with the page.
A new testmodule, test_checksums, is introduced with an extensive
set of tests covering both online and offline data checksum mode
changes. The tests which run concurrent pgbdench during online
processing are gated behind the PG_TEST_EXTRA flag due to being
very expensive to run. Two levels of PG_TEST_EXTRA flags exist
to turn on a subset of the expensive tests, or the full suite of
multiple runs.
This work is based on an earlier version of this patch which was
reviewed by among others Heikki Linnakangas, Robert Haas, Andres
Freund, Tomas Vondra, Michael Banck and Andrey Borodin. During
the work on this new version, Tomas Vondra has given invaluable
assistance with not only coding and reviewing but very in-depth
testing.
Author: Daniel Gustafsson <daniel@yesql.se>
Author: Magnus Hagander <magnus@hagander.net> Co-authored-by: Tomas Vondra <tomas@vondra.me> Reviewed-by: Tomas Vondra <tomas@vondra.me> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/CABUevExz9hUUOLnJVr2kpw9Cx=o4MCr1SVKwbupzuxP7ckNutA@mail.gmail.com
Discussion: https://postgr.es/m/20181030051643.elbxjww5jjgnjaxg@alap3.anarazel.de
Discussion: https://postgr.es/m/CABUevEwE3urLtwxxqdgd5O2oQz9J717ZzMbh+ziCSa5YLLU_BA@mail.gmail.com
This commit adds an early return to this function, allowing us to
remove a level of indentation on a decent chunk of code. This is
preparatory work for follow-up commits that will add a new system
view to show tables' autovacuum scores.
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Discussion: https://postgr.es/m/CAA5RZ0s4xjMrB-VAnLccC7kY8d0-4806-Lsac-czJsdA1LXtAw%40mail.gmail.com
Change default of max_locks_per_transactions to 128
The previous commits reduced the amount of memory available for locks
by eliminating the "safety margins" and by settling the split between
LOCK and PROCLOCK tables at startup. The allocation is now more
deterministic, but it also means that you often hit one of the limits
sooner than before. To compensate for that, bump up
max_locks_per_transactions from 64 to 128. With that there is a little
more space in the both hash tables than what was the effective maximum
size for either table before the previous commits.
This only changes the default, so if you had changed
max_locks_per_transactions in postgresql.conf, you will still have
fewer locks available than before for the same setting value. This
should be noted in the release notes. A good rule of thumb is that if
you double max_locks_per_transactions, you should be able to get as
many locks as before.
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://www.postgresql.org/message-id/e07be2ba-856b-4ff5-8313-8b58b6b4e4d0@iki.fi
This prevents the LOCK table from "stealing" space that was originally
calculated for the PROLOCK table, and vice versa. That was weirdly
indeterministic so that if you e.g. took a lot of locks consuming all
the available shared memory for the LOCK table, subsequent
transactions that needed the more space for the PROCLOCK table would
fail, but if you restarted the system then the space would be
available for PROCLOCK again. Better to be strict and predictable,
even though that means that in many cases you can acquire far fewer
locks than before.
This also prevents the lock hash tables from using up the
general-purpose 100 kB reserve we set aside for "stuff that's too
small to bother estimating" in CalculateShmemSize(). We are pretty
good at accounting for everything nowadays, so we could probably make
that reservation smaller, but I'll leave that for another commit.
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://www.postgresql.org/message-id/e07be2ba-856b-4ff5-8313-8b58b6b4e4d0@iki.fi
Remove 10% safety margin from lock manager hash table estimates
As the comment says, the hash table sizes are just estimates, but that
doesn't mean we need a "safety margin" here. hash_estimate_size()
estimates the needed size in bytes pretty accurately for the given
number of elements, so if we wanted room for more elements in the
table, we should just use larger max_table_size in the
hash_estimate_size() call.
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://www.postgresql.org/message-id/e07be2ba-856b-4ff5-8313-8b58b6b4e4d0@iki.fi
Remove bogus "safety margin" from predicate.c shmem estimates
The 10% safety margin was copy-pasted from lock.c when the predicate
locking code was originally added. However, we later (commit 7c797e7194) added the HASH_FIXED_SIZE flag to the hash tables, which
means that they cannot actually use the safety margin that we're
calculating for them.
The extra memory was mainly used by the main lock manager, which is
the only shmem hash table of non-trivial size that does not use the
HASH_FIXED_SIZE flag. If we wanted to have more space for the lock
manager, we should reserve it directly in lock.c. After this commit,
the lock manager will just have less memory available than before.
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://www.postgresql.org/message-id/e07be2ba-856b-4ff5-8313-8b58b6b4e4d0@iki.fi
Amit Langote [Fri, 3 Apr 2026 05:33:53 +0000 (14:33 +0900)]
Optimize fast-path FK checks with batched index probes
Instead of probing the PK index on each trigger invocation, buffer
FK rows in a new per-constraint cache entry (RI_FastPathEntry) and
flush them as a batch.
On each trigger invocation, the new ri_FastPathBatchAdd() buffers
the FK row in RI_FastPathEntry. When the buffer fills (64 rows)
or the trigger-firing cycle ends, the new ri_FastPathBatchFlush()
probes the index for all buffered rows, sharing a single
CommandCounterIncrement, snapshot, permission check, and security
context switch across the batch, rather than repeating each per row
as the SPI path does. Per-flush CCI is safe because all AFTER
triggers for the buffered rows have already fired by flush time.
For single-column foreign keys, the new ri_FastPathFlushArray()
builds an ArrayType from the buffered FK values (casting to the
PK-side type if needed) and constructs a scan key with the
SK_SEARCHARRAY flag. The index AM sorts and deduplicates the array
internally, then walks matching leaf pages in one ordered traversal
instead of descending from the root once per row. A matched[] bitmap
tracks which batch items were satisfied; the first unmatched item is
reported as a violation. Multi-column foreign keys fall back to
per-row probing via the new ri_FastPathFlushLoop().
The fast path introduced in the previous commit (2da86c1ef9) yields
~1.8x speedup. This commit adds ~1.6x on top of that, for a combined
~2.9x speedup over the unpatched code (int PK / int FK, 1M rows, PK
table and index cached in memory).
FK tuples are materialized via ExecCopySlotHeapTuple() into a new
purpose-specific memory context (flush_cxt), child of
TopTransactionContext, which is also used for per-flush transient
work: cast results, the search array, and index scan allocations.
It is reset after each flush and deleted in teardown.
The PK relation, index, tuple slots, and fast-path metadata are
cached in RI_FastPathEntry across trigger invocations within a
trigger-firing batch, avoiding repeated open/close overhead. The
snapshot and IndexScanDesc are taken fresh per flush. The entry is
not subject to cache invalidation: cached relations are held with
locks for the transaction duration, and the entry's lifetime is
bounded by the trigger-firing cycle.
Lifecycle management for RI_FastPathEntry relies on three new
mechanisms:
- AfterTriggerBatchCallback: A new general-purpose callback
mechanism in trigger.c. Callbacks registered via
RegisterAfterTriggerBatchCallback() fire at the end of each
trigger-firing batch (AfterTriggerEndQuery for immediate
constraints, AfterTriggerFireDeferred at COMMIT, and
AfterTriggerSetState for SET CONSTRAINTS IMMEDIATE). The RI
code registers ri_FastPathEndBatch as a batch callback.
- Batch callbacks only fire at the outermost query level
(checked inside FireAfterTriggerBatchCallbacks), so nested
queries from SPI inside other AFTER triggers do not tear down
the cache mid-batch.
- XactCallback: ri_FastPathXactCallback NULLs the static cache
pointer at transaction end, handling the abort path where the
batch callback never fired.
- SubXactCallback: ri_FastPathSubXactCallback NULLs the static
cache pointer on subtransaction abort, preventing the batch
callback from accessing already-released resources.
- AfterTriggerBatchIsActive(): A new exported accessor that
returns true when afterTriggers.query_depth >= 0. During
ALTER TABLE ... ADD FOREIGN KEY validation, RI triggers are
called directly outside the after-trigger framework, so batch
callbacks would never fire. The fast-path code uses this to
fall back to the non-cached per-invocation path in that
context.
ri_FastPathEndBatch() flushes any partial batch before tearing
down cached resources. Since the FK relation may already be
closed by flush time (e.g. for deferred constraints at COMMIT),
it reopens the relation using entry->fk_relid if needed.
The existing ALTER TABLE validation path bypasses batching and
continues to call ri_FastPathCheck() directly per row, because
RI triggers are called outside the after-trigger framework there
and batch callbacks would never fire to flush the buffer.
Suggested-by: David Rowley <dgrowleyml@gmail.com>
Author: Amit Langote <amitlangote09@gmail.com> Co-authored-by: Junwang Zhao <zhjwpku@gmail.com> Reviewed-by: Haibo Yan <tristan.yim@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Tested-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CA+HiwqF4C0ws3cO+z5cLkPuvwnAwkSp7sfvgGj3yQ=Li6KNMqA@mail.gmail.com
Thomas Munro [Fri, 3 Apr 2026 01:48:54 +0000 (14:48 +1300)]
jit: No backport::SectionMemoryManager for LLVM 22.
LLVM 22 has the fix that we copied into our tree in commit 9044fc1d and
a new function to reach it[1][2], so we only need to use our copy for
Aarch64 + LLVM < 22. The only change to the final version that our copy
didn't get is a new LLVM_ABI macro, but that isn't appropriate for us.
Our copy is hopefully now frozen and would only need maintenance if bugs
are found in the upstream code.
Non-Aarch64 systems now also use the new API with LLVM 22. It allocates
all sections with one contiguous mmap() instead of one per
section. We could have done that earlier, but commit 9044fc1d wanted to
limit the blast radius to the affected systems. We might as well
benefit from that small improvement everywhere now that it is available
out of the box.
We can't delete our copy until LLVM 22 is our minimum supported version,
or we switch to the newer JITLink API for at least Aarch64.
Tom Lane [Thu, 2 Apr 2026 21:21:18 +0000 (17:21 -0400)]
Further harden tests that might use not-so-compatible tar versions.
Buildfarm testing shows that OpenSUSE (and perhaps related platforms?)
configures GNU tar in such a way that it'll archive sparse WAL files
by default, thus triggering the pax-extension detection code added by bc30c704a. Thus, we need something similar to 852de579a but for
GNU tar's option set. "--format=ustar" seems to do the trick.
Moreover, the buildfarm shows that pg_verifybackup's 003_corruption.pl
test script is also triggering creation of pax-format tar files on
that platform. We had not noticed because those test cases all fail
(intentionally) before getting to the point of trying to verify WAL
data.
Since that means two TAP scripts need this option-selection logic, and
plausibly more will do so in future, factor it out into a subroutine
in Test::Utils. We also need to back-patch the 003_corruption.pl fix
into v18, where it's also failing.
While at it, clean up some places where guards for $tar being empty
or undefined were incomplete or even outright backwards. Presumably,
we missed noticing because the set of machines that run TAP tests
and don't have tar installed is empty. But if we're going to try
to handle that scenario, we should do it correctly.
Reported-by: Tomas Vondra <tomas@vondra.me>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/02770bea-b3f3-4015-8a43-443ae345379c@vondra.me
Backpatch-through: 18
Each simply dispatches to the standard string processing functions.
These depend on the locale, but since it's set at `initdb`, they can be
considered immutable and therefore allowed in any jsonpath expression.
Author: Florents Tselai <florents.tselai@gmail.com> Co-authored-by: David E. Wheeler <david@justatheory.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Andrew Dunstan <andrew@dunslane.net>
Discussion: https://postgr.es/m/CA+v5N40sJF39m0v7h=QN86zGp0CUf9F1WKasnZy9nNVj_VhCZQ@mail.gmail.com
Rename the `datetime_precision` tokens to `uint_arg`, as they represent
unsigned integers and will be useful for other methods in the future, as
follows:
Author: David E. Wheeler <david@justatheory.com> Reviewed-by: Andrew Dunstan <andrew@dunslane.net>
Discussion: https://postgr.es/m/CA+v5N40sJF39m0v7h=QN86zGp0CUf9F1WKasnZy9nNVj_VhCZQ@mail.gmail.com
Add target_relid parameter to pg_get_publication_tables().
When a tablesync worker checks whether a specific table is published,
it previously issued a query to the publisher calling
pg_get_publication_tables() and filtering the result by relid via a
WHERE clause. Because the function itself was fully evaluated before
the filter was applied, this forced the publisher to enumerate all
tables in the publication. For publications covering a large number of
tables, this resulted in expensive catalog scans and unnecessary CPU
overhead on the publisher.
This commit adds a new overloaded form of pg_get_publication_tables()
that accepts an array of publication names and a target table
OID. Instead of enumerating all published tables, it evaluates
membership for the specified relation via syscache lookups, using the
new is_table_publishable_in_publication() helper. This helper
correctly accounts for publish_via_partition_root, ALL TABLES with
EXCEPT clauses, schema publications, and partition inheritance, while
avoiding the overhead of building the complete published table list.
The existing VARIADIC array form of pg_get_publication_tables() is
preserved for backward compatibility. Tablesync workers use the new
two-argument form when connected to a publisher running PostgreSQL 19
or later.
Bump catalog version.
Reported-by: Marcos Pegoraro <marcos@f10.com.br> Reviewed-by: Zhijie Hou <houzj.fnst@fujitsu.com> Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Peter Smith <smithpb2250@gmail.com> Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Haoyan Wang <wanghaoyan20@163.com>
Discussion: https://postgr.es/m/CAB-JLwbBFNuASyEnZWP0Tck9uNkthBZqi6WoXNevUT6+mV8XmA@mail.gmail.com
Tom Lane [Thu, 2 Apr 2026 16:20:26 +0000 (12:20 -0400)]
Harden astreamer tar parsing logic against archives it can't handle.
Previously, there was essentially no verification in this code that
the input is a tar file at all, let alone that it fits into the
subset of valid tar files that we can handle. This was exposed by
the discovery that we couldn't handle files that FreeBSD's tar
makes, because it's fairly aggressive about converting sparse WAL
files into sparse tar entries. To fix:
* Bail out if we find a pax extension header. This covers the
sparse-file case, and also protects us against scenarios where
the pax header changes other file properties that we care about.
(Eventually we may extend the logic to actually handle such
headers, but that won't happen in time for v19.)
* Be more wary about tar file type codes in general: do not assume
that anything that's neither a directory nor a symlink must be a
regular file. Instead, we just ignore entries that are none of the
three supported types.
* Apply pg_dump's isValidTarHeader to verify that a purported
header block is actually in tar format. To make this possible,
move isValidTarHeader into src/port/tar.c, which is probably where
it should have been since that file was created.
I also took the opportunity to const-ify the arguments of
isValidTarHeader and tarChecksum, and to use symbols not hard-wired
constants inside tarChecksum.
Back-patch to v18 but not further. Although this code exists inside
pg_basebackup in older branches, it's not really exposed in that
usage to tar files that weren't generated by our own code, so it
doesn't seem worth back-porting these changes across 3c9056981
and f80b09bac. I did choose to include a back-patch of 5868372bb
into v18 though, to minimize cosmetic differences between these
two branches.
Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/3049460.1775067940@sss.pgh.pa.us>
Backpatch-through: 18
Remove redundant SetLatch() calls in interrupt handling functions
Interrupt handling functions (e.g., HandleCatchupInterrupt(),
HandleParallelApplyMessageInterrupt()) are called only by
procsignal_sigusr1_handler(), which already calls SetLatch()
for the current process at the end of its processing.
Therefore, these interrupt handling functions do not need to
call SetLatch() themselves.
However, previously, some of these functions redundantly
called SetLatch(). This commit removes those unnecessary
calls.
While duplicate SetLatch() calls are redundant, they are
harmless, so this change is not backpatched.
John Naylor [Thu, 2 Apr 2026 12:39:57 +0000 (19:39 +0700)]
Check for __cpuidex and __get_cpuid_count separately
Previously we would only check for the availability of __cpuidex if
the related __get_cpuid_count was not available on a platform.
Future commits will need to access hypervisor information about
the TSC frequency of x86 CPUs. For that case __cpuidex is the only
viable option for accessing a high leaf (e.g. 0x40000000), since
__get_cpuid_count does not allow that.
__cpuidex is defined in cpuid.h for gcc/clang, but in intrin.h
for MSVC, so adjust tests to suite. We also need to cast the array
of unsigned ints to signed, since gcc (with -Wall) and clang emit
warnings otherwise.
Author: Lukas Fittl <lukas@fittl.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: John Naylor <john.naylor@postgresql.org>
Discussion: https://postgr.es/m/CAP53PkyooCeR8YV0BUD_xC7oTZESHz8OdA=tP7pBRHFVQ9xtKg@mail.gmail.com