Nathan Bossart [Tue, 17 Mar 2026 16:32:40 +0000 (11:32 -0500)]
pg_dump: Simplify query for retrieving attribute statistics.
This query fetches information from pg_stats, which did not return
table OIDs until recent commit 3b88e50d6c. Because of this, we had
to cart around arrays of schema and table names, and we needed an
extra filter clause to hopefully convince the planner to use the
correct index. With the introduction of pg_stats.tableid, we can
instead just use an array of OIDs, and we no longer need the extra
filter clause hack.
Author: Corey Huinker <corey.huinker@gmail.com> Reviewed-by: Sami Imseih <samimseih@gmail.com>
Discussion: https://postgr.es/m/CADkLM%3DcoCVy92QkVUUTLdo5eO2bMDtwMrzRn_8miAhX%2BuPaqXg%40mail.gmail.com
Peter Eisentraut [Tue, 17 Mar 2026 15:44:43 +0000 (16:44 +0100)]
Hardcode typeof_unqual to __typeof_unqual__ for clang
A new attempt was made in 63275ce84d2 to make typeof_unqual work on all
configurations of CC and CLANG. This re-introduced an old problem
though, where CLANG would only support __typeof_unqual__ but the
configure check for CC detected support for typeof_unqual.
This fixes that by always defining typeof_unqual as __typeof_unqual__
under clang.
Author: Jelte Fennema-Nio <postgres@jeltef.nl> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://www.postgresql.org/message-id/flat/92f9750f-c7f6-42d8-9a4a-85a3cbe808f3%40eisentraut.org
Andrew Dunstan [Tue, 10 Mar 2026 06:25:29 +0000 (14:25 +0800)]
make immutability tests in to_json and to_jsonb complete
Complete the TODOs in to_json_is_immutable() and to_jsonb_is_immutable()
by recursing into container types (arrays, composites, ranges, multiranges,
domains) to check element/sub-type mutability, rather than conservatively
returning "mutable" for all arrays and composites.
The shared logic is factored into a single json_check_mutability() function
in jsonfuncs.c, with the existing exported functions as thin wrappers.
Composite type inspection uses lookup_rowtype_tupdesc() (typcache) instead
of relation_open() to avoid unnecessary lock acquisition in the optimizer.
Range and multirange types are now also checked recursively: if the
subtype's conversion is immutable, the range is considered immutable
for JSON purposes, even though range_out is generically marked STABLE.
This is a behavioral change: range types with immutable subtypes (e.g.,
int4range) can now appear in expression indexes via JSON_ARRAY/JSON_OBJECT,
whereas previously they were conservatively rejected.
Add regression tests for JSON_ARRAY and JSON_OBJECT mutability with
expression indexes and generated columns, covering arrays, composites,
domains, ranges, multiranges and combinations thereof.
Author: Jian He <jian.universality@gmail.com> Reviewed-by: Andrew Dunstan <andrew@dunslane.net>
Discussion: https://postgr.es/m/CACJufxFz=OsXQdsMJ-cqoqspD9aJrwntsQP-U2A-UaV_M+-S9g@mail.gmail.com
Commitfest: https://commitfest.postgresql.org/patch/5759
Nathan Bossart [Tue, 17 Mar 2026 14:26:27 +0000 (09:26 -0500)]
Add more columns to pg_stats, pg_stats_ext, and pg_stats_ext_exprs.
This commit adds table OID and attribute number columns to
pg_stats, and it adds table OID and statistics object OID columns
to pg_stats_ext and pg_stats_ext_exprs. A proposed follow-up
commit would use pg_stats.tableid to simplify a query in pg_dump.
The others have no immediate purpose but may be useful later.
Bumps catversion.
Author: Corey Huinker <corey.huinker@gmail.com> Reviewed-by: Sami Imseih <samimseih@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CADkLM%3DcoCVy92QkVUUTLdo5eO2bMDtwMrzRn_8miAhX%2BuPaqXg%40mail.gmail.com
Peter Eisentraut [Tue, 17 Mar 2026 13:06:53 +0000 (14:06 +0100)]
Dump labels in reproducible order
In pg_get_propgraphdef(), sort the labels before writing out, for a
consistent dump order. Also, since we now have a list, we can get rid
of the separate table scan to get the count.
Don't leave behind files in src dir in 007_multixact_conversion.pl
pg_upgrade test 007_multixact_conversion.pl was leaving files like
delete_old_cluster.sh in the source directory for VPATH and meson
builds. To fix, change the tmp_check directory before running the
test, like in the other pg_upgrade tests.
Michael Paquier [Tue, 17 Mar 2026 08:38:55 +0000 (17:38 +0900)]
gen_guc_tables.pl: Improve detection of inconsistent data
This commit adds two improvements to gen_guc_tables.pl:
1) When finding two entries with the same name, the script complained
about these being not in alphabetical order, which was confusing.
Duplicated entries are now reported as their own error.
2) While the presence of the required fields is checked for all the
parameters, the script did not perform any checks on the non-required
fields. A check is added to check that any field defined matches with
what can be accepted. Previously, a typo in the name of a required
field would cause the field to be reported as missing. Non-mandatory
fields would be silently ignored, which was problematic as we could lose
some information.
Author: Zsolt Parragi <zsolt.parragi@percona.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CAN4CZFP=3xUoXb9jpn5OWwicg+rbyrca8-tVmgJsQAa4+OExkw@mail.gmail.com
Michael Paquier [Tue, 17 Mar 2026 05:34:29 +0000 (14:34 +0900)]
Refactor some code around ALTER TABLE [NO] INHERIT
[NO] INHERIT is not supported for partitioned tables, but this portion
of tablecmds.c did not apply the same rules as the other sub-commands,
checking the relkind in the execution phase, not the preparation phase.
This commit refactors the code to centralize the relkind and other
checks in the preparation phase for both command patterns, getting rid
of one translatable string on the way. ATT_PARTITIONED_TABLE is
removed from ATSimplePermissions(), and the child relation is checked
the same way for both sub-commands. The ALTER TABLE patterns that now
fail at preparation failed already at execution, hence there should be
no changes from the user perspective except more consistent error
messages generated.
Some comments at the top of ATPrepAddInherit() were incorrect,
CreateInheritance() being the routine checking the columns and
constraints between the parent and its to-be-child.
Author: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Discussion: https://postgr.es/m/CAEoWx2kggo1N2kDH6OSfXHL_5gKg3DqQ0PdNuL4LH4XSTKJ3-g@mail.gmail.com
Michael Paquier [Tue, 17 Mar 2026 03:56:46 +0000 (12:56 +0900)]
Tweak TAP test for worker terminations in worker_spi
The test has been reported as having a race condition for the case of a
worker that should be terminated after a database rename. Based on the
report received from buildfarm member jay, the database renamed is
accessed by a different session, preventing the ALTER DATABASE to
complete, ultimately failing the test.
Honestly, I am not completely sure what is the origin of this
disturbance, but two possibilities are an autovacuum or parallel worker
(due to debug_parallel_query being used by the host). In order to
(hopefully) stabilize the test, autovacuum and debug_parallel_query are
now disabled in the configuration of the node used in the test.
The failure is hard to reproduce, so it will take a few weeks to make
sure that the test has become stable. Let's see where it goes.
Reported-by: Aya Iwata <iwata.aya@fujitsu.com>
Discussion: https://postgr.es/m/OS3PR01MB8889505E2F3E443CCA4BD72EEA45A@OS3PR01MB8889.jpnprd01.prod.outlook.com
David Rowley [Tue, 17 Mar 2026 02:06:31 +0000 (15:06 +1300)]
Reduce size of CompactAttribute struct to 8 bytes
Previously, this was 16 bytes. With the use of some bitflags and by
reducing the attcacheoff field size to a 16-bit type, we can halve the
size of the struct.
It's unlikely that caching the offsets for offsets larger than what will
fit in a 16-bit int will help much as the tuple is very likely to have
some non-fixed-width types anyway, the offsets of which we cannot cache.
Shrinking this down to 8 bytes helps by accessing fewer cachelines when
performing tuple deformation. The fields used there are all fully
fledged fields, which don't require any bitmasking to extract the value
of. It also helps to more efficiently calculate the address of a
compact_attrs[] element in TupleDesc as the x86 LEA instruction can work
with 8 byte offsets, which allows the element address to be calculated
from the TupleDesc's address in a single instruction using LEA's
concurrent shift and add.
Author: David Rowley <dgrowleyml@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/CAApHDvodSVBj3ypOYbYUCJX%2BNWL%3DVZs63RNBQ_FxB_F%2B6QXF-A%40mail.gmail.com
Fujii Masao [Mon, 16 Mar 2026 23:10:20 +0000 (08:10 +0900)]
Fix WAL flush LSN used by logical walsender during shutdown
Commit 6eedb2a5fd8 made the logical walsender call
XLogFlush(GetXLogInsertRecPtr()) to ensure that all pending WAL is flushed,
fixing a publisher shutdown hang. However, if the last WAL record ends at
a page boundary, GetXLogInsertRecPtr() can return an LSN pointing past
the page header, which can cause XLogFlush() to report an error.
A similar issue previously existed in the GiST code. Commit b1f14c96720
introduced GetXLogInsertEndRecPtr(), which returns a safe WAL insertion end
location (returning the start of the page when the last record ends at a page
boundary), and updated the GiST code to use it with XLogFlush().
This commit fixes the issue by making the logical walsender use
XLogFlush(GetXLogInsertEndRecPtr()) when flushing pending WAL during shutdown.
Jeff Davis [Mon, 16 Mar 2026 20:42:55 +0000 (13:42 -0700)]
Clean up postgres_fdw/t/010_subscription.pl.
The test was based on test/subscription/002_rep_changes.pl, but had
some leftover copy+paste problems that were useless and/or
distracting.
Discussion: https://postgr.es/m/CAA4eK1+=V_UFNHwcoMFqzy0F4AtS9_GyXhQDUzizgieQPWr=0A@mail.gmail.com Reported-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
David Rowley [Mon, 16 Mar 2026 20:00:39 +0000 (09:00 +1300)]
Fix thinko in nocachegetattr() and nocache_index_getattr()
This code was recently adjusted by c456e3911, but that commit didn't get
the logic correct when finding the attnum to start walking the tuple in.
If there is a NULL, we need to start walking the tuple before it.
Author: David Rowley <dgrowleyml@gmail.com> Reported-by: Tender Wang <tndrwang@gmail.com>
Discussion: https://postgr.es/m/CAHewXNnb-s_=VdVUZ9h7dPA0u3hxV8x2aU3obZytnqQZ_MiROA@mail.gmail.com
Robert Haas [Mon, 16 Mar 2026 18:37:12 +0000 (14:37 -0400)]
pg_plan_advice: Fix failures to accept identifier keywords.
TOK_IDENT allows only non-keywords; identifier should be used
any place where either keywords or non-keywords should be accepted.
Hence, without this commit, any string that happens to be a keyword
can't be used as a partition schema, partition name, or plan name,
which is incorrect.
Peter Eisentraut [Mon, 16 Mar 2026 17:56:30 +0000 (18:56 +0100)]
Hardcode override of typeof_unqual for clang-for-bitcode
The fundamental problem is that when we call clang to generate
bitcode, we might be using configure results from a different
compiler, which might not be fully compatible with the clang we are
using. In practice, clang supports most things other compilers
support, so this has apparently not been a problem in practice.
But commits 4cfce4e62c8, 0af05b5dbb4, and 59292f7aac7 have been
struggling to make typeof_unqual work in this situation. Clang added
support in version 19, GCC in version 14, so if you are using, say,
GCC 14 and Clang 16, the compilation with the latter will fail. Such
combinations are not very likely in practice, because GCC 14 and Clang
19 were released within a few months of each other, and so Linux
distributions are likely to have suitable combinations. But some
buildfarm members and some Fedora versions are affected, so this tries
to fix it.
The fully correct solution would be to run a separate set of configure
tests for that clang-for-bitcode, but that would be very difficult to
implement, and probably of limited use in practice. So the workaround
here is that we hardcodedly override the configure result under clang
based on the version number. As long as we only have a few of these
cases, this should be manageable.
Also swap the order of the tests of typeof_unqual: Commit 59292f7aac7
tested the underscore variant first, but the reasons for that are now
gone.
Reviewed-by: Jelte Fennema-Nio <postgres@jeltef.nl> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://www.postgresql.org/message-id/flat/92f9750f-c7f6-42d8-9a4a-85a3cbe808f3%40eisentraut.org
Nathan Bossart [Mon, 16 Mar 2026 16:01:20 +0000 (11:01 -0500)]
pg_dumpall: Fix handling of incompatible options.
This commit teaches pg_dumpall to fail when both --clean and
--data-only are specified. Previously, it passed the options
through to pg_dump, which would fail after pg_dumpall had already
started producing output. Like recent commits b2898baaf7 and 7c8280eeb5, no back-patch.
Álvaro Herrera [Mon, 16 Mar 2026 13:34:57 +0000 (14:34 +0100)]
Reduce header inclusions via execnodes.h
Remove a bunch of #include lines from execnodes.h. Most of these
requier suitable typedefs to be added, so that it still compiles
standalone. In one case, the fix is to move a struct definition to the
one .c file where it is needed.
Also some light clean up in plannodes.h and genam.h, though not as
extensive as in execnodes.h.
Author: Álvaro Herrera <alvherre@kurilemu.de>
Author: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/202603131240.ihwqdxnj7w2o@alvherre.pgsql
Fujii Masao [Mon, 16 Mar 2026 12:05:13 +0000 (21:05 +0900)]
Remove unstable test for pg_statio_all_sequences stats reset
Commit 8fe315f18d4 added the stats_reset column to pg_statio_all_sequences and
included a regression test to verify that statistics in this view are reset
correctly. However, this test caused buildfarm member crake to report
a pg_upgradeCheck failure.
The failing test assumed that the blks_read and blks_hit counters
in pg_statio_all_sequences would be zero after calling
pg_stat_reset_single_table_counters(). On crake, however, either blks_read or
blks_hit sometimes appeared as 1 during the pg_upgradeCheck test, even right
after the reset.
Since these counters may change due to concurrent activity and the test is
unstable, this commit removes the checks for blks_read and blks_hit in
pg_statio_all_sequences from the regression test.
Peter Eisentraut [Mon, 16 Mar 2026 10:52:02 +0000 (11:52 +0100)]
Fix pg_upgrade failure when extension_control_path is used
When an extension is located via extension_control_path and it has a
hardcoded $libdir/ path, this is stripped by the
extension_control_path mechanism. But when pg_upgrade verifies the
extension using LOAD, this stripping does not happen, and so
pg_upgrade will fail because it cannot load the extension. To work
around that, change pg_upgrade to itself strip the prefix when it runs
its checks. A test case is also added.
Author: Jonathan Gonzalez V. <jonathan.abdiel@gmail.com> Reviewed-by: Niccolò Fei <niccolo.fei@enterprisedb.com> Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/43b3691c673a8b9158f5a09f06eacc3c63e2c02d.camel%40gmail.com
Peter Eisentraut [Mon, 16 Mar 2026 09:53:24 +0000 (10:53 +0100)]
Prevent -Wstrict-prototypes and -Wold-style-definition warnings
A following commit will enable -Wstrict-prototypes and -Wold-style-definition
by default. This commit fixes the warnings that those new flags will generate
before actually adding the new flags.
Peter Eisentraut [Mon, 16 Mar 2026 09:14:18 +0000 (10:14 +0100)]
SQL Property Graph Queries (SQL/PGQ)
Implementation of SQL property graph queries, according to SQL/PGQ
standard (ISO/IEC 9075-16:2023).
This adds:
- GRAPH_TABLE table function for graph pattern matching
- DDL commands CREATE/ALTER/DROP PROPERTY GRAPH
- several new system catalogs and information schema views
- psql \dG command
- pg_get_propgraphdef() function for pg_dump and psql
A property graph is a relation with a new relkind RELKIND_PROPGRAPH.
It acts like a view in many ways. It is rewritten to a standard
relational query in the rewriter. Access privileges act similar to a
security invoker view. (The security definer variant is not currently
implemented.)
Starting documentation can be found in doc/src/sgml/ddl.sgml and
doc/src/sgml/queries.sgml.
Fujii Masao [Mon, 16 Mar 2026 09:10:57 +0000 (18:10 +0900)]
Ensure "still waiting on lock" message is logged only once per wait.
When log_lock_waits is enabled, the "still waiting on lock" message is normally
emitted only once while a session continues waiting. However, if the wait is
interrupted, for example by wakeups from client_connection_check_interval,
SIGHUP for configuration reloads, or similar events, the message could be
emitted again each time the wait resumes.
For example, with very small client_connection_check_interval values
(e.g., 100 ms), this behavior could flood the logs with repeated messages,
making them difficult to use.
To prevent this, this commit guards the "still waiting on lock" message so
it is reported at most once during a lock wait, even if the wait is interrupted.
This preserves the intended behavior when no interrupts occur.
Author: Fujii Masao <masao.fujii@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Hüseyin Demir <huseyin.d3r@gmail.com>
Discussion: https://postgr.es/m/CAHGQGwHZUmg+r4kMcPYt_Z-txxVX+CJJhfra+qemxKXvAxYbpw@mail.gmail.com
Michael Paquier [Mon, 16 Mar 2026 08:48:39 +0000 (17:48 +0900)]
Reject ALTER TABLE .. CLUSTER earlier for partitioned tables
ALTER TABLE .. CLUSTER ON and SET WITHOUT CLUSTER are not supported for
partitioned tables and already fail with a check happening when the
sub-command is executed, not when it is prepared.
This commit moves the relkind check for partitioned tables to happen
when the sub-command is prepared in ATSimplePermissions(). This matches
with the practice of the other sub-commands of ALTER TABLE, shaving one
translatable string.
mark_index_clustered() can be a bit simplified, switching one
elog(ERROR) to an assertion. Note that mark_index_clustered() can also
be called through a CLUSTER command, but it cannot be reached for a
partitioned table, per the assertion based on the relkind in
cluster_rel(), and there is only one caller of rebuild_relation().
Author: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Discussion: https://postgr.es/m/CAEoWx2kggo1N2kDH6OSfXHL_5gKg3DqQ0PdNuL4LH4XSTKJ3-g@mail.gmail.com
Fujii Masao [Mon, 16 Mar 2026 08:24:08 +0000 (17:24 +0900)]
Add stats_reset column to pg_statio_all_sequences
pg_statio_all_sequences lacked a stats_reset column, unlike the other
pg_statio_* views that already expose it. This commit adds the column so
users can see when the statistics in this view were last reset.
Also this commit updates the documentation for
pg_stat_reset_single_table_counters() to clarify that it can reset statistics
for sequences and materialized views as well.
Amit Kapila [Mon, 16 Mar 2026 04:44:22 +0000 (10:14 +0530)]
Remove obsolete speculative insert cleanup in ReorderBuffer.
Commit 4daa140a2f introduced proper decoding for speculative aborts. As a
result, the internal state is guaranteed to be clean when a new
speculative insert is encountered. This patch removes the defensive
cleanup code that is no longer reachable.
Fujii Masao [Mon, 16 Mar 2026 03:13:11 +0000 (12:13 +0900)]
file_fdw: Add regression test for file_fdw with ON_ERROR='set_null'
Commit 2a525cc97e1 introduced the ON_ERROR = 'set_null' option for COPY,
allowing it to be used with foreign tables backed by file_fdw. However,
unlike ON_ERROR = 'ignore', no regression test was added to verify
this behavior for file_fdw.
This commit adds a regression test to ensure that foreign tables using
file_fdw work correctly with ON_ERROR = 'set_null', improving test coverage.
Author: Fujii Masao <masao.fujii@gmail.com> Reviewed-by: Yi Ding <dingyi_yale@163.com>
Discussion: https://postgr.es/m/CAHGQGwGmPc6aHpA5=WxKreiDePiOEitfOFsW2dSo5m81xWXgRA@mail.gmail.com
Michael Paquier [Mon, 16 Mar 2026 00:22:09 +0000 (09:22 +0900)]
Optimize hash index bulk-deletion with streaming read
This commit refactors hashbulkdelete() to use streaming reads, improving
the efficiency of the operation by prefetching upcoming buckets while
processing a current bucket. There are some specific changes required
to make sure that the cleanup work happens in accordance to the data
pushed to the stream read callback. When the cached metadata page is
refreshed to be able to process the next set of buckets, the stream is
reset and the data fed to the stream read callback has to be updated.
The reset needs to happen in two code paths, when _hash_getcachedmetap()
is called.
The author has seen better performance numbers than myself on this one
(with tweaks similar to 6c228755add8). The numbers are good enough for
both of us that this change is worth doing, in terms of IO and runtime.
Author: Xuneng Zhou <xunengzhou@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/CABPTF7VrqfbcDXqGrdLQ2xaQ=K0RzExNuw6U_GGqzSJu32wfdQ@mail.gmail.com
Tom Lane [Sun, 15 Mar 2026 23:34:52 +0000 (19:34 -0400)]
Move -ffast-math defense to float.c and remove the configure check.
We had defenses against -ffast-math in timestamp-related files,
which is a pretty obsolete place for them since we've not supported
floating-point timestamps in a long time. Remove those and instead
put one in float.c, which is still broken by using this switch.
Add some commentary to put more color on why it's a bad idea.
Also remove the check from configure. That was just there to fail
faster, but it doesn't really seem necessary anymore, and besides
we have no corresponding check in meson.build.
Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Suggested-by: Andres Freund <andres@anarazel.de> Suggested-by: Peter Eisentraut <peter@eisentraut.org> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/abFXfKC8zR0Oclon%40ip-10-97-1-34.eu-west-3.compute.internal
Tom Lane [Sun, 15 Mar 2026 22:55:37 +0000 (18:55 -0400)]
Be more careful about int vs. Oid in ecpglib.
Print an OID value inserted into a SQL query with %u not %d.
The existing code accidentally fails to malfunction when
given an OID above 2^31, but only accidentally; future changes
to our SQL parser could perhaps break it.
Declare the Oid values that ecpg_type_infocache_push() and
ecpg_is_type_an_array() work with as "Oid" not "int".
This doesn't have any functional effect, but it's clearer.
At the moment I don't see a need to back-patch this.
Bug: #19429
Author: fairyfar@msn.com Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/19429-aead3b1874be1a99@postgresql.org
David Rowley [Sun, 15 Mar 2026 22:46:00 +0000 (11:46 +1300)]
Optimize tuple deformation
This commit includes various optimizations to improve the performance of
tuple deformation.
We now precalculate CompactAttribute's attcacheoff, which allows us to
remove the code from the deform routines which was setting the
attcacheoff. Setting the attcacheoff is now handled by
TupleDescFinalize(), which must be called before the TupleDesc is used for
anything. Having TupleDescFinalize() means we can store the first
attribute in the TupleDesc which does not have an offset cached. That
allows us to add a dedicated deforming loop to deform all attributes up
to the final one with an attcacheoff set, or up to the first NULL
attribute, whichever comes first.
Here we also improve tuple deformation performance of tuples with NULLs.
Previously, if the HEAP_HASNULL bit was set in the tuple's t_infomask,
deforming would, one-by-one, check each and every bit in the NULL bitmap
to see if it was zero. Now, we process the NULL bitmap 1 byte at a time
rather than 1 bit at a time to find the attnum with the first NULL. We
can now deform the tuple without checking for NULLs up to just before that
attribute.
We also record the maximum attribute number which is guaranteed to exist
in the tuple, that is, has a NOT NULL constraint and isn't an
atthasmissing attribute. When deforming only attributes prior to the
guaranteed attnum, we've no need to access the tuple's natt count. As an
additional optimization, we only count fixed-width columns when
calculating the maximum guaranteed column, as this eliminates the need to
emit code to fetch byref types in the deformation loop for guaranteed
attributes.
Some locations in the code deform tuples that have yet to go through NOT
NULL constraint validation. We're unable to perform the guaranteed
attribute optimization when that's the case. This optimization is opt-in
via the TupleTableSlot using the TTS_FLAG_OBEYS_NOT_NULL_CONSTRAINTS
flag.
This commit also adds a more efficient way of populating the isnull
array by using a bit-wise SWAR trick which performs multiplication on the
inverse of the tuple's bitmap byte and masking out all but the lower bit
of each of the boolean's byte. This results in much more optimal code
when compared to determining the NULLness via att_isnull(). 8 isnull
elements are processed at once using this method, which means we need to
round the tts_isnull array size up to the next 8 bytes. The palloc code
does this anyway, but the round-up needed to be formalized so as not to
overwrite the sentinel byte in MEMORY_CONTEXT_CHECKING builds. Doing
this also allows the NULL-checking deforming loop to more efficiently
check the isnull array, rather than doing the bit-wise processing for each
attribute that att_isnull() does.
The level of performance improvement from these changes seems to vary
depending on the CPU architecture. Apple's M chips seem particularly
fond of the changes, with some of the tested deform-heavy queries going
over twice as fast as before. With x86-64, the speedups aren't quite as
large. With tables containing only a small number of columns, the
speedups will be less.
Author: David Rowley <dgrowleyml@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: John Naylor <johncnaylorls@gmail.com> Reviewed-by: Amit Langote <amitlangote09@gmail.com> Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Reviewed-by: Junwang Zhao <zhjwpku@gmail.com>
Discussion: https://postgr.es/m/CAApHDvpoFjaj3%2Bw_jD5uPnGazaw41A71tVJokLDJg2zfcigpMQ%40mail.gmail.com
David Rowley [Sun, 15 Mar 2026 22:45:49 +0000 (11:45 +1300)]
Add all required calls to TupleDescFinalize()
As of this commit all TupleDescs must have TupleDescFinalize() called on
them once the TupleDesc is set up and before BlessTupleDesc() is called.
In this commit, TupleDescFinalize() does nothing. This change has only
been separated out from the commit that properly implements this function
to make the change more obvious. Any extension which makes its own
TupleDesc will need to be modified to call the new function.
The follow-up commit which properly implements TupleDescFinalize() will
cause any code which forgets to do this to fail in assert-enabled builds in
BlessTupleDesc(). It may still be worth mentioning this change in the
release notes so that extension authors update their code.
Author: David Rowley <dgrowleyml@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: John Naylor <johncnaylorls@gmail.com> Reviewed-by: Amit Langote <amitlangote09@gmail.com> Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Reviewed-by: Junwang Zhao <zhjwpku@gmail.com>
Discussion: https://postgr.es/m/CAApHDvpoFjaj3%2Bw_jD5uPnGazaw41A71tVJokLDJg2zfcigpMQ%40mail.gmail.com
Tom Lane [Sun, 15 Mar 2026 22:05:38 +0000 (18:05 -0400)]
Save a few bytes per CatCTup.
CatalogCacheCreateEntry() computed the space needed for a CatCTup
as sizeof(CatCTup) + MAXIMUM_ALIGNOF. That's not our usual style,
and it wastes memory by allocating more padding than necessary.
On 64-bit machines sizeof(CatCTup) would be maxaligned already
since it contains pointer fields, therefore this code is wasting
8 bytes compared to the more usual MAXALIGN(sizeof(CatCTup)).
While at it, we don't really need to do MemoryContextSwitchTo()
when we're only allocating one block.
Author: ChangAo Chen <cca5507@qq.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/tencent_A42E0544C6184FE940CD8E3B14A3F0A39605@qq.com
Tom Lane [Sun, 15 Mar 2026 19:24:04 +0000 (15:24 -0400)]
Fix small memory leak in get_dbname_oid_list_from_mfile().
Coverity complained that this function leaked the dumpdirpath string,
which it did. But we don't need to make a copy at all, because
there's not really any point in trimming trailing slashes from the
directory name here. If that were needed, the initial
file_exists_in_directory() test would have failed, since it doesn't
bother with that (and neither does anyplace else in this file).
Moreover, if we did want that, reimplementing canonicalize_path()
poorly is not the way to proceed. Arguably, all of this code should
be reexamined with an eye to using src/port/path.c's facilities, but
for today I'll settle for getting rid of the memory leak.
Melanie Plageman [Sun, 15 Mar 2026 15:09:10 +0000 (11:09 -0400)]
Save vmbuffer in heap-specific scan descriptors for on-access pruning
Future commits will use the visibility map in on-access pruning to fix
VM corruption and set the VM if the page is all-visible.
Saving the vmbuffer in the scan descriptor reduces the number of times
it would need to be pinned and unpinned, making the overhead of doing so
negligible.
Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/C3AB3F5B-626E-4AAA-9529-23E9A20C727F%40gmail.com
Melanie Plageman [Sun, 15 Mar 2026 14:42:34 +0000 (10:42 -0400)]
Avoid BufferGetPage() calls in heap_update()
BufferGetPage() isn't cheap and heap_update() calls it multiple times
when it could just save the page from a single call. Do that.
While we are at it, make separate variables for old and new page in
heap_xlog_update(). It's confusing to reuse "page" for both pages.
Tom Lane [Sat, 14 Mar 2026 18:10:32 +0000 (14:10 -0400)]
Switch the semaphore API on Solaris to unnamed POSIX.
Solaris descendants (Illumos, OpenIndiana, OmniOS, etc.) hit System V
semaphore limits ("No space left on device" from semget) when running
many parallel test scripts under default system settings. We could
tell people to raise those settings, but there's a better answer.
Unnamed POSIX semaphores have been available on Solaris for decades
and work well, so prefer them, as was recently done for AIX.
This patch also updates the documentation to remove now-unnecessary
advice about raising project.max-sem-ids and project.max-msg-ids.
Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Greg Burd <greg@burd.me>
Discussion: https://postgr.es/m/470305.1772417108@sss.pgh.pa.us
Tom Lane [Sat, 14 Mar 2026 17:46:54 +0000 (13:46 -0400)]
Fix aclitemout() to work during early bootstrap.
"initdb -d" has been broken since commit f95d73ed4, because I changed
aclitemin to work in bootstrap mode but failed to consider aclitemout.
That routine isn't reached by default, but it is if the elog message
level is high enough, so it needs to work without catalog access too.
This patch just makes it use its existing code paths to print role
OIDs numerically. We could alternatively invent an inverse of
boot_get_role_oid() and print them symbolically, but that would take
more code and it's not apparent that it'd be any better for debugging
purposes.
Reported-by: Greg Burd <greg@burd.me>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/4416.1773328045@sss.pgh.pa.us
Tomas Vondra [Sat, 14 Mar 2026 14:24:37 +0000 (15:24 +0100)]
Tighten asserts on ParallelWorkerNumber
The comment about ParallelWorkerNumbr in parallel.c says:
In parallel workers, it will be set to a value >= 0 and < the number
of workers before any user code is invoked; each parallel worker will
get a different parallel worker number.
However asserts in various places collecting instrumentation allowed
(ParallelWorkerNumber == num_workers). That would be a bug, as the value
is used as index into an array with num_workers entries.
Fixed by adjusting the asserts accordingly. Backpatch to all supported
versions.
Michael Paquier [Sat, 14 Mar 2026 06:06:13 +0000 (15:06 +0900)]
pgstattuple: Optimize pgstattuple_approx() with streaming read
This commit plugs into pgstattuple_approx(), the SQL function faster
than pgstattuple() that returns approximate results, the streaming read
APIs. A callback is used to be able to skip all-visible pages via VM
lookup, to match with the logic prior to this commit.
Under test conditions similar to 6c228755add8 (some dm_delay and
debug_io_direct=data), this can substantially improve the execution time
of the function, particularly for large relations.
Author: Xuneng Zhou <xunengzhou@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/CABPTF7VrqfbcDXqGrdLQ2xaQ=K0RzExNuw6U_GGqzSJu32wfdQ@mail.gmail.com
David Rowley [Sat, 14 Mar 2026 00:52:09 +0000 (13:52 +1300)]
Allow sibling call optimization in slot_getsomeattrs_int()
This changes the TupleTableSlotOps contract to make it so the
getsomeattrs() function is in charge of calling
slot_getmissingattrs().
Since this removes all code from slot_getsomeattrs_int() aside from the
getsomeattrs() call itself, we may as well adjust slot_getsomeattrs() so
that it calls getsomeattrs() directly. We leave slot_getsomeattrs_int()
intact as this is still called from the JIT code.
Author: David Rowley <dgrowleyml@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Discussion: https://postgr.es/m/CAApHDvodSVBj3ypOYbYUCJX%2BNWL%3DVZs63RNBQ_FxB_F%2B6QXF-A%40mail.gmail.com
Peter Geoghegan [Sat, 14 Mar 2026 00:37:39 +0000 (20:37 -0400)]
Use fake LSNs to improve nbtree dropPin behavior.
Use fake LSNs in all nbtree critical sections that write a WAL record.
That way we can safely apply the _bt_killitems LSN trick with logged and
unlogged indexes alike. This brings the same benefits to plain scans of
unlogged relations that commit 2ed5b87f brought to plain scans of logged
relations: scans will drop their leaf page pin eagerly (by applying the
"dropPin" optimization), which avoids blocking progress by VACUUM. This
is particularly helpful with applications that allow a scrollable cursor
to remain idle for long periods.
Preparation for an upcoming commit that will add the amgetbatch
interface, and switch nbtree over to it (from amgettuple) to enable I/O
prefetching. The index prefetching read stream's effective prefetch
distance is adversely affected by any buffer pins held by the index AM.
At the same time, it can be useful for prefetching to read dozens of
leaf pages ahead of the scan to maintain an adequate prefetch distance.
The index prefetching patch avoids this tension by always eagerly
dropping index page pins of the kind traditionally held as an interlock
against unsafe concurrent TID recycling by VACUUM (essentially the same
way that amgetbitmap routines have always avoided holding onto pins).
The work from this commit makes that possible during scans of nbtree
unlogged indexes -- without our having to give up on setting LP_DEAD
bits on index tuples altogether.
Follow-up to commit d774072f, which moved the fake LSN infrastructure
out of GiST so that it could be used by other index AMs.
Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Andres Freund <andres@anarazel.de> Reviewed-By: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAH2-WzkehuhxyuA8quc7rRN3EtNXpiKsjPfO8mhb+0Dr2K0Dtg@mail.gmail.com
Peter Geoghegan [Fri, 13 Mar 2026 23:38:17 +0000 (19:38 -0400)]
Move fake LSN infrastructure out of GiST.
Move utility functions used by GiST to generate fake LSNs into xlog.c
and xloginsert.c, so that other index AMs can also generate fake LSNs.
Preparation for an upcoming commit that will add support for fake LSNs
to nbtree, allowing its dropPin optimization to be used during scans of
unlogged relations. That commit is itself preparation for another
upcoming commit that will add a new amgetbatch/btgetbatch interface to
enable I/O prefetching.
Bump XLOG_PAGE_MAGIC due to XLOG_GIST_ASSIGN_LSN becoming
XLOG_ASSIGN_LSN.
Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Andres Freund <andres@anarazel.de> Reviewed-By: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAH2-WzkehuhxyuA8quc7rRN3EtNXpiKsjPfO8mhb+0Dr2K0Dtg@mail.gmail.com
Tomas Vondra [Fri, 13 Mar 2026 21:42:29 +0000 (22:42 +0100)]
Use GetXLogInsertEndRecPtr in gistGetFakeLSN
The function used GetXLogInsertRecPtr() to generate the fake LSN. Most
of the time this is the same as what XLogInsert() would return, and so
it works fine with the XLogFlush() call. But if the last record ends at
a page boundary, GetXLogInsertRecPtr() returns LSN pointing after the
page header. In such case XLogFlush() fails with errors like this:
ERROR: xlog flush request 0/01BD2018 is not satisfied --- flushed only to 0/01BD2000
Such failures are very hard to trigger, particularly outside aggressive
test scenarios.
Fixed by introducing GetXLogInsertEndRecPtr(), returning the correct LSN
without skipping the header. This is the same as GetXLogInsertRecPtr(),
except that it calls XLogBytePosToEndRecPtr().
Initial investigation by me, root cause identified by Andres Freund.
This is a long-standing bug in gistGetFakeLSN(), probably introduced by c6b92041d38 in PG13. Backpatch to all supported versions.
Reported-by: Peter Geoghegan <pg@bowt.ie> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/vf4hbwrotvhbgcnknrqmfbqlu75oyjkmausvy66ic7x7vuhafx@e4rvwavtjswo
Backpatch-through: 14
Free memory allocated for unrecognized_protocol_options
Since 4966bd3ed95e Valgrind started to warn about little amount of
memory being leaked in ProcessStartupPacket(). This is not critical
but the warnings may distract from real issues. Fix it by freeing the
list after use.
Author: Aleksander Alekseev <aleksander@tigerdata.com>
Discussion: https://www.postgresql.org/message-id/CAJ7c6TN3Hbb5p=UHx0SPVN+h_JwPAV6rxoqOm7gHBMFKfnGK-Q@mail.gmail.com
Nathan Bossart [Fri, 13 Mar 2026 20:04:10 +0000 (15:04 -0500)]
Add convenience view to stats import test.
Presently, many statements in stats_import.sql select all columns
from the pg_stats system view. A proposed follow-up commit would
add columns to this view (some of which are not stable across test
runs), breaking all of these tests. This commit introduces a
convenience view for those statements so that future changes are
minimally disruptive.
Author: Corey Huinker <corey.huinker@gmail.com> Reviewed-by: Sami Imseih <samimseih@gmail.com>
Discussion: https://postgr.es/m/CADkLM%3DcoCVy92QkVUUTLdo5eO2bMDtwMrzRn_8miAhX%2BuPaqXg%40mail.gmail.com
Andres Freund [Fri, 13 Mar 2026 16:11:05 +0000 (12:11 -0400)]
Fix bug due to confusion about what IsMVCCSnapshot means
In 0b96e734c59 I (Andres) relied on page_collect_tuples() being called only
with an MVCC snapshot, and added assertions to that end, but did not realize
that IsMVCCSnapshot() allows both proper MVCC snapshots and historical
snapshots, which behave quite similarly to MVCC snapshots.
Unfortunately that can lead to incorrect visibility results during logical
decoding, as a historical snapshot is interpreted as a plain MVCC
snapshot. The only reason this wasn't noticed earlier is that it's hard to
reach as most of the time there are no sequential scans during logical
decoding.
To fix the bug and avoid issues like this in the future, split
IsMVCCSnapshot() into IsMVCCSnapshot() and IsMVCCLikeSnapshot(), where now
only the latter includes historic snapshots.
One effect of this is that during logical decoding no page-at-a-time snapshots
are used, as otherwise runtime branches to handle historic snapshots would be
needed in some performance critical paths. Given how uncommon sequential scans
are during logical decoding, that seems acceptable.
Jacob Champion [Fri, 13 Mar 2026 17:34:03 +0000 (10:34 -0700)]
libpq-oauth: Fix Makefile dependencies
As part of 6225403f2, I'd removed the override for the `stlib` target,
since NAME no longer contains a major version number. But I forgot that
its dependencies are declared before Makefile.shlib is included; those
dependencies were then omitted entirely.
Per buildfarm member indri, which appears to be the only system so far
that's bothered by an empty archive.
Jacob Champion [Fri, 13 Mar 2026 16:38:04 +0000 (09:38 -0700)]
libpq-oauth: Never link against libpq's encoding functions
Now that libpq-oauth doesn't have to match the major version of libpq,
some things in pg_wchar.h are technically unsafe for us to use. (See b6c7cfac8 for a fuller discussion.) This is unlikely to be a problem --
we only care about UTF-8 in the context of OAuth right now -- but if
anyone did introduce a way to hit it, it'd be extremely difficult to
debug or reproduce, and it'd be a potential security vulnerability to
boot.
Define USE_PRIVATE_ENCODING_FUNCS so that anyone who tries to add a
dependency on the exported APIs will simply fail to link the shared
module.
Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Discussion: https://postgr.es/m/CAOYmi%2BmrGg%2Bn_X2MOLgeWcj3v_M00gR8uz_D7mM8z%3DdX1JYVbg%40mail.gmail.com
Jacob Champion [Fri, 13 Mar 2026 16:37:59 +0000 (09:37 -0700)]
libpq-oauth: Use the PGoauthBearerRequestV2 API
Switch the private libpq-oauth ABI to a public one, based on the new
PGoauthBearerRequestV2 API. A huge amount of glue code can be removed as
part of this, and several code paths can be deduplicated. Additionally,
the shared library no longer needs to change its name for every major
release; it's now just "libpq-oauth.so".
Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Discussion: https://postgr.es/m/CAOYmi%2BmrGg%2Bn_X2MOLgeWcj3v_M00gR8uz_D7mM8z%3DdX1JYVbg%40mail.gmail.com
Nathan Bossart [Fri, 13 Mar 2026 16:32:14 +0000 (11:32 -0500)]
Initialize variable to placate compiler.
Since commit 5883ff30b0, some compilers have been warning that the
rtekind variable in unique_nonjoin_rtekind() may be used
uninitialized. There doesn't appear to be any actual risk, so
let's just initialize it to something to silence the compiler
warnings.
Author: Sami Imseih <samimseih@gmail.com>
Discussion: https://postgr.es/m/CAA5RZ0sieVNfniCKMDdDjuXGd1OuzMQfTS5%3D9vX3sa-iiujKUA%40mail.gmail.com
Nathan Bossart [Fri, 13 Mar 2026 16:07:32 +0000 (11:07 -0500)]
Optimize COPY FROM (FORMAT {text,csv}) using SIMD.
Presently, such commands scan the input buffer one byte at a time
looking for special characters. This commit adds a new path that
uses SIMD instructions to skip over chunks of data without any
special characters. This can be much faster.
To avoid regressions, SIMD processing is disabled for the remainder
of the COPY FROM command as soon as we encounter a short line or a
special character (except for end-of-line characters, else we'd
always disable it after the first line). This is perhaps too
conservative, but it could probably be made more lenient in the
future via fine-tuned heuristics.
Author: Nazir Bilal Yavuz <byavuz81@gmail.com> Co-authored-by: Shinya Kato <shinya11.kato@gmail.com> Reviewed-by: Ayoub Kazar <ma_kazar@esi.dz> Reviewed-by: Andrew Dunstan <andrew@dunslane.net> Reviewed-by: Neil Conway <neil.conway@gmail.com> Reviewed-by: Greg Burd <greg@burd.me> Tested-by: Manni Wood <manni.wood@enterprisedb.com> Tested-by: Mark Wong <markwkm@gmail.com>
Discussion: https://postgr.es/m/CAOzEurSW8cNr6TPKsjrstnPfhf4QyQqB4tnPXGGe8N4e_v7Jig%40mail.gmail.com
Historically, all SLRUs were addressed by transaction IDs, but that
hasn't been true for a long time. However, the error message on I/O
error still always talked about accessing a transaction ID.
This commit adds a callback that allows subsystems to construct their
own error messages, which can then correctly refer to a transaction
ID, multixid or whatever else is used to address the particular SLRU.
Fujii Masao [Fri, 13 Mar 2026 13:17:14 +0000 (22:17 +0900)]
Add stats_reset column to pg_stat_database_conflicts.
This commit adds a stats_reset column to pg_stat_database_conflicts,
allowing users to see when the statistics in this view were last reset.
This makes the view consistent with pg_stat_database and other statistics
views.
Check for interrupts during non-fast-update GIN insertion
ginExtractEntries() can produce a lot of entries for a single item.
During index build, we check for interrupts between entries, and the
fast-update codepath does it as part of vacuum_delay_point(), but the
non-fast update insertion codepath was uninterruptible. Add
CHECK_FOR_INTERRUPTS() between entries in the non-fast update codepath
too.
Rework ginScanToDelete() to pass Buffers instead of BlockNumbers.
Previously, ginScanToDelete() and ginDeletePage() passed BlockNumbers and
re-read pages that were already pinned and locked during the tree walk. The
caller ginVacuumPostingTree()) held a cleanup-locked root buffer, yet
ginScanToDelete() re-read it by block number with special-case code to skip
re-locking.
At first, this commit gives both functions more appropriate names,
ginScanPostingTreeToDelete() and ginDeletePostingPage(), indicating they deal
with posting trees/pages. This is more descriptive and similar to the way we
name other GIN functions, for instance, ginVacuumPostingTree() and
ginVacuumPostingTreeLeaves().
Then rework both functions to pass Buffers directly. DataPageDeleteStack now
carries buffer, myoff (downlink offset in parent), and isRoot per level,
so ginScanPostingTreeToDelete() takes only GinVacuumState and
DataPageDeleteStack pointers. Also, ginDeletePostingPage() receives the three
Buffers directly, and no longer reads or releases them itself. The caller
reads and locks child pages before recursing, and manages buffer lifecycle
afterward.
This eliminates the confusing isRoot special cases in buffer management,
including the apparent (but unreachable) double release of the root
buffer identified by Andres Freund.
Add comments explaining the locking protocol and the DataPageDeleteStack
structure.
Reported-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/utrlxij43fbguzw4kldte2spc4btoldizutcqyrfakqnbrp3ir@ph3sphpj4asz Reviewed-by: Pavel Borisov <pashkin.elfe@gmail.com> Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com> Reviewed-by: Jinbinge <jinbinge@126.com>
Michael Paquier [Fri, 13 Mar 2026 07:06:28 +0000 (16:06 +0900)]
xml2: Fix failure with xslt_process() under -fsanitize=undefined
The logic of xslt_process() has never considered the fact that
xsltSaveResultToString() would return NULL for an empty string (the
upstream code has always done so, with a string length of 0). This
would cause memcpy() to be called with a NULL pointer, something
forbidden by POSIX.
Like 46ab07ffda9d and similar fixes, this is backpatched down to all the
supported branches, with a test case to cover this scenario. An empty
string has been always returned in xml2 in this case, based on the
history of the module, so this is an old issue.
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://postgr.es/m/c516a0d9-4406-47e3-9087-5ca5176ebcf9@gmail.com
Backpatch-through: 14
Peter Eisentraut [Fri, 13 Mar 2026 05:34:24 +0000 (06:34 +0100)]
Change copyObject() to use typeof_unqual
Currently, when the argument of copyObject() is const-qualified, the
return type is also, because the use of typeof carries over all the
qualifiers. This is incorrect, since the point of copyObject() is to
make a copy to mutate. But apparently no code ran into it.
The new implementation uses typeof_unqual, which drops the qualifiers,
making this work correctly.
typeof_unqual is standardized in C23, but all recent versions of all
the usual compilers support it even in non-C23 mode, at least as
__typeof_unqual__. We add a configure/meson test for typeof_unqual
and __typeof_unqual__ and use it if it's available, else we use the
existing fallback of just returning void *.
This is the second attempt, after the first attempt in commit 4cfce4e62c8 was reverted. The following two points address problems
with the earlier version:
We test the underscore variant first so that there is a higher chance
that clang used for bitcode also supports it, since we don't test that
separately.
Unlike the typeof test, the typeof_unqual test also tests with a void
pointer similar to how copyObject() would use it, because that is not
handled by MSVC, so we want the test to fail there.
Reviewed-by: David Geier <geidav.pg@gmail.com> Reviewed-by: Jelte Fennema-Nio <postgres@jeltef.nl>
Discussion: https://www.postgresql.org/message-id/flat/92f9750f-c7f6-42d8-9a4a-85a3cbe808f3%40eisentraut.org
Michael Paquier [Fri, 13 Mar 2026 01:48:45 +0000 (10:48 +0900)]
pgstattuple: Optimize btree and hash index functions with streaming read
This commit replaces the synchronous ReadBufferExtended() loops with the
streaming read routines, affecting pgstatindex() (for btree) and
pgstathashindex() (for hash indexes).
Under test conditions similar to 6c228755add8 (some dm_delay and
debug_io_direct=data), this can result in nice runtime and IO gains.
Author: Xuneng Zhou <xunengzhou@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/CABPTF7VrqfbcDXqGrdLQ2xaQ=K0RzExNuw6U_GGqzSJu32wfdQ@mail.gmail.com
Andrew Dunstan [Thu, 12 Mar 2026 21:53:09 +0000 (17:53 -0400)]
Enable fast default for domains with non-volatile constraints
Previously, ALTER TABLE ADD COLUMN always forced a table rewrite when
the column type was a domain with constraints (CHECK or NOT NULL), even
if the default value satisfied those constraints. This was because
contain_volatile_functions() considers CoerceToDomain immutable, so
the code conservatively assumed any constrained domain might fail.
Improve this by using soft error handling (ErrorSaveContext) to evaluate
the CoerceToDomain expression at ALTER TABLE time. If the default value
passes the domain's constraints, the value is stored as a "missing"
attribute default and no table rewrite is needed. If the constraint
check fails, we fall back to a table rewrite, preserving the historical
behavior that constraint violations are only raised when the table
actually contains rows.
Domains with volatile constraint expressions always require a table
rewrite since the constraint result could differ per evaluation and
cannot be cached.
Author: Jian He <jian.universality@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Andrew Dunstan <andrew@dunslane.net> Reviewed-by: Viktor Holmberg <viktor.holmberg@aiven.io>
Discussion: https://postgr.es/m/CACJufxE_+iZBR1i49k_AHigppPwLTJi6km8NOsC7FWvKdEmmXg@mail.gmail.com
Andrew Dunstan [Thu, 12 Mar 2026 21:52:33 +0000 (17:52 -0400)]
Extend DomainHasConstraints() to optionally check constraint volatility
Add an optional bool *has_volatile output parameter to
DomainHasConstraints(). When non-NULL, the function checks whether any
CHECK constraint contains a volatile expression. Callers that don't
need this information pass NULL and get the same behavior as before.
This is needed by a subsequent commit that enables the fast default
optimization for domains with non-volatile constraints: we can safely
evaluate such constraints once at ALTER TABLE time, but volatile
constraints require a full table rewrite.
Author: Jian He <jian.universality@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Andrew Dunstan <andrew@dunslane.net> Reviewed-by: Viktor Holmberg <viktor.holmberg@aiven.io>
Discussion: https://postgr.es/m/CACJufxE_+iZBR1i49k_AHigppPwLTJi6km8NOsC7FWvKdEmmXg@mail.gmail.com
Álvaro Herrera [Thu, 12 Mar 2026 18:19:23 +0000 (19:19 +0100)]
Document the 'command' column of pg_stat_progress_repack
Commit ac58465e0618 added and documented a new progress-report view for
REPACK, but neglected to list the 'command' column in the docs. This is
my (Álvaro's) fail, as I added the column in v23 of the patch and forgot
to document it.
In passing, add a note in the docs for pg_stat_progress_cluster that it
might contain rows for sessions running REPACK, though mapping the
command name to either the older commands; and that it is for backwards-
compatibility only. (Maybe we should just remove this older view.)
Peter Geoghegan [Thu, 12 Mar 2026 17:26:16 +0000 (13:26 -0400)]
Use simplehash for backend-private buffer pin refcounts.
Replace dynahash with simplehash for the per-backend PrivateRefCountHash
overflow table. Simplehash generates inlined, open-addressed lookup
code, avoiding the per-call overhead of dynahash that becomes noticeable
when many buffers are pinned with a CPU-bound workload.
Motivated by testing of the index prefetching patch, which pins many
more buffers concurrently than typical index scans.
Author: Peter Geoghegan <pg@bowt.ie> Suggested-by: Andres Freund <andres@anarazel.de> Reviewed-By: Tomas Vondra <tomas@vondra.me> Reviewed-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAH2-Wz=g=JTSyDB4UtB5su2ZcvsS7VbP+ZMvvaG6ABoCb+s8Lw@mail.gmail.com
Peter Geoghegan [Thu, 12 Mar 2026 17:22:36 +0000 (13:22 -0400)]
nbtree: Avoid allocating _bt_search stack.
Avoid allocating memory for an nbtree descent stack during index scans.
We only require a descent stack during inserts, when it is used to
determine where to insert a new pivot tuple/downlink into the target
leaf page's parent page in the event of a page split. (Page deletion's
first phase also performs a _bt_search that requires a descent stack.)
This optimization improves performance by minimizing palloc churn. It
speeds up index scans that call _bt_search frequently/descend the index
many times, especially when the cost of scanning the index dominates
(e.g., with index-only skip scans). Testing has shown that the
underlying issue causes performance problems for an upcoming patch that
will replace btgettuple with a new btgetbatch interface to enable I/O
prefetching.
Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAH2-Wzmy7NMba9k8m_VZ-XNDZJEUQBU8TeLEeL960-rAKb-+tQ@mail.gmail.com
Robert Haas [Thu, 12 Mar 2026 16:59:52 +0000 (12:59 -0400)]
Add pg_plan_advice contrib module.
Provide a facility that (1) can be used to stabilize certain plan choices
so that the planner cannot reverse course without authorization and
(2) can be used by knowledgeable users to insist on plan choices contrary
to what the planner believes best. In both cases, terrible outcomes are
possible: users should think twice and perhaps three times before
constraining the planner's ability to do as it thinks best; nevertheless,
there are problems that are much more easily solved with these facilities
than without them.
This patch takes the approach of analyzing a finished plan to produce
textual output, which we call "plan advice", that describes key
decisions made during plan; if that plan advice is provided during
future planning cycles, it will force those key decisions to be made in
the same way. Not all planner decisions can be controlled using advice;
for example, decisions about how to perform aggregation are currently
out of scope, as is choice of sort order. Plan advice can also be edited
by the user, or even written from scratch in simple cases, making it
possible to generate outcomes that the planner would not have produced.
Partial advice can be provided to control some planner outcomes but not
others.
Currently, plan advice is focused only on specific outcomes, such as
the choice to use a sequential scan for a particular relation, and not
on estimates that might contribute to those outcomes, such as a
possibly-incorrect selectivity estimate. While it would be useful to
users to be able to provide plan advice that affects selectivity
estimates or other aspects of costing, that is out of scope for this
commit.
Reviewed-by: Lukas Fittl <lukas@fittl.com> Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com> Reviewed-by: Greg Burd <greg@burd.me> Reviewed-by: Jacob Champion <jacob.champion@enterprisedb.com> Reviewed-by: Haibo Yan <tristan.yim@gmail.com> Reviewed-by: Dian Fay <di@nmfay.com> Reviewed-by: Ajay Pal <ajay.pal.k@gmail.com> Reviewed-by: John Naylor <johncnaylorls@gmail.com> Reviewed-by: Alexandra Wang <alexandra.wang.oss@gmail.com>
Discussion: http://postgr.es/m/CA+TgmoZ-Jh1T6QyWoCODMVQdhTUPYkaZjWztzP1En4=ZHoKPzw@mail.gmail.com
Michael Paquier [Thu, 12 Mar 2026 07:35:58 +0000 (16:35 +0900)]
doc: Document variables for path substitution in SQL tests
Test suites driven by pg_regress can use the following environment
variables for path substitutions since d1029bb5a26c:
- PG_ABS_SRCDIR
- PG_ABS_BUILDDIR
- PG_DLSUFFIX
- PG_LIBDIR
These variables have never been documented, and they can be useful for
out-of-core code based on the options used by the pg_regress command
invoked by installcheck (or equivalent) to build paths to libraries for
various commands, like LOAD or CREATE FUNCTION.
Reviewed-by: Zhang Hu <kongbaik228@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/abDAWzHaesHLDFke@paquier.xyz
Michael Paquier [Thu, 12 Mar 2026 03:00:22 +0000 (12:00 +0900)]
bloom: Optimize VACUUM and bulk-deletion with streaming read
This commit replaces the synchronous ReadBufferExtended() loops done in
blbulkdelete() and blvacuumcleanup() with the streaming read equivalent,
to improve I/O efficiency during bloom index vacuum cleanup operations.
Under the same test conditions as 6c228755add8, the runtime is proving
to gain around 30% better, with most the benefits coming from a large
reduction of the IO operation based on the stats retrieved in the
scenarios run.
Author: Xuneng Zhou <xunengzhou@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/CABPTF7VrqfbcDXqGrdLQ2xaQ=K0RzExNuw6U_GGqzSJu32wfdQ@mail.gmail.com
Michael Paquier [Thu, 12 Mar 2026 02:48:31 +0000 (11:48 +0900)]
Use streaming read for VACUUM cleanup of GIN
This commit replace the synchronous ReadBufferExtended() loop done in
ginvacuumcleanup() with the streaming read equivalent, to improve I/O
efficiency during GIN index vacuum cleanup operations.
With dm_delay to emulate some latency and debug_io_direct=data to force
synchronous writes and force the read path to be exercised, the author
has noticed a 5x improvement in runtime, with a substantial reduction in
IO stats numbers. I have reproduced similar numbers while running
similar tests, with improvements becoming better with more tuples and
more pages manipulated.
Author: Xuneng Zhou <xunengzhou@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/CABPTF7VrqfbcDXqGrdLQ2xaQ=K0RzExNuw6U_GGqzSJu32wfdQ@mail.gmail.com
Richard Guo [Thu, 12 Mar 2026 00:45:18 +0000 (09:45 +0900)]
Convert NOT IN sublinks to anti-joins when safe
The planner has historically been unable to convert "x NOT IN (SELECT
y ...)" sublinks into anti-joins. This is because standard SQL
semantics for NOT IN require that if the comparison "x = y" returns
NULL, the "NOT IN" expression evaluates to NULL (effectively false),
causing the row to be discarded. In contrast, an anti-join preserves
the row if no match is found. Due to this semantic mismatch regarding
NULL handling, the conversion was previously considered unsafe.
However, if we can prove that neither side of the comparison can yield
NULL values, and further that the operator itself cannot return NULL
for non-null inputs, the behavior of NOT IN and anti-join becomes
identical. Enabling this conversion allows the planner to treat the
sublink as a first-class relation rather than an opaque SubPlan
filter. This unlocks global join ordering optimization and permits
the selection of the most efficient join algorithm based on cost,
often yielding significant performance improvements for large
datasets.
This patch verifies that neither side of the comparison can be NULL
and that the operator is safe regarding NULL results before performing
the conversion.
To verify operator safety, we require that the operator be a member of
a B-tree or Hash operator family. This serves as a proxy for standard
boolean behavior, ensuring the operator does not return NULL on valid
non-null inputs, as doing so would break index integrity.
For operand non-nullability, this patch makes use of several existing
mechanisms. It leverages the outer-join-aware-Var infrastructure to
verify that a Var does not come from the nullable side of an outer
join, and consults the NOT-NULL-attnums hash table to efficiently
verify schema-level NOT NULL constraints. Additionally, it employs
find_nonnullable_vars to identify Vars forced non-nullable by qual
clauses, and expr_is_nonnullable to deduce non-nullability for other
expression types.
The logic for verifying the non-nullability of the subquery outputs
was adapted from prior work by David Rowley and Tom Lane.
Author: Richard Guo <guofenglinux@gmail.com> Reviewed-by: wenhui qiu <qiuwenhuifx@gmail.com> Reviewed-by: Zhang Mingli <zmlpostgres@gmail.com> Reviewed-by: Japin Li <japinli@hotmail.com>
Discussion: https://postgr.es/m/CAMbWs495eF=-fSa5CwJS6B-BaEi3ARp0UNb4Lt3EkgUGZJwkAQ@mail.gmail.com
Andres Freund [Wed, 11 Mar 2026 21:26:38 +0000 (17:26 -0400)]
bufmgr: Fix use of wrong variable in GetPrivateRefCountEntrySlow()
Unfortunately, in 30df61990c67, I made GetPrivateRefCountEntrySlow() set a
wrong cache hint when moving entries from the hash table to the faster array.
There are no correctness concerns due to this, just an unnecessary loss of
performance.
Noticed while testing the index prefetching patch.
Andrew Dunstan [Wed, 11 Mar 2026 20:15:35 +0000 (16:15 -0400)]
Add support for altering CHECK constraint enforceability
This commit adds support for ALTER TABLE ALTER CONSTRAINT ... [NOT]
ENFORCED for CHECK constraints. Previously, only foreign key
constraints could have their enforceability altered.
When changing from NOT ENFORCED to ENFORCED, the operation not only
updates catalog information but also performs a full table scan in
Phase 3 to validate that existing data satisfies the constraint.
For partitioned tables and inheritance hierarchies, the operation
recurses to all child tables. When changing to NOT ENFORCED, we must
recurse even if the parent is already NOT ENFORCED, since child
constraints may still be ENFORCED.
Author: Jian He <jian.universality@gmail.com> Reviewed-by: Robert Treat <rob@xzilla.net> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com> Reviewed-by: Amul Sul <sulamul@gmail.com> Reviewed-by: Zsolt Parragi <zsolt.parragi@cybertec.at> Reviewed-by: Andrew Dunstan <andrew@dunslane.net>
Discussion: https://postgr.es/m/CACJufxHCh_FU-FsEwsCvg9mN6-5tzR6H9ntn+0KUgTCaerDOmg@mail.gmail.com
Andrew Dunstan [Wed, 11 Mar 2026 20:14:58 +0000 (16:14 -0400)]
rename alter constraint enforceability related functions
The functions AlterConstrEnforceabilityRecurse and
ATExecAlterConstrEnforceability are being renamed to
AlterFKConstrEnforceabilityRecurse and ATExecAlterFKConstrEnforceability,
respectively.
The current alter constraint functions only handle Foreign Key constraints.
Renaming them to be more explicit about the constraint type is necessary;
otherwise, it will cause confusion when we later introduce the ability to alter
the enforceability of other constraints.
Author: Jian He <jian.universality@gmail.com> Reviewed-by: Amul Sul <sulamul@gmail.com> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com> Reviewed-by: Robert Treat <rob@xzilla.net>
Discussion: https://postgr.es/m/CACJufxHCh_FU-FsEwsCvg9mN6-5tzR6H9ntn+0KUgTCaerDOmg@mail.gmail.com
Andres Freund [Wed, 11 Mar 2026 17:58:44 +0000 (13:58 -0400)]
bufmgr: Switch to standard order in MarkBufferDirtyHint()
When we were updating hint bits with just a share lock MarkBufferDirtyHint()
had to use a non-standard order of operations, i.e. WAL log the buffer before
marking the buffer dirty. This was required because the lock level used to set
hints did not conflict with the lock level that was used to flush pages, which
would have allowed flushing the page out before the WAL record. The
non-standard order in turn required preventing the checkpoint from starting
between writing the WAL record and flushing out the page.
Now that setting hints and writing out buffers use share-exclusive, we can
revert back to the normal order of operations.
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/5ubipyssiju5twkb7zgqwdr7q2vhpkpmuelxfpanetlk6ofnop@hvxb4g2amb2d
Andres Freund [Wed, 11 Mar 2026 17:53:37 +0000 (13:53 -0400)]
bufmgr: Remove the, now obsolete, BM_JUST_DIRTIED
Due to the recent changes to use a share-exclusive mode for setting hint bits
and for flushing pages - instead of using share mode as before - a buffer
cannot be dirtied while the flush is ongoing. The reason we needed
JUST_DIRTIED was to handle the case where the buffer was dirtied while IO was
ongoing - which is not possible anymore.
Melanie Plageman [Wed, 11 Mar 2026 18:48:26 +0000 (14:48 -0400)]
Avoid WAL flush checks for unlogged buffers in GetVictimBuffer()
GetVictimBuffer() rejects a victim buffer if it is from a bulkread
strategy ring and reusing it would require flushing WAL. Unlogged table
buffers can have fake LSNs (e.g. unlogged GiST pages) and calling
XLogNeedsFlush() on a fake LSN is meaningless.
This is a bit of future-proofing because currently the bulkread strategy
is not used for relations with fake LSNs.
Author: Melanie Plageman <melanieplageman@gmail.com> Reported-by: Andres Freund <andres@anarazel.de> Reviewed-by: Andres Freund <andres@anarazel.de>
Earlier version reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/flat/fmkqmyeyy7bdpvcgkheb6yaqewemkik3ls6aaveyi5ibmvtxnd%40nu2kvy5rq3a6
Tomas Vondra [Wed, 11 Mar 2026 17:32:03 +0000 (18:32 +0100)]
Do not lock in BufferGetLSNAtomic() on archs with 8 byte atomic reads
On platforms where we can read or write the whole LSN atomically, we do
not need to lock the buffer header to prevent torn LSNs. We can do this
only on platforms with PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY, and when the
pd_lsn field is properly aligned.
For historical reasons the PageXLogRecPtr was defined as a struct with
two uint32 fields. This replaces it with a single uint64 value, to make
the intent clearer. To prevent issues with weak typedefs the value is
still wrapped in a struct.
This also adjusts heapfuncs() in pageinspect, to ensure proper alignment
when reading the LSN from a page on alignment-sensitive hardware.
Idea by Andres Freund. Initial patch by Andreas Karlsson, improved by
Peter Geoghegan. Minor tweaks by me.
Author: Andreas Karlsson <andreas@proxel.se>
Author: Peter Geoghegan <pg@bowt.ie> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/b6610c3b-3f59-465a-bdbb-8e9259f0abc4@proxel.se
Tomas Vondra [Wed, 11 Mar 2026 11:11:04 +0000 (12:11 +0100)]
Conditional locking in pgaio_worker_submit_internal
With io_method=worker, there's a single I/O submission queue. With
enough workers, the backends and workers may end up spending a lot of
time competing for the AioWorkerSubmissionQueueLock lock. This can
happen with workloads that keep the queue full, in which case it's
impossible to add requests to the queue. Increasing the number of I/O
workers increases the pressure on the lock, worsening the issue.
This change improves the situation in two ways:
* If AioWorkerSubmissionQueueLock can't be acquired without waiting,
the I/O is performed synchronously (as if the queue was full).
* When an entry can't be added to a full queue, stop trying to add more
entries. All remaining entries are handled as synchronous I/O.
The regression was reported by Alexandre Felipe. Investigation and
patch by me, based on an idea by Andres Freund.
Reported-by: Alexandre Felipe <o.alexandre.felipe@gmail.com>
Author: Tomas Vondra <tomas@vondra.me> Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAE8JnxOn4+xUAnce+M7LfZWOqfrMMxasMaEmSKwiKbQtZr65uA@mail.gmail.com
First, if we are using the fallback C++ implementation of typeof, then
we need to include the C++ header <type_traits> for
std::remove_reference_t. This header is also likely to be used for
other C++ implementations of type tricks, so we'll put it into the
global includes.
Second, for the case that the C compiler supports typeof in a spelling
that is not "typeof" (for example, __typeof__), then we need to #undef
typeof in the C++ section to avoid warnings about duplicate macro
definitions.
Peter Eisentraut [Wed, 11 Mar 2026 09:46:08 +0000 (10:46 +0100)]
Remove Int8GetDatum function
We have no uses of Int8GetDatum in our tree and did not have for a
long time (or never), and the inverse does not exist either.
Author: Kirill Reshke <reshkekirill@gmail.com> Suggested-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://www.postgresql.org/message-id/flat/CALdSSPhFyb9qLSHee73XtZm1CBWJNo9+JzFNf-zUEWCRW5yEiQ@mail.gmail.com
Peter Eisentraut [Wed, 11 Mar 2026 08:22:11 +0000 (09:22 +0100)]
Sort out table_open vs. relation_open in rewriter
table_open() is a wrapper around relation_open() that checks that the
relkind is table-like and gives a user-facing error message if not.
It is best used in directly user-facing areas to check that the user
used the right kind of command for the relkind. In internal uses
where the relkind was previously checked from the user's perspective,
table_open() is not necessary and might even be confusing if it were
to give out-of-context error messages.
In rewriteHandler.c, there were several such table_open() calls, which
this changes to relation_open(). This currently doesn't make a
difference, but there are plans to have other relkinds that could
appear in the rewriter but that shouldn't be accessible via
table-specific commands, and this clears the way for that.
Andres Freund [Tue, 10 Mar 2026 23:32:13 +0000 (19:32 -0400)]
Require share-exclusive lock to set hint bits and to flush
At the moment hint bits can be set with just a share lock on a page (and,
until 45f658dacb9, in one case even without any lock). Because of this we need
to copy pages while writing them out, as otherwise the checksum could be
corrupted.
The need to copy the page is problematic to implement AIO writes:
1) Instead of just needing a single buffer for a copied page we need one for
each page that's potentially undergoing I/O
2) To be able to use the "worker" AIO implementation the copied page needs to
reside in shared memory
It also causes problems for using unbuffered/direct-IO, independent of AIO:
Some filesystems, raid implementations, ... do not tolerate the data being
written out to change during the write. E.g. they may compute internal
checksums that can be invalidated by concurrent modifications, leading e.g. to
filesystem errors (as the case with btrfs).
It also just is plain odd to allow modifications of buffers that are just
share locked.
To address these issues, this commit changes the rules so that modifications
to pages are not allowed anymore while holding a share lock. Instead the new
share-exclusive lock (introduced in fcb9c977aa5) allows at most one backend to
modify a buffer while other backends have the same page share locked. An
existing share-lock can be upgraded to a share-exclusive lock, if there are no
conflicting locks. For that BufferBeginSetHintBits()/BufferFinishSetHintBits()
and BufferSetHintBits16() have been introduced.
To prevent hint bits from being set while the buffer is being written out,
writing out buffers now requires a share-exclusive lock.
The use of share-exclusive to gate setting hint bits means that from now on
only one backend can set hint bits at a time. To allow multiple backends to
set hint bits would require more complicated locking: For setting hint bits
we'd need to store the count of backends currently setting hint bits and we
would need another lock-level for I/O conflicting with the lock-level to set
hint bits. Given that the share-exclusive lock for setting hint bits is only
held for a short time, that backends would often just set the same hint bits
and that the cost of occasionally not setting hint bits in hotly accessed
pages is fairly low, this seems like an acceptable tradeoff.
The biggest change to adapt to this is in heapam. To avoid performance
regressions for sequential scans that need to set a lot of hint bits, we need
to amortize the cost of BufferBeginSetHintBits() for cases where hint bits are
set at a high frequency. To that end HeapTupleSatisfiesMVCCBatch() uses the
new SetHintBitsExt(), which defers BufferFinishSetHintBits() until all hint
bits on a page have been set. Conversely, to avoid regressions in cases where
we can't set hint bits in bulk (because we're looking only at individual
tuples), use BufferSetHintBits16() when setting hint bits without batching.
Several other places also need to be adapted, but those changes are
comparatively simpler.
After this we do not need to copy buffers to write them out anymore. That
change is done separately however.
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf%40gcnactj4z56m
Michael Paquier [Tue, 10 Mar 2026 22:36:10 +0000 (07:36 +0900)]
bloom: Optimize bitmap scan path with streaming read
This commit replaces the per-page buffer read look in blgetbitmap() with
a reading stream, to improve scan efficiency, particularly useful for
large bloom indexes. Some benchmarking with a large number of rows has
shown a very nice improvement in terms of runtime and IO read reduction
with test cases up to 10M rows for a bloom index scan.
For the io_uring method, The author has reported a 3x in runtime with
io_uring while I was at close to a 7x. For the worker method with 3
workers, the author has reported better numbers than myself in runtime,
with the reduction in IO stats being appealing for all the cases
measured.
Author: Xuneng Zhou <xunengzhou@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/CABPTF7VrqfbcDXqGrdLQ2xaQ=K0RzExNuw6U_GGqzSJu32wfdQ@mail.gmail.com
Don't clear pendingRecoveryConflicts at end of transaction
Commit 17f51ea818 introduced a new pendingRecoveryConflicts field in
PGPROC to replace the various ProcSignals. The new field was cleared
in ProcArrayEndTransaction(), which makes sense for conflicts with
e.g. locks or buffer pins which are gone at end of transaction. But it
is not appropriate for conflicts on a database, or a logical slot.
Because of this, the 035_standby_logical_decoding.pl test was
occasionally getting stuck in the buildfarm. It happens if the startup
process signals recovery conflict with the logical slot just when the
walsender process using the slot calls ProcArrayEndTransaction().
To fix, don't clear pendingRecoveryConflicts in
ProcArrayEndTransaction(). We could still clear certain conflict
flags, like conflicts on locks, but we didn't try to do that before
commit 17f51ea818 either.
In the passing, fix a misspelled comment, and make
InitAuxiliaryProcess() to also clear pendingRecoveryConflicts. I don't
think aux processes can have recovery conflicts, but it seems best to
initialize the field and keep InitAuxiliaryProcess() as close to
InitProcess() as possible.
Analyzed-by: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://www.postgresql.org/message-id/3e07149d-060b-48a0-8f94-3d5e4946ae45@gmail.com
Melanie Plageman [Tue, 10 Mar 2026 19:24:39 +0000 (15:24 -0400)]
Use the newest to-be-frozen xid as the conflict horizon for freezing
Previously WAL records that froze tuples used OldestXmin as the snapshot
conflict horizon, or the visibility cutoff if the page would become
all-frozen. Both are newer than (or equal to) the newst XID actually
frozen on the page.
Track the newest XID that will be frozen and use that as the snapshot
conflict horizon instead. This yields an older horizon resulting in
fewer query cancellations on standbys.
Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAAKRu_bbaUV8OUjAfVa_iALgKnTSfB4gO3jnkfpcFgrxEpSGJQ%40mail.gmail.com