git.ipfire.org Git - thirdparty/postgresql.git/log

Add test case for same-type reordered FK columns

The test added in 980c1a85d819 covered reordered FK columns with
different types, which triggered an "operator not a member of opfamily"
error in the fast-path prior to that commit. Add a test for the
same-type case, which is also fixed by that commit but where the wrong
scan key ordering instead produced a spurious FK violation without any
internal error.

Reported-by: Fredrik Widlert <fredrik.widlert@digpro.se>
Discussion: https://postgr.es/m/CADfhSr8hYc-4Cz7vfXH_oV-Jq81pyK9W4phLrOGspovsg2W7Kw@mail.gmail.com

Move afterTriggerFiringDepth into AfterTriggersData

The static variable afterTriggerFiringDepth introduced by commit
5c54c3ed1b9 is logically part of the after-trigger state and in
hindsight should have been a field in AfterTriggersData alongside
query_depth and the other per-transaction after-trigger state.
Move it there as firing_depth. Also update its comment to
accurately reflect its sole remaining purpose: signaling to
AfterTriggerIsActive() that after-trigger firing is active.

Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CA+HiwqFt4NGTNk7BinOsHHM48E9zGAa852vCfGoSe1bbL=JNFQ@mail.gmail.com

Fix var_is_nonnullable() to account for varreturningtype

var_is_nonnullable() failed to consider varreturningtype, which meant
it could incorrectly claim a Var is non-nullable based on a column's
NOT NULL constraint even when the Var refers to a non-existent row.
Specifically, OLD.col is NULL for INSERT (no old row exists) and
NEW.col is NULL for DELETE (no new row exists), regardless of any NOT
NULL constraint on the column.

This caused the planner's constant folding in eval_const_expressions
to incorrectly simplify IS NULL / IS NOT NULL tests on such Vars. For
example, "old.a IS NULL" in an INSERT's RETURNING clause would be
folded to false when column "a" has a NOT NULL constraint, even though
the correct result is true.

Fix by returning false from var_is_nonnullable() when varreturningtype
is not VAR_RETURNING_DEFAULT, since such Vars can be NULL regardless
of table constraints.

Author: SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com>
Reviewed-by: Tender Wang <tndrwang@gmail.com>
Reviewed-by: Richard Guo <guofenglinux@gmail.com>
Discussion: https://postgr.es/m/CAHg+QDfaAipL6YzOq2H=gAhKBbcUTYmfbAv+W1zueOfRKH43FQ@mail.gmail.com

Assert index_attnos[0] == 1 in ri_FastPathFlushArray()

ri_FastPathFlushArray() handles single-column FKs only, so
index_attnos[0] is always 1. Add an Assert to make this invariant
explicit, as a followup to 980c1a85d819.

Suggested-by: Junwang Zhao <zhjwpku@gmail.com> (offlist)
Discussion: https://www.postgresql.org/message-id/CADfhSr-pCkbDxmiOVYSAGE5QGjsQ48KKH_W424SPk%2BpwzKZFaQ%40mail.gmail.com

Fix FK fast-path scan key ordering for mismatched column order

The fast-path foreign key check introduced in 2da86c1ef9b assumed that
constraint key positions directly correspond to index column positions.
This is not always true as a FK constraint can reference PK columns in a
different order than they appear in the PK's unique index.

For example, if the PK is (a, b, c) and the FK references them as
(a, c, b), the constraint stores keys in the FK-specified order, but
the index has columns in PK order. The buggy code used the constraint
key index to access rd_opfamily[i], which retrieved the wrong operator
family when columns were reordered, causing "operator X is not a member
of opfamily Y" errors.

After fixing the opfamily lookup, a second issue started to happen:
btree index scans require scan keys to be ordered by attribute number.
The code was placing scan keys at array position i with attribute number
idx_attno, producing out-of-order keys when columns were swapped. This
caused "btree index keys must be ordered by attribute" errors.

The fix adds an index_attnos array to FastPathMeta that maps each
constraint key position to its corresponding index column position.
In ri_populate_fastpath_metadata(), we search indkey to find the actual
index column for each pk_attnums[i] and use that position for the
opfamily lookup. In build_index_scankeys(), we place each scan key at
the array position corresponding to its index column
(skeys[idx_attno-1]) rather than at the constraint key position,
ensuring scan keys are properly ordered by attribute number as btree
requires.

Reported-by: Fredrik Widlert <fredrik.widlert@digpro.se>
Author: Matheus Alcantara <matheusssilv97@gmail.com>
Reviewed-by: Junwang Zhao <zhjwpku@gmail.com>
Discussion: https://www.postgresql.org/message-id/CADfhSr-pCkbDxmiOVYSAGE5QGjsQ48KKH_W424SPk%2BpwzKZFaQ%40mail.gmail.com

Fix typo left by 34a30786293

Reported-by: jie wang <jugierwang@gmail.com>
Reported-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CAJnZyeDyaS=X-eYN=9rDYqK=6ma1gMLa0qDgfNbZKK0e0+q99Q@mail.gmail.com

Fix RI fast-path crash under nested C-level SPI

When a C-language function uses SPI_connect/SPI_execute/SPI_finish to
INSERT into a table with FK constraints, the FK AFTER triggers fire
and schedule ri_FastPathEndBatch via
RegisterAfterTriggerBatchCallback(), opening PK relations under
CurrentResourceOwner at the time of the SPI call.  The query_depth > 0
guard in FireAfterTriggerBatchCallbacks suppresses the callback at
that nesting level, deferring teardown to the outer query's
AfterTriggerEndQuery. By then the resource owner active during the SPI
call may have been released, decrementing the cached relations'
refcounts to zero. ri_FastPathTeardown, running under the outer
query's resource owner, then crashes in assert builds when it attempts
to close relations whose refcounts are already zero:

  TRAP: failed Assert("rel->rd_refcnt > 0")

Fix by storing batch callbacks at the level where they should fire:
in AfterTriggersQueryData.batch_callbacks for immediate constraints
(fired by AfterTriggerEndQuery) and in AfterTriggersData.batch_callbacks
for deferred constraints (fired by AfterTriggerFireDeferred and
AfterTriggerSetState).  RegisterAfterTriggerBatchCallback() routes the
callback to the current query-level list when query_depth >= 0, and to
the top-level list otherwise.  FireAfterTriggerBatchCallbacks() takes a
list parameter and simply iterates and invokes it; memory cleanup is
handled by the caller.  This replaces the query_depth > 0 guard with
list-level scoping.  Note that deferred constraints are unaffected by
this bug: their callbacks fire at commit via AfterTriggerFireDeferred,
under the outer transaction's resource owner, which remains valid
throughout.

Also add firing_batch_callbacks to AfterTriggersData to enforce that
callbacks do not register new callbacks during
FireAfterTriggerBatchCallbacks(), which would be unsafe as it could
modify the list being iterated.  An Assert in
RegisterAfterTriggerBatchCallback() enforces this discipline for
future callers.  The flag is reset at transaction and subtransaction
boundaries to handle cases where an error thrown by a callback is
caught and the subtransaction is rolled back.

While at it, ensure callbacks are properly accounted for at all
transaction boundaries, as cleanup of b7b27eb41a5c: discard any
remaining top-level callbacks on both commit and abort in
AfterTriggerEndXact(), and clean up query-level callbacks in
AfterTriggerFreeQuery().

Note that ri_PerformCheck() calls SPI with fire_triggers=false, which
skips AfterTriggerBeginQuery/EndQuery for that SPI command.  Any
triggers queued during that SPI command are not fired immediately but
deferred to the outer query level.  Since the fast-path check for
those triggers runs under the outer query's resource owner rather than
a nested SPI resource owner, and ri_PerformCheck() does not create
a dedicated child resource owner, the bug described above does not
apply.

Reported-by: Evan Montgomery-Recht <montge@mianetworks.net>
Reported-by: Sandro Santilli <strk@kbt.io>
Analyzed-by: Evan Montgomery-Recht <montge@mianetworks.net>
Author: Amit Langote <amitlangote09@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CAEg7pwcKf01FmDqFAf-Hzu_pYnMYScY_Otid-pe9uw3BJ6gq9g@mail.gmail.com

Document new catalog columns, missed in commit 8185bb5347.

Reported-by: "Shinoda, Noriyoshi (PSD Japan FSI)" <noriyoshi.shinoda@hpe.com>
Co-authored-by: "Shinoda, Noriyoshi (PSD Japan FSI)" <noriyoshi.shinoda@hpe.com>
Discussion: https://postgr.es/m/LV8PR84MB3787135EBDBF7747A05731F3EE592@LV8PR84MB3787.NAMPRD84.PROD.OUTLOOK.COM

Zero-fill private_data when attaching an injection point

InjectionPointAttach() did not initialize the private_data buffer of the
shared memory entry before (perhaps partially) overwriting it. When the
private data is set to NULL by the caler, the buffer was left
uninitialized. If set, it could have stale contents.

The buffer is initialized to zero, so as the contents recorded when a
point is attached are deterministic.

Author: Sami Imseih <samimseih@gmail.com>
Discussion: https://postgr.es/m/CAA5RZ0tsGHu2h6YLnVu4HiK05q+gTE_9WVUAqihW2LSscAYS-g@mail.gmail.com
Backpatch-through: 17

Fix double-free in pg_stat_autovacuum_scores.

Presently, relation_needs_vacanalyze() unconditionally frees the
pgstat entry returned by pgstat_fetch_stat_tabentry_ext().  This
behavior was first added by commit 02502c1bca to avoid memory
leakage in autovacuum.  While this is fine for autovacuum since it
forces stats_fetch_consistency to "none", it is not okay for other
callers that use "cache" or "snapshot".  This manifests as a
double-free when pg_stat_autovacuum_scores is called multiple times
in the same transaction.

To fix, add a "bool *may_free" parameter to
pgstat_fetch_stat_tabentry_ext() that returns whether it is safe
for the caller to explicitly pfree() the result.  If a caller would
rather leave it to the memory context machinery to free the result,
it can pass NULL as the "may_free" argument (or just ignore its
value).

Oversight in commit 87f61f0c82.

Reported-by: Tender Wang <tndrwang@gmail.com>
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Suggested-by: Andres Freund <andres@anarazel.de>
Suggested-by: Tom Lane <tgl@sss.pgh.pa.us>
Author: Sami Imseih <samimseih@gmail.com>
Discussion: https://postgr.es/m/CAHewXNkJKdwb3D5OnksrdOqzqUnXUEMpDam1TPW0vfUkW%3D7jUw%40mail.gmail.com
Discussion: https://postgr.es/m/5684f479-858e-4c5d-b8f5-bcf05de1f909%40gmail.com

Remove an unstable wait from parallel autovacuum regression test.

The test 001_parallel_autovacuum.pl verified that vacuum delay
parameters are propagated to parallel vacuum workers by using
injection points. It previously waited for autovacuum to complete on
the test_autovac table. However, since injection points are
cluster-wide, an autovacuum worker could be triggered on tables in
other databases (e.g., template1) and get stuck at the same injection
point. This could lead to a timeout when the test waits for the
expected table's autovacuum to finish.

This commit removes the wait for autovacuum completion from this
specific test case. Since the primary goal is to verify the
propagation of parameter updates, which is already confirmed via log
messages, waiting for the entire vacuum process to finish is
unnecessary and prone to instability in concurrent test environments.

Author: Sami Imseih <samimseih@gmail.com>
Discussion: https://postgr.es/m/CAA5RZ0s+kZZRMSF4HW7tZ9W2jS1o4B+Fg8dr5a-T6mANX+mdQA@mail.gmail.com

instrumentation: Avoid CPUID 0x15/0x16 for Hypervisor TSC frequency

This restricts the retrieval of the TSC frequency whilst under a Hypervisor to
either Hypervisor-specific CPUID registers (0x40000010), or TSC
calibration. We previously allowed retrieving from the traditional CPUID
registers for TSC frequency (0x15/0x16) like on bare metal, but it turns out
that they are not trustworthy when virtualized and can report wildly incorrect
frequencies, like 7 kHz when the actual calibrated frequencty is 2.5 GHz.

Per report from buildfarm member drongo.

Author: Lukas Fittl <lukas@fittl.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/jr4hk2sxhqcfpb67ftz5g4vw33nm67cgf7go3wwmqsafu5aclq%405m67ukuhyszz

Add LOG_NEVER error level code.

This logging level means not to emit the log, which is useful for
functions like relation_needs_vacanalyze(). This function accepts
a log level argument but not all callers want it to emit logs.

Suggested-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/3101163.1775676098%40sss.pgh.pa.us

Fix integer overflow in nodeWindowAgg.c

In nodeWindowAgg.c, the calculations for frame start and end positions
in ROWS and GROUPS modes were performed using simple integer addition.
If a user-supplied offset was sufficiently large (close to INT64_MAX),
adding it to the current row or group index could cause a signed
integer overflow, wrapping the result to a negative number.

This led to incorrect behavior where frame boundaries that should have
extended indefinitely (or beyond the partition end) were treated as
falling at the first row, or where valid rows were incorrectly marked
as out-of-frame.  Depending on the specific query and data, these
overflows can result in incorrect query results, execution errors, or
assertion failures.

To fix, use overflow-aware integer addition (ie, pg_add_s64_overflow)
to check for overflows during these additions.  If an overflow is
detected, the boundary is now clamped to INT64_MAX.  This ensures the
logic correctly treats the boundary as extending to the end of the
partition.

Bug: #19405
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Tender Wang <tndrwang@gmail.com>
Discussion: https://postgr.es/m/19405-1ecf025dda171555@postgresql.org
Backpatch-through: 14

Update config.guess and config.sub

Strip PlaceHolderVars from partition pruning operands

When pulling up a subquery, its targetlist items may be wrapped in
PlaceHolderVars to enforce separate identity or as a result of outer
joins.  This causes any upper-level WHERE clauses referencing these
outputs to contain PlaceHolderVars, which prevents partprune.c from
recognizing that they match partition key columns, defeating partition
pruning.

To fix, strip PlaceHolderVars from operands before comparing them to
partition keys.  A PlaceHolderVar with empty phnullingrels appearing
in a relation-scan-level expression is effectively a no-op, so
stripping it is safe.  This parallels the existing treatment in
indxpath.c for index matching.

In passing, rename strip_phvs_in_index_operand() to strip_noop_phvs()
and move it from indxpath.c to placeholder.c, since it is now a
general-purpose utility used by both index matching and partition
pruning code.

Back-patch to v18.  Although this issue exists before that, changes in
that version made it common enough to notice.  Given the lack of field
reports for older versions, I am not back-patching further.  In the
v18 back-patch, strip_phvs_in_index_operand() is retained as a thin
wrapper around the new strip_noop_phvs() to avoid breaking third-party
extensions that may reference it.

Reported-by: Cándido Antonio Martínez Descalzo <candido@ninehq.com>
Diagnosed-by: David Rowley <dgrowleyml@gmail.com>
Author: Richard Guo <guofenglinux@gmail.com>
Discussion: https://postgr.es/m/CAH5YaUwVUWETTyVECTnhs7C=CVwi+uMSQH=cOkwAUqMdvXdwWA@mail.gmail.com
Backpatch-through: 18

Add nkeys parameter to recheck_matched_pk_tuple()

The function looped over ii_NumIndexKeyAttrs elements of the skeys
array, but one caller (ri_FastPathFlushArray) passes a one-element
array since it only handles single-column FKs. The function
signature did not communicate this constraint, which static analysis
flags as a potential out-of-bounds read.

Add an nkeys parameter and assert that it matches
ii_NumIndexKeyAttrs, then use it in the loop. The call sites
already know the key count.

Reported-by: Evan Montgomery-Recht <montge@mianetworks.net>
Discussion: https://postgr.es/m/CAEg7pwcKf01FmDqFAf-Hzu_pYnMYScY_Otid-pe9uw3BJ6gq9g@mail.gmail.com

Reduce presence of syscache.h in src/include/

ee642cccc43c has added syscache.h in inval.h and objectaddress.h,
enlarging by a lot the footprint of this header, particularly via
objectaddress.h. A change in syscache.h would cause a lot more files to
be recompiled.

This commit reduces the presence of syscache.h by switching to a direct
use of syscache_ids.h in inval.h and objectaddress.h, where the enum
SysCacheIdentifier is defined. genbki.pl gains an #ifndef block for
this header, so as its inclusion is more controlled.

Reported-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/vlcexdcimsmvu3aplt2yxpfndkgtuvjsrms2fdl46rbw3k2kug@drspkoxlaije

Simplify declaration of memcpy target

The existing one is understandable failing on (some?) 32-bit platforms.

Reported-by: Tomas Vondra <tomas@vondra.me>
Suggested-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/1c197f2d-49a2-4830-8dde-55867218b62d@vondra.me

doc: Fix data_checksums data type

Commit f19c0eccae96 changed the data_checksums GUC datatype from a
boolean to an enum. This updates the documentation to accurately
reflect its new type and document the new possible states: 'on',
'off', 'inprogress-on', and 'inprogress-off'.

Also update the xref for more information to point to the section
on data checksums rather than the initdb checksum option.

Author: Lakshmi N <lakshmin.jhs@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Discussion: https://postgr.es/m/CA+3i_M-AtTnqTB2KLBTpu-c-jvnTuy7bGxyxs80rgiQLxWrRUQ@mail.gmail.com

Add a couple of commits to .git-blame-ignore-revs.

Add missing PGDLLIMPORT markings

Remove RADIUS support.

Our RADIUS implementation supported only the deprecated RADIUS/UDP
variant, without the recommended Message-Authenticator attribute to
mitigate against the Blast-RADIUS vulnerability. By now, popular RADIUS
servers are expected to generate loud warnings or reject our
authentication attempts outright.

Since there have been no user reports about this, it seems unlikely that
there are users.

Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Aleksander Alekseev <aleksander@tigerdata.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Jacob Champion <jacob.champion@enterprisedb.com>
Reviewed-by: Michael Banck <mbanck@gmx.net>
Discussion: https://postgr.es/m/CA%2BhUKG%2BSH309V8KECU5%3DxuLP9Dks0v9f9UVS2W74fPAE5O21dg%40mail.gmail.com

Add support for importing statistics from remote servers.

Add a new FDW callback routine that allows importing remote statistics
for a foreign table directly to the local server, instead of collecting
statistics locally.  The new callback routine is called at the beginning
of the ANALYZE operation on the table, and if the FDW failed to import
the statistics, the existing callback routine is called on the table to
collect statistics locally.

Also implement this for postgres_fdw.  It is enabled by "restore_stats"
option both at the server and table level.  Currently, it is the user's
responsibility to ensure remote statistics to import are up-to-date, so
the default is false.

Author: Corey Huinker <corey.huinker@gmail.com>
Co-authored-by: Etsuro Fujita <etsuro.fujita@gmail.com>
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Etsuro Fujita <etsuro.fujita@gmail.com>
Discussion: https://postgr.es/m/CADkLM%3DchrYAx%3DX2KUcDRST4RLaRLivYDohZrkW4LLBa0iBhb5w%40mail.gmail.com

aio: Adjust I/O worker pool automatically.

The size of the I/O worker pool used to implement io_method=worker was
previously controlled by the io_workers setting, defaulting to 3.  It
was hard to know how to tune it effectively.  That is replaced with:

  io_min_workers=2
  io_max_workers=8 (up to 32)
  io_worker_idle_timeout=60s
  io_worker_launch_interval=100ms

The pool is automatically sized within the configured range according to
recent variation in demand.  It grows when existing workers detect that
latency might be introduced by queuing, and shrinks when the
highest-numbered worker is idle for too long.  Work was already
concentrated into low-numbered workers in anticipation of this logic.

The logic for waking extra workers now also tries to measure and reduce
the number of spurious wakeups, though they are not entirely eliminated.

Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Dmitry Dolgov <9erthalion6@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKG%2Bm4xV0LMoH2c%3DoRAdEXuCnh%2BtGBTWa7uFeFMGgTLAw%2BQ%40mail.gmail.com

Exit early from pg_comp_crc32c_pmull for small inputs

The vectorized path in commit fbc57f2bc had a side effect of putting
more branches in the path taken for small inputs. To reduce risk
of regressions, only proceed with the vectorized path if we can
guarantee that the remaining input after the alignment preamble is
greater than 64 bytes. That also allows removing length checks in
the alignment preamble.

Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Discussion: https://postgr.es/m/CANWCAZZ48GuLYhJCcTy8TXysjrMVJL6n1n7NP94=iG+t80YKPw@mail.gmail.com

pg_upgrade: Check for unsupported encodings.

Since we have dropped MULE_INTERNAL, add a check that all encodings used
in the source cluster are still supported according to
PG_ENCODING_BE_VALID(). This is done generically, in case we decide to
drop another encoding some day.

Suggested-by: Jeff Davis <pgsql@j-davis.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CA%2BhUKGKXDXh-FdU0orjfv%2BF08f%3DD91BhV3Ra-4zL-q%2BJmGYqTA%40mail.gmail.com

Remove MULE_INTERNAL encoding.

This was useful before widespread Unicode adoption, and was based on the
internal encoding Emacs used to mix multiple sub-encodings.  Emacs
itself has stopped using it, and our implementation hadn't been updated
with modern underlying standards.  It is thought to be very unlikely
that anyone is still using it in the field.  Since such a complex
encoding comes with costs and risks, we agreed to drop support.

Any existing database using this encoding would need to be dumped and
restored with a new encoding to upgrade to PostgreSQL 19, most likely
UTF8, since pg_upgrade would fail.

Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Tatsuo Ishii <ishii@postgresql.org>
Reviewed-by: Jeff Davis <pgsql@j-davis.com>
Discussion: https://postgr.es/m/CA%2BhUKGKXDXh-FdU0orjfv%2BF08f%3DD91BhV3Ra-4zL-q%2BJmGYqTA%40mail.gmail.com

instrumentation: Allocate query level instrumentation in ExecutorStart

Until now extensions that wanted to measure overall query execution could
create QueryDesc->totaltime, which the core executor would then start and
stop. That's a bit odd and composes badly, e.g. extensions always had to use
INSTRUMENT_ALL, because otherwise another extension might not get what they
need.

Instead this introduces a new field, QueryDesc->query_instr_options, that
extensions can use to indicate whether they need query level instrumentation
populated, and with which instrumentation options. Extensions should take care
to only add options they need, instead of replacing the options of others.

The prior name of the field, totaltime, sounded like it would only measure
time, but these days the instrumentation infrastructure can track more
resources. The secondary benefit is that this will make it obvious to
extensions that they may not create the Instrumentation struct themselves
anymore (often extensions build only against a postgres build without
assertions).

Adjust pg_stat_statements and auto_explain to match, and lower the
requested instrumentation level for auto_explain to INSTRUMENT_TIMER,
since the summary instrumentation it needs is only runtime.

The reason to push this now, rather in the PG 20 cycle, is that 5a79e78501f
already required extensions using query level instrumentations to adjust their
code, and it seemed undesirable to require them to do so again for 20.

Author: Lukas Fittl <lukas@fittl.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAP53Pkyqsht+exJQYRsjhSWYKu+vFGHhPub7m6PmFD6Or0=p1g@mail.gmail.com

Fix slotsync worker blocking promotion when stuck in wait

Previously, on standby promotion, the startup process sent SIGUSR1 to
the slotsync worker (or a backend performing slot synchronization) and
waited for it to exit. This worked in most cases, but if the process was
blocked waiting for a response from the primary (e.g., due to a network
failure), SIGUSR1 would not interrupt the wait. As a result, the process
could remain stuck, causing the startup process to wait for a long time
and delaying promotion.

This commit fixes the issue by introducing a new procsignal reason,
PROCSIG_SLOTSYNC_MESSAGE. On promotion, the startup process
sends this signal, and the handler sets interrupt flags so the process
exits (or errors out) promptly at CHECK_FOR_INTERRUPTS(), allowing
promotion to complete without delay.

Backpatch to v17, where slotsync was introduced.

Author: Nisha Moond <nisha.moond412@gmail.com>
Reviewed-by: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Zhijie Hou <houzj.fnst@fujitsu.com>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/CAHGQGwFzNYroAxSoyJhqTU-pH=t4Ej6RyvhVmBZ91Exj_TPMMQ@mail.gmail.com
Backpatch-through: 17

instrumentation: Move ExecProcNodeInstr to allow inlining

This moves the implementation of ExecProcNodeInstr, the ExecProcNode variant
that gets used when instrumentation is on, to be defined in instrument.c
instead of execProcNode.c, and marks functions it uses as inline.

This allows compilers to generate an optimized implementation, and shows a 4
to 12% reduction in instrumentation overhead for queries that move lots of
rows.

Author: Lukas Fittl <lukas@fittl.com>
Suggested-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAP53PkzdBK8VJ1fS4AZ481LgMN8f9mJiC39ZRHqkFUSYq6KWmg@mail.gmail.com

Add EXPLAIN (IO) instrumentation for TidRangeScan

Adds support for EXPLAIN (IO) instrumentation for TidRange scans. This
requires adding shared instrumentation for parallel scans, using the
separate DSM approach introduced by dd78e69cfc33.

Author: Tomas Vondra <tomas@vondra.me>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/flat/a177a6dd-240b-455a-8f25-aca0b1c08c6e%40vondra.me

pg_test_timing: Also test RDTSC[P] timing, report time source, TSC frequency

This adds support to pg_test_timing for the different timing sources added by
294520c4448.

Author: Lukas Fittl <lukas@fittl.com>
Author: David Geier <geidav.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: David Geier <geidav.pg@gmail.com>
Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> (in an earlier version)
Discussion: https://www.postgresql.org/message-id/flat/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de

Add EXPLAIN (IO) instrumentation for SeqScan

Adds support for EXPLAIN (IO) instrumentation for sequential scans. This
requires adding shared instrumentation, using the separate DSM approach
introduced by dd78e69cfc33.

Author: Tomas Vondra <tomas@vondra.me>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/flat/a177a6dd-240b-455a-8f25-aca0b1c08c6e%40vondra.me

Suppress unused-variable warning.

x86 machines lacking HAVE__CPUIDEX saw a complaint about
"unused variable 'reg'", per buildfarm as well as local
experience. Oversight in bcb2cf41f.

auto_explain: Add new GUC auto_explain.log_io

Allows enabling the new EXPLAIN "IO" option for auto_explain.

Author: Tomas Vondra <tomas@vondra.me>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Discussion: https://postgr.es/m/flat/a177a6dd-240b-455a-8f25-aca0b1c08c6e%40vondra.me

Add EXPLAIN (IO) infrastructure with BitmapHeapScan support

Allows collecting details about AIO / prefetch for scan nodes backed by
a ReadStream. This may be enabled by a new "IO" option in EXPLAIN, and
it shows information about the prefetch distance and I/O requests.

As of this commit this applies only to BitmapHeapScan, because that's
the only scan node using a ReadStream and collecting instrumentation
from workers in a parallel query. Support for SeqScan and TidRangeScan,
the other scan nodes using ReadStream, will be added in subsequent
commits.

The stats are collected only when required by EXPLAIN ANALYZE, with the
IO option (disabled by default). The amount of collected statistics is
very limited, but we don't want to clutter EXPLAIN with too much data.

The IOStats struct is stored in the scan descriptor as a field, next to
other fields used by table AMs. A pointer to the field is passed to the
ReadStream, and updated directly.

It's the responsibility of the table AM to allocate the struct (e.g. in
ambeginscan) whenever the flag SO_SCAN_INSTRUMENT flag is passed to the
scan, so that the executor and ReadStream has access to it.

The collected stats are designed for ReadStream, but are meant to be
reasonably generic in case a TAM manages I/Os in different ways.

Author: Tomas Vondra <tomas@vondra.me>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/flat/a177a6dd-240b-455a-8f25-aca0b1c08c6e%40vondra.me

Switch EXPLAIN to unaligned output for json/xml/yaml

Use unaligned output for multiple EXPLAIN queries using non-text format
in regression tests. With aligned output adding/removing explain fields
can be very disruptive, as it often modifies the whole block because of
padding. Unaligned output does not have this issue.

Author: Tomas Vondra <tomas@vondra.me>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/flat/a177a6dd-240b-455a-8f25-aca0b1c08c6e%40vondra.me

Fix WITHOUT OVERLAPS' interaction with domains.

UNIQUE/PRIMARY KEY ... WITHOUT OVERLAPS requires the no-overlap
column to be a range or multirange, but it should allow a domain
over such a type too. This requires minor adjustments in both
the parser and executor.

In passing, fix a nearby break-instead-of-continue thinko in
transformIndexConstraint. This had the effect of disabling
parse-time validation of the no-overlap column's type in the context
of ALTER TABLE ADD CONSTRAINT, if it follows a dropped column.
We'd still complain appropriately at runtime though.

Author: Jian He <jian.universality@gmail.com>
Reviewed-by: Paul A Jungwirth <pj@illuminatedcomputing.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CACJufxGoAmN_0iJ=hjTG0vGpOSOyy-vYyfE+-q0AWxrq2_p5XQ@mail.gmail.com
Backpatch-through: 18

instrumentation: Use Time-Stamp Counter on x86-64 to lower overhead

This allows the direct use of the Time-Stamp Counter (TSC) value retrieved
from the CPU using RDTSC/RDTSCP instructions, instead of APIs like
clock_gettime() on POSIX systems.

This reduces the overhead of EXPLAIN with ANALYZE and TIMING ON. Tests showed
that the overhead on top of actual runtime when instrumenting queries moving
lots of rows through the plan can be reduced from 2x as slow to 1.2x as slow
compared to the actual runtime. More complex workloads such as TPCH queries
have also shown ~20% gains when instrumented compared to before.

To control use of the TSC, the new "timing_clock_source" GUC is introduced,
whose default ("auto") automatically uses the TSC when reliable, for example
when running on modern Intel CPUs, or when running on Linux and the system
clocksource is reported as "tsc". The use of the operating system clock source
can be enforced by setting "system", or on x86-64 architectures the use of TSC
can be enforced by explicitly setting "tsc".

In order to use the TSC the frequency is first determined by use of CPUID, and
if not available, by running a short calibration loop at program start,
falling back to the system clock source if TSC values are not stable.

Note, that we split TSC usage into the RDTSC CPU instruction which does not
wait for out-of-order execution (faster, less precise) and the RDTSCP
instruction, which waits for outstanding instructions to retire. RDTSCP is
deemed to have little benefit in the typical InstrStartNode() /
InstrStopNode() use case of EXPLAIN, and can be up to twice as slow. To
separate these use cases, the new macro INSTR_TIME_SET_CURRENT_FAST() is
introduced, which uses RDTSC.

The original macro INSTR_TIME_SET_CURRENT() uses RDTSCP and is supposed to be
used when precision is more important than performance. When the system timing
clock source is used both of these macros instead utilize the system
APIs (clock_gettime / QueryPerformanceCounter) like before.

Additional users of interval timing, such as track_io_timing and
track_wal_io_timing could also benefit from being converted to use
INSTR_TIME_SET_CURRENT_FAST() but are left for future changes.

Author: Lukas Fittl <lukas@fittl.com>
Author: Andres Freund <andres@anarazel.de>
Author: David Geier <geidav.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: David Geier <geidav.pg@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Reviewed-by: Thomas Munro <thomas.munro@gmail.com> (in an earlier version)
Reviewed-by: Maciek Sakrejda <m.sakrejda@gmail.com> (in an earlier version)
Reviewed-by: Robert Haas <robertmhaas@gmail.com> (in an earlier version)
Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com> (in an earlier version)
Discussion: https://postgr.es/m/20200612232810.f46nbqkdhbutzqdg@alap3.anarazel.de

Allow retrieving x86 TSC frequency/flags from CPUID

This adds additional x86 specific CPUID checks for flags needed for
determining whether the Time-Stamp Counter (TSC) is usable on a given system,
as well as a helper function to retrieve the TSC frequency from CPUID.

This is intended for a future patch that will utilize the TSC to lower the
overhead of timing instrumentation.

In passing, always make pg_cpuid_subleaf reset the variables used for its
result, to avoid accidentally using stale results if __get_cpuid_count errors
out and the caller doesn't check for it.

Author: Lukas Fittl <lukas@fittl.com>
Author: David Geier <geidav.pg@gmail.com>
Author: Andres Freund <andres@anarazel.de>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: David Geier <geidav.pg@gmail.com>
Reviewed-by: John Naylor <john.naylor@postgresql.org>
Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com> (in an earlier version)
Discussion: https://www.postgresql.org/message-id/flat/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de

instrumentation: Standardize ticks to nanosecond conversion method

The timing infrastructure (INSTR_* macros) measures time elapsed using
clock_gettime() on POSIX systems, which returns the time as nanoseconds,
and QueryPerformanceCounter() on Windows, which is a specialized timing
clock source that returns a tick counter that needs to be converted to
nanoseconds using the result of QueryPerformanceFrequency().

This conversion currently happens ad-hoc on Windows, e.g. when calling
INSTR_TIME_GET_NANOSEC, which calls QueryPerformanceFrequency() on every
invocation, despite the frequency being stable after program start,
incurring unnecessary overhead. It also causes a fractured implementation
where macros are defined differently between platforms.

To ease code readability, and prepare for a future change that intends
to use a ticks-to-nanosecond conversion on x86-64 for TSC use, introduce
new pg_ticks_to_ns() / pg_ns_to_ticks() functions that get called from
INSTR_* macros on all platforms.

These functions rely on a separately initialized ticks_per_ns_scaled
value, that represents the conversion ratio. This value is initialized
from QueryPerformanceFrequency() on Windows, and set to zero on x86-64
POSIX systems, which results in the ticks being treated as nanoseconds.
Other architectures always directly return the original ticks.

To support this, pg_initialize_timing() is introduced, and is now
mandatory for both the backend and any frontend programs to call before
utilizing INSTR_* macros.

In passing, fix variable names in comment documenting INSTR_TIME_ADD_NANOSEC().

Author: Lukas Fittl <lukas@fittl.com>
Author: David Geier <geidav.pg@gmail.com>
Author: Andres Freund <andres@anarazel.de>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: David Geier <geidav.pg@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Discussion: https://www.postgresql.org/message-id/flat/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de

oauth: Allow validators to register custom HBA options

OAuth validators can already use custom GUCs to configure behavior
globally. But we currently provide no ability to adjust settings for
individual HBA entries, because the original design focused on a world
where a provider covered a "single audience" of users for one database
cluster. This assumption does not apply to multitenant use cases, where
a single validator may be controlling access for wildly different user
groups.

To improve this use case, add two new API calls for use by validator
callbacks: RegisterOAuthHBAOptions() and GetOAuthHBAOption().
Registering options "foo" and "bar" allows a user to set "validator.foo"
and "validator.bar" in an oauth HBA entry. These options are stringly
typed (syntax validation is solely the responsibility of the defining
module), and names are restricted to a subset of ASCII to avoid tying
our hands with future HBA syntax improvements.

Unfortunately, we can't check the custom option names during a reload of
the configuration, like we do with standard HBA options, without
requiring all validators to be loaded via shared_preload_libraries.
(I consider this to be a nonstarter: most validators should probably use
session_preload_libraries at most, since requiring a full restart just
to update authentication behavior will be unacceptable to many users.)
Instead, the new validator.* options are checked against the registered
list at connection time.

Multiple alternatives were proposed and/or prototyped, including
extending the GUC system to allow per-HBA overrides, joining forces with
recent refactoring work on the reloptions subsystem, and giving the
ability to customize HBA options to all PostgreSQL extensions. I
personally believe per-HBA GUC overrides are the best option, because
several existing GUCs like authentication_timeout and pre_auth_delay
would fit there usefully. But the recent addition of SNI per-host
settings in 4f433025f indicates that a more general solution is needed,
and I expect that to take multiple releases' worth of discussion.

This compromise patch, then, is intentionally designed to be an
architectural dead end: simple to describe, cheap to maintain, and
providing just enough functionality to let validators move forward for
PG19. The hope is that it will be replaced in the future by a solution
that can handle per-host, per-HBA, and other per-context configuration
with the same functionality that GUCs provide today. In the meantime,
the bulk of the code in this patch consists of strict guardrails on the
simple API, to try to ensure that we don't have any reason to regret its
existence during its unknown lifespan.

I owe particular thanks here to Zsolt Parragi, who prototyped several
approaches that guided the final design.

Suggested-by: Zsolt Parragi <zsolt.parragi@percona.com>
Suggested-by: VASUKI M <vasukianand0119@gmail.com>
Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Discussion: https://postgr.es/m/CAN4CZFM3b8u5uNNNsY6XCya257u%2BDofms3su9f11iMCxvCacag%40mail.gmail.com

libpq: Split PGOAUTHDEBUG=UNSAFE into multiple options

PGOAUTHDEBUG is a blunt instrument: you get all the debugging features,
or none of them. The most annoying consequence during manual use is the
Curl debug trace, which tends to obscure the device flow prompt
entirely. The promotion of PGOAUTHCAFILE into its own feature in
993368113 improved the situation somewhat, but there's still the
discomfort of knowing you have to opt into many dangerous behaviors just
to get the single debug feature you wanted.

Explode the PGOAUTHDEBUG syntax into a comma-separated list. The old
"UNSAFE" value enables everything, like before. Any individual unsafe
features still require the envvar to begin with an "UNSAFE:" prefix, to
try to interrupt the flow of someone who is about to do something they
should not.

So now, rather than

    PGOAUTHDEBUG=UNSAFE        # enable all the unsafe things

a developer can say

    PGOAUTHDEBUG=call-count    # only show me the call count. safe!
    PGOAUTHDEBUG=UNSAFE:trace  # print secrets, but don't allow HTTP

To avoid adding more build system scaffolding to libpq-oauth, implement
this entirely in a small private header. This unfortunately can't be
standalone, so it needs a headerscheck exception.

Author: Zsolt Parragi <zsolt.parragi@percona.com>
Co-authored-by: Jacob Champion <jacob.champion@enterprisedb.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Discussion: https://postgr.es/m/CAOYmi%2B%3DfbZNJSkHVci%3DGpR8XPYObK%3DH%2B2ERRha0LDTS%2BifsWnw%40mail.gmail.com
Discussion: https://postgr.es/m/CAN4CZFMmDZMH56O9vb_g7vHqAk8ryWFxBMV19C39PFghENg8kA%40mail.gmail.com

Reserve replication slots specifically for REPACK

Add a new GUC max_repack_replication_slots, which lets the user reserve
some additional replication slots for concurrent repack (and only
concurrent repack). With this, the user doesn't have to worry about
changing the max_replication_slots in order to cater for use of
concurrent repack.

(We still use the same pool of bgworkers though, but that's less
commonly a problem than slots.)

Author: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Srinath Reddy Sadipiralla <srinath2133@gmail.com>
Discussion: https://postgr.es/m/202604012148.nnnmyxxrr6nh@alvherre.pgsql

Fix harmless leftover in _hash_kill_items()

Checking for 'havePin' is sufficient here. An earlier version of the
patch didn't have the 'havePin' variable and used
'so->hashso_bucket_buf == so->currPos.buf' as the condition when both
locking and unlocking the page. The havePin variable was added later
during development, but the unlocking condition wasn't fully
updated. Tidy it up.

Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://www.postgresql.org/message-id/b9de8d05-3b02-4a27-9b0b-03972fa4bfd3@iki.fi

Add errdetail() with PID and UID about source of termination signal.

When a backend is terminated via pg_terminate_backend() or an external
SIGTERM, the error message now includes the sender's PID and UID as
errdetail, making it easier to identify the source of unexpected
terminations in multi-user environments.

On platforms that support SA_SIGINFO (Linux, FreeBSD, and most modern
Unix systems), the signal handler captures si_pid and si_uid from the
siginfo_t structure. On platforms without SA_SIGINFO, the detail is
simply omitted.

Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Andrew Dunstan <andrew@dunslane.net>
Reviewed-by: Chao Li <1356863904@qq.com>
Discussion: https://postgr.es/m/CAKZiRmyrOWovZSdixpLd3PGMQXuQL_zw2Ght5XhHCkQ1uDsxjw@mail.gmail.com

pg_stash_advice: Allow stashed advice to be persisted to disk.

If pg_stash_advice.persist = true, stashed advice will be written to
pg_stash_advice.tsv in the data directory, periodically and at
shutdown. On restart, stash modifications are locked out until this
file has been reloaded, but queries will not be, so there may be a
short window after startup during which previously-stashed advice is
not automatically applied.

Author: Robert Haas <rhaas@postgresql.org>
Co-authored-by: Lukas Fittl <lukas@fittl.com>
Discussion: https://postgr.es/m/CA+Tgmob87qsWa-VugofU6epuV0H5XjWZGMbQas4Q-ADKmvSyBg@mail.gmail.com

Minimal fix for WAIT FOR ... MODE 'standby_flush'

The investigation into the negative test performance impact of 7e8aeb9e483
lead to discovering that there are a few issues with WAIT FOR.

This commit is just a minimal fix to prevent hangs in standby_flush mode, due
to WAIT FOR ... 'standby_flush' seeing a 0 LSN if a newly started walreceiver
does not receive any writes, because the stanby is already caught up.

There are several other issues and this is isn't necessarily the best fix. But
this way we get the hangs out of the way.

Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/zqbppucpmkeqecfy4s5kscnru4tbk6khp3ozqz6ad2zijz354k@w4bdf4z3wqoz

doc: Add an example of REPACK (CONCURRENTLY)

Suggested-by: vignesh C <vignesh21@gmail.com>
Discussion: https://postgr.es/m/CALDaNm3tiKhtegx5Cawi34UjbHmNGEDNAtScGM1RgWRtV-5_0Q@mail.gmail.com

Tidy up #ifdef USE_INJECTION_POINTS guards

Remove unnecessary #ifdef guard around the function prototypes; they
are already inside a larger #ifdef block. Move #include "subsystems.h"
inside the USE_INJECTION_POINTS guard; it's needed for
InjectionPointShmemCallbacks, which is a also inside the guard.

Reported-by: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org>
Discussion: https://www.postgresql.org/message-id/87y0iz2c1v.fsf@wibble.ilmari.org

Fix tests under wal_level=minimal

Buildfarm members which have specifically configured to use
wal_level=minimal fail the repack regression tests, which require
wal_level=replica. Add a temp config file to fix that.

Modernize and optimize pg_buffercache_pages()

Refactor pg_buffercache_pages() to use SFRM_Materialize mode and
construct a tuplestore directly. That's simpler and more efficient
than collecting all the data to a custom array first.

Author: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Author: Palak Chaturvedi <chaturvedipalak1911@gmail.com>
Discussion: https://www.postgresql.org/message-id/CAExHW5sMsaz1j+hrdhyo-DJp7JCgJx87=q2iJfOc_9mwYWyvmw@mail.gmail.com

Optimize sorting and deduplicating trigrams

Use templated qsort() so that the comparison function can be
inlined. To speed up qunique(), use a specialized comparison function
that only checks for equality.

Author: David Geier <geidav.pg@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://www.postgresql.org/message-id/2a76b5ef-4b12-4023-93a1-eed6e64968f3@gmail.com

Use add_size/mul_size for index instrumentation size calculations

Use overflow-safe size arithmetic in the Index[Only]Scan and parallel
instrumentation functions, consistent with other executor nodes (Hash,
Sort, Agg, Memoize). This was an oversight in dd78e69cfc3.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Discussion: https://postgr.es/m/flat/a177a6dd-240b-455a-8f25-aca0b1c08c6e%40vondra.me

Fix BitmapHeapScan non-parallel-aware EXPLAIN ANALYZE

Allocates shared bitmap table scan instrumentation for all parallel
scans. Previously, the instrumentation was only allocated for
parallel-aware scans, other bitmap heap scans in the parallel query had
no shared instrumentation and EXPLAIN didn't report exact/lossy pages.
This affected cases like scans on the outside of a parallel join or
queries run with debug_parallel_query=regress.

Fixed by allocating a separate DSM chunk for shared instrumentation and
doing so regardless of parallel-awareness. The instrumentation is
allocated in its own DSM chunk, separate from ParallelBitmapHeapState.

Report an initial patch by me. The approach with a separate DSM was
proposed and implemented by Melanie.

Not backpatched. The issue affects Postgres 18 (since 5a1e6df3b84c), but
having multiple DSM chunks is possible only since dd78e69cfc33. If we
decide to fix this in backbranches too, it will need to be done in a
less invasive way.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Discussion: https://postgr.es/m/flat/a177a6dd-240b-455a-8f25-aca0b1c08c6e%40vondra.me

Allow logical replication snapshots to be database-specific

By default, the logical decoding assumes access to shared catalogs, so
the snapshot builder needs to consider cluster-wide XIDs during startup.
That in turn means that, if any transaction is already running (and has
XID assigned), the snapshot builder needs to wait for its completion, as
it does not know if that transaction performed catalog changes earlier.

A possible problem with this concept is that if REPACK (CONCURRENTLY) is
running in some database, backends running the same command in other
databases get stuck until the first one has committed. Thus only a
single backend in the cluster can run REPACK (CONCURRENTLY) at any time.
Likewise, REPACK (CONCURRENTLY) can block walsenders starting on behalf
of subscriptions throughout the cluster.

This patch adds a new option to logical replication output plugin, to
declare that it does not use shared catalogs (i.e. catalogs that can be
changed by transactions running in other databases in the cluster). In
that case, no snapshot the backend will use during the decoding needs to
contain information about transactions running in other databases. Thus
the snapshot builder only needs to wait for completion of transactions
in the current database.

Currently we only use this option in the REPACK background worker. It
could possibly be used in the plugin for logical replication too,
however that would need thorough analysis of that plugin.

Bump WAL version number, due to a new field in xl_running_xacts.

Author: Antonin Houska <ah@cybertec.at>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/90475.1775218118@localhost

Avoid different-size pointer-to-integer cast

Buildfarm member mamba is unhappy that I wrote "(Datum) NULL" in commit
28d534e2ae0a:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2026-04-07%2005%3A08%3A08
Use "(Datum) 0" which is what we do everywhere else.

Discussion: https://postgr.es/m/CANWCAZaOs_+WPH13ow33Q==+FwBwVZkqzm4vND=WEB4_NBmv1Q@mail.gmail.com

Optimize sort and deduplication in ginExtractEntries()

Remove NULLs from the array first, and use qsort to deduplicate only
the non-NULL items. This simplifies the comparison function. Also
replace qsort_arg() with a templated version so that the comparison
function can be inlined. These changes make ginExtractEntries() a
little faster especially for simple datatypes like integers.

Author: David Geier <geidav.pg@gmail.com>
Discussion: https://www.postgresql.org/message-id/6d16b6bd-a1ff-4469-aefb-a1c8274e561a@iki.fi

Add isolation tests for UPDATE/DELETE FOR PORTION OF

Add documentation about concurrency issues related to UPDATE/DELETE
FOR PORTION OF as well as supporting isolation tests.

Author: Paul A. Jungwirth <pj@illuminatedcomputing.com>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://www.postgresql.org/message-id/flat/ec498c3d-5f2b-48ec-b989-5561c8aa2024%40illuminatedcomputing.com

Fix valgrind failure

Buildfarm member skink reports that the new REPACK code is trying to
write uninitialized bytes to disk, which correspond to padding space in
the SerializedSnapshotData struct. Silence that by initializing the
memory in SerializeSnapshot() to all zeroes.

Co-authored-by: Srinath Reddy Sadipiralla <srinath2133@gmail.com>
Co-authored-by: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/1976915.1775537087@sss.pgh.pa.us

Use .h for the file containing the page checksum code fragment

Commit 5e13b0f24 used a .c file for a file containing a code fragment,
to avoid adding an exception to headerscheck. That turned out to be
too clever, since it meant installation didn't happen by the usual
mechanism. Make it look like a normal header and add the requisite
exception.

Bug: #19450
Reported-by: RekGRpth <rekgrpth@gmail.com>
Discussion: https://postgr.es/m/19450-bb0612c50c6786e5@postgresql.org

Simplify SortSupport for the macaddr data type

As of commit 6aebedc38 Datums are 64-bit values. Since MAC addresses
have only 6 bytes, the abbreviated key always contains the entire
MAC address and is thus authoritative (for practical purposes -- the
tuple sort machinery has no way of knowing that). Abbreviating this
datatype is cheap, and aborting abbreviation prevents optimizations
like radix sort, so remove cardinality estimation.

Author: Aleksander Alekseev <aleksander@tigerdata.com>
Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Suggested-by: John Naylor <johncnaylorls@gmail.com>
Discussion: https://postgr.es/m/CAJ7c6TMk10rF_LiMz6j9rRy1rqk-5s+wBPuBefLix4cY+-4s1w@mail.gmail.com

Mark JumbleState as a const in the post_parse_analyze hook

This commit changes the post_parse_analyze_hook_type() hook to take a
const JumbleState, to tell external modules that they are not allowed to
touch the JumbleState that has been compiled by the core code.  This
fixes a pretty old problem with pg_stat_statements, that had always the
idea of modifying the lengths of the constants stored in the
JumbleState.  The previous state could confuse extensions that need to
look at a JumbleState depending on the loading order, if
pg_stat_statements is part of the stack loaded.

Another piece included in this commit is the move of the routine
fill_in_constant_lengths() to queryjumblefuncs.c, to give an option to
extensions to compile the lengths of the constants, if necessary.  I was
surprised by the number of external code that carries a copy of this
routine (see the thread for details).  Previously, this routine modified
JumbleState.  It now copies the set of LocationLens from JumbleState,
and fills the constant lengths for separate use.

pg_stat_statements is updated to use the new ComputeConstantLengths().
JumbleState is now marked with a const in the module, where relevant.

Author: Sami Imseih <samimseih@gmail.com>
Co-authored-by: Lukas Fittl <lukas@fittl.com>
Discussion: https://postgr.es/m/CAA5RZ0tZp5qU0ikZEEqJnxvdSNGh1DWv80sb-k4QAUmiMoOp_Q@mail.gmail.com

Split CREATE STATISTICS error reasons out into errdetails

Some errmsgs in statscmds.c were phrased as "...cannot be used
because...". Put the reasons into errdetails. While at it, switch
from passive voice to "cannot create..." for the errmsg.

Author: Yugo Nagata <nagata@sraoss.co.jp>
Suggested-by: John Naylor <johncnaylorls@gmail.com>
Discussion: https://postgr.es/m/CANWCAZaZeX0omWNh_ZbD_JVujzYQdRUW8UZOQ4dWh9Sg7OcAow@mail.gmail.com

Fix injection point detach timing problem in TAP test for lock stats

injection_points_detach() could fail because of a concurrent cleanup
triggered by injection_points_set_local() when a session finishes.
This problem could be reproduced by adding a hardcoded sleep in
InjectionPointDetach(), and has been detected by the CI.

As the test is designed so as the injection point is detached before
being awaken, there is no need for it to be local, similarly to test
010_index_concurrently_upsert. This commit removes
injection_points_set_local(), replacing it with a confirmation that the
point has been attached in the session expected to block on a lock.
With this removal, the detach cannot happen concurrently anymore, only
before when the point is woken up.

Issue introduced by 557a9f1e3e62, where the test has been added.

Reported-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/rp6wz4lnz5qn4zlh7uxtavzfrmqvycy2g42z4zasfss2gxi54f@zzcsjdvdflwp

Fix shmem allocation of fixed-sized custom stats kind

StatsShmemSize(), that computes the shmem size needed for pgstats,
includes the amount of shared memory wanted by all the custom stats
kinds registered. However, the shared memory allocation was done by
ShmemAlloc() in StatsShmemInit(), meaning that the space reserved was
not used, wasting some memory.

These extra allocations would show up under "<anonymous>" in
pg_shmem_allocations, as the allocations done by ShmemAlloc() are not
tracked by ShmemIndexEnt.

Issue introduced by 7949d9594582.

Author: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/04b04387-92f5-476c-90b0-4064e71c5f37@iki.fi
Backpatch-through: 18

Fix deferred FK check batching introduced by commit b7b27eb41a5

That commit introduced AfterTriggerIsActive() to detect whether
we are inside the after-trigger firing machinery, so that RI trigger
functions can take the batched fast path.  It was implemented using
query_depth >= 0, which correctly identified immediate trigger firing
but missed the deferred case where query_depth is -1 at COMMIT via
AfterTriggerFireDeferred().  This caused deferred FK checks to fall
back to the per-row fast path instead of the batched path.

The correct check is whether we are inside an after-trigger firing
loop specifically.  Introduce afterTriggerFiringDepth, a counter
incremented around the trigger-firing loops in AfterTriggerEndQuery,
AfterTriggerFireDeferred, and AfterTriggerSetState, and decremented
after FireAfterTriggerBatchCallbacks() returns.  AfterTriggerIsActive()
now returns afterTriggerFiringDepth > 0.

Reported-by: Chao Li <li.evan.chao@gmail.com>
Author: Chao Li <li.evan.chao@gmail.com>
Co-authored-by: Amit Langote <amitlangote09@gmail.com>
Discussion: https://postgr.es/m/C2133B47-79CD-40FF-B088-02D20D654806@gmail.com

Fix shared memory size of template code for custom fixed-sized pgstats

On HEAD, the template code for custom fixed-sized pgstats is in the test
module test_custom_stats.  On REL_18_STABLE, this code lives in the test
module injection_points.

Both cases were underestimating the size of the shared memory area
required for the storage of the stats data, using a single entry rather
than the whole area.  This underestimation meant that there was no
memory allocated for the LWLock required for the stats, and even more.
This problem would be also misleading for extension developers looking
at this code.

This issue has been noticed while digging into a different bug reported
by Heikki Linnakangas, showing that the underestimation was causing
failures in the TAP tests of the test modules for 32-bit builds.  The
other issue reported, related to the memory allocation of custom
fixed-sized pgstats, will be fixed in a follow-up commit.

Discussion: https://postgr.es/m/adMk_lWbnz3HDOA8@paquier.xyz
Backpatch-through: 18

Allocate separate DSM chunk for parallel Index[Only]Scan instrumentation

Previously, parallel index and index-only scans packed the parallel scan
descriptor and shared instrumentation (for EXPLAIN ANALYZE) into a
single DSM allocation. Since scans may be instrumented without being
parallel-aware, and vice versa, using separate DSM chunks -- each with
its own TOC key -- is cleaner. A future commit will extend this pattern
to other scan node types.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/flat/a177a6dd-240b-455a-8f25-aca0b1c08c6e%40vondra.me

Assert no duplicate keys in shm_toc_insert()

shm_toc_insert() silently accepts duplicate keys. Since shm_toc_lookup()
returns the first matching entry, any later entry with the same key
would be unreachable. Add an assertion to catch this.

Author: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/flat/a177a6dd-240b-455a-8f25-aca0b1c08c6e%40vondra.me

Add pg_stat_autovacuum_scores system view.

This view contains one row for each table in the current database,
showing the current autovacuum scores for that specific table. It
also shows whether autovacuum would vacuum or analyze the table.

Bumps catversion.

Author: Sami Imseih <samimseih@gmail.com>
Reviewed-by: Satyanarayana Narlapuram <satyanarlapuram@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Robert Treat <rob@xzilla.net>
Discussion: https://postgr.es/m/CAA5RZ0s4xjMrB-VAnLccC7kY8d0-4806-Lsac-czJsdA1LXtAw%40mail.gmail.com

Use PG_DATA_CHECKSUM_OFF instead of hardcoded value

For a long time, the online checksums patchset kept the "off" state as
literal zero without a label to be consistent with the previous coding
which only had a label for the "on" state. Later, when an "off" label
was made not all uses in the code got the memo. Fix by setting these
to PG_DATA_CHECKSUM_OFF.

While there, fix a duplicate word in a comment introduced by the same
commit.

Author: Aleksander Alekseev <aleksander@tigerdata.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Discussion: https://postgr.es/m/CAJ7c6TPRTnQFXXX1CRcYoTLXw2swtDH==uSz1MYoMKdLrKZHjA@mail.gmail.com

Add CONCURRENTLY option to REPACK

When this flag is specified, REPACK no longer acquires access-exclusive
lock while the new copy of the table is being created; instead, it
creates the initial copy under share-update-exclusive lock only (same as
vacuum, etc), and it follows an MVCC snapshot; it sets up a replication
slot starting at that snapshot, and uses a concurrent background worker
to do logical decoding starting at the snapshot to populate a stash of
concurrent data changes.  Those changes can then be re-applied to the
new copy of the table just before swapping the relfilenodes.
Applications can continue to access the original copy of the table
normally until just before the swap, which is the only point at which
the access-exclusive lock is needed.

There are some loose ends in this commit:
1. concurrent repack needs its own replication slot in order to apply
   logical decoding, which are a scarce resource and easy to run out of.
2. due to the way the historic snapshot is initially set up, only one
   REPACK process can be running at any one time on the whole system.
3. there's a danger of deadlocking (and thus abort) due to the lock
   upgrade required at the final phase.

These issues will be addressed in upcoming commits.

The design and most of the code are by Antonin Houska, heavily based on
his own pg_squeeze third-party implementation.

Author: Antonin Houska <ah@cybertec.at>
Co-authored-by: Mihail Nikalayeu <mihailnikalayeu@gmail.com>
Co-authored-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-by: Srinath Reddy Sadipiralla <srinath2133@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Jim Jones <jim.jones@uni-muenster.de>
Reviewed-by: Robert Treat <rob@xzilla.net>
Reviewed-by: Noriyoshi Shinoda <noriyoshi.shinoda@hpe.com>
Reviewed-by: vignesh C <vignesh21@gmail.com>
Discussion: https://postgr.es/m/5186.1706694913@antos
Discussion: https://postgr.es/m/202507262156.sb455angijk6@alvherre.pgsql

Document that WAIT FOR may be interrupted by recovery conflicts

Add a note to the WAIT FOR documentation explaining that sessions
using this command on a standby server may be interrupted by recovery
conflicts. Some conflicts are unavoidable - for example, replaying
a tablespace drop terminates all backends unconditionally.

Discussion: https://postgr.es/m/CAPpHfds7oSCbZqob7ytT_Lso8fv-NW8LnedUTE4Krde%2B3rkJeA%40mail.gmail.com
Author: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>

Use WAIT FOR LSN in PostgreSQL::Test::Cluster::wait_for_catchup()

When the standby is passed as a PostgreSQL::Test::Cluster instance,
use the WAIT FOR LSN command on the standby server to implement
wait_for_catchup() for replay, write, and flush modes. This is more
efficient than polling pg_stat_replication on the upstream, as the
WAIT FOR LSN command uses a latch-based wakeup mechanism.

The optimization applies when:
- The standby is passed as a Cluster object (not just a name string)
- The mode is 'replay', 'write', or 'flush' (not 'sent')

Rather than pre-checking pg_is_in_recovery() on the standby (which
would add an extra round-trip on every call), we issue WAIT FOR LSN
directly and handle the 'not in recovery' result as a signal to fall
back to polling.

For 'sent' mode, when the standby is passed as a string (e.g., a
subscription name for logical replication), when the standby has been
promoted, or when WAIT FOR LSN is interrupted by a recovery conflict,
the function falls back to the original polling-based approach using
pg_stat_replication on the upstream. The recovery conflict fallback
is necessary because some conflicts are unavoidable - for example,
ResolveRecoveryConflictWithTablespace() kills all backends
unconditionally, regardless of what they are doing.

The recovery conflict detection matches the English error message
"conflict with recovery", which is reliable because the test suite
runs with LC_MESSAGES=C.

Discussion: https://postgr.es/m/CABPTF7UiArgW-sXj9CNwRzUhYOQrevLzkYcgBydmX5oDes1sjg%40mail.gmail.com
Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Alvaro Herrera <alvherre@kurilemu.de>

Avoid syscache lookup while building a WAIT FOR tuple descriptor

Use TupleDescInitBuiltinEntry instead of TupleDescInitEntry when building
the result tuple descriptor for the WAIT FOR command. This avoids a syscache
access that could re-establish a catalog snapshot after we've explicitly
released all snapshots before the wait.

Discussion: https://postgr.es/m/CABPTF7U%2BSUnJX_woQYGe%3D%3DR9Oz%2B-V6X0VO2stBLPGfJmH_LEhw%40mail.gmail.com
Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>

Remove recheck_relation_needs_vacanalyze().

This function is a thin wrapper around relation_needs_vacanalyze()
that handles fetching and freeing the pgstat entry for the table.
Since all callers of relation_needs_vacanalyze() do that anyway, we
can teach that function to fetch/free the pgstat entry and use it
instead.

Suggested-by: Álvaro Herrera <alvherre@kurilemu.de>
Author: Sami Imseih <samimseih@gmail.com>
Co-authored-by: Nathan Bossart <nathandbossart@gmail.com>
Discussion: https://postgr.es/m/CAA5RZ0s4xjMrB-VAnLccC7kY8d0-4806-Lsac-czJsdA1LXtAw%40mail.gmail.com

auto_explain: Add new GUC, auto_explain.log_extension_options.

The associated value should look like something that could be
part of an EXPLAIN options list, but restricted to EXPLAIN options
added by extensions.

For example, if pg_overexplain is loaded, you could set
auto_explain.log_extension_options = 'DEBUG, RANGE_TABLE'.
You can also specify arguments to these options in the same manner
as normal e.g. 'DEBUG 1, RANGE_TABLE false'.

Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Discussion: http://postgr.es/m/CA+Tgmob-0W8306mvrJX5Urtqt1AAasu8pi4yLrZ1XfwZU-Uj1w@mail.gmail.com

Support more object types within CREATE SCHEMA.

Having rejected the principle that we should know how to re-order
the sub-commands of CREATE SCHEMA, there is not really anything
except a little coding to stop us from supporting more object types.
This patch adds support for creating functions (including procedures
and aggregates), operators, types (including domains), collations,
and text search objects.

SQL:2021 specifies that we should allow functions, procedures,
types, domains, and collations, so this moves us a great deal
closer to full SQL compatibility of CREATE SCHEMA.  What remains
missing from their list are casts, transforms, roles, and some
object types we don't support yet (e.g. CREATE CHARACTER SET).
Supporting casts or transforms would be problematic because
they don't have names at all, let alone schema-qualified names,
so it'd be quite a stretch to say that they belong to a schema.
Roles likewise are not schema-qualified, plus they are global
to a cluster, making it even less reasonable to consider them
as belonging to a schema.  So I don't see us trying to complete
the list.

User-defined aggregates and operators are outside the spec's ken,
as are text search objects, so adding them does not do anything for
spec compatibility.  But they go along with these other object types,
plus it takes no additional code to support them since they are
represented as DefineStmts like some variants of CREATE TYPE.
It would indeed take some effort to reject them.

Author: Kirill Reshke <reshkekirill@gmail.com>
Author: Jian He <jian.universality@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CALdSSPh4jUSDsWu3K58hjO60wnTRR0DuO4CKRcwa8EVuOSfXxg@mail.gmail.com

Execute foreign key constraints in CREATE SCHEMA at the end.

The previous patch simplified CREATE SCHEMA's behavior to "execute all
subcommands in the order they are written".  However, that's a bit too
simple, as the spec clearly requires forward references in foreign key
constraint clauses to work, see feature F311-01.  (Most other SQL
implementations seem to read more into the spec than that, but it's
not clear that there's justification for more in the text, and this is
the only case that doesn't introduce unresolvable issues.)  We never
implemented that before, but let's do so now.

To fix it, transform FOREIGN KEY clauses into ALTER TABLE ... ADD
FOREIGN KEY commands and append them to the end of the CREATE SCHEMA's
subcommand list.  This works because the foreign key constraints are
independent and don't affect any other DDL that might be in CREATE
SCHEMA.  For simplicity, we do this for all FOREIGN KEY clauses even
if they would have worked where they were.

Author: Jian He <jian.universality@gmail.com>
Co-authored-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/1075425.1732993688@sss.pgh.pa.us

Don't try to re-order the subcommands of CREATE SCHEMA.

transformCreateSchemaStmtElements has always believed that it is
supposed to re-order the subcommands of CREATE SCHEMA into a safe
execution order.  However, it is nowhere near being capable of doing
that correctly.  Nor is there reason to think that it ever will be,
or that that is a well-defined requirement.  (The SQL standard does
say that it should be possible to do foreign-key forward references
within CREATE SCHEMA, but it's not clear that the text requires
anything more than that.)  Moreover, the problem will get worse as
we add more subcommand types.  Let's just drop the whole idea and
execute the commands in the order given, which seems like a much
less astonishment-prone definition anyway.  The foreign-key issue
will be handled in a follow-up patch.

This will result in a release-note-worthy incompatibility,
which is that forward references like
CREATE SCHEMA myschema
    CREATE VIEW myview AS SELECT * FROM mytable
    CREATE TABLE mytable (...);
used to work and no longer will.  Considering how many closely
related variants never worked, this isn't much of a loss.

Along the way, pass down a ParseState so that we can provide an
error cursor for "wrong schema name" and related errors, and fix
transformCreateSchemaStmtElements so that it doesn't scribble
on the parsetree passed to it.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Jian He <jian.universality@gmail.com>
Discussion: https://postgr.es/m/1075425.1732993688@sss.pgh.pa.us

Allow autovacuum to use parallel vacuum workers.

Previously, autovacuum always disabled parallel vacuum regardless of
the table's index count or configuration. This commit enables
autovacuum workers to use parallel index vacuuming and index cleanup,
using the same parallel vacuum infrastructure as manual VACUUM.

Two new configuration options control the feature. The GUC
autovacuum_max_parallel_workers sets the maximum number of parallel
workers a single autovacuum worker may launch; it defaults to 0,
preserving existing behavior unless explicitly enabled. The per-table
storage parameter autovacuum_parallel_workers provides per-table
limits. A value of 0 disables parallel vacuum for the table, a
positive value caps the worker count (still bounded by the GUC), and
-1 (the default) defers to the GUC.

To handle cases where autovacuum workers receive a SIGHUP and update
their cost-based vacuum delay parameters mid-operation, a new
propagation mechanism is added to vacuumparallel.c. The leader stores
its effective cost parameters in a DSM segment. Parallel vacuum
workers poll for changes in vacuum_delay_point(); if an update is
detected, they apply the new values locally via VacuumUpdateCosts().

A new test module, src/test/modules/test_autovacuum, is added to
verify that parallel autovacuum workers are correctly launched and
that cost-parameter updates are propagated as expected.

The patch was originally proposed by Maxim Orlov, but the
implementation has undergone significant architectural changes
since then during the review process.

Author: Daniil Davydov <3danissimo@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: zengman <zengman@halodbtech.com>
Discussion: https://postgr.es/m/CACG=ezZOrNsuLoETLD1gAswZMuH2nGGq7Ogcc0QOE5hhWaw=cw@mail.gmail.com

Rename cluster.c to repack.c (and corresponding .h)

CLUSTER is no longer the favored way to invoke this functionality, and
the code is about to shift its focus to the REPACK more ambitiously.
Rename the file to avoid leaving an unnecessary historical artifact
around.

Author: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/202603271635.owyhm7btgoic@alvherre.pgsql

Disallow system columns in COPY FROM WHERE conditions.

These columns haven't been computed yet when the filtering happens
(since we've not written the candidate tuple into the table); so
any check on them is wrong or useless.  Worse, since aa606b931 such a
reference results in an access off the end of a TupleDesc, potentially
causing a phony "generated columns are not supported in COPY FROM
WHERE conditions" error; and since c98ad086a it throws an Assert
instead.

Actually we could allow tableoid, which has been set to the OID of the
table named as the COPY target.  However, plausible uses for tests of
tableoid would involve a partitioned target table, and the user would
wish it to read as the OID of the destination partition.  There has
been some discussion of changing things to make it work like that,
but pending that happening we should just disallow tableoid along
with other system columns.

It seems best though to install this prohibition only in HEAD.
In the back branches we'll just guard the unsafe TupleDesc access,
and people will keep getting whatever semantics they got before.

Reported-by: Alexander Lakhin <exclusion@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/6f435023-8ab6-47c2-ba07-035d0c4212f9@gmail.com

Add missing .gitignore files.

contrib/pg_stash_advice and src/test/modules/test_shmem
missed these, leading to complaints from git after an
in-tree check-world run.

Use our standard boilerplate list of ignorable subdirectories,
although the two modules presently create different subsets
of that.

Fix null-bitmap combining in array_agg_array_combine().

This code missed the need to update the combined state's
nullbitmap if state1 already had a bitmap but state2 didn't.
We need to extend the existing bitmap with 1's but didn't.
This could result in wrong output from a parallelized
array_agg(anyarray) calculation, if the input has a mix of
null and non-null elements. The errors depended on timing
of the parallel workers, and therefore would vary from one
run to another.

Also install guards against integer overflow when calculating
the combined object's sizes, and make some trivial cosmetic
improvements.

Author: Dmytro Astapov <dastapov@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAFQUnFj2pQ1HbGp69+w2fKqARSfGhAi9UOb+JjyExp7kx3gsqA@mail.gmail.com
Backpatch-through: 16

Add a guc_check_handler to the EXPLAIN extension mechanism.

It would be useful to be able to tell auto_explain to set a custom
EXPLAIN option, but it would be bad if it tried to do so and the
option name or value wasn't valid, because then every query would fail
with a complaint about the EXPLAIN option. So add a guc_check_handler
that auto_explain will be able to use to only try to set option
name/value/type combinations that have been determined to be legal,
and to emit useful messages about ones that aren't.

Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Discussion: http://postgr.es/m/CA+Tgmob-0W8306mvrJX5Urtqt1AAasu8pi4yLrZ1XfwZU-Uj1w@mail.gmail.com

Remove autoanalyze corner case.

The restructuring in commit 53b8ca6881 revealed an interesting
corner case: if a table needs vacuuming for wraparound prevention
and autovacuum is disabled for it, we might still choose to analyze
it. Research seems to indicate this was an accidental addition by
commit 48188e1621, and further discussion indicates there is
consensus that it is unnecessary and can be removed.

Reviewed-by: Robert Treat <rob@xzilla.net>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Reviewed-by: Shinya Kato <shinya11.kato@gmail.com>
Discussion: https://postgr.es/m/adB9nSsm_S0D9708%40nathan

Expose helper functions scan_quoted_identifier and scan_identifier.

Previously, this logic was embedded within SplitIdentifierString,
SplitDirectoriesString, and SplitGUCList. Factoring it out saves
a bit of duplicated code, and also makes it available to extensions
that might want to do similar things without necessarily wanting to
do exactly the same thing.

Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Discussion: http://postgr.es/m/CA+Tgmob-0W8306mvrJX5Urtqt1AAasu8pi4yLrZ1XfwZU-Uj1w@mail.gmail.com

Add TAP tests for log_lock_waits

This commit updates 011_lock_stats.pl to verify log_lock_waits behavior.

The tests check that messages are emitted both when a wait occurs and
when the lock is acquired, and that the "still waiting for" message is logged
exactly once per wait, even if the backend wakes up during the wait.

The latter covers the behavior introduced by commit fd6ecbfa75f.

Author: Hüseyin Demir <huseyin.d3r@gmail.com>
Co-authored-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/CAB5wL7YB1my9W5k5i=SY+=sTjeozyJ0YkvGXrVfeDNzuRkoTPg@mail.gmail.com

Release postmaster working memory context in slotsync worker

Child processes do not need the postmaster's working memory context and
normally release it at the start of their main entry point. However,
the slotsync worker forgot to do so.

This commit makes the slotsync worker release the postmaster's working
memory context at startup, preventing unintended use.

Author: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Tiancheng Ge <getiancheng_2012@163.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CAHGQGwHO05JaUpgKF8FBDmPdBUJsK22axRRcgmAUc2Jyi8OK8g@mail.gmail.com

Fix memory leaks introduced by commit 283e823f9dcb

When freeing pending_shmem_requests we should also free the ->options.

Author: Aleksander Alekseev <aleksander@tigerdata.com>
Discussion: https://www.postgresql.org/message-id/CAJ7c6TN9tp8MTc0WXM0zfSWqjfBqU8gpe+o5KqHB1-cQ7409Kw@mail.gmail.com

Fix compilation without injection points with some compilers

Some compilers didn't like the empty initializer when compiled without
USE_INJECTION_POINTS. Per buildfarm member 'drongo', using Visual
Studio 2019.

Author: Michael Paquier <michael@paquier.xyz>
Discussion: https://www.postgresql.org/message-id/adNHcBVJO5gIOp1l@paquier.xyz

Add pg_stash_advice contrib module.

This module allows plan advice strings to be provided automatically
from an in-memory advice stash. Advice stashes are stored in dynamic
shared memory and must be recreated and repopulated after a server
restart. If pg_stash_advice.stash_name is set to the name of an advice
stash, and if query identifiers are enabled, the query identifier
for each query will be looked up in the advice stash and the
associated advice string, if any, will be used each time that query
is planned.

Reviewed-by: Lukas Fittl <lukas@fittl.com>
Reviewed-by: Alexandra Wang <alexandra.wang.oss@gmail.com>
Reviewed-by: David G. Johnston <david.g.johnston@gmail.com>
Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com>
Discussion: http://postgr.es/m/CA+TgmoaeNuHXQ60P3ZZqJLrSjP3L1KYokW9kPfGbWDyt+1t=Ng@mail.gmail.com

Use single LWLock for lock statistics in pgstats

Previously, one LWLock was used for each lock type, adding complexity
without an observable performance benefit as data is gathered only for
paths involving lock waits, at least currently.  This commit replaces
the per-type set of LWLocks with a single LWLock protecting the stats
data of all the lock types, like the stats kinds for SLRU or WAL.  A
good chunk of the callbacks get simpler thanks to this change.

The previous approach also had one bug in the flush callback when nowait
was called with "true": a backend iterating over all entries could
successfully flush some entries while skipping others due to contention,
then unconditionally reset the pending data.  This would cause some
stats data loss.

Oversight in 4019f725f5d4.

Reported-by: Tomas Vondra <tomas@vondra.me>
Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/1af63e6d-16d5-4d5b-9b03-11472ef1adf9@vondra.me

Improve more stability of worker_spi termination test

Alexander Lakhin has noticed that it can be possible on machines with
slow storage to have the spawned workers be stuck in
initialize_worker_spi(), before they reach their main loop.  Waiting for
a flush to happen would block the interrupt attempts done by the
database commands, causing the test to fail on timeout once the number
of interrupt attempts is reached in CountOtherDBBackends().

This commit switches the test to wait for the spawned bgworkers to reach
their main loops before attempting the database commands that would
trigger the interrupts, napping for a time larger than the default, with
worker_spi.naptime set at 10 minutes.  Another thing that could be
attempted is to enforce a larger number of tries in
CountOtherDBBackends(), if what is done here is not enough.  Let's see
first if what this commit does is enough for the buildfarm members
widowbird and jay.

Analyzed-by: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://postgr.es/m/f913fba1-da59-404c-9eb3-07c7304be637@gmail.com

Simplify redundant current_database() subqueries in stats.sql regression test

Previously the stats.sql regression test used conditions like
"datname = (SELECT current_database())" to check the current database name.

The subquery is unnecessary, so this commit simplifies these expressions to
"datname = current_database()".

Author: Chao Li <lic@highgo.com>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/A1535A8F-65AF-4C3D-ACBE-25891CB5D38B@gmail.com

Fix volatile function evaluation in eager aggregation

Pushing aggregates containing volatile functions below a join can
violate volatility semantics by changing the number of times the
function is executed.

Here we check the Aggref nodes in the targetlist and havingQual for
volatile functions and disable eager aggregation when such functions
are present.

Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Discussion: https://postgr.es/m/CAMbWs48A53PY1Y4zoj7YhxPww9fO1hfnbdntKfA855zpXfVFRA@mail.gmail.com

Fix collation handling for grouping keys in eager aggregation

When determining if it is safe to use an expression as a grouping key
for partial aggregation, eager aggregation relies on the B-tree
equalimage support function to ensure that equality implies image
equality.

Previously, the code incorrectly passed the default collation of the
expression's data type to the equalimage procedure, rather than the
expression's actual collation. As a result, if a column used a
non-deterministic collation but the base type's default collation was
deterministic, eager aggregation would incorrectly assume that the
column was safe for byte-level grouping. This could cause rows to be
prematurely grouped and subsequently discarded by strict join
conditions, resulting in incorrect query results.

This patch fixes the issue by passing the expression's actual
collation to the equalimage procedure.

Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Discussion: https://postgr.es/m/CAMbWs48A53PY1Y4zoj7YhxPww9fO1hfnbdntKfA855zpXfVFRA@mail.gmail.com