Dave Airlie [Thu, 7 Aug 2025 19:50:02 +0000 (05:50 +1000)]
Merge tag 'drm-xe-next-fixes-2025-08-06' of https://gitlab.freedesktop.org/drm/xe/kernel into drm-next
- SRIOV: PF fixes and removal of need of module param (Michal)
- Fix driver unbind around Devcoredump (Bala)
- Mark xe driver as BROKEN if kernel page size is not 4kB (Simon)
Alex Deucher [Wed, 30 Jul 2025 15:16:05 +0000 (11:16 -0400)]
drm/amdgpu/discovery: fix fw based ip discovery
We only need the fw based discovery table for sysfs. No
need to parse it. Additionally parsing some of the board
specific tables may result in incorrect data on some boards.
just load the binary and don't parse it on those boards.
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4441 Fixes: 80a0e8282933 ("drm/amdgpu/discovery: optionally use fw based ip discovery") Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 62eedd150fa11aefc2d377fc746633fdb1baeb55) Cc: stable@vger.kernel.org
Amber Lin [Fri, 1 Aug 2025 00:45:00 +0000 (20:45 -0400)]
drm/amdkfd: Destroy KFD debugfs after destroy KFD wq
Since KFD proc content was moved to kernel debugfs, we can't destroy KFD
debugfs before kfd_process_destroy_wq. Move kfd_process_destroy_wq prior
to kfd_debugfs_fini to fix a kernel NULL pointer problem. It happens
when /sys/kernel/debug/kfd was already destroyed in kfd_debugfs_fini but
kfd_process_destroy_wq calls kfd_debugfs_remove_process. This line
debugfs_remove_recursive(entry->proc_dentry);
tries to remove /sys/kernel/debug/kfd/proc/<pid> while
/sys/kernel/debug/kfd is already gone. It hangs the kernel by kernel
NULL pointer.
Signed-off-by: Amber Lin <Amber.Lin@amd.com> Reviewed-by: Eric Huang <jinhuieric.huang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 0333052d90683d88531558dcfdbf2525cc37c233) Cc: stable@vger.kernel.org
Jesse.Zhang [Mon, 4 Aug 2025 00:43:15 +0000 (08:43 +0800)]
drm/amdgpu: Update SDMA firmware version check for user queue support
This commit fixes a firmware version check for enabling user queue
support in SDMA v7.0. The previous version check (7836028) was
incorrect and could lead to issues with PROTECTED_FENCE_SIGNAL
commands causing register conflicts between MCU_DBG0 and MCU_DBG1.
Fixes: 8c011408ed84 ("drm/amdgpu/sdma7: add ucode version checks for userq support") Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 92e2449241516c95aab95eea91faecd0fa2b7ed5) Cc: stable@vger.kernel.org
This reverts commit 66abb996999de0d440a02583a6e70c2c24deab45.
This broke custom brightness curves but it wasn't obvious because
of other related changes. Custom brightness curves are always
from a 0-255 input signal. The correct fix was to fix the default
value which was done by [1].
Siyang Liu [Fri, 4 Jul 2025 03:16:22 +0000 (11:16 +0800)]
drm/amd/display: fix a Null pointer dereference vulnerability
[Why]
A null pointer dereference vulnerability exists in the AMD display driver's
(DC module) cleanup function dc_destruct().
When display control context (dc->ctx) construction fails
(due to memory allocation failure), this pointer remains NULL.
During subsequent error handling when dc_destruct() is called,
there's no NULL check before dereferencing the perf_trace member
(dc->ctx->perf_trace), causing a kernel null pointer dereference crash.
[How]
Check if dc->ctx is non-NULL before dereferencing.
Link: https://lore.kernel.org/r/tencent_54FF4252EDFB6533090A491A25EEF3EDBF06@qq.com Co-developed-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
(Updated commit text and removed unnecessary error message) Signed-off-by: Siyang Liu <Security@tencent.com> Signed-off-by: Roman Li <roman.li@amd.com> Reviewed-by: Alex Hung <alex.hung@amd.com> Tested-by: Daniel Wheeler <daniel.wheeler@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 9dd8e2ba268c636c240a918e0a31e6feaee19404) Cc: stable@vger.kernel.org
Michel Dänzer [Wed, 30 Jul 2025 08:09:02 +0000 (10:09 +0200)]
drm/amd/display: Add primary plane to commits for correct VRR handling
amdgpu_dm_commit_planes calls update_freesync_state_on_stream only for
the primary plane. If a commit affects a CRTC but not its primary plane,
it would previously not trigger a refresh cycle or affect LFC, violating
current UAPI semantics.
Fixes e.g. atomic commits affecting only the cursor plane being limited
to the minimum refresh rate.
Don't do this for the legacy cursor ioctls though, it would break the
UAPI semantics for those.
Suggested-by: Xaver Hugl <xaver.hugl@kde.org> Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3034 Signed-off-by: Michel Dänzer <mdaenzer@redhat.com> Reviewed-by: Harry Wentland <harry.wentland@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit cc7bfba95966251b254cb970c21627124da3b7f4) Cc: stable@vger.kernel.org
drm/amdgpu: Retain job->vm in amdgpu_job_prepare_job
The field job->vm is used in function amdgpu_job_run to get the page
table re-generation counter and decide whether the job should be skipped.
Specifically, function amdgpu_vm_generation checks if the VM is valid for this job to use.
For instance, if a gfx job depends on a cancelled sdma job from entity vm->delayed,
then the gfx job should be skipped.
Fixes: 26c95e838e63 ("drm/amdgpu: set the VM pointer to NULL in amdgpu_job_prepare") Signed-off-by: YuanShang <YuanShang.Mao@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit ed76936c6b10b547c6df4ca75412331e9ef6d339) Cc: stable@vger.kernel.org
Timur Kristóf [Tue, 22 Jul 2025 15:58:30 +0000 (17:58 +0200)]
drm/amd/display: Fix DCE 6.0 and 6.4 PLL programming.
Apparently, both DCE 6.0 and 6.4 have 3 PLLs, but PLL0 can only
be used for DP. Make sure to initialize the correct amount of PLLs
in DC for these DCE versions and use PLL0 only for DP.
Also, on DCE 6.0 and 6.4, the PLL0 needs to be powered on at
initialization as opposed to DCE 6.1 and 7.x which use a different
clock source for DFS.
The following functions were used as reference from the old
radeon driver implementation of DCE 6.x:
- radeon_atom_pick_pll
- atombios_crtc_set_disp_eng_pll
Reviewed-by: Rodrigo Siqueira <siqueira@igalia.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Timur Kristóf <timur.kristof@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 35222b5934ec8d762473592ece98659baf6bc48e) Cc: stable@vger.kernel.org
Timur Kristóf [Tue, 22 Jul 2025 15:58:29 +0000 (17:58 +0200)]
drm/amd/display: Don't overwrite dce60_clk_mgr
dc_clk_mgr_create accidentally overwrites the dce60_clk_mgr
with the dce_clk_mgr, causing incorrect behaviour on DCE6.
Fix it by removing the extra dce_clk_mgr_construct.
Fixes: 62eab49faae7 ("drm/amd/display: hide VGH asic specific structs") Reviewed-by: Rodrigo Siqueira <siqueira@igalia.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Timur Kristóf <timur.kristof@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit bbddcbe36a686af03e91341b9bbfcca94bd45fb6) Cc: stable@vger.kernel.org
David Yat Sin [Wed, 16 Jul 2025 22:04:28 +0000 (22:04 +0000)]
drm/amdkfd: Fix checkpoint-restore on multi-xcc
GPUs with multi-xcc have multiple MQDs per queue. This patch saves and
restores all the MQDs within the partition.
Signed-off-by: David Yat Sin <David.YatSin@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit a578f2a58c3ab38f0643b1b6e7534af860233cb1) Cc: stable@vger.kernel.org
drm/amd: Restore cached manual clock settings during resume
If the SCLK limits have been set before S3 they will not
be restored. The limits are however cached in the driver and so
they can be restored by running a commit sequence during resume.
Simon Richter [Sat, 2 Aug 2025 02:40:36 +0000 (11:40 +0900)]
Mark xe driver as BROKEN if kernel page size is not 4kB
This driver, for the time being, assumes that the kernel page size is 4kB,
so it fails on loong64 and aarch64 with 16kB pages, and ppc64el with 64kB
pages.
Signed-off-by: Simon Richter <Simon.Richter@hogyros.de> Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs") Cc: stable@vger.kernel.org # v6.8+ Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Link: https://lore.kernel.org/r/20250802024152.3021-1-Simon.Richter@hogyros.de
(cherry picked from commit 0521a868222ffe636bf202b6e9d29292c1e19c62) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
The PF driver might be resumed just to configure VFs, but since
it is doing some asynchronous GuC reconfigurations after fresh
reset, we should wait until all pending works are completed.
This is especially important in case of LMEM provisioning, since
we also need to update the LMTT and send invalidation requests
to all GuCs, which are expected to be already in the VGT mode.
drm/xe/pf: Disable PF restart worker on device removal
We can't let restart worker run once device is removed, since other
data that it might want to access could be already released.
Explicitly disable worker as part of device cleanup action.
drm/xe/devcoredump: Defer devcoredump initialization during probe
Doing devcoredump initializing before GT though look harmless, it leads
to problem during driver unbind. Because of this order, GT/Engine
release functions will be called before xe devcoredump release function
(xe_driver_devcoredump_fini) leading to the following kernel crash[1]
because the devcoredump functions might still use GT/Engine
datastructures after those are freed.
The following crash is observed while running the IGT
xe_wedged@wedged-at-any-timeout. The test forces a wedged state by
submitting a workload which hangs. Then does a unbind/rebind of the
driver to recover from the wedged state.
The hanged workload leads to a devcoredump. The following crash is
noticed when the devcoredump capture races with the driver unbind.
During driver unbind, the release function hw_engine_fini() will be
called which assigns NULL to hwe->gt. But the same data structure is
accessed during the coredump capture in the function
xe_engine_snapshot_print by reading snapshot->hwe->gt.
With this patch, we make sure the devcoredump is stopped before
deinitializing the core driver functions.
Michal Wajdeczko [Tue, 22 Jul 2025 18:26:16 +0000 (20:26 +0200)]
drm/xe/pf: Enable SR-IOV PF mode by default
We already claim official support for SR-IOV PF/VF modes on PTL
and BMG platforms, but by default we start the Xe driver on those
platforms in non-virtualized mode (native) since we still have
max_vfs modparam set to disable creation of the VFs.
It's time to let the Xe driver support SR-IOV PF mode by default.
We were already testing this on our CI, which was relying on the
patch that was enabling it for CONFIG_DRM_XE_DEBUG used by our CI.
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Thomas Hellstrom <thomas.hellstrom@linux.intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://lore.kernel.org/r/20250722182618.30811-3-michal.wajdeczko@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
(cherry picked from commit a2b461bd6f3b36bded0a74178dec0e58e4714d3d) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
drm/i915/display: Write PHY_CMN1_CONTROL only when using AUXLess ALPM
We are seeing "dmesg-warn/abort - *ERROR* PHY * failed after 3 retries"
since we started configuring LFPS sending. According to Bspec Configuring
LFPS sending is needed only when using AUXLess ALPM. This patch avoids
these failures by configuring LFPS sending only when using AUXLess ALPM.
Ivan Lipski [Thu, 17 Jul 2025 17:58:35 +0000 (13:58 -0400)]
drm/amd/display: Allow DCN301 to clear update flags
[Why & How]
Not letting DCN301 to clear after surface/stream update results
in artifacts when switching between active overlay planes. The issue
is known and has been solved initially. See below:
(https://gitlab.freedesktop.org/drm/amd/-/issues/3441)
Fixes: f354556e29f4 ("drm/amd/display: limit clear_update_flags t dcn32 and above") Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Ivan Lipski <ivan.lipski@amd.com> Tested-by: Daniel Wheeler <daniel.wheeler@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
drm/amd/display: Pass up errors for reset GPU that fails to init HW
[Why]
If a GPU is in reset and the hardware fails to initialize the rest of the
resume sequence shouldn't be run.
[How]
Pass error code up to caller of dm_resume().
Reviewed-by: Alex Hung <alex.hung@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Ivan Lipski <ivan.lipski@amd.com> Tested-by: Daniel Wheeler <daniel.wheeler@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
drm/amd/display: Only finalize atomic_obj if it was initialized
[Why]
If amdgpu_dm failed to initalize before amdgpu_dm_initialize_drm_device()
completed then freeing atomic_obj will lead to list corruption.
[How]
Check if atomic_obj state is initialized before trying to free.
Reviewed-by: Harry Wentland <harry.wentland@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Ivan Lipski <ivan.lipski@amd.com> Tested-by: Daniel Wheeler <daniel.wheeler@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
drm/amd/display: Avoid configuring PSR granularity if PSR-SU not supported
[Why]
If PSR-SU is disabled on the link, then configuring su_y granularity in
mod_power_calc_psr_configs() can lead to assertions in
psr_su_set_dsc_slice_height().
[How]
Check the PSR version in amdgpu_dm_link_setup_psr() to determine whether
or not to configure granularity.
Reviewed-by: Sun peng (Leo) Li <sunpeng.li@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Ivan Lipski <ivan.lipski@amd.com> Tested-by: Daniel Wheeler <daniel.wheeler@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Roman Li [Mon, 14 Jul 2025 18:37:33 +0000 (14:37 -0400)]
drm/amd/display: Disable dsc_power_gate for dcn314 by default
[Why]
"REG_WAIT timeout 1us * 1000 tries - dcn314_dsc_pg_control line"
warnings seen after resuming from s2idle.
DCN314 has issues with DSC power gating that cause REG_WAIT timeouts
when attempting to power down DSC blocks.
[How]
Disable dsc_power_gate for dcn314 by default.
Reviewed-by: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com> Signed-off-by: Roman Li <Roman.Li@amd.com> Signed-off-by: Ivan Lipski <ivan.lipski@amd.com> Tested-by: Daniel Wheeler <daniel.wheeler@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Frank Min <Frank.Min@amd.com> Reviewed-by: Gui Chengming <Jack.Gui@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
Yang Wang [Thu, 24 Jul 2025 07:16:18 +0000 (15:16 +0800)]
drm/amd/amdgpu: fix missing lock for cper.ring->rptr/wptr access
Add lock protection for 'ring->wptr'/'ring->rptr' to ensure the correct execution.
Fixes: 8652920d2c00 ("drm/amdgpu: add mutex lock for cper ring") Signed-off-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
drm/amd/display: Fix misuse of /** to /* in 'dce_i2c_hw.c'
Fix the comment style before cntl_stuck_hw_workaround() by replacing
'/**' with '/*' since it is not a kdoc comment.
Fixes the below with gcc W=1:
display/dc/dce/dce_i2c_hw.c:380: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
* If we boot without an HDMI display, the I2C engine does not get
initialized
Fixes: 04d57f4462a6 ("drm/amd/display: Workaround for stuck I2C arbitrage") Cc: Alvin Lee <alvin.lee2@amd.com> Cc: Dominik Kaszewski <dominik.kaszewski@amd.com> Cc: Ivan Lipski <ivan.lipski@amd.com> Cc: Harry Wentland <harry.wentland@amd.com> Cc: Tom Chung <chiahsuan.chung@amd.com> Cc: Roman Li <roman.li@amd.com> Cc: Alex Hung <alex.hung@amd.com> Cc: Aurabindo Pillai <aurabindo.pillai@amd.com> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Reviewed-by: Alex Hung <alex.hung@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
DIV_ROUND_CLOSEST(x, 100) returns either 0 or 1 if 0<x<=100, so the
division needs to be performed after the multiplication and not the
other way around, to properly scale the value.
Fixes: 8b5f3a229a70 ("drm/amd/display: Fix default DC and AC levels") Signed-off-by: Lauri Tirkkonen <lauri@hacktheplanet.fi> Cc: stable@vger.kernel.org Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> Link: https://lore.kernel.org/r/aH2Q_HJvxKbW74vU@hacktheplanet.fi Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Suggested-by: Christian König <christian.koenig@amd.com> Signed-off-by: Gang Ba <Gang.Ba@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
Peter Shkenev [Thu, 17 Jul 2025 20:48:17 +0000 (23:48 +0300)]
drm/amdgpu: check if hubbub is NULL in debugfs/amdgpu_dm_capabilities
HUBBUB structure is not initialized on DCE hardware, so check if it is NULL
to avoid null dereference while accessing amdgpu_dm_capabilities file in
debugfs.
Signed-off-by: Peter Shkenev <mustela@erminea.space> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
drm/amdgpu: Initialize data to NULL in imu_v12_0_program_rlc_ram()
After a recent change in clang to expose uninitialized warnings from
const variables and pointers [1], there is a warning in
imu_v12_0_program_rlc_ram() because data is passed uninitialized to
program_imu_rlc_ram():
drivers/gpu/drm/amd/amdgpu/imu_v12_0.c:374:30: error: variable 'data' is uninitialized when used here [-Werror,-Wuninitialized]
374 | program_imu_rlc_ram(adev, data, (const u32)size);
| ^~~~
As this warning happens early in clang's frontend, it does not realize
that due to the assignment of r to -EINVAL, program_imu_rlc_ram() is
never actually called, and even if it were, data would not be
dereferenced because size is 0.
Just initialize data to NULL to silence the warning, as the commit that
added program_imu_rlc_ram() mentioned it would eventually be used over
the old method, at which point data can be properly initialized and
used.
Cc: stable@vger.kernel.org Closes: https://github.com/ClangBuiltLinux/linux/issues/2107 Fixes: 56159fffaab5 ("drm/amdgpu: use new method to program rlc ram") Link: https://github.com/llvm/llvm-project/commit/2464313eef01c5b1edf0eccf57a32cdee01472c7 Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
drm/amd/display: Fix divide by zero when calculating min ODM factor
[WHY&HOW]
If the debug option is set to disable_dsc the max slice width and/or
dispclk can be zero. This causes a divide by zero when calculating the
min ODM combine factor. Add a check to ensure they are valid first.
Reviewed-by: Wenjing Liu <wenjing.liu@amd.com> Signed-off-by: Dillon Varone <dillon.varone@amd.com> Signed-off-by: Wayne Lin <wayne.lin@amd.com> Tested-by: Daniel Wheeler <daniel.wheeler@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
Michal Wajdeczko [Tue, 22 Jul 2025 14:10:54 +0000 (16:10 +0200)]
drm/xe/configfs: Fix pci_dev reference leak
We are using pci_get_domain_bus_and_slot() function to verify if
the given config directory name matches any existing PCI device,
but we missed to call matching pci_dev_put() to release reference.
While around, also change error code in case of no device match,
to make it more specific than generic formatting error.
Fixes: 16280ded45fb ("drm/xe: Add configfs to enable survivability mode") Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Link: https://lore.kernel.org/r/20250722141059.30707-2-michal.wajdeczko@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
(cherry picked from commit 0bdd05c2a82bbf2419415d012fd4f5faeca7f1af) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Shuicheng Lin [Thu, 24 Jul 2025 19:38:55 +0000 (19:38 +0000)]
drm/xe/hw_engine_group: Avoid call kfree() for drmm_kzalloc()
Memory allocated with drmm_kzalloc() should not be freed using
kfree(), as it is managed by the DRM subsystem. The memory will
be automatically freed when the associated drm_device is released.
These 3 group pointers are allocated using drmm_kzalloc() in
hw_engine_group_alloc(), so they don't require manual deallocation.
Fixes: 67979060740f ("drm/xe/hw_engine_group: Fix potential leak") Cc: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://lore.kernel.org/r/20250724193854.1124510-2-shuicheng.lin@intel.com
(cherry picked from commit f98de826b418885a21ece67f0f5b921ae759b7bf) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Michal Wajdeczko [Wed, 23 Jul 2025 17:56:39 +0000 (19:56 +0200)]
drm/xe/guc: Clear whole g2h_fence during initialization
The struct g2h_fence must be explicitly initializated using the
g2h_fence_init() function to avoid trash values in its members,
but we missed to update this helper function with the new member.
To fix that and avoid any future mistakes, memset the whole struct
first, then update remaining non-zero members.
Fixes: 94de94d24ea8 ("drm/xe/guc: Cancel ongoing H2G requests when stopping CT") Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Lukasz Laguna <lukasz.laguna@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250723175639.206875-1-michal.wajdeczko@intel.com
(cherry picked from commit 159afd92bae8153bdd8d8b34aea0d463fe19c978) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Ashutosh Dixit [Tue, 15 Jul 2025 18:14:22 +0000 (11:14 -0700)]
drm/xe/oa: Fix static checker warning about null gt
There is a static checker warning that gt returned by xe_device_get_gt can
be NULL and that is being dereferenced. Use xe_root_mmio_gt instead, which
is equivalent and cannot return a NULL gt 0.
drm/xe: Don't fail probe on unsupported mailbox command
If the device is running older pcode firmware, it is possible that newer
mailbox commands are not supported by it. The sysfs attributes aren't
useful in that case, but we shouldn't fail driver probe because of it.
As of now, it is unknown if we can distinguish unsupported commands before
attempting them. But until we figure out a way to do that, fix the
regressions.
v2: Add debug message (Lucas)
Fixes: cdc36b66cd41 ("drm/xe: Expose fan control and voltage regulator version") Signed-off-by: Raag Jadav <raag.jadav@intel.com> Tested-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Link: https://lore.kernel.org/r/20250714215503.2897748-1-raag.jadav@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
(cherry picked from commit ed5461daa150b037e36b8202381da1ef85d6b16b) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
drm/tidss: oldi: convert to devm_drm_bridge_alloc() API
DRM bridges now use "devm_drm_bridge_alloc()" for allocation and
initialization. "devm_kzalloc()" is not allowed anymore and it results
in WARNING. So convert it.
Michael Walle [Wed, 16 Jul 2025 13:41:07 +0000 (15:41 +0200)]
drm/tidss: encoder: convert to devm_drm_bridge_alloc()
Convert the tidss encoder to use devm_drm_bridge_alloc(). Instead of
allocating the memory by drmm_simple_encoder_alloc() use
devm_drm_bridge_alloc() and initialize the encoder afterwards.
Fixes: a7748dd127ea ("drm/bridge: get/put the bridge reference in drm_bridge_add/remove()") Signed-off-by: Michael Walle <mwalle@kernel.org> Link: https://lore.kernel.org/r/20250716134107.4084945-1-mwalle@kernel.org Signed-off-by: Tomi Valkeinen <tomi.valkeinen@ideasonboard.com>
Dave Airlie [Fri, 18 Jul 2025 09:48:13 +0000 (19:48 +1000)]
Merge tag 'drm-xe-next-2025-07-15' of https://gitlab.freedesktop.org/drm/xe/kernel into drm-next
Driver Changes:
- Create and use XE_DEVICE_WA infrastructure (Atwood)
- SRIOV: Mark BMG as SR-IOV capable (Michal)
- Dont skip TLB invalidations on VF (Tejas)
- Fix migration copy direction in access_memory (Auld)
- General code clean-up (Lucas, Brost, Dr. David, Xin)
- More missing XeLP workarounds (Tvrtko)
- SRIOV: Relax VF/PF version negotiation (Michal)
- SRIOV: LMTT invalidation (Michal)
Alex Deucher [Thu, 29 May 2025 17:12:35 +0000 (13:12 -0400)]
drm/amdgpu/sdma7: re-emit unprocessed state on ring reset
Re-emit the unprocessed state after resetting the queue.
Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Thu, 29 May 2025 17:11:54 +0000 (13:11 -0400)]
drm/amdgpu/sdma6: re-emit unprocessed state on ring reset
Re-emit the unprocessed state after resetting the queue.
Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Thu, 26 Jun 2025 13:53:18 +0000 (09:53 -0400)]
drm/amdgpu/sdma5.2: re-emit unprocessed state on ring reset
Re-emit the unprocessed state after resetting the queue.
Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Thu, 26 Jun 2025 13:52:55 +0000 (09:52 -0400)]
drm/amdgpu/sdma5: re-emit unprocessed state on ring reset
Re-emit the unprocessed state after resetting the queue.
Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Wed, 28 May 2025 02:29:31 +0000 (22:29 -0400)]
drm/amdgpu/gfx12: re-emit unprocessed state on ring reset
Re-emit the unprocessed state after resetting the queue.
Drop the soft_recovery callbacks as the queue reset replaces
it.
Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Wed, 28 May 2025 02:05:13 +0000 (22:05 -0400)]
drm/amdgpu/gfx11: re-emit unprocessed state on ring reset
Re-emit the unprocessed state after resetting the queue.
Drop the soft_recovery callbacks as the queue reset replaces
it.
Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Fri, 23 May 2025 04:33:04 +0000 (00:33 -0400)]
drm/amdgpu/gfx10: re-emit unprocessed state on ring reset
Re-emit the unprocessed state after resetting the queue.
Drop the soft_recovery callbacks as the queue reset replaces
it.
Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Wed, 28 May 2025 03:23:53 +0000 (23:23 -0400)]
drm/amdgpu/gfx9.4.3: re-emit unprocessed state on kcq reset
Re-emit the unprocessed state after resetting the queue.
Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Wed, 28 May 2025 03:19:29 +0000 (23:19 -0400)]
drm/amdgpu/gfx9: re-emit unprocessed state on kcq reset
Re-emit the unprocessed state after resetting the queue.
Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
drm/amdgpu: Add WARN_ON to the resource clear function
Set the dirty bit when the memory resource is not cleared
during BO release.
v2(Christian):
- Drop the cleared flag set to false.
- Improve the amdgpu_vram_mgr_set_clear_state() function.
v3:
- Add back the resource clear flag set function call after
being cleared during eviction (Christian).
- Modified the patch subject name.
Signed-off-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com> Suggested-by: Christian König <christian.koenig@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cached metrics data validity is 1ms on SMUv13.0.6 SOCs. It's not
reasonable for any client to query gpu_metrics at a faster rate and
constantly interrupt PMFW.
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Asad Kamal <asad.kamal@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
drm/amdgpu: Replace HQD terminology with slots naming
The term "HQD" is CP-specific and doesn't
accurately describe the queue resources for other IP blocks like SDMA,
VCN, or VPE. This change:
1. Renames `num_hqds` to `num_slots` in amdgpu_kms.c to better reflect
the generic nature of the resource counting
2. Updates the UAPI struct member from `userq_num_hqds` to `userq_num_slots`
3. Maintains the same functionality while using more appropriate terminology
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com> Reviewed-by: Marek Olšák <marek.olsak@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Jesse Zhang [Wed, 25 Jun 2025 07:29:45 +0000 (15:29 +0800)]
drm/amdgpu: Add user queue instance count in HW IP info
This change exposes the number of available user queue instances
for each hardware IP type (GFX, COMPUTE, SDMA) through the
drm_amdgpu_info_hw_ip interface.
Key changes:
1. Added userq_num_instance field to drm_amdgpu_info_hw_ip structure
2. Implemented counting of available HQD slots using:
- mes.gfx_hqd_mask for GFX queues
- mes.compute_hqd_mask for COMPUTE queues
- mes.sdma_hqd_mask for SDMA queues
3. Only counts available instances when user queues are enabled
(!disable_uq)
v2: using the adev->mes.gfx_hqd_mask[]/compute_hqd_mask[]/sdma_hqd_mask[] masks
to determine the number of queue slots available for each engine type (Alex)
v3: rename userq_num_instance to userq_num_hqds (Alex)
Suggested-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Pratap Nirujogi [Mon, 23 Jun 2025 22:44:50 +0000 (18:44 -0400)]
drm/amd/amdgpu: Add helper functions for isp buffers
Accessing amdgpu internal data structures "struct amdgpu_device"
and "struct amdgpu_bo" in ISP V4L2 driver to alloc/free GART
buffers is not recommended.
Add new amdgpu_isp helper functions that takes opaque params
from ISP V4L2 driver and calls the amdgpu internal functions
amdgpu_bo_create_isp_user() and amdgpu_bo_create_kernel() to
alloc/free GART buffers.
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Pratap Nirujogi <pratap.nirujogi@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Pratap Nirujogi [Tue, 24 Jun 2025 23:15:00 +0000 (19:15 -0400)]
drm/amd/amdgpu: Initialize swnode for ISP MFD device
Create amd_isp_capture MFD device with swnode initialized to
isp specific software_node part of fwnode graph in amd_isp4
x86/platform driver. The isp driver use this swnode handle
to retrieve the critical properties (data-lanes, mipi phyid,
link-frequencies etc.) required for camera to work on AMD ISP4
based targets.
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Pratap Nirujogi <pratap.nirujogi@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
drm/amdgpu/gfx8: reset compute ring wptr on the GPU on resume
Commit 42cdf6f687da ("drm/amdgpu/gfx8: always restore kcq MQDs") made the
ring pointer always to be reset on resume from suspend. This caused compute
rings to fail since the reset was done without also resetting it for the
firmware. Reset wptr on the GPU to avoid a disconnect between the driver
and firmware wptr.
Writing a string without delimiters (' ', '\n', '\0') to the under
gpu_od/fan_ctrl sysfs or pp_power_profile_mode for the CUSTOM profile
will result in a null pointer dereference.
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4401 Signed-off-by: Umio Yasuno <coelacanth_dream@protonmail.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
v2: use amdgpu_xgmi_same_hive() as suggested by Felix
Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Mon, 16 Jun 2025 21:15:27 +0000 (17:15 -0400)]
drm/amdgpu/vcn3: implement ring reset
Use the new helpers to handle engine resets for VCN.
Reviewed-by: Sathishkumar S <sathishkumar.sundararaju@amd.com> Tested-by: Sathishkumar S <sathishkumar.sundararaju@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Mon, 16 Jun 2025 21:07:22 +0000 (17:07 -0400)]
drm/amdgpu/vcn2.5: implement ring reset
Use the new helpers to handle engine resets for VCN.
Reviewed-by: Sathishkumar S <sathishkumar.sundararaju@amd.com> Tested-by: Sathishkumar S <sathishkumar.sundararaju@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Mon, 16 Jun 2025 20:37:34 +0000 (16:37 -0400)]
drm/amdgpu/vcn2: implement ring reset
Use the new helpers to handle engine resets for VCN.
Reviewed-by: Sathishkumar S <sathishkumar.sundararaju@amd.com> Tested-by: Sathishkumar S <sathishkumar.sundararaju@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Mon, 16 Jun 2025 20:01:25 +0000 (16:01 -0400)]
drm/amdgpu/vcn: add a helper framework for engine resets
With engine resets we reset all queues on the engine rather
than just a single queue. Add a framework to handle this
similar to SDMA.
Reviewed-by: Sathishkumar S <sathishkumar.sundararaju@amd.com> Tested-by: Sathishkumar S <sathishkumar.sundararaju@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Thu, 29 May 2025 19:26:37 +0000 (15:26 -0400)]
drm/amdgpu/vcn5: re-emit unprocessed state on ring reset
Re-emit the unprocessed state after resetting the queue.
Reviewed-by: Sathishkumar S <sathishkumar.sundararaju@amd.com> Tested-by: Sathishkumar S <sathishkumar.sundararaju@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Thu, 29 May 2025 19:26:08 +0000 (15:26 -0400)]
drm/amdgpu/vcn4.0.5: re-emit unprocessed state on ring reset
Re-emit the unprocessed state after resetting the queue.
Reviewed-by: Sathishkumar S <sathishkumar.sundararaju@amd.com> Tested-by: Sathishkumar S <sathishkumar.sundararaju@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Thu, 29 May 2025 19:25:05 +0000 (15:25 -0400)]
drm/amdgpu/vcn4: re-emit unprocessed state on ring reset
Re-emit the unprocessed state after resetting the queue.
Reviewed-by: Sathishkumar S <sathishkumar.sundararaju@amd.com> Tested-by: Sathishkumar S <sathishkumar.sundararaju@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Thu, 29 May 2025 17:05:35 +0000 (13:05 -0400)]
drm/amdgpu/jpeg5: add queue reset
Add queue reset support for jpeg 5.0.0.
Use the new helpers to re-emit the unprocessed state
after resetting the queue.
Reviewed-by: Sathishkumar S <sathishkumar.sundararaju@amd.com> Tested-by: Sathishkumar S <sathishkumar.sundararaju@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Thu, 5 Jun 2025 22:11:11 +0000 (18:11 -0400)]
drm/amdgpu/jpeg4.0.5: add queue reset
Add queue reset support for jpeg 4.0.5.
Use the new helpers to re-emit the unprocessed state
after resetting the queue.
Reviewed-by: Sathishkumar S <sathishkumar.sundararaju@amd.com> Tested-by: Sathishkumar S <sathishkumar.sundararaju@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Thu, 29 May 2025 19:22:36 +0000 (15:22 -0400)]
drm/amdgpu/jpeg4: re-emit unprocessed state on ring reset
Re-emit the unprocessed state after resetting the queue.
Reviewed-by: Sathishkumar S <sathishkumar.sundararaju@amd.com> Tested-by: Sathishkumar S <sathishkumar.sundararaju@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Thu, 29 May 2025 19:22:20 +0000 (15:22 -0400)]
drm/amdgpu/jpeg3: re-emit unprocessed state on ring reset
Re-emit the unprocessed state after resetting the queue.
Reviewed-by: Sathishkumar S <sathishkumar.sundararaju@amd.com> Tested-by: Sathishkumar S <sathishkumar.sundararaju@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>