summaryrefslogtreecommitdiff
path: root/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
AgeCommit message (Collapse)AuthorFilesLines
2024-01-16drm/amdgpu: replace MCA macro with ACA for XGMIYang Wang1-12/+12
use new ACA macro to instead of MCA Signed-off-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-16drm/amdgpu: add xgmi v6.4.0 ACA supportYang Wang1-1/+60
add xgmi v6.4.0 ACA driver support Signed-off-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-12-19drm/amdgpu: MCA supports recording umc address informationYiPeng Chai1-2/+2
MCA supports recording umc address information. V2: Move err_addr variable from struct ras_err_node to struct ras_err_info. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-12-13drm/amdgpu: xgmi_fill_topology_infoVignesh Chander1-8/+50
1. Use the mirrored topology info to fill links for VF. The new solution is required to simplify and optimize host driver logic. Only use the new solution for VFs that support full duplex and extended_peer_link_info otherwise the info would be incomplete. 2. avoid calling extended_link_info on VF as its not supported Signed-off-by: Vignesh Chander <Vignesh.Chander@amd.com> Reviewed-by: Zhigang Luo <zhigang.luo@amd.com> Reviewed-by: Jonathan Kim <jonathan.kim@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-17drm/amdgpu: expose the connected port num info through sysfsShiwu Zhang1-0/+44
By catting the xgmi_port_num sysfs node, it prints out the info in the format of <src node id>:<src port num> -> <dst node id>:<dst port num> for one xgmi link. For example, in case of 4 sockets fully and evenly connected setup, it would be like as below for the first node in the hive. 01:02 -> 02:03 01:03 -> 02:02 01:07 -> 03:04 01:04 -> 03:07 01:06 -> 04:05 01:05 -> 04:06 Based on the fact that there is two xgmi links between each socket pair, "01:02 -> 02:03" means that the current socket in question use the port 2 to connect with port 3 of the second node in the hive and so on. v2: print out the src/dst node id for each xgmi link (lijo) v3: replace the current_node++ with +1 to align with dst node (le) and use the dev_err instead of pr_err (lijo) v4: fix checkpatch warning (alex) Signed-off-by: Shiwu Zhang <shiwu.zhang@amd.com> Acked-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Le Ma <le.ma@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-10drm/amdgpu: add pcs xgmi v6.4.0 ras supportYang Wang1-3/+155
add pcs xgmi v6.4.0 ras support Signed-off-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-10drm/amdgpu: Don't warn for unsupported set_xgmi_plpd_modeTao Zhou1-6/+8
set_xgmi_plpd_mode may be unsupported and this isn't error, no need to print warning for it. v2: add ret2 to save the status of psp_ras_trigger_error. Suggested-by: lijo.lazar@amd.com Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-07drm/amdgpu: add RAS reset/query operations for XGMI v6_4Tao Zhou1-3/+43
Reset/query RAS error status and count. v2: use XGMI IP version instead of WAFL version. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-20drm/amdgpu: replace reset_error_count with amdgpu_ras_reset_error_countTao Zhou1-2/+2
Simplify the code. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-09-28drm/amd/pm: deprecate allow_xgmi_power_down interfaceLe Ma1-2/+2
Replace with set_plpd_mode uniformly for places to use. Signed-off-by: Le Ma <le.ma@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-09-27drm/amdgpu:Expose physical id of device in XGMI hiveMangesh Gadre1-0/+20
This identifies the physical ordering of devices in the hive v2: fix compilation issue Signed-off-by: Mangesh Gadre <Mangesh.Gadre@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-09-20drm/amdgpu: Use function for IP version checkLijo Lazar1-1/+2
Use an inline function for version check. Gives more flexibility to handle any format changes. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-07-25drm/amdgpu: Add -ENOMEM error handling when there is no memorySrinivasan Shanmugam1-0/+1
Return -ENOMEM, when there is no sufficient dynamically allocated memory Cc: Christian König <christian.koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-15drm/amdgpu: expose num_hops and num_links xgmi info through dev attrShiwu Zhang1-0/+46
Add these two dev attrs for xgmi info details which is helpful for developers checking the xgmi topology by catting the sys file directly. Take 4 cards with xgmi connection as an example, get the num_hops for each device or node through xmig_hive_info dir like, cat /sys/bus/pci/devices/0000:41:00.0/xgmi_hive_info/node1/num_hops will return "00 41 41 41" where "00" stands for the hops to node1 itself and "41" is the hops in hex format to every other node in the same hive. There are node1/node2/node3/node4 representing 4 cards in the hive. The same for num_links dev attr. Signed-off-by: Shiwu Zhang <shiwu.zhang@amd.com> Acked-by: Le Ma <le.ma@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09drm/amdgpu: add instance mask for RAS injectTao Zhou1-2/+3
User can specify injected instances by the mask. For backward compatibility, the mask value is incorporated into sub block index without interface change of RAS TA. User uses logical mask and driver should convert it to physical value before sending it to RAS TA. v2: update parameter name. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-03-31drm/amdgpu: Correct xgmi_wafl block nameHawking Zhang1-1/+1
Fix backward compatibility issue to stay with the old name of xgmi_wafl node. Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-03-16drm/amdgpu: Rework xgmi_wafl_pcs ras sw_initHawking Zhang1-5/+23
To align with other IP blocks. Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Stanley Yang <Stanley.Yang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-02-24drm/amdgpu: make kobj_type structures constantThomas Weißschuh1-1/+1
Since commit ee6d3dd4ed48 ("driver core: make kobj_type constant.") the driver core allows the usage of const struct kobj_type. Take advantage of this to constify the structure definitions to prevent modification at runtime. Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-01-18drm/amdgpu: correct query xgmi3x16 pcs error statusStanley.Yang1-3/+69
There is xgmi3x16 pcs error status for aldebaran, driver should check xgmi3x16 pcs error status field instead of gopx16 pcs error status field. Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-01-18drm/amdgpu: support check xgmi/walf error mask bit for aldebaranStanley.Yang1-38/+64
The pcs error count should be determined by PCS ERROR status and PCS ERROR MASK registers, only PCS ERROR status register can not refect error counts accurately. Changed from V1: remove clean noncorrectable mask registers optimize query pcs error status Changed from V2: remove check mask_value bits correct set value corresponding bit Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-29drm/amdgpu: Fix potential double free and null pointer dereferenceLiang He1-2/+0
In amdgpu_get_xgmi_hive(), we should not call kfree() after kobject_put() as the PUT will call kfree(). In amdgpu_device_ip_init(), we need to check the returned *hive* which can be NULL before we dereference it. Signed-off-by: Liang He <windhl@126.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-09-19drm/amd/display: clean up some inconsistent indentingsYang Li1-8/+8
clean up some inconsistent indentings Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=2178 Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-09-13drm/amdgpu: Use per device reset_domain for XGMI on sriov configurationshaoyunl1-14/+23
For SRIOV configuration, host driver control the reset method(either FLR or heavier chain reset). The host will notify the guest individually with FLR message if individual GPU within the hive need to be reset. So for guest side, no need to use hive->reset_domain to replace the original per device reset_domain Signed-off-by: shaoyunl <shaoyun.liu@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-08-25drm/amdgpu: skip set_topology_info for VFVignesh Chander1-0/+3
Skip set_topology_info as xgmi TA will now block it and host needs to program it. Signed-off-by: Vignesh Chander <Vignesh.Chander@amd.com> Reviewed-By : Shaoyun Liu <Shaoyun.Liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-08-22drm/amdgpu: Move psp_xgmi_terminate call from amdgpu_xgmi_remove_device to ↵YiPeng Chai1-1/+1
psp_hw_fini V1: The amdgpu_xgmi_remove_device function will send unload command to psp through psp ring to terminate xgmi, but psp ring has been destroyed in psp_hw_fini. V2: 1. Change the commit title. 2. Restore amdgpu_xgmi_remove_device to its original calling location. Move psp_xgmi_terminate call from amdgpu_xgmi_remove_device to psp_hw_fini. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-03-15drm/amdgpu: drop xmgi23 error query/reset supportHawking Zhang1-22/+0
xgmi_ras is only initialized when host to GPU interface is PCIE. in such case, xgmi23 is disabled and protected by security firmware. Host access will results to security violation Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-03-03drm/amdgpu: Remove redundant .ras_fini initialization in some ras blocksyipechai1-1/+0
1. Define amdgpu_ras_block_late_fini_default in amdgpu_ras.c as .ras_fini common function, which is called when .ras_fini of ras block isn't initialized. 2. Remove the code of using amdgpu_ras_block_late_fini to initialize .ras_fini in ras blocks. Signed-off-by: yipechai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-03-03drm/amdgpu: Remove redundant calls of amdgpu_ras_block_late_fini in xgmi ras ↵yipechai1-8/+1
block Remove redundant calls of amdgpu_ras_block_late_fini in xgmi ras block. Signed-off-by: yipechai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-03-03drm/amdgpu: Optimize xxx_ras_fini function of each ras blockyipechai1-2/+2
1. Move the variables of ras block instance members from specific xxx_ras_fini to general ras_fini call. 2. Function calls inside the modules only use parameters passed from xxx_ras_fini instead of ras block instance members. Signed-off-by: yipechai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-03-03drm/amdgpu: Modify .ras_fini function pointer parameteryipechai1-1/+1
Modify .ras_fini function pointer parameter so that we can remove redundant intermediate calls in some ras blocks. Signed-off-by: yipechai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-24Merge tag 'drm-misc-next-2022-02-23' of ↵Dave Airlie1-1/+26
git://anongit.freedesktop.org/drm/drm-misc into drm-next drm-misc-next for v5.18: UAPI Changes: Cross-subsystem Changes: - Split out panel-lvds and lvds dt bindings . - Put yes/no on/off disabled/enabled strings in linux/string_helpers.h and use it in drivers and tomoyo. - Clarify dma_fence_chain and dma_fence_array should never include eachother. - Flatten chains in syncobj's. - Don't double add in fbdev/defio when page is already enlisted. - Don't sort deferred-I/O pages by default in fbdev. Core Changes: - Fix missing pm_runtime_put_sync in bridge. - Set modifier support to only linear fb modifier if drivers don't advertise support. - As a result, we remove allow_fb_modifiers. - Add missing clear for EDID Deep Color Modes in drm_reset_display_info. - Assorted documentation updates. - Warn once in drm_clflush if there is no arch support. - Add missing select for dp helper in drm_panel_edp. - Assorted small fixes. - Improve fb-helper's clipping handling. - Don't dump shmem mmaps in a core dump. - Add accounting to ttm resource manager, and use it in amdgpu. - Allow querying the detected eDP panel through debugfs. - Add helpers for xrgb8888 to 8 and 1 bits gray. - Improve drm's buddy allocator. - Add selftests for the buddy allocator. Driver Changes: - Add support for nomodeset to a lot of drm drivers. - Use drm_module_*_driver in a lot of drm drivers. - Assorted small fixes to bridge/lt9611, v3d, vc4, vmwgfx, mxsfb, nouveau, bridge/dw-hdmi, panfrost, lima, ingenic, sprd, bridge/anx7625, ti-sn65dsi86. - Add bridge/it6505. - Create DP and DVI-I connectors in ast. - Assorted nouveau backlight fixes. - Rework amdgpu reset handling. - Add dt bindings for ingenic,jz4780-dw-hdmi. - Support reading edid through aux channel in ingenic. - Add a drm driver for Solomon SSD130x OLED displays. - Add simple support for sharp LQ140M1JW46. - Add more panels to nt35560. Signed-off-by: Dave Airlie <airlied@redhat.com> From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/686ec871-e77f-c230-22e5-9e3bb80f064a@linux.intel.com
2022-02-17drm/amdgpu: Optimize xxx_ras_late_init function of each ras blockyipechai1-1/+1
1. Move calling ras block instance members from module internal function to the top calling xxx_ras_late_init. 2. Module internal function calls can only use parameter variables of xxx_ras_late_init instead of ras block instance members. Signed-off-by: yipechai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-17drm/amdgpu: Modify .ras_late_init function pointer parameteryipechai1-1/+1
Modify .ras_late_init function pointer parameter so that it can remove redundant intermediate calls in some ras blocks. Signed-off-by: yipechai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-14drm/amdgpu: Optimize amdgpu_xgmi_ras_late_init/amdgpu_xgmi_ras_fini function ↵yipechai1-35/+5
code Optimize amdgpu_xgmi_ras_late_init/amdgpu_xgmi_ras_fini function code. Signed-off-by: yipechai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-14drm/amdgpu: Optimize xxx_ras_late_init/xxx_ras_late_fini for each ras blockyipechai1-2/+4
1. Define amdgpu_ras_block_late_init to create sysfs nodes and interrupt handles. 2. Define amdgpu_ras_block_late_fini to remove sysfs nodes and interrupt handles. 3. Replace ras block variable members in struct amdgpu_ras_block_object with struct ras_common_if, which can make it easy to associate each ras block instance with each ras block functional interface. 4. Add .ras_cb to struct amdgpu_ras_block_object. 5. Change each ras block to fit for the changement of struct amdgpu_ras_block_object. Signed-off-by: yipechai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-09drm/amdgpu: Rework reset domain to be refcounted.Andrey Grodzovsky1-6/+23
The reset domain contains register access semaphor now and so needs to be present as long as each device in a hive needs it and so it cannot be binded to XGMI hive life cycle. Adress this by making reset domain refcounted and pointed by each member of the hive and the hive itself. v4: Fix crash on boot witrh XGMI hive by adding type to reset_domain. XGMI will only create a new reset_domain if prevoius was of single device type meaning it's first boot. Otherwsie it will take a refocunt to exsiting reset_domain from the amdgou device. Add a wrapper around reset_domain->refcount get/put and a wrapper around send to reset wq (Lijo) Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Link: https://www.spinics.net/lists/amd-gfx/msg74121.html
2022-02-09drm/amdgpu: Drop hive->in_resetAndrey Grodzovsky1-1/+0
Since we serialize all resets no need to protect from concurrent resets. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Link: https://www.spinics.net/lists/amd-gfx/msg74115.html
2022-02-09drm/amdgpu: Introduce reset domainAndrey Grodzovsky1-0/+9
Defined a reset_domain struct such that all the entities that go through reset together will be serialized one against another. Do it for both single device and XGMI hive cases. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Suggested-by: Daniel Vetter <daniel.vetter@ffwll.ch> Suggested-by: Christian König <ckoenig.leichtzumerken@gmail.com> Reviewed-by: Christian König <christian.koenig@amd.com> Link: https://www.spinics.net/lists/amd-gfx/msg74111.html
2022-01-19drm/amdgpu: Fix the code style warnings in hdp xgmi mca and umcyipechai1-1/+2
drm/amdgpu: Fix the code style warnings in hdp xgmi mca and umc: 1. WARNING: missing space after struct definition. 2. WARNING: please, no space before tabs. 3. WARNING: line length of xxx exceeds 100 columns. 4. ERROR: "foo* bar" should be "foo *bar". 5. ERROR: space required before the open parenthesis '('. 6. ERROR: space prohibited after that open parenthesis '('. Signed-off-by: yipechai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-01-15drm/amdgpu: Adjust error inject function code style in amdgpu_ras.cyipechai1-0/+27
1. Move xgmi special error inject function from amdgpu_ras.c to xgmi block. 2. Support to use psp_ras_trigger_error as default error inject function in amdgpu_ras.c. If .ras_error_inject isn't defined in ras block, default error inject function will take effect. v2: squash in warning fix (Alex) Signed-off-by: yipechai <YiPeng.Chai@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: John Clements <john.clements@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-01-15drm/amdgpu: Modify xgmi block to fit for the unified ras block data and opsyipechai1-10/+16
1.Modify gmc block to fit for the unified ras block data and ops. 2.Change amdgpu_xgmi_ras_funcs to amdgpu_xgmi_ras, and the corresponding variable name remove _funcs suffix. 3.Remove the const flag of gmc ras variable so that gmc ras block can be able to be inserted into amdgpu device ras block link list. 4.Invoke amdgpu_ras_register_ras_block function to register gmc ras block into amdgpu device ras block link list. 5.Remove the redundant code about gmc in amdgpu_ras.c after using the unified ras block. Signed-off-by: yipechai <YiPeng.Chai@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: John Clements <john.clements@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-01-11drm/amdgpu: use default_groups in kobj_typeGreg Kroah-Hartman1-1/+2
There are currently 2 ways to create a set of sysfs files for a kobj_type, through the default_attrs field, and the default_groups field. Move the amdgpu sysfs code to use default_groups field which has been the preferred way since aa30f47cf666 ("kobject: Add support for default attribute groups to kobj_type") so that we can soon get rid of the obsolete default_attrs field. Cc: Alex Deucher <alexander.deucher@amd.com> Cc: "Christian König" <christian.koenig@amd.com> Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com> Cc: David Airlie <airlied@linux.ie> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Cc: John Clements <john.clements@amd.com> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Jonathan Kim <jonathan.kim@amd.com> Cc: Kevin Wang <kevin1.wang@amd.com> Cc: shaoyunl <shaoyun.liu@amd.com> Cc: Tao Zhou <tao.zhou1@amd.com> Cc: amd-gfx@lists.freedesktop.org Cc: dri-devel@lists.freedesktop.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-14drm/amdgpu: check df_funcs and its callback pointersHawking Zhang1-0/+5
in case they are not avaiable in early phase Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Le Ma <Le.Ma@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-22drm/amd/amdgpu: fix potential memleakBernard Zhao1-0/+1
In function amdgpu_get_xgmi_hive, when kobject_init_and_add failed There is a potential memleak if not call kobject_put. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Bernard Zhao <bernard@vivo.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-05drm/amdgpu: correct xgmi ras error count resetTao Zhou1-2/+2
The error count reset for xgmi3x16 pcs is missed. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-08-26drm/amdgpu: Add support for RAS XGMI err queryJohn Clements1-0/+65
Update XGMI RAS to support error query on aldebaran Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: John Clements <john.clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-08-24drm/amdgpu: Update RAS XGMI Error QueryJohn Clements1-1/+3
Resolve bug querying error on unsupported ASIC Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: John Clements <john.clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-08-19drm/amdgpu: get extended xgmi topology dataJonathan Kim1-2/+56
The TA has a limit to the amount of data that can be retrieved from GET_TOPOLOGY. For setups that exceed this limit, the xGMI topology needs to be re-initialized and data needs to be re-fetched from the extended link records by setting a flag in the shared command buffer. The number of hops and the number of links must be accumulated by the driver. Other data points are all fetched from the first request. Because the TA has already exceeded its link record limit, it cannot hold bidirectional information. Otherwise the driver would have to do more than two fetches so the driver has to reflect the topology information in the opposite direction. v2: squashed with internal reviewed fix Signed-off-by: Jonathan Kim <jonathan.kim@amd.com> Reviewed-by: Hawking Zhang <hawking.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-08-16drm/amd/amdgpu: remove unnecessary RAS context fieldCandice Li1-1/+0
Delete ras_if->name in the RAS ctx structure and remove related lines. Signed-off-by: Candice Li <candice.li@amd.com> Reviewed-by: John Clements <john.clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-07-23drm/amdkfd: report xgmi bandwidth between direct peers to the kfdJonathan Kim1-0/+12
Report the min/max bandwidth in megabytes to the kfd for direct xgmi connections only. Indirect peers will report 0 since indirect route is unknown. Signed-off-by: Jonathan Kim <jonathan.kim@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>