summaryrefslogtreecommitdiff
path: root/drivers/nvme
AgeCommit message (Collapse)AuthorFilesLines
2022-04-27nvme-pci: disable namespace identifiers for Qemu controllersChristoph Hellwig1-1/+4
[ Upstream commit 66dd346b84d79fde20832ed691a54f4881eac20d ] Qemu unconditionally reports a UUID, which depending on the qemu version is either all-null (which is incorrect but harmless) or contains a single bit set for all controllers. In addition it can also optionally report a eui64 which needs to be manually set. Disable namespace identifiers for Qemu controlles entirely even if in some cases they could be set correctly through manual intervention. Reported-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-04-27nvme-pci: disable namespace identifiers for the MAXIO MAP1002/1202Christoph Hellwig1-0/+4
[ Upstream commit a98a945b80f8684121d477ae68ebc01da953da1f ] The MAXIO MAP1002/1202 controllers reports completely bogus Namespace identifiers that even change after suspend cycles. Disable using the Identifiers entirely. Reported-by: 金韬 <me@kingtous.cn> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Tested-by: 金韬 <me@kingtous.cn> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-04-27nvme: add a quirk to disable namespace identifiersChristoph Hellwig2-6/+23
[ Upstream commit 00ff400e6deee00f7b15e200205b2708b63b8cf6 ] Add a quirk to disable using and exporting namespace identifiers for controllers where they are broken beyond repair. The most directly visible problem with non-unique namespace identifiers is that they break the /dev/disk/by-id/ links, with the link for a supposedly unique identifier now pointing to one of multiple possible namespaces that share the same ID, and a somewhat random selection of which one actually shows up. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-04-08nvme: fix the read-only state for zoned namespaces with unsupposed featuresPankaj Raghav1-3/+5
commit 726be2c72efc0a64c206e854b8996ad3ab9c7507 upstream. commit 2f4c9ba23b88 ("nvme: export zoned namespaces without Zone Append support read-only") marks zoned namespaces without append support read-only. It does iso by setting NVME_NS_FORCE_RO in ns->flags in nvme_update_zone_info and checking for that flag later in nvme_update_disk_info to mark the disk as read-only. But commit 73d90386b559 ("nvme: cleanup zone information initialization") rearranged nvme_update_disk_info to be called before nvme_update_zone_info and thus not marking the disk as read-only. The call order cannot be just reverted because nvme_update_zone_info sets certain queue parameters such as zone_write_granularity that depend on the prior call to nvme_update_disk_info. Remove the call to set_disk_ro in nvme_update_disk_info. and call set_disk_ro after nvme_update_zone_info and nvme_update_disk_info to set the permission for ZNS drives correctly. The same applies to the multipath disk path. Fixes: 73d90386b559 ("nvme: cleanup zone information initialization") Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-04-08nvme: allow duplicate NSIDs for private namespacesSungup Moon3-8/+33
commit 5974ea7ce0f9a5987fc8cf5e08ad6e3e70bb542e upstream. A NVMe subsystem with multiple controller can have private namespaces that use the same NSID under some conditions: "If Namespace Management, ANA Reporting, or NVM Sets are supported, the NSIDs shall be unique within the NVM subsystem. If the Namespace Management, ANA Reporting, and NVM Sets are not supported, then NSIDs: a) for shared namespace shall be unique; and b) for private namespace are not required to be unique." Reference: Section 6.1.6 NSID and Namespace Usage; NVM Express 1.4c spec. Make sure this specific setup is supported in Linux. Fixes: 9ad1927a3bc2 ("nvme: always search for namespace head") Signed-off-by: Sungup Moon <sungup.moon@samsung.com> [hch: refactored and fixed the controller vs subsystem based naming conflict] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-04-08nvme-tcp: lockdep: annotate in-kernel socketsChris Leech1-0/+40
[ Upstream commit 841aee4d75f18fdfb53935080b03de0c65e9b92c ] Put NVMe/TCP sockets in their own class to avoid some lockdep warnings. Sockets created by nvme-tcp are not exposed to user-space, and will not trigger certain code paths that the general socket API exposes. Lockdep complains about a circular dependency between the socket and filesystem locks, because setsockopt can trigger a page fault with a socket lock held, but nvme-tcp sends requests on the socket while file system locks are held. ====================================================== WARNING: possible circular locking dependency detected 5.15.0-rc3 #1 Not tainted ------------------------------------------------------ fio/1496 is trying to acquire lock: (sk_lock-AF_INET){+.+.}-{0:0}, at: tcp_sendpage+0x23/0x80 but task is already holding lock: (&xfs_dir_ilock_class/5){+.+.}-{3:3}, at: xfs_ilock+0xcf/0x290 [xfs] which lock already depends on the new lock. other info that might help us debug this: chain exists of: sk_lock-AF_INET --> sb_internal --> &xfs_dir_ilock_class/5 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&xfs_dir_ilock_class/5); lock(sb_internal); lock(&xfs_dir_ilock_class/5); lock(sk_lock-AF_INET); *** DEADLOCK *** 6 locks held by fio/1496: #0: (sb_writers#13){.+.+}-{0:0}, at: path_openat+0x9fc/0xa20 #1: (&inode->i_sb->s_type->i_mutex_dir_key){++++}-{3:3}, at: path_openat+0x296/0xa20 #2: (sb_internal){.+.+}-{0:0}, at: xfs_trans_alloc_icreate+0x41/0xd0 [xfs] #3: (&xfs_dir_ilock_class/5){+.+.}-{3:3}, at: xfs_ilock+0xcf/0x290 [xfs] #4: (hctx->srcu){....}-{0:0}, at: hctx_lock+0x51/0xd0 #5: (&queue->send_mutex){+.+.}-{3:3}, at: nvme_tcp_queue_rq+0x33e/0x380 [nvme_tcp] This annotation lets lockdep analyze nvme-tcp controlled sockets independently of what the user-space sockets API does. Link: https://lore.kernel.org/linux-nvme/CAHj4cs9MDYLJ+q+2_GXUK9HxFizv2pxUryUR0toX974M040z7g@mail.gmail.com/ Signed-off-by: Chris Leech <cleech@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-04-08nvme: fix the check for duplicate unique identifiersChristoph Hellwig1-8/+10
[ Upstream commit e2724cb9f0c406b8fb66efd3aa9e8b3edfd8d5c8 ] nvme_subsys_check_duplicate_ids should needs to return an error if any of the identifiers matches, not just if all of them match. But it does not need to and should not look at the CSI value for this sanity check. Rewrite the logic to be separate from nvme_ns_ids_equal and optimize it by reducing duplicate checks for non-present identifiers. Fixes: ed754e5deeb1 ("nvme: track shared namespaces") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-04-08nvme: cleanup __nvme_check_idsChristoph Hellwig1-5/+4
[ Upstream commit fd8099e7918cd2df39ef306dd1d1af7178a15b81 ] Pass the actual nvme_ns_ids used for the comparison instead of the ns_head that isn't needed and use a more descriptive function name. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-03-02nvme: also mark passthrough-only namespaces ready in nvme_update_ns_infoChristoph Hellwig1-3/+3
commit 602e57c9799c19f27e440639deed3ec45cfe1651 upstream. Commit e7d65803e2bb ("nvme-multipath: revalidate paths during rescan") introduced the NVME_NS_READY flag, which nvme_path_is_disabled() uses to check if a path can be used or not. We also need to set this flag for devices that fail the ZNS feature validation and which are available through passthrough devices only to that they can be used in multipathing setups. Fixes: e7d65803e2bb ("nvme-multipath: revalidate paths during rescan") Reported-by: Kanchan Joshi <joshi.k@samsung.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Daniel Wagner <dwagner@suse.de> Tested-by: Kanchan Joshi <joshi.k@samsung.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-02-23block: fix surprise removal for drivers calling blk_set_queue_dyingChristoph Hellwig2-2/+2
commit 7a5428dcb7902700b830e912feee4e845df7c019 upstream. Various block drivers call blk_set_queue_dying to mark a disk as dead due to surprise removal events, but since commit 8e141f9eb803 that doesn't work given that the GD_DEAD flag needs to be set to stop I/O. Replace the driver calls to blk_set_queue_dying with a new (and properly documented) blk_mark_disk_dead API, and fold blk_set_queue_dying into the only remaining caller. Fixes: 8e141f9eb803 ("block: drain file system I/O on del_gendisk") Reported-by: Markus Blöchl <markus.bloechl@ipetronik.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Link: https://lore.kernel.org/r/20220217075231.1140-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-02-23nvme-rdma: fix possible use-after-free in transport error_recovery workSagi Grimberg1-0/+1
[ Upstream commit b6bb1722f34bbdbabed27acdceaf585d300c5fd2 ] While nvme_rdma_submit_async_event_work is checking the ctrl and queue state before preparing the AER command and scheduling io_work, in order to fully prevent a race where this check is not reliable the error recovery work must flush async_event_work before continuing to destroy the admin queue after setting the ctrl state to RESETTING such that there is no race .submit_async_event and the error recovery handler itself changing the ctrl state. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-02-23nvme-tcp: fix possible use-after-free in transport error_recovery workSagi Grimberg1-0/+1
[ Upstream commit ff9fc7ebf5c06de1ef72a69f9b1ab40af8b07f9e ] While nvme_tcp_submit_async_event_work is checking the ctrl and queue state before preparing the AER command and scheduling io_work, in order to fully prevent a race where this check is not reliable the error recovery work must flush async_event_work before continuing to destroy the admin queue after setting the ctrl state to RESETTING such that there is no race .submit_async_event and the error recovery handler itself changing the ctrl state. Tested-by: Chris Leech <cleech@redhat.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-02-23nvme: fix a possible use-after-free in controller reset during loadSagi Grimberg1-1/+8
[ Upstream commit 0fa0f99fc84e41057cbdd2efbfe91c6b2f47dd9d ] Unlike .queue_rq, in .submit_async_event drivers may not check the ctrl readiness for AER submission. This may lead to a use-after-free condition that was observed with nvme-tcp. The race condition may happen in the following scenario: 1. driver executes its reset_ctrl_work 2. -> nvme_stop_ctrl - flushes ctrl async_event_work 3. ctrl sends AEN which is received by the host, which in turn schedules AEN handling 4. teardown admin queue (which releases the queue socket) 5. AEN processed, submits another AER, calling the driver to submit 6. driver attempts to send the cmd ==> use-after-free In order to fix that, add ctrl state check to validate the ctrl is actually able to accept the AER submission. This addresses the above race in controller resets because the driver during teardown should: 1. change ctrl state to RESETTING 2. flush async_event_work (as well as other async work elements) So after 1,2, any other AER command will find the ctrl state to be RESETTING and bail out without submitting the AER. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-02-16nvme-tcp: fix bogus request completion when failing to send AERSagi Grimberg1-1/+9
commit 63573807b27e0faf8065a28b1bbe1cbfb23c0130 upstream. AER is not backed by a real request, hence we should not incorrectly assume that when failing to send a nvme command, it is a normal request but rather check if this is an aer and if so complete the aer (similar to the normal completion path). Cc: stable@vger.kernel.org Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-02-16nvme-pci: add the IGNORE_DEV_SUBNQN quirk for Intel P4500/P4600 SSDsWu Zheng1-1/+2
[ Upstream commit 25e58af4be412d59e056da65cc1cefbd89185bd2 ] The Intel P4500/P4600 SSDs do not report a subsystem NQN despite claiming compliance to a standards version where reporting one is required. Add the IGNORE_DEV_SUBNQN quirk to not fail the initialization of a second such SSDs in a system. Signed-off-by: Zheng Wu <wu.zheng@intel.com> Signed-off-by: Ye Jinhe <jinhe.ye@intel.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-02-08nvme-fabrics: fix state check in nvmf_ctlr_matches_baseopts()Uday Shankar1-0/+1
commit 6a51abdeb259a56d95f13cc67e3a0838bcda0377 upstream. Controller deletion/reset, immediately followed by or concurrent with a reconnect, is hard failing the connect attempt resulting in a complete loss of connectivity to the controller. In the connect request, fabrics looks for an existing controller with the same address components and aborts the connect if a controller already exists and the duplicate connect option isn't set. The match routine filters out controllers that are dead or dying, so they don't interfere with the new connect request. When NVME_CTRL_DELETING_NOIO was added, it missed updating the state filters in the nvmf_ctlr_matches_baseopts() routine. Thus, when in this new state, it's seen as a live controller and fails the connect request. Correct by adding the DELETING_NIO state to the match checks. Fixes: ecca390e8056 ("nvme: fix deadlock in disconnect during scan_work and/or ana_work") Cc: <stable@vger.kernel.org> # v5.7+ Signed-off-by: Uday Shankar <ushankar@purestorage.com> Reviewed-by: James Smart <jsmart2021@gmail.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-12-01nvmet: use IOCB_NOWAIT only if the filesystem supports itMaurizio Lombardi1-1/+3
[ Upstream commit c024b226a417c4eb9353ff500b1c823165d4d508 ] Submit I/O requests with the IOCB_NOWAIT flag set only if the underlying filesystem supports it. Fixes: 50a909db36f2 ("nvmet: use IOCB_NOWAIT for file-ns buffered I/O") Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-12-01nvmet-tcp: fix incomplete data digest sendVarun Prakash1-1/+6
[ Upstream commit 102110efdff6beedece6ab9b51664c32ac01e2db ] Current nvmet_try_send_ddgst() code does not check whether all data digest bytes are transmitted, fix this by returning -EAGAIN if all data digest bytes are not transmitted. Fixes: 872d26a391da ("nvmet-tcp: add NVMe over TCP target driver") Signed-off-by: Varun Prakash <varun@chelsio.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18nvme-rdma: fix error code in nvme_rdma_setup_ctrlMax Gurtovoy1-0/+2
[ Upstream commit 09748122009aed7bfaa7acc33c10c083a4758322 ] In case that icdoff is not zero or mandatory keyed sgls are not supported by the NVMe/RDMA target, we'll go to error flow but we'll return 0 to the caller. Fix it by returning an appropriate error code. Fixes: c66e2998c8ca ("nvme-rdma: centralize controller setup sequence") Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18nvme: drop scan_lock and always kick requeue list when removing namespacesHannes Reinecke1-5/+4
[ Upstream commit 2b81a5f015199f3d585ce710190a9e87714d3c1e ] When reading the partition table on initial scan hits an I/O error the I/O will hang with the scan_mutex held: [<0>] do_read_cache_page+0x49b/0x790 [<0>] read_part_sector+0x39/0xe0 [<0>] read_lba+0xf9/0x1d0 [<0>] efi_partition+0xf1/0x7f0 [<0>] bdev_disk_changed+0x1ee/0x550 [<0>] blkdev_get_whole+0x81/0x90 [<0>] blkdev_get_by_dev+0x128/0x2e0 [<0>] device_add_disk+0x377/0x3c0 [<0>] nvme_mpath_set_live+0x130/0x1b0 [nvme_core] [<0>] nvme_mpath_add_disk+0x150/0x160 [nvme_core] [<0>] nvme_alloc_ns+0x417/0x950 [nvme_core] [<0>] nvme_validate_or_alloc_ns+0xe9/0x1e0 [nvme_core] [<0>] nvme_scan_work+0x168/0x310 [nvme_core] [<0>] process_one_work+0x231/0x420 and trying to delete the controller will deadlock as it tries to grab the scan mutex: [<0>] nvme_mpath_clear_ctrl_paths+0x25/0x80 [nvme_core] [<0>] nvme_remove_namespaces+0x31/0xf0 [nvme_core] [<0>] nvme_do_delete_ctrl+0x4b/0x80 [nvme_core] As we're now properly ordering the namespace list there is no need to hold the scan_mutex in nvme_mpath_clear_ctrl_paths() anymore. And we always need to kick the requeue list as the path will be marked as unusable and I/O will be requeued _without_ a current path. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18nvmet-tcp: fix use-after-free when a port is removedIsrael Rukshin1-0/+16
[ Upstream commit 2351ead99ce9164fb42555aee3f96af84c4839e9 ] When removing a port, all its controllers are being removed, but there are queues on the port that doesn't belong to any controller (during connection time). This causes a use-after-free bug for any command that dereferences req->port (like in nvmet_alloc_ctrl). Those queues should be destroyed before freeing the port via configfs. Destroy the remaining queues after the accept_work was cancelled guarantees that no new queue will be created. Signed-off-by: Israel Rukshin <israelr@nvidia.com> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18nvmet-rdma: fix use-after-free when a port is removedIsrael Rukshin1-0/+24
[ Upstream commit fcf73a804c7d6bbf0ea63531c6122aa363852e04 ] When removing a port, all its controllers are being removed, but there are queues on the port that doesn't belong to any controller (during connection time). This causes a use-after-free bug for any command that dereferences req->port (like in nvmet_alloc_ctrl). Those queues should be destroyed before freeing the port via configfs. Destroy the remaining queues after the RDMA-CM was destroyed guarantees that no new queue will be created. Signed-off-by: Israel Rukshin <israelr@nvidia.com> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18nvmet: fix use-after-free when a port is removedIsrael Rukshin1-0/+2
[ Upstream commit e3e19dcc4c416d65f99f13d55be2b787f8d0050e ] When a port is removed through configfs, any connected controllers are starting teardown flow asynchronously and can still send commands. This causes a use-after-free bug for any command that dereferences req->port (like in nvmet_parse_io_cmd). To fix this, wait for all the teardown scheduled works to complete (like release_work at rdma/tcp drivers). This ensures there are no active controllers when the port is eventually removed. Signed-off-by: Israel Rukshin <israelr@nvidia.com> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-10-27nvmet-tcp: fix header digest verificationAmit Engel1-1/+1
Pass the correct length to nvmet_tcp_verify_hdgst, which is the pdu header length. This fixes a wrong behaviour where header digest verification passes although the digest is wrong. Signed-off-by: Amit Engel <amit.engel@dell.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-27nvmet-tcp: fix data digest pointer calculationVarun Prakash1-1/+1
exp_ddgst is of type __le32, &cmd->exp_ddgst + cmd->offset increases &cmd->exp_ddgst by 4 * cmd->offset, fix this by type casting &cmd->exp_ddgst to u8 *. Fixes: 872d26a391da ("nvmet-tcp: add NVMe over TCP target driver") Signed-off-by: Varun Prakash <varun@chelsio.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-27nvme-tcp: fix data digest pointer calculationVarun Prakash1-1/+1
ddgst is of type __le32, &req->ddgst + req->offset increases &req->ddgst by 4 * req->offset, fix this by type casting &req->ddgst to u8 *. Fixes: 3f2304f8c6d6 ("nvme-tcp: add NVMe over TCP host driver") Signed-off-by: Varun Prakash <varun@chelsio.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-27nvme-tcp: fix possible req->offset corruptionVarun Prakash1-1/+2
With commit db5ad6b7f8cd ("nvme-tcp: try to send request in queue_rq context") r2t and response PDU can get processed while send function is executing. Current data digest send code uses req->offset after kernel_sendmsg(), this creates a race condition where req->offset gets reset before it is used in send function. This can happen in two cases - 1. Target sends r2t PDU which resets req->offset. 2. Target send response PDU which completes the req and then req is used for a new command, nvme_tcp_setup_cmd_pdu() resets req->offset. Fix this by storing req->offset in a local variable and using this local variable after kernel_sendmsg(). Fixes: db5ad6b7f8cd ("nvme-tcp: try to send request in queue_rq context") Signed-off-by: Varun Prakash <varun@chelsio.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-26nvme-tcp: fix H2CData PDU send accounting (again)Sagi Grimberg1-1/+3
We should not access request members after the last send, even to determine if indeed it was the last data payload send. The reason is that a completion could have arrived and trigger a new execution of the request which overridden these members. This was fixed by commit 825619b09ad3 ("nvme-tcp: fix possible use-after-completion"). Commit e371af033c56 broke that assumption again to address cases where multiple r2t pdus are sent per request. To fix it, we need to record the request data_sent and data_len and after the payload network send we reference these counters to determine weather we should advance the request iterator. Fixes: e371af033c56 ("nvme-tcp: fix incorrect h2cdata pdu offset accounting") Reported-by: Keith Busch <kbusch@kernel.org> Cc: stable@vger.kernel.org # 5.10+ Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-26nvmet-tcp: fix a memory leak when releasing a queueMaurizio Lombardi1-0/+3
page_frag_free() won't completely release the memory allocated for the commands, the cache page must be explicitly freed by calling __page_frag_cache_drain(). This bug can be easily reproduced by repeatedly executing the following command on the initiator: $echo 1 > /sys/devices/virtual/nvme-fabrics/ctl/nvme0/reset_controller Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: John Meneghini <jmeneghi@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-14Merge tag 'nvme-5.15-2021-10-14' of git://git.infradead.org/nvme into block-5.15Jens Axboe3-12/+13
Pull NVMe fixes from Christoph: "nvme fixes for Linux 5.15: - fix the abort command id (Keith Busch) - nvme: fix per-namespace chardev deletion (Adam Manzanares)" * tag 'nvme-5.15-2021-10-14' of git://git.infradead.org/nvme: nvme: fix per-namespace chardev deletion nvme-pci: Fix abort command id
2021-10-14nvme: fix per-namespace chardev deletionAdam Manzanares2-11/+12
Decrease reference count of chardevice during char device deletion in order to fix a memory leak. Add a release callabck for the device associated chardev and move ida_simple_remove into the release function. Fixes: 2637baed7801 ("nvme: introduce generic per-namespace chardev") Reported-by: Yi Zhang <yi.zhang@redhat.com> Suggested-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Adam Manzanares <a.manzanares@samsung.com> Reviewed-by: Javier González <javier@javigon.com> Tested-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-07nvme-pci: Fix abort command idKeith Busch1-1/+1
The request tag is no longer the only component of the command id. Fixes: e7006de6c2380 ("nvme: code command_id with a genctr for use-after-free validation") Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2021-09-27nvme: add command id quirk for apple controllersKeith Busch3-2/+11
Some apple controllers use the command id as an index to implementation specific data structures and will fail if the value is out of bounds. The nvme driver's recently introduced command sequence number breaks this controller. Provide a quirk so these spec incompliant controllers can function as before. The driver will not have the ability to detect bad completions when this quirk is used, but we weren't previously checking this anyway. The quirk bit was selected so that it can readily apply to stable. Link: https://bugzilla.kernel.org/show_bug.cgi?id=214509 Cc: Sven Peter <sven@svenpeter.dev> Reported-by: Orlando Chamberlain <redecorating@protonmail.com> Reported-by: Aditya Garg <gargaditya08@live.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Sven Peter <sven@svenpeter.dev> Link: https://lore.kernel.org/r/20210927154306.387437-1-kbusch@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-21nvme: keep ctrl->namespaces orderedChristoph Hellwig1-16/+17
Various places in the nvme code that rely on ctrl->namespace to be ordered. Ensure that the namespae is inserted into the list at the right position from the start instead of sorting it after the fact. Fixes: 540c801c65eb ("NVMe: Implement namespace list scanning") Reported-by: Anton Eidelman <anton.eidelman@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
2021-09-21nvme-tcp: fix incorrect h2cdata pdu offset accountingSagi Grimberg1-3/+10
When the controller sends us multiple r2t PDUs in a single request we need to account for it correctly as our send/recv context run concurrently (i.e. we get a new r2t with r2t_offset before we updated our iterator and req->data_sent marker). This can cause wrong offsets to be sent to the controller. To fix that, we will first know that this may happen only in the send sequence of the last page, hence we will take the r2t_offset to the h2c PDU data_offset, and in nvme_tcp_try_send_data loop, we make sure to increment the request markers also when we completed a PDU but we are expecting more r2t PDUs as we still did not send the entire data of the request. Fixes: 825619b09ad3 ("nvme-tcp: fix possible use-after-completion") Reported-by: Nowak, Lukasz <Lukasz.Nowak@Dell.com> Tested-by: Nowak, Lukasz <Lukasz.Nowak@Dell.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-09-21nvme-fc: remove freeze/unfreeze around update_nr_hw_queuesJames Smart1-2/+0
Remove the freeze/unfreeze around changes to the number of hardware queues. Study and retest has indicated there are no ios that can be active at this point so there is nothing to freeze. nvme-fc is draining the queues in the shutdown and error recovery path in __nvme_fc_abort_outstanding_ios. This patch primarily reverts 88e837ed0f1f "nvme-fc: wait for queues to freeze before calling update_hr_hw_queues". It's not an exact revert as it leaves the adjusting of hw queues only if the count changes. Signed-off-by: James Smart <jsmart2021@gmail.com> [dwagner: added explanation why no IO is pending] Signed-off-by: Daniel Wagner <dwagner@suse.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-09-21nvme-fc: avoid race between time out and tear downJames Smart1-0/+2
To avoid race between time out and tear down, in tear down process, first we quiesce the queue, and then delete the timer and cancel the time out work for the queue. This patch merges the admin and io sync ops into the queue teardown logic as shown in the RDMA patch 3017013dcc "nvme-rdma: avoid race between time out and tear down". There is no teardown_lock in nvme-fc. Signed-off-by: James Smart <jsmart2021@gmail.com> Tested-by: Daniel Wagner <dwagner@suse.de> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-09-21nvme-fc: update hardware queues before using themDaniel Wagner1-8/+8
In case the number of hardware queues changes, we need to update the tagset and the mapping of ctx to hctx first. If we try to create and connect the I/O queues first, this operation will fail (target will reject the connect call due to the wrong number of queues) and hence we bail out of the recreate function. Then we will to try the very same operation again, thus we don't make any progress. Signed-off-by: Daniel Wagner <dwagner@suse.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-09-15Merge tag 'nvme-5.15-2021-09-15' of git://git.infradead.org/nvme into block-5.15Jens Axboe5-34/+26
Pull NVMe fixes from Christoph: "nvme fixes for Linux 5.15 - fix ANA state updates when a namespace is not present (Anton Eidelman) - nvmet: fix a width vs precision bug in nvmet_subsys_attr_serial_show (Dan Carpenter) - avoid race in shutdown namespace removal (Daniel Wagner) - fix io_work priority inversion in nvme-tcp (Keith Busch) - destroy cm id before destroy qp to avoid use after free (Ruozhu Li)" * tag 'nvme-5.15-2021-09-15' of git://git.infradead.org/nvme: nvme-tcp: fix io_work priority inversion nvme-rdma: destroy cm id before destroy qp to avoid use after free nvme-multipath: fix ANA state updates when a namespace is not present nvme: avoid race in shutdown namespace removal nvmet: fix a width vs precision bug in nvmet_subsys_attr_serial_show()
2021-09-15nvme: remove the call to nvme_update_disk_info in nvme_ns_removeChristoph Hellwig1-2/+0
There is no need to explicitly unregister the integrity profile when deleting the gendisk. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Link: https://lore.kernel.org/r/20210914070657.87677-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-14nvme-tcp: fix io_work priority inversionKeith Busch1-10/+10
Dispatching requests inline with the .queue_rq() call may block while holding the send_mutex. If the tcp io_work also happens to schedule, it may see the req_list is non-empty, leaving "pending" true and remaining in TASK_RUNNING. Since io_work is of higher scheduling priority, the .queue_rq task may not get a chance to run, blocking forward progress and leading to io timeouts. Instead of checking for pending requests within io_work, let the queueing restart io_work outside the send_mutex lock if there is more work to be done. Fixes: a0fdd1418007f ("nvme-tcp: rerun io_work if req_list is not empty") Reported-by: Samuel Jones <sjones@kalrayinc.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-09-14nvme-rdma: destroy cm id before destroy qp to avoid use after freeRuozhu Li1-13/+3
We should always destroy cm_id before destroy qp to avoid to get cma event after qp was destroyed, which may lead to use after free. In RDMA connection establishment error flow, don't destroy qp in cm event handler.Just report cm_error to upper level, qp will be destroy in nvme_rdma_alloc_queue() after destroy cm id. Signed-off-by: Ruozhu Li <liruozhu@huawei.com> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-09-14nvme-multipath: fix ANA state updates when a namespace is not presentAnton Eidelman1-2/+5
nvme_update_ana_state() has a deficiency that results in a failure to properly update the ana state for a namespace in the following case: NSIDs in ctrl->namespaces: 1, 3, 4 NSIDs in desc->nsids: 1, 2, 3, 4 Loop iteration 0: ns index = 0, n = 0, ns->head->ns_id = 1, nsid = 1, MATCH. Loop iteration 1: ns index = 1, n = 1, ns->head->ns_id = 3, nsid = 2, NO MATCH. Loop iteration 2: ns index = 2, n = 2, ns->head->ns_id = 4, nsid = 4, MATCH. Where the update to the ANA state of NSID 3 is missed. To fix this increment n and retry the update with the same ns when ns->head->ns_id is higher than nsid, Signed-off-by: Anton Eidelman <anton@lightbitslabs.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2021-09-13nvme: avoid race in shutdown namespace removalDaniel Wagner1-8/+7
When we remove the siblings entry, we update ns->head->list, hence we can't separate the removal and test for being empty. They have to be in the same critical section to avoid a race. To avoid breaking the refcounting imbalance again, add a list empty check to nvme_find_ns_head. Fixes: 5396fdac56d8 ("nvme: fix refcounting imbalance when all paths are down") Signed-off-by: Daniel Wagner <dwagner@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Tested-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-09-13nvmet: fix a width vs precision bug in nvmet_subsys_attr_serial_show()Dan Carpenter1-1/+1
This was intended to limit the number of characters printed from "subsys->serial" to NVMET_SN_MAX_SIZE. But accidentally the width specifier was used instead of the precision specifier so it only affects the alignment and not the number of characters printed. Fixes: f04064814c2a ("nvmet: fixup buffer overrun in nvmet_subsys_attr_serial()") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-09-06nvme: add error handling support for add_disk()Luis Chamberlain1-1/+8
We never checked for errors on add_disk() as this function returned void. Now that this is fixed, use the shiny new error handling. Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-09-06nvme: only call synchronize_srcu when clearing current pathDaniel Wagner1-3/+6
The function nmve_mpath_clear_current_path returns true if the current path has changed. In this case we have to wait for all concurrent submissions to finish. But if we didn't change the current path, there is no point in waiting for another RCU period to finish. Signed-off-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-09-06nvme: update keep alive interval when kato is modifiedTatsuya Sasaki1-2/+40
Currently the connection between host and NVMe-oF target gets disconnected by keep-alive timeout when a user connects to a target with a relatively large kato value and then sets the smaller kato with a set features command (e.g. connects with 60 seconds kato value and then sets 10 seconds kato value). The cause is that keep alive command interval on the host, which is defined as unsigned int kato in nvme_ctrl structure, does not follow the kato value changes. This patch updates the keep alive interval in the following steps when the kato is modified by a set features command: stops the keep alive work queue, then sets the kato as new timer value and re-start the queue. Signed-off-by: Tatsuya Sasaki <tatsuya6.sasaki@kioxia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-09-06nvme-tcp: Do not reset transport on data digest errorsDaniel Wagner1-4/+18
The spec says 7.4.6.1 Digest Error handling When a host detects a data digest error in a C2HData PDU, that host shall continue processing C2HData PDUs associated with the command and when the command processing has completed, if a successful status was returned by the controller, the host shall fail the command with a non-fatal transport error. Currently the transport is reseted when a data digest error is detected. Instead, when a digest error is detected, mark the final status as NVME_SC_DATA_XFER_ERROR and let the upper layer handle the error. In order to keep track of the final result maintain a status field in nvme_tcp_request object and use it to overwrite the completion queue status (which might be successful even though a digest error has been detected) when completing the request. Signed-off-by: Daniel Wagner <dwagner@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-09-06nvmet: fixup buffer overrun in nvmet_subsys_attr_serial()Hannes Reinecke1-1/+2
The serial number is copied into the buffer via memcpy_and_pad() with the length NVMET_SN_MAX_SIZE. So when printing out we also need to take just that length as anything beyond that will be uninitialized. Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>