summaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)AuthorFilesLines
2019-11-12md/raid10: prevent access of uninitialized resync_pages offsetJohn Pittman1-1/+1
Due to unneeded multiplication in the out_free_pages portion of r10buf_pool_alloc(), when using a 3-copy raid10 layout, it is possible to access a resync_pages offset that has not been initialized. This access translates into a crash of the system within resync_free_pages() while passing a bad pointer to put_page(). Remove the multiplication, preventing access to the uninitialized area. Fixes: f0250618361db ("md: raid10: don't use bio's vec table to manage resync pages") Cc: stable@vger.kernel.org # 4.12+ Signed-off-by: John Pittman <jpittman@redhat.com> Suggested-by: David Jeffery <djeffery@redhat.com> Reviewed-by: Laurence Oberman <loberman@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2019-11-12md: avoid invalid memory access for array sb->dev_rolesYufen Yu1-31/+20
we need to gurantee 'desc_nr' valid before access array of sb->dev_roles. In addition, we should avoid .load_super always return '0' when level is LEVEL_MULTIPATH, which is not expected. Reported-by: coverity-bot <keescook+coverity-bot@chromium.org> Addresses-Coverity-ID: 1487373 ("Memory - illegal accesses") Fixes: 6a5cb53aaa4e ("md: no longer compare spare disk superblock events in super_load") Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2019-11-12md/raid1: avoid soft lockup under high loadHannes Reinecke1-0/+1
As all I/O is being pushed through a kernel thread the softlockup watchdog might be triggered under high load. Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Song Liu <songliubraving@fb.com>
2019-11-07Merge branch 'for-linus' into for-5.5/blockJens Axboe4-46/+82
Pull on for-linus to resolve what otherwise would have been a conflict with the cgroups rstat patchset from Tejun. * for-linus: (942 commits) blkcg: make blkcg_print_stat() print stats only for online blkgs nvme: change nvme_passthru_cmd64 to explicitly mark rsvd nvme-multipath: fix crash in nvme_mpath_clear_ctrl_paths nvme-rdma: fix a segmentation fault during module unload iocost: don't nest spin_lock_irq in ioc_weight_write() io_uring: ensure we clear io_kiocb->result before each issue um-ubd: Entrust re-queue to the upper layers nvme-multipath: remove unused groups_only mode in ana log nvme-multipath: fix possible io hang after ctrl reconnect io_uring: don't touch ctx in setup after ring fd install io_uring: Fix leaked shadow_req Linux 5.4-rc5 riscv: cleanup do_trap_break nbd: verify socket is supported during setup ata: libahci_platform: Fix regulator_get_optional() misuse nbd: handle racing with error'ed out commands nbd: protect cmd->status with cmd->lock io_uring: fix bad inflight accounting for SETUP_IOPOLL|SETUP_SQTHREAD io_uring: used cached copies of sq->dropped and cq->overflow ARM: dts: stm32: relax qspi pins slew-rate for stm32mp157 ...
2019-11-07dm raid: Remove unnecessary negation of a shift in raid10_format_to_md_layoutNathan Chancellor1-1/+0
When building with Clang + -Wtautological-constant-compare: drivers/md/dm-raid.c:619:8: warning: converting the result of '<<' to a boolean always evaluates to true [-Wtautological-constant-compare] r = !RAID10_OFFSET; ^ drivers/md/dm-raid.c:517:28: note: expanded from macro 'RAID10_OFFSET' #define RAID10_OFFSET (1 << 16) /* stripes with data copies area adjacent on devices */ ^ 1 warning generated. Negating a non-zero number will always make it zero, which is the default value of r in this function so this statement is unnecessary; remove it so that clang no longer warns. Link: https://github.com/ClangBuiltLinux/linux/issues/753 Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Acked-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-11-07dm zoned: reduce overhead of backing device checksDmitry Fomichev4-32/+61
Commit 75d66ffb48efb3 added backing device health checks and as a part of these checks, check_events() block ops template call is invoked in dm-zoned mapping path as well as in reclaim and flush path. Calling check_events() with ATA or SCSI backing devices introduces a blocking scsi_test_unit_ready() call being made in sd_check_events(). Even though the overhead of calling scsi_test_unit_ready() is small for ATA zoned devices, it is much larger for SCSI and it affects performance in a very negative way. Fix this performance regression by executing check_events() only in case of any I/O errors. The function dmz_bdev_is_dying() is modified to call only blk_queue_dying(), while calls to check_events() are made in a new helper function, dmz_check_bdev(). Reported-by: zhangxiaoxu <zhangxiaoxu5@huawei.com> Fixes: 75d66ffb48efb3 ("dm zoned: properly handle backing device failure") Cc: stable@vger.kernel.org Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-11-07dm: add zone open, close and finish supportAjay Joshi3-7/+7
Implement REQ_OP_ZONE_OPEN, REQ_OP_ZONE_CLOSE and REQ_OP_ZONE_FINISH support to allow explicit control of zone states. Contains contributions from Matias Bjorling, Hans Holmberg and Damien Le Moal. Acked-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ajay Joshi <ajay.joshi@wdc.com> Signed-off-by: Matias Bjorling <matias.bjorling@wdc.com> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-07Merge branch 'for-5.5/block' into for-5.5/driversJens Axboe1-3/+3
Pull in dependencies for the new zoned open/close/finish support. * for-5.5/block: (32 commits) block: add zone open, close and finish ioctl support block: add zone open, close and finish operations block: Simplify REQ_OP_ZONE_RESET_ALL handling block: Remove REQ_OP_ZONE_RESET plugging block: Warn if elevator= parameter is used block: avoid blk_bio_segment_split for small I/O operations blk-mq: make sure that line break can be printed block: sed-opal: Introduce Opal Datastore UID block: sed-opal: Add support to read/write opal tables generically block: sed-opal: Generalizing write data to any opal table bdev: Refresh bdev size for disks without partitioning bdev: Factor out bdev revalidation into a common helper blk-mq: avoid sysfs buffer overflow with too many CPU cores blk-mq: Make blk_mq_run_hw_queue() return void fcntl: fix typo in RWH_WRITE_LIFE_NOT_SET r/w hint name blk-mq: fill header with kernel-doc blk-mq: remove needless goto from blk_mq_get_driver_tag block: reorder bio::__bi_remaining for better packing block: Reduce the amount of memory used for tag sets block: Reduce the amount of memory required per request queue ...
2019-11-07block: add zone open, close and finish operationsAjay Joshi1-3/+3
Zoned block devices (ZBC and ZAC devices) allow an explicit control over the condition (state) of zones. The operations allowed are: * Open a zone: Transition to open condition to indicate that a zone will actively be written * Close a zone: Transition to closed condition to release the drive resources used for writing to a zone * Finish a zone: Transition an open or closed zone to the full condition to prevent write operations To enable this control for in-kernel zoned block device users, define the new request operations REQ_OP_ZONE_OPEN, REQ_OP_ZONE_CLOSE and REQ_OP_ZONE_FINISH as well as the generic function blkdev_zone_mgmt() for submitting these operations on a range of zones. This results in blkdev_reset_zones() removal and replacement with this new zone magement function. Users of blkdev_reset_zones() (f2fs and dm-zoned) are updated accordingly. Contains contributions from Matias Bjorling, Hans Holmberg, Dmitry Fomichev, Keith Busch, Damien Le Moal and Christoph Hellwig. Reviewed-by: Javier González <javier@javigon.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ajay Joshi <ajay.joshi@wdc.com> Signed-off-by: Matias Bjorling <matias.bjorling@wdc.com> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-05dm dust: add limited write failure modeBryan Gurney1-7/+46
Add a limited write failure mode which allows a write to a block to fail a specified amount of times, prior to remapping. The "addbadblock" message is extended to allow specifying the limited number of times a write fails. Example: add bad block on block 60, with 5 write failures: dmsetup message 0 dust1 addbadblock 60 5 The write failure counter will be printed for newly added bad blocks. Signed-off-by: Bryan Gurney <bgurney@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-11-05dm dust: change ret to r in dust_map_read and dust_mapBryan Gurney1-7/+7
In the dust_map_read() and dust_map() functions, change the return code variable "ret" to "r", to match the convention of the other device-mapper targets. Signed-off-by: Bryan Gurney <bgurney@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-11-05dm dust: change result vars to rBryan Gurney1-16/+16
Change the "result" variables to "r" in dust_status() and dust_message(). Signed-off-by: Bryan Gurney <bgurney@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-11-05dm cache: replace spin_lock_irqsave with spin_lock_irqMikulas Patocka1-49/+28
If we are in a place where it is known that interrupts are enabled, functions spin_lock_irq/spin_unlock_irq should be used instead of spin_lock_irqsave/spin_unlock_irqrestore. spin_lock_irq and spin_unlock_irq are faster because they don't need to push and pop the flags register. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-11-05dm bio prison: replace spin_lock_irqsave with spin_lock_irqMikulas Patocka2-33/+20
Replace spin_lock_irqsave/irqrestore with spin_lock_irq/spin_unlock_irq. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-11-05dm thin: replace spin_lock_irqsave with spin_lock_irqMikulas Patocka1-67/+46
If we are in a place where it is known that interrupts are enabled, functions spin_lock_irq/spin_unlock_irq should be used instead of spin_lock_irqsave/spin_unlock_irqrestore. spin_lock_irq and spin_unlock_irq are faster because they don't need to push and pop the flags register. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-11-05dm clone: add bucket_lock_irq/bucket_unlock_irq helpersNikos Tsironis1-15/+19
Introduce bucket_lock_irq() and bucket_unlock_irq() helpers and use them in places where it is known that interrupts are enabled. Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-11-05dm clone: replace spin_lock_irqsave with spin_lock_irqMikulas Patocka3-34/+27
If we are in a place where it is known that interrupts are enabled, functions spin_lock_irq/spin_unlock_irq should be used instead of spin_lock_irqsave/spin_unlock_irqrestore. spin_lock_irq and spin_unlock_irq are faster because they don't need to push and pop the flags register. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-11-05dm writecache: handle REQ_FUAMaged Mokhtar1-1/+2
Call writecache_flush() on REQ_FUA in writecache_map(). Cc: stable@vger.kernel.org # 4.18+ Signed-off-by: Maged Mokhtar <mmokhtar@petasan.org> Acked-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-11-05dm writecache: fix uninitialized variable warningMikulas Patocka1-1/+1
This fixes coverity warning CID 1454301. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-11-05dm stripe: use struct_size() in kmalloc()Gustavo A. R. Silva1-14/+1
One of the more common cases of allocation size calculations is finding the size of a structure that has a zero-sized array at the end, along with memory for some number of elements for that array. For example: struct stripe_c { ... struct stripe stripe[0]; }; In this case alloc_context() and dm_array_too_big() are removed and replaced by the direct use of the struct_size() helper in kmalloc(). Notice that open-coded form is prone to type mistakes. This code was detected with the help of Coccinelle. Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-11-05dm raid: streamline rs_get_progress() and its raid_status() caller sideHeinz Mauelshagen1-27/+20
Pass already deciphered state into rs_get_progress, simplify recovery offset definition and combine two st_resync, st_reshape conditionals into one as is already the case with st_check and st_repair. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-11-05dm raid: simplify rs_setup_recovery call chainHeinz Mauelshagen1-21/+6
rs_setup_recovery() sets the starting recovery offset. Drop superfluous rs_setup_recovery() and replace with __rs_setup_recovery(). Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-11-05dm raid: to ensure resynchronization, perform raid set grow in preresumeHeinz Mauelshagen1-21/+60
This fixes a flaw causing raid set extensions not to be synchronized in case the MD bitmap resize required additional pages to be allocated. Also share resize code in the raid constructor between new size changes and those occuring during recovery. Bump the target version to define the change and document it in Documentation/admin-guide/device-mapper/dm-raid.rst. Reported-by: Steve D <steved424@gmail.com> Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-11-05dm raid: change rs_set_dev_and_array_sectors API and callersHeinz Mauelshagen1-9/+5
Add a size argument to rs_set_dev_and_array_sectors as prerequisite to fixing grown device resynchronization not occuring when new MD bitmap pages have to be allocated as a result of the extension in a follwup patch. Also avoid code duplication by using rs_set_rdev_sectors in the aforementioned function. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-11-05dm table: do not allow request-based DM to stack on partitionsMike Snitzer1-19/+8
Partitioned request-based devices cannot be used as underlying devices for request-based DM because no partition offsets are added to each incoming request. As such, until now, stacking on partitioned devices would _always_ result in data corruption (e.g. wiping the partition table, writing to other partitions, etc). Fix this by disallowing request-based stacking on partitions. While at it, since all .request_fn support has been removed from block core, remove legacy dm-table code that differentiated between blk-mq and .request_fn request-based. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-10-25md: no longer compare spare disk superblock events in super_loadYufen Yu1-6/+51
We have a test case as follow: mdadm -CR /dev/md1 -l 1 -n 4 /dev/sd[a-d] \ --assume-clean --bitmap=internal mdadm -S /dev/md1 mdadm -A /dev/md1 /dev/sd[b-c] --run --force mdadm --zero /dev/sda mdadm /dev/md1 -a /dev/sda echo offline > /sys/block/sdc/device/state echo offline > /sys/block/sdb/device/state sleep 5 mdadm -S /dev/md1 echo running > /sys/block/sdb/device/state echo running > /sys/block/sdc/device/state mdadm -A /dev/md1 /dev/sd[a-c] --run --force When we readd /dev/sda to the array, it started to do recovery. After offline the other two disks in md1, the recovery have been interrupted and superblock update info cannot be written to the offline disks. While the spare disk (/dev/sda) can continue to update superblock info. After stopping the array and assemble it, we found the array run fail, with the follow kernel message: [ 172.986064] md: kicking non-fresh sdb from array! [ 173.004210] md: kicking non-fresh sdc from array! [ 173.022383] md/raid1:md1: active with 0 out of 4 mirrors [ 173.022406] md1: failed to create bitmap (-5) [ 173.023466] md: md1 stopped. Since both sdb and sdc have the value of 'sb->events' smaller than that in sda, they have been kicked from the array. However, the only remained disk sda is in 'spare' state before stop and it cannot be added to conf->mirrors[] array. In the end, raid array assemble and run fail. In fact, we can use the older disk sdb or sdc to assemble the array. That means we should not choose the 'spare' disk as the fresh disk in analyze_sbs(). To fix the problem, we do not compare superblock events when it is a spare disk, as same as validate_super. Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2019-10-25md: improve handling of bio with REQ_PREFLUSH in md_flush_request()David Jeffery8-21/+23
If pers->make_request fails in md_flush_request(), the bio is lost. To fix this, pass back a bool to indicate if the original make_request call should continue to handle the I/O and instead of assuming the flush logic will push it to completion. Convert md_flush_request to return a bool and no longer calls the raid driver's make_request function. If the return is true, then the md flush logic has or will complete the bio and the md make_request call is done. If false, then the md make_request function needs to keep processing like it is a normal bio. Let the original call to md_handle_request handle any need to retry sending the bio to the raid driver's make_request function should it be needed. Also mark md_flush_request and the make_request function pointer as __must_check to issue warnings should these critical return values be ignored. Fixes: 2bc13b83e629 ("md: batch flush requests.") Cc: stable@vger.kernel.org # # v4.19+ Cc: NeilBrown <neilb@suse.com> Signed-off-by: David Jeffery <djeffery@redhat.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2019-10-25md/bitmap: avoid race window between md_bitmap_resize and bitmap_file_clear_bitGuoqing Jiang1-1/+1
We need to move "spin_lock_irq(&bitmap->counts.lock)" before unmap previous storage, otherwise panic like belows could happen as follows. [ 902.353802] sdl: detected capacity change from 1077936128 to 3221225472 [ 902.616948] general protection fault: 0000 [#1] SMP [snip] [ 902.618588] CPU: 12 PID: 33698 Comm: md0_raid1 Tainted: G O 4.14.144-1-pserver #4.14.144-1.1~deb10 [ 902.618870] Hardware name: Supermicro SBA-7142G-T4/BHQGE, BIOS 3.00 10/24/2012 [ 902.619120] task: ffff9ae1860fc600 task.stack: ffffb52e4c704000 [ 902.619301] RIP: 0010:bitmap_file_clear_bit+0x90/0xd0 [md_mod] [ 902.619464] RSP: 0018:ffffb52e4c707d28 EFLAGS: 00010087 [ 902.619626] RAX: ffe8008b0d061000 RBX: ffff9ad078c87300 RCX: 0000000000000000 [ 902.619792] RDX: ffff9ad986341868 RSI: 0000000000000803 RDI: ffff9ad078c87300 [ 902.619986] RBP: ffff9ad0ed7a8000 R08: 0000000000000000 R09: 0000000000000000 [ 902.620154] R10: ffffb52e4c707ec0 R11: ffff9ad987d1ed44 R12: ffff9ad0ed7a8360 [ 902.620320] R13: 0000000000000003 R14: 0000000000060000 R15: 0000000000000800 [ 902.620487] FS: 0000000000000000(0000) GS:ffff9ad987d00000(0000) knlGS:0000000000000000 [ 902.620738] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 902.620901] CR2: 000055ff12aecec0 CR3: 0000001005207000 CR4: 00000000000406e0 [ 902.621068] Call Trace: [ 902.621256] bitmap_daemon_work+0x2dd/0x360 [md_mod] [ 902.621429] ? find_pers+0x70/0x70 [md_mod] [ 902.621597] md_check_recovery+0x51/0x540 [md_mod] [ 902.621762] raid1d+0x5c/0xeb0 [raid1] [ 902.621939] ? try_to_del_timer_sync+0x4d/0x80 [ 902.622102] ? del_timer_sync+0x35/0x40 [ 902.622265] ? schedule_timeout+0x177/0x360 [ 902.622453] ? call_timer_fn+0x130/0x130 [ 902.622623] ? find_pers+0x70/0x70 [md_mod] [ 902.622794] ? md_thread+0x94/0x150 [md_mod] [ 902.622959] md_thread+0x94/0x150 [md_mod] [ 902.623121] ? wait_woken+0x80/0x80 [ 902.623280] kthread+0x119/0x130 [ 902.623437] ? kthread_create_on_node+0x60/0x60 [ 902.623600] ret_from_fork+0x22/0x40 [ 902.624225] RIP: bitmap_file_clear_bit+0x90/0xd0 [md_mod] RSP: ffffb52e4c707d28 Because mdadm was running on another cpu to do resize, so bitmap_resize was called to replace bitmap as below shows. PID: 38801 TASK: ffff9ad074a90e00 CPU: 0 COMMAND: "mdadm" [exception RIP: queued_spin_lock_slowpath+56] [snip] -- <NMI exception stack> -- #5 [ffffb52e60f17c58] queued_spin_lock_slowpath at ffffffff9c0b27b8 #6 [ffffb52e60f17c58] bitmap_resize at ffffffffc0399877 [md_mod] #7 [ffffb52e60f17d30] raid1_resize at ffffffffc0285bf9 [raid1] #8 [ffffb52e60f17d50] update_size at ffffffffc038a31a [md_mod] #9 [ffffb52e60f17d70] md_ioctl at ffffffffc0395ca4 [md_mod] And the procedure to keep resize bitmap safe is allocate new storage space, then quiesce, copy bits, replace bitmap, and re-start. However the daemon (bitmap_daemon_work) could happen even the array is quiesced, which means when bitmap_file_clear_bit is triggered by raid1d, then it thinks it should be fine to access store->filemap since counts->lock is held, but resize could change the storage without the protection of the lock. Cc: Jack Wang <jinpu.wang@cloud.ionos.com> Cc: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2019-10-25md/raid0: Fix an error message in raid0_make_request()Dan Carpenter1-1/+1
The first argument to WARN() is supposed to be a condition. The original code will just print the mdname() instead of the full warning message. Fixes: c84a1372df92 ("md/raid0: avoid RAID0 data corruption due to layout confusion.") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2019-10-19Merge tag 'for-linus-2019-10-18' of git://git.kernel.dk/linux-blockLinus Torvalds1-1/+1
Pull block fixes from Jens Axboe: - NVMe pull request from Keith that address deadlocks, double resets, memory leaks, and other regression. - Fixup elv_support_iosched() for bio based devices (Damien) - Fixup for the ahci PCS quirk (Dan) - Socket O_NONBLOCK handling fix for io_uring (me) - Timeout sequence io_uring fixes (yangerkun) - MD warning fix for parameter default_layout (Song) - blkcg activation fixes (Tejun) - blk-rq-qos node deletion fix (Tejun) * tag 'for-linus-2019-10-18' of git://git.kernel.dk/linux-block: nvme-pci: Set the prp2 correctly when using more than 4k page io_uring: fix logic error in io_timeout io_uring: fix up O_NONBLOCK handling for sockets md/raid0: fix warning message for parameter default_layout libata/ahci: Fix PCS quirk application blk-rq-qos: fix first node deletion of rq_qos_del() blkcg: Fix multiple bugs in blkcg_activate_policy() io_uring: consider the overflow of sequence for timeout req nvme-tcp: fix possible leakage during error flow nvmet-loop: fix possible leakage during error flow block: Fix elv_support_iosched() nvme-tcp: Initialize sk->sk_ll_usec only with NET_RX_BUSY_POLL nvme: Wait for reset state when required nvme: Prevent resets during paused controller state nvme: Restart request timers in resetting state nvme: Remove ADMIN_ONLY state nvme-pci: Free tagset if no IO queues nvme: retain split access workaround for capability reads nvme: fix possible deadlock when nvme_update_formats fails
2019-10-17dm cache: fix bugs when a GFP_NOWAIT allocation failsMikulas Patocka1-26/+2
GFP_NOWAIT allocation can fail anytime - it doesn't wait for memory being available and it fails if the mempool is exhausted and there is not enough memory. If we go down this path: map_bio -> mg_start -> alloc_migration -> mempool_alloc(GFP_NOWAIT) we can see that map_bio() doesn't check the return value of mg_start(), and the bio is leaked. If we go down this path: map_bio -> mg_start -> mg_lock_writes -> alloc_prison_cell -> dm_bio_prison_alloc_cell_v2 -> mempool_alloc(GFP_NOWAIT) -> mg_lock_writes -> mg_complete the bio is ended with an error - it is unacceptable because it could cause filesystem corruption if the machine ran out of memory temporarily. Change GFP_NOWAIT to GFP_NOIO, so that the mempool code will properly wait until memory becomes available. mempool_alloc with GFP_NOIO can't fail, so remove the code paths that deal with allocation failure. Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-10-16md/raid0: fix warning message for parameter default_layoutSong Liu1-1/+1
The message should match the parameter, i.e. raid0.default_layout. Fixes: c84a1372df92 ("md/raid0: avoid RAID0 data corruption due to layout confusion.") Cc: NeilBrown <neilb@suse.de> Reported-by: Ivan Topolsky <doktor.yak@gmail.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2019-10-10dm snapshot: rework COW throttling to fix deadlockMikulas Patocka1-14/+64
Commit 721b1d98fb517a ("dm snapshot: Fix excessive memory usage and workqueue stalls") introduced a semaphore to limit the maximum number of in-flight kcopyd (COW) jobs. The implementation of this throttling mechanism is prone to a deadlock: 1. One or more threads write to the origin device causing COW, which is performed by kcopyd. 2. At some point some of these threads might reach the s->cow_count semaphore limit and block in down(&s->cow_count), holding a read lock on _origins_lock. 3. Someone tries to acquire a write lock on _origins_lock, e.g., snapshot_ctr(), which blocks because the threads at step (2) already hold a read lock on it. 4. A COW operation completes and kcopyd runs dm-snapshot's completion callback, which ends up calling pending_complete(). pending_complete() tries to resubmit any deferred origin bios. This requires acquiring a read lock on _origins_lock, which blocks. This happens because the read-write semaphore implementation gives priority to writers, meaning that as soon as a writer tries to enter the critical section, no readers will be allowed in, until all writers have completed their work. So, pending_complete() waits for the writer at step (3) to acquire and release the lock. This writer waits for the readers at step (2) to release the read lock and those readers wait for pending_complete() (the kcopyd thread) to signal the s->cow_count semaphore: DEADLOCK. The above was thoroughly analyzed and documented by Nikos Tsironis as part of his initial proposal for fixing this deadlock, see: https://www.redhat.com/archives/dm-devel/2019-October/msg00001.html Fix this deadlock by reworking COW throttling so that it waits without holding any locks. Add a variable 'in_progress' that counts how many kcopyd jobs are running. A function wait_for_in_progress() will sleep if 'in_progress' is over the limit. It drops _origins_lock in order to avoid the deadlock. Reported-by: Guruswamy Basavaiah <guru2018@gmail.com> Reported-by: Nikos Tsironis <ntsironis@arrikto.com> Reviewed-by: Nikos Tsironis <ntsironis@arrikto.com> Tested-by: Nikos Tsironis <ntsironis@arrikto.com> Fixes: 721b1d98fb51 ("dm snapshot: Fix excessive memory usage and workqueue stalls") Cc: stable@vger.kernel.org # v5.0+ Depends-on: 4a3f111a73a8c ("dm snapshot: introduce account_start_copy() and account_end_copy()") Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-10-10dm snapshot: introduce account_start_copy() and account_end_copy()Mikulas Patocka1-5/+15
This simple refactoring moves code for modifying the semaphore cow_count into separate functions to prepare for changes that will extend these methods to provide for a more sophisticated mechanism for COW throttling. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Nikos Tsironis <ntsironis@arrikto.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-10-08dm clone: Make __hash_find staticYueHaibing1-2/+2
drivers/md/dm-clone-target.c:594:34: warning: symbol '__hash_find' was not declared. Should it be static? Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-09-25Merge tag 'for-5.4/post-2019-09-24' of git://git.kernel.dk/linux-blockLinus Torvalds1-0/+10
Pull more block updates from Jens Axboe: "Some later additions that weren't quite done for the first pull request, and also a few fixes that have arrived since. This contains: - Kill silly pktcdvd warning on attempting to register a non-scsi passthrough device (me) - Use symbolic constants for the block t10 protection types, and switch to handling it in core rather than in the drivers (Max) - libahci platform missing node put fix (Nishka) - Small series of fixes for BFQ (Paolo) - Fix possible nbd crash (Xiubo)" * tag 'for-5.4/post-2019-09-24' of git://git.kernel.dk/linux-block: block: drop device references in bsg_queue_rq() block: t10-pi: fix -Wswitch warning pktcdvd: remove warning on attempting to register non-passthrough dev ata: libahci_platform: Add of_node_put() before loop exit nbd: fix possible page fault for nbd disk nbd: rename the runtime flags as NBD_RT_ prefixed block, bfq: push up injection only after setting service time block, bfq: increase update frequency of inject limit block, bfq: reduce upper bound for inject limit to max_rq_in_driver+1 block, bfq: update inject limit only after injection occurred block: centralize PI remapping logic to the block layer block: use symbolic constants for t10_pi type
2019-09-21Merge tag 'for-5.4/dm-changes' of ↵Linus Torvalds21-396/+3828
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - crypto and DM crypt advances that allow the crypto API to reclaim implementation details that do not belong in DM crypt. The wrapper template for ESSIV generation that was factored out will also be used by fscrypt in the future. - Add root hash pkcs#7 signature verification to the DM verity target. - Add a new "clone" DM target that allows for efficient remote replication of a device. - Enhance DM bufio's cache to be tailored to each client based on use. Clients that make heavy use of the cache get more of it, and those that use less have reduced cache usage. - Add a new DM_GET_TARGET_VERSION ioctl to allow userspace to query the version number of a DM target (even if the associated module isn't yet loaded). - Fix invalid memory access in DM zoned target. - Fix the max_discard_sectors limit advertised by the DM raid target; it was mistakenly storing the limit in bytes rather than sectors. - Small optimizations and cleanups in DM writecache target. - Various fixes and cleanups in DM core, DM raid1 and space map portion of DM persistent data library. * tag 'for-5.4/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (22 commits) dm: introduce DM_GET_TARGET_VERSION dm bufio: introduce a global cache replacement dm bufio: remove old-style buffer cleanup dm bufio: introduce a global queue dm bufio: refactor adjust_total_allocated dm bufio: call adjust_total_allocated from __link_buffer and __unlink_buffer dm: add clone target dm raid: fix updating of max_discard_sectors limit dm writecache: skip writecache_wait for pmem mode dm stats: use struct_size() helper dm crypt: omit parsing of the encapsulated cipher dm crypt: switch to ESSIV crypto API template crypto: essiv - create wrapper template for ESSIV generation dm space map common: remove check for impossible sm_find_free() return value dm raid1: use struct_size() with kzalloc() dm writecache: optimize performance by sorting the blocks for writeback_all dm writecache: add unlikely for getting two block with same LBA dm writecache: remove unused member pointer in writeback_struct dm zoned: fix invalid memory access dm verity: add root hash pkcs#7 signature verification ...
2019-09-18block: centralize PI remapping logic to the block layerMax Gurtovoy1-0/+10
Currently t10_pi_prepare/t10_pi_complete functions are called during the NVMe and SCSi layers command preparetion/completion, but their actual place should be the block layer since T10-PI is a general data integrity feature that is used by block storage protocols. Introduce .prepare_fn and .complete_fn callbacks within the integrity profile that each type can implement according to its needs. Suggested-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Suggested-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Fixed to not call queue integrity functions if BLK_DEV_INTEGRITY isn't defined in the config. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-18Merge tag 'for-5.4/block-2019-09-16' of git://git.kernel.dk/linux-blockLinus Torvalds13-89/+259
Pull block updates from Jens Axboe: - Two NVMe pull requests: - ana log parse fix from Anton - nvme quirks support for Apple devices from Ben - fix missing bio completion tracing for multipath stack devices from Hannes and Mikhail - IP TOS settings for nvme rdma and tcp transports from Israel - rq_dma_dir cleanups from Israel - tracing for Get LBA Status command from Minwoo - Some nvme-tcp cleanups from Minwoo, Potnuri and Myself - Some consolidation between the fabrics transports for handling the CAP register - reset race with ns scanning fix for fabrics (move fabrics commands to a dedicated request queue with a different lifetime from the admin request queue)." - controller reset and namespace scan races fixes - nvme discovery log change uevent support - naming improvements from Keith - multiple discovery controllers reject fix from James - some regular cleanups from various people - Series fixing (and re-fixing) null_blk debug printing and nr_devices checks (André) - A few pull requests from Song, with fixes from Andy, Guoqing, Guilherme, Neil, Nigel, and Yufen. - REQ_OP_ZONE_RESET_ALL support (Chaitanya) - Bio merge handling unification (Christoph) - Pick default elevator correctly for devices with special needs (Damien) - Block stats fixes (Hou) - Timeout and support devices nbd fixes (Mike) - Series fixing races around elevator switching and device add/remove (Ming) - sed-opal cleanups (Revanth) - Per device weight support for BFQ (Fam) - Support for blk-iocost, a new model that can properly account cost of IO workloads. (Tejun) - blk-cgroup writeback fixes (Tejun) - paride queue init fixes (zhengbin) - blk_set_runtime_active() cleanup (Stanley) - Block segment mapping optimizations (Bart) - lightnvm fixes (Hans/Minwoo/YueHaibing) - Various little fixes and cleanups * tag 'for-5.4/block-2019-09-16' of git://git.kernel.dk/linux-block: (186 commits) null_blk: format pr_* logs with pr_fmt null_blk: match the type of parameter nr_devices null_blk: do not fail the module load with zero devices block: also check RQF_STATS in blk_mq_need_time_stamp() block: make rq sector size accessible for block stats bfq: Fix bfq linkage error raid5: use bio_end_sector in r5_next_bio raid5: remove STRIPE_OPS_REQ_PENDING md: add feature flag MD_FEATURE_RAID0_LAYOUT md/raid0: avoid RAID0 data corruption due to layout confusion. raid5: don't set STRIPE_HANDLE to stripe which is in batch list raid5: don't increment read_errors on EILSEQ return nvmet: fix a wrong error status returned in error log page nvme: send discovery log page change events to userspace nvme: add uevent variables for controller devices nvme: enable aen regardless of the presence of I/O queues nvme-fabrics: allow discovery subsystems accept a kato nvmet: Use PTR_ERR_OR_ZERO() in nvmet_init_discovery() nvme: Remove redundant assignment of cq vector nvme: Assign subsys instance from first ctrl ...
2019-09-16dm: introduce DM_GET_TARGET_VERSIONMikulas Patocka1-3/+29
This commit introduces a new ioctl DM_GET_TARGET_VERSION. It will load a target that is specified in the "name" entry in the parameter structure and return its version. This functionality is intended to be used by cryptsetup, so that it can query kernel capabilities before activating the device. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-09-14dm bufio: introduce a global cache replacementMikulas Patocka1-7/+91
This commit introduces a global cache replacement (instead of per-client cleanup). If one bufio client uses the cache heavily and another client is not using it, we want to let the first client use most of the cache. The old algorithm would partition the cache equally betwen the clients and that is sub-optimal. For cache replacement, we use the clock algorithm because it doesn't require taking any lock when the buffer is accessed. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-09-13raid5: use bio_end_sector in r5_next_bioGuoqing Jiang1-3/+1
Actually, we calculate bio's end sector here, so use the common way for the purpose. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2019-09-13raid5: remove STRIPE_OPS_REQ_PENDINGGuoqing Jiang2-2/+0
This stripe state is not used anymore after commit 51acbcec6c42b24 ("md: remove CONFIG_MULTICORE_RAID456"), so remove the obsoleted state. gjiang@nb01257:~/md$ grep STRIPE_OPS_REQ_PENDING drivers/md/ -r drivers/md/raid5.c: (1 << STRIPE_OPS_REQ_PENDING) | drivers/md/raid5.h: STRIPE_OPS_REQ_PENDING, Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2019-09-13md: add feature flag MD_FEATURE_RAID0_LAYOUTNeilBrown2-0/+16
Due to a bug introduced in Linux 3.14 we cannot determine the correctly layout for a multi-zone RAID0 array - there are two possibilities. It is possible to tell the kernel which to chose using a module parameter, but this can be clumsy to use. It would be best if the choice were recorded in the metadata. So add a feature flag for this purpose. If it is set, then the 'layout' field of the superblock is used to determine which layout to use. If this flag is not set, then mddev->layout gets set to -1, which causes the module parameter to be required. Acked-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Song Liu <songliubraving@fb.com>
2019-09-13md/raid0: avoid RAID0 data corruption due to layout confusion.NeilBrown2-1/+45
If the drives in a RAID0 are not all the same size, the array is divided into zones. The first zone covers all drives, to the size of the smallest. The second zone covers all drives larger than the smallest, up to the size of the second smallest - etc. A change in Linux 3.14 unintentionally changed the layout for the second and subsequent zones. All the correct data is still stored, but each chunk may be assigned to a different device than in pre-3.14 kernels. This can lead to data corruption. It is not possible to determine what layout to use - it depends which kernel the data was written by. So we add a module parameter to allow the old (0) or new (1) layout to be specified, and refused to assemble an affected array if that parameter is not set. Fixes: 20d0189b1012 ("block: Introduce new bio_split()") cc: stable@vger.kernel.org (3.14+) Acked-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Song Liu <songliubraving@fb.com>
2019-09-13raid5: don't set STRIPE_HANDLE to stripe which is in batch listGuoqing Jiang1-1/+2
If stripe in batch list is set with STRIPE_HANDLE flag, then the stripe could be set with STRIPE_ACTIVE by the handle_stripe function. And if error happens to the batch_head at the same time, break_stripe_batch_list is called, then below warning could happen (the same report in [1]), it means a member of batch list was set with STRIPE_ACTIVE. [7028915.431770] stripe state: 2001 [7028915.431815] ------------[ cut here ]------------ [7028915.431828] WARNING: CPU: 18 PID: 29089 at drivers/md/raid5.c:4614 break_stripe_batch_list+0x203/0x240 [raid456] [...] [7028915.431879] CPU: 18 PID: 29089 Comm: kworker/u82:5 Tainted: G O 4.14.86-1-storage #4.14.86-1.2~deb9 [7028915.431881] Hardware name: Supermicro SSG-2028R-ACR24L/X10DRH-iT, BIOS 3.1 06/18/2018 [7028915.431888] Workqueue: raid5wq raid5_do_work [raid456] [7028915.431890] task: ffff9ab0ef36d7c0 task.stack: ffffb72926f84000 [7028915.431896] RIP: 0010:break_stripe_batch_list+0x203/0x240 [raid456] [7028915.431898] RSP: 0018:ffffb72926f87ba8 EFLAGS: 00010286 [7028915.431900] RAX: 0000000000000012 RBX: ffff9aaa84a98000 RCX: 0000000000000000 [7028915.431901] RDX: 0000000000000000 RSI: ffff9ab2bfa15458 RDI: ffff9ab2bfa15458 [7028915.431902] RBP: ffff9aaa8fb4e900 R08: 0000000000000001 R09: 0000000000002eb4 [7028915.431903] R10: 00000000ffffffff R11: 0000000000000000 R12: ffff9ab1736f1b00 [7028915.431904] R13: 0000000000000000 R14: ffff9aaa8fb4e900 R15: 0000000000000001 [7028915.431906] FS: 0000000000000000(0000) GS:ffff9ab2bfa00000(0000) knlGS:0000000000000000 [7028915.431907] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [7028915.431908] CR2: 00007ff953b9f5d8 CR3: 0000000bf4009002 CR4: 00000000003606e0 [7028915.431909] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [7028915.431910] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [7028915.431910] Call Trace: [7028915.431923] handle_stripe+0x8e7/0x2020 [raid456] [7028915.431930] ? __wake_up_common_lock+0x89/0xc0 [7028915.431935] handle_active_stripes.isra.58+0x35f/0x560 [raid456] [7028915.431939] raid5_do_work+0xc6/0x1f0 [raid456] Also commit 59fc630b8b5f9f ("RAID5: batch adjacent full stripe write") said "If a stripe is added to batch list, then only the first stripe of the list should be put to handle_list and run handle_stripe." So don't set STRIPE_HANDLE to stripe which is already in batch list, otherwise the stripe could be put to handle_list and run handle_stripe, then the above warning could be triggered. [1]. https://www.spinics.net/lists/raid/msg62552.html Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2019-09-13raid5: don't increment read_errors on EILSEQ returnNigel Croxon1-1/+2
While MD continues to count read errors returned by the lower layer. If those errors are -EILSEQ, instead of -EIO, it should NOT increase the read_errors count. When RAID6 is set up on dm-integrity target that detects massive corruption, the leg will be ejected from the array. Even if the issue is correctable with a sector re-write and the array has necessary redundancy to correct it. The leg is ejected because it runs up the rdev->read_errors beyond conf->max_nr_stripes. The return status in dm-drypt when there is a data integrity error is -EILSEQ (BLK_STS_PROTECTION). Signed-off-by: Nigel Croxon <ncroxon@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2019-09-13dm bufio: remove old-style buffer cleanupMikulas Patocka1-58/+3
Remove code that cleans up buffers if the cache size grows over the limit. The next commit will introduce a new global cleanup. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-09-13dm bufio: introduce a global queueMikulas Patocka1-3/+12
Rename param_spinlock to global_spinlock and introduce a global queue of all used buffers. The queue will be used in the following commits. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-09-13dm bufio: refactor adjust_total_allocatedMikulas Patocka1-3/+11
Refactor adjust_total_allocated() so that it takes a bool argument indicating if it should add or subtract the buffer size. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>