summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2023-08-02fs: add CONFIG_BUFFER_HEADChristoph Hellwig37-29/+119
Add a new config option that controls building the buffer_head code, and select it from all file systems and stacking drivers that need it. For the block device nodes and alternative iomap based buffered I/O path is provided when buffer_head support is not enabled, and iomap needs a a small tweak to define the IOMAP_F_BUFFER_HEAD flag to 0 to not call into the buffer_head code when it doesn't exist. Otherwise this is just Kconfig and ifdef changes. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20230801172201.1923299-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-02block: use iomap for writes to block devicesChristoph Hellwig2-2/+30
Use iomap in buffer_head compat mode to write to block devices. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Pankaj Raghav <p.raghav@samsung.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20230801172201.1923299-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-02block: stop setting ->direct_IOChristoph Hellwig1-2/+1
Direct I/O on block devices now nevers goes through aops->direct_IO. Stop setting it and set the FMODE_CAN_ODIRECT in ->open instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20230801172201.1923299-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-02block: open code __generic_file_write_iter for blkdev writesChristoph Hellwig1-2/+43
Open code __generic_file_write_iter to remove the indirect call into ->direct_IO and to prepare using the iomap based write code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Link: https://lore.kernel.org/r/20230801172201.1923299-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-02fs: rename and move block_page_mkwrite_returnChristoph Hellwig8-25/+31
block_page_mkwrite_return is neither block nor mkwrite specific, and should not be under CONFIG_BLOCK. Move it to mm.h and rename it to vmf_fs_error. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20230801172201.1923299-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-02fs: remove emergency_thaw_bdevChristoph Hellwig3-13/+3
Fold emergency_thaw_bdev into it's only caller, to prepare for buffer.c to be built only when buffer_head support is enabled. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20230801172201.1923299-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-29Merge tag 'md-next-20230729' of ↵Jens Axboe15-394/+431
https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.6/block Pull MD updates from Song: "1. Deprecate bitmap file support, by Christoph Hellwig; 2. Fix deadlock with md sync thread, by Yu Kuai; 3. Refactor md io accounting, by Yu Kuai; 4. Various non-urgent fixes by Li Nan, Yu Kuai, and Jack Wang." * tag 'md-next-20230729' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: (36 commits) md/md-bitmap: hold 'reconfig_mutex' in backlog_store() md/md-bitmap: remove unnecessary local variable in backlog_store() md/raid10: use dereference_rdev_and_rrdev() to get devices md/raid10: factor out dereference_rdev_and_rrdev() md/raid10: check replacement and rdev to prevent submit the same io twice md/raid1: Avoid lock contention from wake_up() md: restore 'noio_flag' for the last mddev_resume() md: don't quiesce in mddev_suspend() md: remove redundant check in fix_read_error() md/raid10: optimize fix_read_error md/raid1: prioritize adding disk to 'removed' mirror md/md-faulty: enable io accounting md/md-linear: enable io accounting md/md-multipath: enable io accounting md/raid10: switch to use md_account_bio() for io accounting md/raid1: switch to use md_account_bio() for io accounting raid5: fix missing io accounting in raid5_align_endio() md: also clone new io if io accounting is disabled md: move initialization and destruction of 'io_acct_set' to md.c md: deprecate bitmap file support ...
2023-07-27md/md-bitmap: hold 'reconfig_mutex' in backlog_store()Yu Kuai1-0/+7
Several reasons why 'reconfig_mutex' should be held: 1) rdev_for_each() is not safe to be called without the lock, because rdev can be removed concurrently. 2) mddev_destroy_serial_pool() and mddev_create_serial_pool() should not be called concurrently. 3) mddev_suspend() from mddev_destroy/create_serial_pool() should be protected by the lock. Fixes: 10c92fca636e ("md-bitmap: create and destroy wb_info_pool with the change of backlog") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20230706083727.608914-3-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2023-07-27md/md-bitmap: remove unnecessary local variable in backlog_store()Yu Kuai1-2/+0
Local variable is definied first in the beginning of backlog_store(), there is no need to define it again. Fixes: 8c13ab115b57 ("md/bitmap: don't set max_write_behind if there is no write mostly device") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20230706083727.608914-2-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2023-07-27md/raid10: use dereference_rdev_and_rrdev() to get devicesLi Nan1-10/+5
Commit 2ae6aaf76912 ("md/raid10: fix io loss while replacement replace rdev") reads replacement first to prevent io loss. However, there are same issue in wait_blocked_dev() and raid10_handle_discard(), too. Fix it by using dereference_rdev_and_rrdev() to get devices. Fixes: d30588b2731f ("md/raid10: improve raid10 discard request") Fixes: f2e7e269a752 ("md/raid10: pull the code that wait for blocked dev into one function") Signed-off-by: Li Nan <linan122@huawei.com> Link: https://lore.kernel.org/r/20230701080529.2684932-4-linan666@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2023-07-27md/raid10: factor out dereference_rdev_and_rrdev()Li Nan1-9/+20
Factor out a helper to get 'rdev' and 'replacement' from config->mirrors. Just to make code cleaner and prepare to fix the bug of io loss while 'replacement' replace 'rdev'. There is no functional change. Signed-off-by: Li Nan <linan122@huawei.com> Link: https://lore.kernel.org/r/20230701080529.2684932-3-linan666@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2023-07-27md/raid10: check replacement and rdev to prevent submit the same io twiceLi Nan1-0/+2
After commit 4ca40c2ce099 ("md/raid10: Allow replacement device to be replace old drive."), 'rdev' and 'replacement' could appear to be identical. There are already checks for that in wait_blocked_dev() and raid10_write_request(). Add check for raid10_handle_discard() now. Signed-off-by: Li Nan <linan122@huawei.com> Link: https://lore.kernel.org/r/20230701080529.2684932-2-linan666@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2023-07-27md/raid1: Avoid lock contention from wake_up()Jack Wang1-6/+12
wake_up is called unconditionally in a few paths such as make_request(), which cause lock contention under high concurrency workload like below raid1_end_write_request wake_up __wake_up_common_lock spin_lock_irqsave Improve performance by only call wake_up() if waitqueue is not empty Fio test script: [global] name=random reads and writes ioengine=libaio direct=1 readwrite=randrw rwmixread=70 iodepth=64 buffered=0 filename=/dev/md0 size=1G runtime=30 time_based randrepeat=0 norandommap refill_buffers ramp_time=10 bs=4k numjobs=400 group_reporting=1 [job1] Test result with 2 ramdisk in raid1 on a Intel Broadwell 56 cores server. Before this patch With this patch READ BW=4621MB/s BW=7337MB/s WRITE BW=1980MB/s BW=3144MB/s The patch is inspired by Yu Kuai's change for raid10: https://lore.kernel.org/r/20230621105728.1268542-1-yukuai1@huaweicloud.com Cc: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Jack Wang <jinpu.wang@ionos.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20230705113227.148494-1-jinpu.wang@ionos.com Signed-off-by: Song Liu <song@kernel.org>
2023-07-27md: restore 'noio_flag' for the last mddev_resume()Yu Kuai1-2/+4
memalloc_noio_save() is called for the first mddev_suspend(), and repeated mddev_suspend() only increase 'suspended'. However, memalloc_noio_restore() is also called for the first mddev_resume(), which means that memory reclaim will be enabled before the last mddev_resume() is called, while the array is still suspended. Fix this problem by restore 'noio_flag' for the last mddev_resume(). Fixes: 78f57ef9d50a ("md: use memalloc scope APIs in mddev_suspend()/mddev_resume()") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20230628012931.88911-3-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2023-07-27md: don't quiesce in mddev_suspend()Yu Kuai1-2/+0
Some levels doesn't implement "pers->quiesce", for example raid0_quiesce() is empty, and now that all levels will drop 'active_io' until io is done, wait for 'active_io' to be 0 is enough to make sure all normal io is done, and percpu_ref_kill() for 'active_io' will make sure no new normal io can be dispatched. There is no need to call "pers->quiesce" anymore from mddev_suspend(). Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20230628012931.88911-2-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2023-07-27md: remove redundant check in fix_read_error()Li Nan2-2/+2
In fix_read_error(), 'success' will be checked immediately after assigning it, if it is set to 1 then the loop will break. Checking it again in condition of loop is redundant. Clean it up. Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20230623173236.2513554-3-linan666@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2023-07-27md/raid10: optimize fix_read_errorLi Nan1-10/+10
We dereference r10_bio->read_slot too many times in fix_read_error(). Optimize it by using a variable to store read_slot. Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20230623173236.2513554-2-linan666@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2023-07-27md/raid1: prioritize adding disk to 'removed' mirrorLi Nan1-11/+15
New disk should be added to "removed" position first instead of to be a replacement. Commit 6090368abcb4 ("md/raid10: prioritize adding disk to 'removed' mirror") has fixed this issue for raid10. Fix it for raid1 now. Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20230627014332.3810102-1-linan666@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2023-07-27md/md-faulty: enable io accountingYu Kuai1-0/+2
use md_account_bio() to enable io accounting, also make sure mddev_suspend() will wait for all io to be done. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230621165110.1498313-9-yukuai1@huaweicloud.com
2023-07-27md/md-linear: enable io accountingYu Kuai1-0/+1
use md_account_bio() to enable io accounting, also make sure mddev_suspend() will wait for all io to be done. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230621165110.1498313-8-yukuai1@huaweicloud.com
2023-07-27md/md-multipath: enable io accountingYu Kuai1-0/+1
use md_account_bio() to enable io accounting, also make sure mddev_suspend() will wait for all io to be done. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230621165110.1498313-7-yukuai1@huaweicloud.com
2023-07-27md/raid10: switch to use md_account_bio() for io accountingYu Kuai2-12/+9
Make sure that 'active_io' will represent inflight io instead of io that is dispatching, and io accounting from all levels will be consistent. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230621165110.1498313-6-yukuai1@huaweicloud.com
2023-07-27md/raid1: switch to use md_account_bio() for io accountingYu Kuai2-9/+6
Two problems can be fixed this way: 1) 'active_io' will represent inflight io instead of io that is dispatching. 2) If io accounting is enabled or disabled while io is still inflight, bio_start_io_acct() and bio_end_io_acct() is not balanced and io inflight counter will be leaked. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230621165110.1498313-5-yukuai1@huaweicloud.com
2023-07-27raid5: fix missing io accounting in raid5_align_endio()Yu Kuai1-21/+8
Io will only be accounted as done from raid5_align_endio() if the io succeeded, and io inflight counter will be leaked if such io failed. Fix this problem by switching to use md_account_bio() for io accounting. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230621165110.1498313-4-yukuai1@huaweicloud.com
2023-07-27md: also clone new io if io accounting is disabledYu Kuai3-43/+42
Currently, 'active_io' is grabbed before make_reqeust() is called, and it's dropped immediately make_reqeust() returns. Hence 'active_io' actually means io is dispatching, not io is inflight. For raid0 and raid456 that io accounting is enabled, 'active_io' will also be grabbed when bio is cloned for io accounting, and this 'active_io' is dropped until io is done. Always clone new bio so that 'active_io' will mean that io is inflight, raid1 and raid10 will switch to use this method in later patches. Now that bio will be cloned even if io accounting is disabled, also rename related structure from '*_acct_*' to '*_clone_*'. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230621165110.1498313-3-yukuai1@huaweicloud.com
2023-07-27md: move initialization and destruction of 'io_acct_set' to md.cYu Kuai4-63/+23
'io_acct_set' is only used for raid0 and raid456, prepare to use it for raid1 and raid10, so that io accounting from different levels can be consistent. By the way, follow up patches will also use this io clone mechanism to make sure 'active_io' represents in flight io, not io that is dispatching, so that mddev_suspend will wait for io to be done as designed. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230621165110.1498313-2-yukuai1@huaweicloud.com
2023-07-27md: deprecate bitmap file supportChristoph Hellwig2-1/+3
The support for bitmaps on files is a very bad idea abusing various kernel APIs, and fundamentally requires the file to not be on the actual array without a way to check that this is actually the case. Add a deprecation warning to see if we might be able to eventually drop it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230615064840.629492-12-hch@lst.de
2023-07-27md: make bitmap file support optionalChristoph Hellwig3-0/+32
The support for write intent bitmaps in files on an external files in md is a hot mess that abuses ->bmap to map file offsets into physical device objects, and also abuses buffer_heads in a creative way. Make this code optional so that MD can be built into future kernels without buffer_head support, and so that we can eventually deprecate it. Note this does not affect the internal bitmap support, which has none of the problems. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230615064840.629492-11-hch@lst.de
2023-07-27md-bitmap: don't use ->index for pages backing the bitmap fileChristoph Hellwig2-27/+39
The md driver allocates pages for storing the bitmap file data, which are not page cache pages, and then stores the page granularity file offset in page->index, which is a field that isn't really valid except for page cache pages. Use a separate index for the superblock, and use the scheme used at read size to recalculate the index for the bitmap pages instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230615064840.629492-10-hch@lst.de
2023-07-27md-bitmap: account for mddev->bitmap_info.offset in read_sb_pageChristoph Hellwig1-9/+8
Diretly apply mddev->bitmap_info.offset to the sector number to read instead of doing that in both callers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230615064840.629492-9-hch@lst.de
2023-07-27md-bitmap: cleanup read_sb_pageChristoph Hellwig1-12/+11
Convert read_sb_page to the normal kernel coding style, calculate the target sector only once, and add a local iosize variable to make the call to sync_page_io more readable. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230615064840.629492-8-hch@lst.de
2023-07-27md-bitmap: refactor md_bitmap_init_from_diskChristoph Hellwig1-71/+70
Split the confusing loop in md_bitmap_init_from_disk that iterates over all chunks but also needs to read and map the pages into three separate loops: one that iterates over the pages to read them, a second optional one to iterate over the pages to mark them invalid if the bitmaps are out of date, and a final one that actually iterates over the chunks. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202306160552.smw0qbmb-lkp@intel.com/ Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230615064840.629492-7-hch@lst.de
2023-07-27md-bitmap: rename read_page to read_file_pageChristoph Hellwig1-6/+4
Make the difference to read_sb_page clear. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230615064840.629492-6-hch@lst.de
2023-07-27md-bitmap: split file writes into a separate helperChristoph Hellwig1-24/+24
Split the file write code out of write_page into a separate helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230615064840.629492-5-hch@lst.de
2023-07-27md-bitmap: use %pD to print the file name in md_bitmap_file_kickChristoph Hellwig1-10/+2
Don't bother allocating an extra buffer in the I/O failure handler and instead use the printk built-in format to print the last 4 path name components. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230615064840.629492-4-hch@lst.de
2023-07-27md-bitmap: initialize variables at declaration time in md_bitmap_file_unmapChristoph Hellwig1-8/+4
Just a small tidyup to prepare for bigger changes. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230615064840.629492-3-hch@lst.de
2023-07-27md-bitmap: set BITMAP_WRITE_ERROR in write_sb_pageChristoph Hellwig1-13/+8
Set BITMAP_WRITE_ERROR directly in write_sb_page instead of propagating the error to the caller and setting it there. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230615064840.629492-2-hch@lst.de
2023-07-27md: enhance checking in md_check_recovery()Yu Kuai1-7/+15
For md_check_recovery(): 1) if 'MD_RECOVERY_RUNING' is not set, register new sync_thread. 2) if 'MD_RECOVERY_RUNING' is set: a) if 'MD_RECOVERY_DONE' is not set, don't do anything, wait for md_do_sync() to be done. b) if 'MD_RECOVERY_DONE' is set, unregister sync_thread. Current code expects that sync_thread is not NULL, otherwise new sync_thread will be registered, which will corrupt the array. Make sure md_check_recovery() won't register new sync_thread if 'MD_RECOVERY_RUNING' is still set, and a new WARN_ON_ONCE() is added for the above corruption, Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230529132037.2124527-7-yukuai1@huaweicloud.com
2023-07-27md: wake up 'resync_wait' at last in md_reap_sync_thread()Yu Kuai1-1/+1
md_reap_sync_thread() is just replaced with wait_event(resync_wait, ...) from action_store(), just make sure action_store() will still wait for everything to be done in md_reap_sync_thread(). Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewd-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230529132037.2124527-6-yukuai1@huaweicloud.com
2023-07-27md: refactor idle/frozen_sync_thread() to fix deadlockYu Kuai2-4/+21
Our test found a following deadlock in raid10: 1) Issue a normal write, and such write failed: raid10_end_write_request set_bit(R10BIO_WriteError, &r10_bio->state) one_write_done reschedule_retry // later from md thread raid10d handle_write_completed list_add(&r10_bio->retry_list, &conf->bio_end_io_list) // later from md thread raid10d if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) list_move(conf->bio_end_io_list.prev, &tmp) r10_bio = list_first_entry(&tmp, struct r10bio, retry_list) raid_end_bio_io(r10_bio) Dependency chain 1: normal io is waiting for updating superblock 2) Trigger a recovery: raid10_sync_request raise_barrier Dependency chain 2: sync thread is waiting for normal io 3) echo idle/frozen to sync_action: action_store mddev_lock md_unregister_thread kthread_stop Dependency chain 3: drop 'reconfig_mutex' is waiting for sync thread 4) md thread can't update superblock: raid10d md_check_recovery if (mddev_trylock(mddev)) md_update_sb Dependency chain 4: update superblock is waiting for 'reconfig_mutex' Hence cyclic dependency exist, in order to fix the problem, we must break one of them. Dependency 1 and 2 can't be broken because they are foundation design. Dependency 4 may be possible if it can be guaranteed that no io can be inflight, however, this requires a new mechanism which seems complex. Dependency 3 is a good choice, because idle/frozen only requires sync thread to finish, which can be done asynchronously that is already implemented, and 'reconfig_mutex' is not needed anymore. This patch switch 'idle' and 'frozen' to wait sync thread to be done asynchronously, and this patch also add a sequence counter to record how many times sync thread is done, so that 'idle' won't keep waiting on new started sync thread. Noted that raid456 has similiar deadlock([1]), and it's verified[2] this deadlock can be fixed by this patch as well. [1] https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc1ac@molgen.mpg.de/T/#t [2] https://lore.kernel.org/linux-raid/e9067438-d713-f5f3-0d3d-9e6b0e9efa0e@huaweicloud.com/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230529132037.2124527-5-yukuai1@huaweicloud.com
2023-07-27md: add a mutex to synchronize idle and frozen in action_store()Yu Kuai2-0/+8
Currently, for idle and frozen, action_store will hold 'reconfig_mutex' and call md_reap_sync_thread() to stop sync thread, however, this will cause deadlock (explained in the next patch). In order to fix the problem, following patch will release 'reconfig_mutex' and wait on 'resync_wait', like md_set_readonly() and do_md_stop() does. Consider that action_store() will set/clear 'MD_RECOVERY_FROZEN' unconditionally, which might cause unexpected problems, for example, frozen just set 'MD_RECOVERY_FROZEN' and is still in progress, while 'idle' clear 'MD_RECOVERY_FROZEN' and new sync thread is started, which might starve in progress frozen. A mutex is added to synchronize idle and frozen from action_store(). Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230529132037.2124527-4-yukuai1@huaweicloud.com
2023-07-27md: refactor action_store() for 'idle' and 'frozen'Yu Kuai1-16/+45
Prepare to handle 'idle' and 'frozen' differently to fix a deadlock, there are no functional changes except that MD_RECOVERY_RUNNING is checked again after 'reconfig_mutex' is held. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230529132037.2124527-3-yukuai1@huaweicloud.com
2023-07-27Revert "md: unlock mddev before reap sync_thread in action_store"Yu Kuai2-18/+2
This reverts commit 9dfbdafda3b34e262e43e786077bab8e476a89d1. Because it will introduce a defect that sync_thread can be running while MD_RECOVERY_RUNNING is cleared, which will cause some unexpected problems, for example: list_add corruption. prev->next should be next (ffff0001ac1daba0), but was ffff0000ce1a02a0. (prev=ffff0000ce1a02a0). Call trace: __list_add_valid+0xfc/0x140 insert_work+0x78/0x1a0 __queue_work+0x500/0xcf4 queue_work_on+0xe8/0x12c md_check_recovery+0xa34/0xf30 raid10d+0xb8/0x900 [raid10] md_thread+0x16c/0x2cc kthread+0x1a4/0x1ec ret_from_fork+0x10/0x18 This is because work is requeued while it's still inside workqueue: t1: t2: action_store mddev_lock if (mddev->sync_thread) mddev_unlock md_unregister_thread // first sync_thread is done md_check_recovery mddev_try_lock /* * once MD_RECOVERY_DONE is set, new sync_thread * can start. */ set_bit(MD_RECOVERY_RUNNING, &mddev->recovery) INIT_WORK(&mddev->del_work, md_start_sync) queue_work(md_misc_wq, &mddev->del_work) test_and_set_bit(WORK_STRUCT_PENDING_BIT, ...) // set pending bit insert_work list_add_tail mddev_unlock mddev_lock_nointr md_reap_sync_thread // MD_RECOVERY_RUNNING is cleared mddev_unlock t3: // before queued work started from t2 md_check_recovery // MD_RECOVERY_RUNNING is not set, a new sync_thread can be started INIT_WORK(&mddev->del_work, md_start_sync) work->data = 0 // work pending bit is cleared queue_work(md_misc_wq, &mddev->del_work) insert_work list_add_tail // list is corrupted The above commit is reverted to fix the problem, the deadlock this commit tries to fix will be fixed in following patches. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230529132037.2124527-2-yukuai1@huaweicloud.com
2023-07-26block: cleanup bio_integrity_prepJinyoung Choi1-5/+1
If a problem occurs in the process of creating an integrity payload, the status of bio is always BLK_STS_RESOURCE. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jinyoung Choi <j-young.choi@samsung.com> Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20230725051839epcms2p8e4d20ad6c51326ad032e8406f59d0aaa@epcms2p8 Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-25block: Improve performance for BLK_MQ_F_BLOCKING driversBart Van Assche2-7/+12
blk_mq_run_queue() runs the queue asynchronously if BLK_MQ_F_BLOCKING has been set. This is suboptimal since running the queue asynchronously is slower than running the queue synchronously. This patch modifies blk_mq_run_queue() as follows if BLK_MQ_F_BLOCKING has been set: - Run the queue synchronously if it is allowed to sleep. - Run the queue asynchronously if it is not allowed to sleep. Additionally, blk_mq_run_hw_queue(hctx, false) calls are modified into blk_mq_run_hw_queue(hctx, hctx->flags & BLK_MQ_F_BLOCKING) if the caller may be invoked from atomic context. The following caller chains have been reviewed: blk_mq_run_hw_queue(hctx, false) blk_mq_get_tag() /* may sleep, hence the functions it calls may also sleep */ blk_execute_rq() /* may sleep */ blk_mq_run_hw_queues(q, async=false) blk_freeze_queue_start() /* may sleep */ blk_mq_requeue_work() /* may sleep */ scsi_kick_queue() scsi_requeue_run_queue() /* may sleep */ scsi_run_host_queues() scsi_ioctl_reset() /* may sleep */ blk_mq_insert_requests(hctx, ctx, list, run_queue_async=false) blk_mq_dispatch_plug_list(plug, from_sched=false) blk_mq_flush_plug_list(plug, from_schedule=false) __blk_flush_plug(plug, from_schedule=false) blk_add_rq_to_plug() blk_mq_submit_bio() /* may sleep if REQ_NOWAIT has not been set */ blk_mq_plug_issue_direct() blk_mq_flush_plug_list() /* see above */ blk_mq_dispatch_plug_list(plug, from_sched=false) blk_mq_flush_plug_list() /* see above */ blk_mq_try_issue_directly() blk_mq_submit_bio() /* may sleep if REQ_NOWAIT has not been set */ blk_mq_try_issue_list_directly(hctx, list) blk_mq_insert_requests() /* see above */ Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20230721172731.955724-4-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-25scsi: Remove a blk_mq_run_hw_queues() callBart Van Assche1-1/+1
blk_mq_kick_requeue_list() calls blk_mq_run_hw_queues() asynchronously. Leave out the direct blk_mq_run_hw_queues() call. This patch causes scsi_run_queue() to call blk_mq_run_hw_queues() asynchronously instead of synchronously. Since scsi_run_queue() is not called from the hot I/O submission path, this patch does not affect the hot path. This patch prepares for allowing blk_mq_run_hw_queue() to sleep if BLK_MQ_F_BLOCKING has been set. scsi_run_queue() may be called from atomic context and must not sleep. Hence the removal of the blk_mq_run_hw_queues(q, false) call. See also scsi_unblock_requests(). Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20230721172731.955724-3-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-25scsi: Inline scsi_kick_queue()Bart Van Assche1-7/+2
Inline scsi_kick_queue() to prepare for modifying the second argument passed to blk_mq_run_hw_queues(). Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20230721172731.955724-2-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-25block: don't pass a bio to bio_try_merge_hw_segChristoph Hellwig1-9/+7
There is no good reason to pass the bio to bio_try_merge_hw_seg. Just pass the current bvec and rename the function to bvec_try_merge_hw_page. This will allow reusing this function for supporting multi-page integrity payload bvecs. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jinyoung Choi <j-young.choi@samsung.com> Link: https://lore.kernel.org/r/20230724165433.117645-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-25block: move the bi_size update out of __bio_try_merge_pageChristoph Hellwig1-37/+20
The update of bi_size is the only thing in __bio_try_merge_page that needs a bio. Move it to the callers, and merge __bio_try_merge_page and page_is_mergeable into a single bvec_try_merge_page that only takes the current bvec instead of a full bio. This will allow reusing this function for supporting multi-page integrity payload bvecs. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jinyoung Choi <j-young.choi@samsung.com> Link: https://lore.kernel.org/r/20230724165433.117645-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-25block: downgrade a bio_full call in bio_add_pageChristoph Hellwig1-1/+1
bio_add_page already checks that there is space in bi_size a little earlier. So after we failed to add to an existing segment, just check that there is another one available instead of duplicating the bi_size check. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jinyoung Choi <j-young.choi@samsung.com> Link: https://lore.kernel.org/r/20230724165433.117645-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>