Age | Commit message (Collapse) | Author | Files | Lines |
|
commit bd293d071ffe65e645b4d8104f9d8fe15ea13862 upstream.
When thin-volume is built on loop device, if available memory is low,
the following deadlock can be triggered:
One process P1 allocates memory with GFP_FS flag, direct alloc fails,
memory reclaim invokes memory shrinker in dm_bufio, dm_bufio_shrink_scan()
runs, mutex dm_bufio_client->lock is acquired, then P1 waits for dm_buffer
IO to complete in __try_evict_buffer().
But this IO may never complete if issued to an underlying loop device
that forwards it using direct-IO, which allocates memory using
GFP_KERNEL (see: do_blockdev_direct_IO()). If allocation fails, memory
reclaim will invoke memory shrinker in dm_bufio, dm_bufio_shrink_scan()
will be invoked, and since the mutex is already held by P1 the loop
thread will hang, and IO will never complete. Resulting in ABBA
deadlock.
Cc: stable@vger.kernel.org
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 3b8cafdd5436f9298b3bf6eb831df5eef5ee82b6 upstream.
dm-zoned uses the zone flag DMZ_ACTIVE to indicate that a zone of the
backend device is being actively read or written and so cannot be
reclaimed. This flag is set as long as the zone atomic reference
counter is not 0. When this atomic is decremented and reaches 0 (e.g.
on BIO completion), the active flag is cleared and set again whenever
the zone is reused and BIO issued with the atomic counter incremented.
These 2 operations (atomic inc/dec and flag set/clear) are however not
always executed atomically under the target metadata mutex lock and
this causes the warning:
WARN_ON(!test_bit(DMZ_ACTIVE, &zone->flags));
in dmz_deactivate_zone() to be displayed. This problem is regularly
triggered with xfstests generic/209, generic/300, generic/451 and
xfs/077 with XFS being used as the file system on the dm-zoned target
device. Similarly, xfstests ext4/303, ext4/304, generic/209 and
generic/300 trigger the warning with ext4 use.
This problem can be easily fixed by simply removing the DMZ_ACTIVE flag
and managing the "ACTIVE" state by directly looking at the reference
counter value. To do so, the functions dmz_activate_zone() and
dmz_deactivate_zone() are changed to inline functions respectively
calling atomic_inc() and atomic_dec(), while the dmz_is_active() macro
is changed to an inline function calling atomic_read().
Fixes: 3b1a94c88b79 ("dm zoned: drive-managed zoned block device target")
Cc: stable@vger.kernel.org
Reported-by: Masato Suzuki <masato.suzuki@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit d9771f5ec46c282d518b453c793635dbdc3a2a94 upstream.
commit d5d885fd514f ("md: introduce new personality funciton start()")
splits the init job to two parts. The first part run() does the jobs that
do not require the md threads. The second part start() does the jobs that
require the md threads.
Now it just does run() in adding new journal device. It needs to do the
second part start() too.
Fixes: d5d885fd514f ("md: introduce new personality funciton start()")
Cc: stable@vger.kernel.org #v4.9+
Reported-by: Michal Soltys <soltys@ziu.info>
Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit f54d801dda14942dbefa00541d10603015b7859c upstream.
Commit 9baf30972b55 ("bcache: fix for gc and write-back race") added a
new work queue dc->writeback_write_wq, but forgot to destroy it in the
error condition when creating dc->writeback_thread failed.
This patch destroys dc->writeback_write_wq if kthread_create() returns
error pointer to dc->writeback_thread, then a memory leak is avoided.
Fixes: 9baf30972b55 ("bcache: fix for gc and write-back race")
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 5461999848e0462c14f306a62923d22de820a59c upstream.
In bch_cached_dev_files[] from driver/md/bcache/sysfs.c, sysfs_errors is
incorrectly inserted in. The correct entry should be sysfs_io_errors.
This patch fixes the problem and now I/O errors of cached device can be
read from /sys/block/bcache<N>/bcache/io_errors.
Fixes: c7b7bd07404c5 ("bcache: add io_disable to struct cached_dev")
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 578df99b1b0531d19af956530fe4da63d01a1604 upstream.
When md raid device (e.g. raid456) is used as backing device, read-ahead
requests on a degrading and recovering md raid device might be failured
immediately by md raid code, but indeed this md raid array can still be
read or write for normal I/O requests. Therefore such failed read-ahead
request are not real hardware failure. Further more, after degrading and
recovering accomplished, read-ahead requests will be handled by md raid
array again.
For such condition, I/O failures of read-ahead requests don't indicate
real health status (because normal I/O still be served), they should not
be counted into I/O error counter dc->io_errors.
Since there is no simple way to detect whether the backing divice is a
md raid device, this patch simply ignores I/O failures for read-ahead
bios on backing device, to avoid bogus backing device failure on a
degrading md raid array.
Suggested-and-tested-by: Thorsten Knabe <linux@thorsten-knabe.de>
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit ba82c1ac1667d6efb91a268edb13fc9cdaecec9b upstream.
This reverts commit 6268dc2c4703aabfb0b35681be709acf4c2826c6.
This patch depends on commit c4dc2497d50d ("bcache: fix high CPU
occupancy during journal") which is reverted in previous patch. So
revert this one too.
Fixes: 6268dc2c4703 ("bcache: free heap cache_set->flush_btree in bch_journal_free")
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Cc: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 249a5f6da57c28a903c75d81505d58ec8c10030d upstream.
This reverts commit c4dc2497d50d9c6fb16aa0d07b6a14f3b2adb1e0.
This patch enlarges a race between normal btree flush code path and
flush_btree_write(), which causes deadlock when journal space is
exhausted. Reverts this patch makes the race window from 128 btree
nodes to only 1 btree nodes.
Fixes: c4dc2497d50d ("bcache: fix high CPU occupancy during journal")
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Cc: Tang Junhui <tang.junhui.linux@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 695277f16b3a102fcc22c97fdf2de77c7b19f0b3 upstream.
This reverts commit 6147305c73e4511ca1a975b766b97a779d442567.
Although this patch helps the failed bcache device to stop faster when
too many I/O errors detected on corresponding cached device, setting
CACHE_SET_IO_DISABLE bit to cache set c->flags was not a good idea. This
operation will disable all I/Os on cache set, which means other attached
bcache devices won't work neither.
Without this patch, the failed bcache device can also be stopped
eventually if internal I/O accomplished (e.g. writeback). Therefore here
I revert it.
Fixes: 6147305c73e4 ("bcache: set CACHE_SET_IO_DISABLE in bch_cached_dev_error()")
Reported-by: Yong Li <mr.liyong@qq.com>
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 7e865eba00a3df2dc8c4746173a8ca1c1c7f042e ]
When enable lockdep and reboot system with a writeback mode bcache
device, the following potential deadlock warning is reported by lockdep
engine.
[ 101.536569][ T401] kworker/2:2/401 is trying to acquire lock:
[ 101.538575][ T401] 00000000bbf6e6c7 ((wq_completion)bcache_writeback_wq){+.+.}, at: flush_workqueue+0x87/0x4c0
[ 101.542054][ T401]
[ 101.542054][ T401] but task is already holding lock:
[ 101.544587][ T401] 00000000f5f305b3 ((work_completion)(&cl->work)#2){+.+.}, at: process_one_work+0x21e/0x640
[ 101.548386][ T401]
[ 101.548386][ T401] which lock already depends on the new lock.
[ 101.548386][ T401]
[ 101.551874][ T401]
[ 101.551874][ T401] the existing dependency chain (in reverse order) is:
[ 101.555000][ T401]
[ 101.555000][ T401] -> #1 ((work_completion)(&cl->work)#2){+.+.}:
[ 101.557860][ T401] process_one_work+0x277/0x640
[ 101.559661][ T401] worker_thread+0x39/0x3f0
[ 101.561340][ T401] kthread+0x125/0x140
[ 101.562963][ T401] ret_from_fork+0x3a/0x50
[ 101.564718][ T401]
[ 101.564718][ T401] -> #0 ((wq_completion)bcache_writeback_wq){+.+.}:
[ 101.567701][ T401] lock_acquire+0xb4/0x1c0
[ 101.569651][ T401] flush_workqueue+0xae/0x4c0
[ 101.571494][ T401] drain_workqueue+0xa9/0x180
[ 101.573234][ T401] destroy_workqueue+0x17/0x250
[ 101.575109][ T401] cached_dev_free+0x44/0x120 [bcache]
[ 101.577304][ T401] process_one_work+0x2a4/0x640
[ 101.579357][ T401] worker_thread+0x39/0x3f0
[ 101.581055][ T401] kthread+0x125/0x140
[ 101.582709][ T401] ret_from_fork+0x3a/0x50
[ 101.584592][ T401]
[ 101.584592][ T401] other info that might help us debug this:
[ 101.584592][ T401]
[ 101.588355][ T401] Possible unsafe locking scenario:
[ 101.588355][ T401]
[ 101.590974][ T401] CPU0 CPU1
[ 101.592889][ T401] ---- ----
[ 101.594743][ T401] lock((work_completion)(&cl->work)#2);
[ 101.596785][ T401] lock((wq_completion)bcache_writeback_wq);
[ 101.600072][ T401] lock((work_completion)(&cl->work)#2);
[ 101.602971][ T401] lock((wq_completion)bcache_writeback_wq);
[ 101.605255][ T401]
[ 101.605255][ T401] *** DEADLOCK ***
[ 101.605255][ T401]
[ 101.608310][ T401] 2 locks held by kworker/2:2/401:
[ 101.610208][ T401] #0: 00000000cf2c7d17 ((wq_completion)events){+.+.}, at: process_one_work+0x21e/0x640
[ 101.613709][ T401] #1: 00000000f5f305b3 ((work_completion)(&cl->work)#2){+.+.}, at: process_one_work+0x21e/0x640
[ 101.617480][ T401]
[ 101.617480][ T401] stack backtrace:
[ 101.619539][ T401] CPU: 2 PID: 401 Comm: kworker/2:2 Tainted: G W 5.2.0-rc4-lp151.20-default+ #1
[ 101.623225][ T401] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[ 101.627210][ T401] Workqueue: events cached_dev_free [bcache]
[ 101.629239][ T401] Call Trace:
[ 101.630360][ T401] dump_stack+0x85/0xcb
[ 101.631777][ T401] print_circular_bug+0x19a/0x1f0
[ 101.633485][ T401] __lock_acquire+0x16cd/0x1850
[ 101.635184][ T401] ? __lock_acquire+0x6a8/0x1850
[ 101.636863][ T401] ? lock_acquire+0xb4/0x1c0
[ 101.638421][ T401] ? find_held_lock+0x34/0xa0
[ 101.640015][ T401] lock_acquire+0xb4/0x1c0
[ 101.641513][ T401] ? flush_workqueue+0x87/0x4c0
[ 101.643248][ T401] flush_workqueue+0xae/0x4c0
[ 101.644832][ T401] ? flush_workqueue+0x87/0x4c0
[ 101.646476][ T401] ? drain_workqueue+0xa9/0x180
[ 101.648303][ T401] drain_workqueue+0xa9/0x180
[ 101.649867][ T401] destroy_workqueue+0x17/0x250
[ 101.651503][ T401] cached_dev_free+0x44/0x120 [bcache]
[ 101.653328][ T401] process_one_work+0x2a4/0x640
[ 101.655029][ T401] worker_thread+0x39/0x3f0
[ 101.656693][ T401] ? process_one_work+0x640/0x640
[ 101.658501][ T401] kthread+0x125/0x140
[ 101.660012][ T401] ? kthread_create_worker_on_cpu+0x70/0x70
[ 101.661985][ T401] ret_from_fork+0x3a/0x50
[ 101.691318][ T401] bcache: bcache_device_free() bcache0 stopped
Here is how the above potential deadlock may happen in reboot/shutdown
code path,
1) bcache_reboot() is called firstly in the reboot/shutdown code path,
then in bcache_reboot(), bcache_device_stop() is called.
2) bcache_device_stop() sets BCACHE_DEV_CLOSING on d->falgs, then call
closure_queue(&d->cl) to invoke cached_dev_flush(). And in turn
cached_dev_flush() calls cached_dev_free() via closure_at()
3) In cached_dev_free(), after stopped writebach kthread
dc->writeback_thread, the kwork dc->writeback_write_wq is stopping by
destroy_workqueue().
4) Inside destroy_workqueue(), drain_workqueue() is called. Inside
drain_workqueue(), flush_workqueue() is called. Then wq->lockdep_map
is acquired by lock_map_acquire() in flush_workqueue(). After the
lock acquired the rest part of flush_workqueue() just wait for the
workqueue to complete.
5) Now we look back at writeback thread routine bch_writeback_thread(),
in the main while-loop, write_dirty() is called via continue_at() in
read_dirty_submit(), which is called via continue_at() in while-loop
level called function read_dirty(). Inside write_dirty() it may be
re-called on workqueeu dc->writeback_write_wq via continue_at().
It means when the writeback kthread is stopped in cached_dev_free()
there might be still one kworker queued on dc->writeback_write_wq
to execute write_dirty() again.
6) Now this kworker is scheduled on dc->writeback_write_wq to run by
process_one_work() (which is called by worker_thread()). Before
calling the kwork routine, wq->lockdep_map is acquired.
7) But wq->lockdep_map is acquired already in step 4), so a A-A lock
(lockdep terminology) scenario happens.
Indeed on multiple cores syatem, the above deadlock is very rare to
happen, just as the code comments in process_one_work() says,
2263 * AFAICT there is no possible deadlock scenario between the
2264 * flush_work() and complete() primitives (except for
single-threaded
2265 * workqueues), so hiding them isn't a problem.
But it is still good to fix such lockdep warning, even no one running
bcache on single core system.
The fix is simple. This patch solves the above potential deadlock by,
- Do not destroy workqueue dc->writeback_write_wq in cached_dev_free().
- Flush and destroy dc->writeback_write_wq in writebach kthread routine
bch_writeback_thread(), where after quit the thread main while-loop
and before cached_dev_put() is called.
By this fix, dc->writeback_write_wq will be stopped and destroy before
the writeback kthread stopped, so the chance for a A-A locking on
wq->lockdep_map is disappeared, such A-A deadlock won't happen
any more.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit b387e9b58679c60f5b1e4313939bd4878204fc37 ]
When system memory is in heavy pressure, bch_gc_thread_start() from
run_cache_set() may fail due to out of memory. In such condition,
c->gc_thread is assigned to -ENOMEM, not NULL pointer. Then in following
failure code path bch_cache_set_error(), when cache_set_flush() gets
called, the code piece to stop c->gc_thread is broken,
if (!IS_ERR_OR_NULL(c->gc_thread))
kthread_stop(c->gc_thread);
And KASAN catches such NULL pointer deference problem, with the warning
information:
[ 561.207881] ==================================================================
[ 561.207900] BUG: KASAN: null-ptr-deref in kthread_stop+0x3b/0x440
[ 561.207904] Write of size 4 at addr 000000000000001c by task kworker/15:1/313
[ 561.207913] CPU: 15 PID: 313 Comm: kworker/15:1 Tainted: G W 5.0.0-vanilla+ #3
[ 561.207916] Hardware name: Lenovo ThinkSystem SR650 -[7X05CTO1WW]-/-[7X05CTO1WW]-, BIOS -[IVE136T-2.10]- 03/22/2019
[ 561.207935] Workqueue: events cache_set_flush [bcache]
[ 561.207940] Call Trace:
[ 561.207948] dump_stack+0x9a/0xeb
[ 561.207955] ? kthread_stop+0x3b/0x440
[ 561.207960] ? kthread_stop+0x3b/0x440
[ 561.207965] kasan_report+0x176/0x192
[ 561.207973] ? kthread_stop+0x3b/0x440
[ 561.207981] kthread_stop+0x3b/0x440
[ 561.207995] cache_set_flush+0xd4/0x6d0 [bcache]
[ 561.208008] process_one_work+0x856/0x1620
[ 561.208015] ? find_held_lock+0x39/0x1d0
[ 561.208028] ? drain_workqueue+0x380/0x380
[ 561.208048] worker_thread+0x87/0xb80
[ 561.208058] ? __kthread_parkme+0xb6/0x180
[ 561.208067] ? process_one_work+0x1620/0x1620
[ 561.208072] kthread+0x326/0x3e0
[ 561.208079] ? kthread_create_worker_on_cpu+0xc0/0xc0
[ 561.208090] ret_from_fork+0x3a/0x50
[ 561.208110] ==================================================================
[ 561.208113] Disabling lock debugging due to kernel taint
[ 561.208115] irq event stamp: 11800231
[ 561.208126] hardirqs last enabled at (11800231): [<ffffffff83008538>] do_syscall_64+0x18/0x410
[ 561.208127] BUG: unable to handle kernel NULL pointer dereference at 000000000000001c
[ 561.208129] #PF error: [WRITE]
[ 561.312253] hardirqs last disabled at (11800230): [<ffffffff830052ff>] trace_hardirqs_off_thunk+0x1a/0x1c
[ 561.312259] softirqs last enabled at (11799832): [<ffffffff850005c7>] __do_softirq+0x5c7/0x8c3
[ 561.405975] PGD 0 P4D 0
[ 561.442494] softirqs last disabled at (11799821): [<ffffffff831add2c>] irq_exit+0x1ac/0x1e0
[ 561.791359] Oops: 0002 [#1] SMP KASAN NOPTI
[ 561.791362] CPU: 15 PID: 313 Comm: kworker/15:1 Tainted: G B W 5.0.0-vanilla+ #3
[ 561.791363] Hardware name: Lenovo ThinkSystem SR650 -[7X05CTO1WW]-/-[7X05CTO1WW]-, BIOS -[IVE136T-2.10]- 03/22/2019
[ 561.791371] Workqueue: events cache_set_flush [bcache]
[ 561.791374] RIP: 0010:kthread_stop+0x3b/0x440
[ 561.791376] Code: 00 00 65 8b 05 26 d5 e0 7c 89 c0 48 0f a3 05 ec aa df 02 0f 82 dc 02 00 00 4c 8d 63 20 be 04 00 00 00 4c 89 e7 e8 65 c5 53 00 <f0> ff 43 20 48 8d 7b 24 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48
[ 561.791377] RSP: 0018:ffff88872fc8fd10 EFLAGS: 00010286
[ 561.838895] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
[ 561.838916] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
[ 561.838934] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
[ 561.838948] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
[ 561.838966] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
[ 561.838979] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
[ 561.838996] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
[ 563.067028] RAX: 0000000000000000 RBX: fffffffffffffffc RCX: ffffffff832dd314
[ 563.067030] RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000297
[ 563.067032] RBP: ffff88872fc8fe88 R08: fffffbfff0b8213d R09: fffffbfff0b8213d
[ 563.067034] R10: 0000000000000001 R11: fffffbfff0b8213c R12: 000000000000001c
[ 563.408618] R13: ffff88dc61cc0f68 R14: ffff888102b94900 R15: ffff88dc61cc0f68
[ 563.408620] FS: 0000000000000000(0000) GS:ffff888f7dc00000(0000) knlGS:0000000000000000
[ 563.408622] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 563.408623] CR2: 000000000000001c CR3: 0000000f48a1a004 CR4: 00000000007606e0
[ 563.408625] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 563.408627] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 563.904795] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
[ 563.915796] PKRU: 55555554
[ 563.915797] Call Trace:
[ 563.915807] cache_set_flush+0xd4/0x6d0 [bcache]
[ 563.915812] process_one_work+0x856/0x1620
[ 564.001226] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
[ 564.033563] ? find_held_lock+0x39/0x1d0
[ 564.033567] ? drain_workqueue+0x380/0x380
[ 564.033574] worker_thread+0x87/0xb80
[ 564.062823] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
[ 564.118042] ? __kthread_parkme+0xb6/0x180
[ 564.118046] ? process_one_work+0x1620/0x1620
[ 564.118048] kthread+0x326/0x3e0
[ 564.118050] ? kthread_create_worker_on_cpu+0xc0/0xc0
[ 564.167066] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
[ 564.252441] ret_from_fork+0x3a/0x50
[ 564.252447] Modules linked in: msr rpcrdma sunrpc rdma_ucm ib_iser ib_umad rdma_cm ib_ipoib i40iw configfs iw_cm ib_cm libiscsi scsi_transport_iscsi mlx4_ib ib_uverbs mlx4_en ib_core nls_iso8859_1 nls_cp437 vfat fat intel_rapl skx_edac x86_pkg_temp_thermal coretemp iTCO_wdt iTCO_vendor_support crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel ses raid0 aesni_intel cdc_ether enclosure usbnet ipmi_ssif joydev aes_x86_64 i40e scsi_transport_sas mii bcache md_mod crypto_simd mei_me ioatdma crc64 ptp cryptd pcspkr i2c_i801 mlx4_core glue_helper pps_core mei lpc_ich dca wmi ipmi_si ipmi_devintf nd_pmem dax_pmem nd_btt ipmi_msghandler device_dax pcc_cpufreq button hid_generic usbhid mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect xhci_pci sysimgblt fb_sys_fops xhci_hcd ttm megaraid_sas drm usbcore nfit libnvdimm sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua efivarfs
[ 564.299390] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
[ 564.348360] CR2: 000000000000001c
[ 564.348362] ---[ end trace b7f0e5cc7b2103b0 ]---
Therefore, it is not enough to only check whether c->gc_thread is NULL,
we should use IS_ERR_OR_NULL() to check both NULL pointer and error
value.
This patch changes the above buggy code piece in this way,
if (!IS_ERR_OR_NULL(c->gc_thread))
kthread_stop(c->gc_thread);
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 80265d8dfd77792e133793cef44a21323aac2908 ]
When enable lockdep engine, a lockdep warning can be observed when
reboot or shutdown system,
[ 3142.764557][ T1] bcache: bcache_reboot() Stopping all devices:
[ 3142.776265][ T2649]
[ 3142.777159][ T2649] ======================================================
[ 3142.780039][ T2649] WARNING: possible circular locking dependency detected
[ 3142.782869][ T2649] 5.2.0-rc4-lp151.20-default+ #1 Tainted: G W
[ 3142.785684][ T2649] ------------------------------------------------------
[ 3142.788479][ T2649] kworker/3:67/2649 is trying to acquire lock:
[ 3142.790738][ T2649] 00000000aaf02291 ((wq_completion)bcache_writeback_wq){+.+.}, at: flush_workqueue+0x87/0x4c0
[ 3142.794678][ T2649]
[ 3142.794678][ T2649] but task is already holding lock:
[ 3142.797402][ T2649] 000000004fcf89c5 (&bch_register_lock){+.+.}, at: cached_dev_free+0x17/0x120 [bcache]
[ 3142.801462][ T2649]
[ 3142.801462][ T2649] which lock already depends on the new lock.
[ 3142.801462][ T2649]
[ 3142.805277][ T2649]
[ 3142.805277][ T2649] the existing dependency chain (in reverse order) is:
[ 3142.808902][ T2649]
[ 3142.808902][ T2649] -> #2 (&bch_register_lock){+.+.}:
[ 3142.812396][ T2649] __mutex_lock+0x7a/0x9d0
[ 3142.814184][ T2649] cached_dev_free+0x17/0x120 [bcache]
[ 3142.816415][ T2649] process_one_work+0x2a4/0x640
[ 3142.818413][ T2649] worker_thread+0x39/0x3f0
[ 3142.820276][ T2649] kthread+0x125/0x140
[ 3142.822061][ T2649] ret_from_fork+0x3a/0x50
[ 3142.823965][ T2649]
[ 3142.823965][ T2649] -> #1 ((work_completion)(&cl->work)#2){+.+.}:
[ 3142.827244][ T2649] process_one_work+0x277/0x640
[ 3142.829160][ T2649] worker_thread+0x39/0x3f0
[ 3142.830958][ T2649] kthread+0x125/0x140
[ 3142.832674][ T2649] ret_from_fork+0x3a/0x50
[ 3142.834915][ T2649]
[ 3142.834915][ T2649] -> #0 ((wq_completion)bcache_writeback_wq){+.+.}:
[ 3142.838121][ T2649] lock_acquire+0xb4/0x1c0
[ 3142.840025][ T2649] flush_workqueue+0xae/0x4c0
[ 3142.842035][ T2649] drain_workqueue+0xa9/0x180
[ 3142.844042][ T2649] destroy_workqueue+0x17/0x250
[ 3142.846142][ T2649] cached_dev_free+0x52/0x120 [bcache]
[ 3142.848530][ T2649] process_one_work+0x2a4/0x640
[ 3142.850663][ T2649] worker_thread+0x39/0x3f0
[ 3142.852464][ T2649] kthread+0x125/0x140
[ 3142.854106][ T2649] ret_from_fork+0x3a/0x50
[ 3142.855880][ T2649]
[ 3142.855880][ T2649] other info that might help us debug this:
[ 3142.855880][ T2649]
[ 3142.859663][ T2649] Chain exists of:
[ 3142.859663][ T2649] (wq_completion)bcache_writeback_wq --> (work_completion)(&cl->work)#2 --> &bch_register_lock
[ 3142.859663][ T2649]
[ 3142.865424][ T2649] Possible unsafe locking scenario:
[ 3142.865424][ T2649]
[ 3142.868022][ T2649] CPU0 CPU1
[ 3142.869885][ T2649] ---- ----
[ 3142.871751][ T2649] lock(&bch_register_lock);
[ 3142.873379][ T2649] lock((work_completion)(&cl->work)#2);
[ 3142.876399][ T2649] lock(&bch_register_lock);
[ 3142.879727][ T2649] lock((wq_completion)bcache_writeback_wq);
[ 3142.882064][ T2649]
[ 3142.882064][ T2649] *** DEADLOCK ***
[ 3142.882064][ T2649]
[ 3142.885060][ T2649] 3 locks held by kworker/3:67/2649:
[ 3142.887245][ T2649] #0: 00000000e774cdd0 ((wq_completion)events){+.+.}, at: process_one_work+0x21e/0x640
[ 3142.890815][ T2649] #1: 00000000f7df89da ((work_completion)(&cl->work)#2){+.+.}, at: process_one_work+0x21e/0x640
[ 3142.894884][ T2649] #2: 000000004fcf89c5 (&bch_register_lock){+.+.}, at: cached_dev_free+0x17/0x120 [bcache]
[ 3142.898797][ T2649]
[ 3142.898797][ T2649] stack backtrace:
[ 3142.900961][ T2649] CPU: 3 PID: 2649 Comm: kworker/3:67 Tainted: G W 5.2.0-rc4-lp151.20-default+ #1
[ 3142.904789][ T2649] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[ 3142.909168][ T2649] Workqueue: events cached_dev_free [bcache]
[ 3142.911422][ T2649] Call Trace:
[ 3142.912656][ T2649] dump_stack+0x85/0xcb
[ 3142.914181][ T2649] print_circular_bug+0x19a/0x1f0
[ 3142.916193][ T2649] __lock_acquire+0x16cd/0x1850
[ 3142.917936][ T2649] ? __lock_acquire+0x6a8/0x1850
[ 3142.919704][ T2649] ? lock_acquire+0xb4/0x1c0
[ 3142.921335][ T2649] ? find_held_lock+0x34/0xa0
[ 3142.923052][ T2649] lock_acquire+0xb4/0x1c0
[ 3142.924635][ T2649] ? flush_workqueue+0x87/0x4c0
[ 3142.926375][ T2649] flush_workqueue+0xae/0x4c0
[ 3142.928047][ T2649] ? flush_workqueue+0x87/0x4c0
[ 3142.929824][ T2649] ? drain_workqueue+0xa9/0x180
[ 3142.931686][ T2649] drain_workqueue+0xa9/0x180
[ 3142.933534][ T2649] destroy_workqueue+0x17/0x250
[ 3142.935787][ T2649] cached_dev_free+0x52/0x120 [bcache]
[ 3142.937795][ T2649] process_one_work+0x2a4/0x640
[ 3142.939803][ T2649] worker_thread+0x39/0x3f0
[ 3142.941487][ T2649] ? process_one_work+0x640/0x640
[ 3142.943389][ T2649] kthread+0x125/0x140
[ 3142.944894][ T2649] ? kthread_create_worker_on_cpu+0x70/0x70
[ 3142.947744][ T2649] ret_from_fork+0x3a/0x50
[ 3142.970358][ T2649] bcache: bcache_device_free() bcache0 stopped
Here is how the deadlock happens.
1) bcache_reboot() calls bcache_device_stop(), then inside
bcache_device_stop() BCACHE_DEV_CLOSING bit is set on d->flags.
Then closure_queue(&d->cl) is called to invoke cached_dev_flush().
2) In cached_dev_flush(), cached_dev_free() is called by continu_at().
3) In cached_dev_free(), when stopping the writeback kthread of the
cached device by kthread_stop(), dc->writeback_thread will be waken
up to quite the kthread while-loop, then cached_dev_put() is called
in bch_writeback_thread().
4) Calling cached_dev_put() in writeback kthread may drop dc->count to
0, then dc->detach kworker is scheduled, which is initialized as
cached_dev_detach_finish().
5) Inside cached_dev_detach_finish(), the last line of code is to call
closure_put(&dc->disk.cl), which drops the last reference counter of
closrure dc->disk.cl, then the callback cached_dev_flush() gets
called.
Now cached_dev_flush() is called for second time in the code path, the
first time is in step 2). And again bch_register_lock will be acquired
again, and a A-A lock (lockdep terminology) is happening.
The root cause of the above A-A lock is in cached_dev_free(), mutex
bch_register_lock is held before stopping writeback kthread and other
kworkers. Fortunately now we have variable 'bcache_is_reboot', which may
prevent device registration or unregistration during reboot/shutdown
time, so it is unncessary to hold bch_register_lock such early now.
This is how this patch fixes the reboot/shutdown time A-A lock issue:
After moving mutex_lock(&bch_register_lock) to a later location where
before atomic_read(&dc->running) in cached_dev_free(), such A-A lock
problem can be solved without any reboot time registration race.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 383ff2183ad16a8842d1fbd9dd3e1cbd66813e64 ]
When too many I/O errors happen on cache set and CACHE_SET_IO_DISABLE
bit is set, bch_journal() may continue to work because the journaling
bkey might be still in write set yet. The caller of bch_journal() may
believe the journal still work but the truth is in-memory journal write
set won't be written into cache device any more. This behavior may
introduce potential inconsistent metadata status.
This patch checks CACHE_SET_IO_DISABLE bit at the head of bch_journal(),
if the bit is set, bch_journal() returns NULL immediately to notice
caller to know journal does not work.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit e775339e1ae1205b47d94881db124c11385e597c ]
If CACHE_SET_IO_DISABLE of a cache set flag is set by too many I/O
errors, currently allocator routines can still continue allocate
space which may introduce inconsistent metadata state.
This patch checkes CACHE_SET_IO_DISABLE bit in following allocator
routines,
- bch_bucket_alloc()
- __bch_bucket_alloc_set()
Once CACHE_SET_IO_DISABLE is set on cache set, the allocator routines
may reject allocation request earlier to avoid potential inconsistent
metadata.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 2eba4e640b2c4161e31ae20090a53ee02a518657 ]
DM verity should also use DMERR_LIMIT to limit repeat data block
corruption messages.
Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit a0651926553cfe7992166432e418987760882652 ]
For the first call to realloc_argv() in dm_split_args(), old_argv is
NULL and size is zero. Then memcpy is called, with the NULL old_argv
as the source argument and a zero size argument. AFAIK, this is
undefined behavior and generates the following warning when compiled
with UBSAN on ppc64le:
In file included from ./arch/powerpc/include/asm/paca.h:19,
from ./arch/powerpc/include/asm/current.h:16,
from ./include/linux/sched.h:12,
from ./include/linux/kthread.h:6,
from drivers/md/dm-core.h:12,
from drivers/md/dm-table.c:8:
In function 'memcpy',
inlined from 'realloc_argv' at drivers/md/dm-table.c:565:3,
inlined from 'dm_split_args' at drivers/md/dm-table.c:588:9:
./include/linux/string.h:345:9: error: argument 2 null where non-null expected [-Werror=nonnull]
return __builtin_memcpy(p, q, size);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/md/dm-table.c: In function 'dm_split_args':
./include/linux/string.h:345:9: note: in a call to built-in function '__builtin_memcpy'
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 9642fa73d073527b0cbc337cc17a47d545d82cd2 ]
Stopping external metadata arrays during resync/recovery causes
retries, loop of interrupting and starting reconstruction, until it
hit at good moment to stop completely. While these retries
curr_mark_cnt can be small- especially on HDD drives, so subtraction
result can be smaller than 0. However it is casted to uint without
checking. As a result of it the status bar in /proc/mdstat while stopping
is strange (it jumps between 0% and 99%).
The real problem occurs here after commit 72deb455b5ec ("block: remove
CONFIG_LBDAF"). Sector_div() macro has been changed, now the
divisor is casted to uint32. For db = -8 the divisior(db/32-1) becomes 0.
Check if db value can be really counted and replace these macro by
div64_u64() inline.
Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@intel.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 211ad4b733037f66f9be0a79eade3da7ab11cbb8 upstream.
Currently, although we submit super bios in order (and super.nr_entries
is incremented by each logged entry), submit_bio() is async so each
super sector may not be written to log device in order and then the
final nr_entries may be smaller than it should be.
This problem can be reproduced by the xfstests generic/455 with ext4:
QA output created by 455
-Silence is golden
+mark 'end' does not exist
Fix this by serializing submission of super sectors to make sure each
is written to the log disk in order.
Fixes: 0e9cebe724597 ("dm: add log writes target")
Cc: stable@vger.kernel.org
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Suggested-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit dec7e6494e1aea6bf676223da3429cd17ce0af79 upstream.
Fix 2 kstrndup() calls with incorrect argument order.
Fixes: 6bbc923dfcf5 ("dm: add support to directly boot to a mapped device")
Cc: stable@vger.kernel.org # v5.1
Signed-off-by: Gen Zhang <blackgod016574@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 1f0ffa67349c56ea54c03ccfd1e073c990e7411e upstream.
When people set a writeback percent via sysfs file,
/sys/block/bcache<N>/bcache/writeback_percent
current code directly sets BCACHE_DEV_WB_RUNNING to dc->disk.flags
and schedules kworker dc->writeback_rate_update.
If there is no cache set attached to, the writeback kernel thread is
not running indeed, running dc->writeback_rate_update does not make
sense and may cause NULL pointer deference when reference cache set
pointer inside update_writeback_rate().
This patch checks whether the cache set point (dc->disk.c) is NULL in
sysfs interface handler, and only set BCACHE_DEV_WB_RUNNING and
schedule dc->writeback_rate_update when dc->disk.c is not NULL (it
means the cache device is attached to a cache set).
This problem might be introduced from initial bcache commit, but
commit 3fd47bfe55b0 ("bcache: stop dc->writeback_rate_update properly")
changes part of the original code piece, so I add 'Fixes: 3fd47bfe55b0'
to indicate from which commit this patch can be applied.
Fixes: 3fd47bfe55b0 ("bcache: stop dc->writeback_rate_update properly")
Reported-by: Bjørn Forsman <bjorn.forsman@gmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Bjørn Forsman <bjorn.forsman@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 31b90956b124240aa8c63250243ae1a53585c5e2 upstream.
Recently people report bcache code compiled with gcc9 is broken, one of
the buggy behavior I observe is that two adjacent 4KB I/Os should merge
into one but they don't. Finally it turns out to be a stack corruption
caused by macro PRECEDING_KEY().
See how PRECEDING_KEY() is defined in bset.h,
437 #define PRECEDING_KEY(_k) \
438 ({ \
439 struct bkey *_ret = NULL; \
440 \
441 if (KEY_INODE(_k) || KEY_OFFSET(_k)) { \
442 _ret = &KEY(KEY_INODE(_k), KEY_OFFSET(_k), 0); \
443 \
444 if (!_ret->low) \
445 _ret->high--; \
446 _ret->low--; \
447 } \
448 \
449 _ret; \
450 })
At line 442, _ret points to address of a on-stack variable combined by
KEY(), the life range of this on-stack variable is in line 442-446,
once _ret is returned to bch_btree_insert_key(), the returned address
points to an invalid stack address and this address is overwritten in
the following called bch_btree_iter_init(). Then argument 'search' of
bch_btree_iter_init() points to some address inside stackframe of
bch_btree_iter_init(), exact address depends on how the compiler
allocates stack space. Now the stack is corrupted.
Fixes: 0eacac22034c ("bcache: PRECEDING_KEY()")
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Rolf Fokkens <rolf@rolffokkens.nl>
Reviewed-by: Pierre JUHEN <pierre.juhen@orange.fr>
Tested-by: Shenghui Wang <shhuiw@foxmail.com>
Tested-by: Pierre JUHEN <pierre.juhen@orange.fr>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Nix <nix@esperi.org.uk>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 78d4eb8ad9e1d413449d1b7a060f50b6efa81ebd ]
clang has identified a code path in which it thinks a
variable may be unused:
drivers/md/bcache/alloc.c:333:4: error: variable 'bucket' is used uninitialized whenever 'if' condition is false
[-Werror,-Wsometimes-uninitialized]
fifo_pop(&ca->free_inc, bucket);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/md/bcache/util.h:219:27: note: expanded from macro 'fifo_pop'
#define fifo_pop(fifo, i) fifo_pop_front(fifo, (i))
^~~~~~~~~~~~~~~~~~~~~~~~~
drivers/md/bcache/util.h:189:6: note: expanded from macro 'fifo_pop_front'
if (_r) { \
^~
drivers/md/bcache/alloc.c:343:46: note: uninitialized use occurs here
allocator_wait(ca, bch_allocator_push(ca, bucket));
^~~~~~
drivers/md/bcache/alloc.c:287:7: note: expanded from macro 'allocator_wait'
if (cond) \
^~~~
drivers/md/bcache/alloc.c:333:4: note: remove the 'if' if its condition is always true
fifo_pop(&ca->free_inc, bucket);
^
drivers/md/bcache/util.h:219:27: note: expanded from macro 'fifo_pop'
#define fifo_pop(fifo, i) fifo_pop_front(fifo, (i))
^
drivers/md/bcache/util.h:189:2: note: expanded from macro 'fifo_pop_front'
if (_r) { \
^
drivers/md/bcache/alloc.c:331:15: note: initialize the variable 'bucket' to silence this warning
long bucket;
^
This cannot happen in practice because we only enter the loop
if there is at least one element in the list.
Slightly rearranging the code makes this clearer to both the
reader and the compiler, which avoids the warning.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit ce3e4cfb59cb382f8e5ce359238aa580d4ae7778 ]
Currently run_cache_set() has no return value, if there is failure in
bch_journal_replay(), the caller of run_cache_set() has no idea about
such failure and just continue to execute following code after
run_cache_set(). The internal failure is triggered inside
bch_journal_replay() and being handled in async way. This behavior is
inefficient, while failure handling inside bch_journal_replay(), cache
register code is still running to start the cache set. Registering and
unregistering code running as same time may introduce some rare race
condition, and make the code to be more hard to be understood.
This patch adds return value to run_cache_set(), and returns -EIO if
bch_journal_rreplay() fails. Then caller of run_cache_set() may detect
such failure and stop registering code flow immedidately inside
register_cache_set().
If journal replay fails, run_cache_set() can report error immediately
to register_cache_set(). This patch makes the failure handling for
bch_journal_replay() be in synchronized way, easier to understand and
debug, and avoid poetential race condition for register-and-unregister
in same time.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 631207314d88e9091be02fbdd1fdadb1ae2ed79a ]
journal replay failed with messages:
Sep 10 19:10:43 ceph kernel: bcache: error on
bb379a64-e44e-4812-b91d-a5599871a3b1: bcache: journal entries
2057493-2057567 missing! (replaying 2057493-2076601), disabling
caching
The reason is in journal_reclaim(), when discard is enabled, we send
discard command and reclaim those journal buckets whose seq is old
than the last_seq_now, but before we write a journal with last_seq_now,
the machine is restarted, so the journal with the last_seq_now is not
written to the journal bucket, and the last_seq_wrote in the newest
journal is old than last_seq_now which we expect to be, so when we doing
replay, journals from last_seq_wrote to last_seq_now are missing.
It's hard to write a journal immediately after journal_reclaim(),
and it harmless if those missed journal are caused by discarding
since those journals are already wrote to btree node. So, if miss
seqs are started from the beginning journal, we treat it as normal,
and only print a message to show the miss journal, and point out
it maybe caused by discarding.
Patch v2 add a judgement condition to ignore the missed journal
only when discard enabled as Coly suggested.
(Coly Li: rebase the patch with other changes in bch_journal_replay())
Signed-off-by: Tang Junhui <tang.junhui.linux@gmail.com>
Tested-by: Dennis Schridde <devurandom@gmx.net>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 68d10e6979a3b59e3cd2e90bfcafed79c4cf180a ]
When failure happens inside bch_journal_replay(), calling
cache_set_err_on() and handling the failure in async way is not a good
idea. Because after bch_journal_replay() returns, registering code will
continue to execute following steps, and unregistering code triggered
by cache_set_err_on() is running in same time. First it is unnecessary
to handle failure and unregister cache set in an async way, second there
might be potential race condition to run register and unregister code
for same cache set.
So in this patch, if failure happens in bch_journal_replay(), we don't
call cache_set_err_on(), and just print out the same error message to
kernel message buffer, then return -EIO immediately caller. Then caller
can detect such failure and handle it in synchrnozied way.
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
CACHE_SYNC branch of run_cache_set
[ Upstream commit 95f18c9d1310730d075499a75aaf13bcd60405a7 ]
In the CACHE_SYNC branch of run_cache_set(), LIST_HEAD(journal) is used
to collect journal_replay(s) and filled by bch_journal_read().
If all goes well, bch_journal_replay() will release the list of
jounal_replay(s) at the end of the branch.
If something goes wrong, code flow will jump to the label "err:" and leave
the list unreleased.
This patch will release the list of journal_replay(s) in the case of
error detected.
v1 -> v2:
* Move the release code to the location after label 'err:' to
simply the change.
Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 7bf7eac8d648057519adb6fce1e31458c902212c upstream.
Pankaj reports that starting with commit ad428cdb525a "dax: Check the
end of the block-device capacity with dax_direct_access()" device-mapper
no longer allows dax operation. This results from the stricter checks in
__bdev_dax_supported() that validate that the start and end of a
block-device map to the same 'pagemap' instance.
Teach the dax-core and device-mapper to validate the 'pagemap' on a
per-target basis. This is accomplished by refactoring the
bdev_dax_supported() internals into generic_fsdax_supported() which
takes a sector range to validate. Consequently generic_fsdax_supported()
is suitable to be used in a device-mapper ->iterate_devices() callback.
A new ->dax_supported() operation is added to allow composite devices to
split and route upper-level bdev_dax_supported() requests.
Fixes: ad428cdb525a ("dax: Check the end of the block-device...")
Cc: <stable@vger.kernel.org>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reported-by: Pankaj Gupta <pagupta@redhat.com>
Reviewed-by: Pankaj Gupta <pagupta@redhat.com>
Tested-by: Pankaj Gupta <pagupta@redhat.com>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit b2176a1dfb518d870ee073445d27055fea64dfb8 upstream.
The problem is that any 'uptodate' vs 'disks' check is not precise
in this path. Put a "WARN_ON(!test_bit(R5_UPTODATE, &dev->flags)" on the
device that might try to kick off writes and then skip the action.
Better to prevent the raid driver from taking unexpected action *and* keep
the system alive vs killing the machine with BUG_ON.
Note: fixed warning reported by kbuild test robot <lkp@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit a25d8c327bb41742dbd59f8c545f59f3b9c39983 upstream.
This reverts commit 4f4fd7c5798bbdd5a03a60f6269cf1177fbd11ef.
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Nigel Croxon <ncroxon@redhat.com>
Cc: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 51b86f9a8d1c4bb4e3862ee4b4c5f46072f7520d upstream.
Commit 61697a6abd24 ("dm: eliminate 'split_discard_bios' flag from DM
target interface") incorrectly removed code from
__send_changing_extent_only() that is required to impose a per-target IO
boundary on IO that exceeds max_io_len_target_boundary(). Otherwise
"special" IO (e.g. DISCARD, WRITE SAME, WRITE ZEROES) can write beyond
where allowed.
Fix this by restoring the max_io_len_target_boundary() limit in
__send_changing_extent_only()
Fixes: 61697a6abd24 ("dm: eliminate 'split_discard_bios' flag from DM target interface")
Cc: stable@vger.kernel.org # 5.1+
Signed-off-by: Michael Lass <bevan@bi-co.net>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 940bc471780b004a5277c1931f52af363c2fc9da upstream.
Commit b592211c33f7 ("dm mpath: fix attached_handler_name leak and
dangling hw_handler_name pointer") fixed a memory leak for the case
where setup_scsi_dh() returns failure. But setup_scsi_dh may return
success and not "use" attached_handler_name if the
retain_attached_hwhandler flag is not set on the map. As setup_scsi_sh
properly "steals" the pointer by nullifying it, freeing it
unconditionally in parse_path() is safe.
Fixes: b592211c33f7 ("dm mpath: fix attached_handler_name leak and dangling hw_handler_name pointer")
Cc: stable@vger.kernel.org
Reported-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Martin Wilck <mwilck@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 0f41fcf78849c902ddca564f99a8e23ccfc80333 upstream.
The dm_early_create() function (which deals with "dm-mod.create=" kernel
command line option) calls dm_hash_insert() who gets an extra reference
to the md object.
In case of failure, this reference wasn't being released, causing
dm_destroy() to hang, thus hanging the whole boot process.
Fix this by calling __hash_remove() in the error path.
Fixes: 6bbc923dfcf57d ("dm: add support to directly boot to a mapped device")
Cc: stable@vger.kernel.org
Signed-off-by: Helen Koike <helen.koike@collabora.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 30bba430ddf737978e40561198693ba91386dac1 upstream.
When we use separate devices for data and metadata, dm-integrity would
incorrectly calculate the size of the metadata device as if it had
512-byte block size - and it would refuse activation with larger block
size and smaller metadata device.
Fix this so that it takes actual block size into account, which fixes
the following reported issue:
https://gitlab.com/cryptsetup/cryptsetup/issues/450
Fixes: 356d9d52e122 ("dm integrity: allow separate metadata device")
Cc: stable@vger.kernel.org # v4.19+
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 7a1cd7238fde6ab367384a4a2998cba48330c398 upstream.
The information about tag size should not be printed without debug info
set. Also print device major:minor in the error message to identify the
device instance.
Also use rate limiting and debug level for info about used crypto API
implementaton. This is important because during online reencryption
the existing message saturates syslog (because we are moving hotzone
across the whole device).
Cc: stable@vger.kernel.org
Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 81bc6d150ace6250503b825d9d0c10f7bbd24095 upstream.
When the target line contains an invalid device, delay_ctr() will call
delay_dtr() with NULL workqueue. Attempting to destroy the NULL
workqueue causes a crash.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 8e890c1ab1b1e0f765cd8da82c4dee011698a5e8 upstream.
dm-init should allow up to DM_MAX_{DEVICES,TARGETS} for devices/targets,
and not DM_MAX_{DEVICES,TARGETS} - 1.
Fix the checks and also fix the error message when the number of devices
is surpassed.
Fixes: 6bbc923dfcf57d ("dm: add support to directly boot to a mapped device")
Cc: stable@vger.kernel.org
Signed-off-by: Helen Koike <helen.koike@collabora.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 7aedf75ff740a98f3683439449cd91c8662d03b2 upstream.
The function blkdev_report_zones() returns success even if no zone
information is reported (empty report). Empty zone reports can only
happen if the report start sector passed exceeds the device capacity.
The conditions for this to happen are either a bug in the caller code,
or, a change in the device that forced the low level driver to change
the device capacity to a value that is lower than the report start
sector. This situation includes a failed disk revalidation resulting in
the disk capacity being changed to 0.
If this change happens while dm-zoned is in its initialization phase
executing dmz_init_zones(), this function may enter an infinite loop
and hang the system. To avoid this, add a check to disallow empty zone
reports and bail out early. Also fix the function dmz_update_zone() to
make sure that the report for the requested zone was correctly obtained.
Fixes: 3b1a94c88b79 ("dm zoned: drive-managed zoned block device target")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Shaun Tancheff <shaun@tancheff.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit e28adc3bf34e434b30e8d063df4823ba0f3e0529 upstream.
Add missing dm_bitset_cursor_next() to properly advance the bitset
cursor.
Otherwise, the discarded state of all blocks is set according to the
discarded state of the first block.
Fixes: ae4a46a1f6 ("dm cache metadata: use bitset cursor api to load discard bitset")
Cc: stable@vger.kernel.org
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit ed4d0a4ea11e19863952ac6a7cea3bbb27ccd452 upstream.
The on-disk value is little endian and we need to convert it to
native endian before storing the value in the in-core structure.
Fixes: 7564beda19b36 ("md-cluster/raid10: support add disk under grow mode")
Cc: <stable@vger.kernel.org> # 4.20+
Acked-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit ee37e62191a59d253fc916b9fc763deb777211e2 upstream.
When doing re-add, we need to ensure rdev->mddev->pers is not NULL,
which can avoid potential NULL pointer derefence in fallowing
add_bound_rdev().
Fixes: a6da4ef85cef ("md: re-add a failed disk")
Cc: Xiao Ni <xni@redhat.com>
Cc: NeilBrown <neilb@suse.com>
Cc: <stable@vger.kernel.org> # 4.4+
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 2bc13b83e6298486371761de503faeffd15b7534 upstream.
Currently if many flush requests are submitted to an md device is quick
succession, they are serialized and can take a long to process them all.
We don't really need to call flush all those times - a single flush call
can satisfy all requests submitted before it started.
So keep track of when the current flush started and when it finished,
allow any pending flush that was requested before the flush started
to complete without waiting any more.
Test results from Xiao:
Test is done on a raid10 device which is created by 4 SSDs. The tool is
dbench.
1. The latest linux stable kernel
Operation Count AvgLat MaxLat
--------------------------------------------------
Deltree 768 10.509 78.305
Flush 2078376 0.013 10.094
Close 21787697 0.019 18.821
LockX 96580 0.007 3.184
Mkdir 384 0.008 0.062
Rename 1255883 0.191 23.534
ReadX 46495589 0.020 14.230
WriteX 14790591 7.123 60.706
Unlink 5989118 0.440 54.551
UnlockX 96580 0.005 2.736
FIND_FIRST 10393845 0.042 12.079
SET_FILE_INFORMATION 2415558 0.129 10.088
QUERY_FILE_INFORMATION 4711725 0.005 8.462
QUERY_PATH_INFORMATION 26883327 0.032 21.715
QUERY_FS_INFORMATION 4929409 0.010 8.238
NTCreateX 29660080 0.100 53.268
Throughput 1034.88 MB/sec (sync open) 128 clients 128 procs
max_latency=60.712 ms
2. With patch1 "Revert "MD: fix lock contention for flush bios""
Operation Count AvgLat MaxLat
--------------------------------------------------
Deltree 256 8.326 36.761
Flush 693291 3.974 180.269
Close 7266404 0.009 36.929
LockX 32160 0.006 0.840
Mkdir 128 0.008 0.021
Rename 418755 0.063 29.945
ReadX 15498708 0.007 7.216
WriteX 4932310 22.482 267.928
Unlink 1997557 0.109 47.553
UnlockX 32160 0.004 1.110
FIND_FIRST 3465791 0.036 7.320
SET_FILE_INFORMATION 805825 0.015 1.561
QUERY_FILE_INFORMATION 1570950 0.005 2.403
QUERY_PATH_INFORMATION 8965483 0.013 14.277
QUERY_FS_INFORMATION 1643626 0.009 3.314
NTCreateX 9892174 0.061 41.278
Throughput 345.009 MB/sec (sync open) 128 clients 128 procs
max_latency=267.939 m
3. With patch1 and patch2
Operation Count AvgLat MaxLat
--------------------------------------------------
Deltree 768 9.570 54.588
Flush 2061354 0.666 15.102
Close 21604811 0.012 25.697
LockX 95770 0.007 1.424
Mkdir 384 0.008 0.053
Rename 1245411 0.096 12.263
ReadX 46103198 0.011 12.116
WriteX 14667988 7.375 60.069
Unlink 5938936 0.173 30.905
UnlockX 95770 0.005 4.147
FIND_FIRST 10306407 0.041 11.715
SET_FILE_INFORMATION 2395987 0.048 7.640
QUERY_FILE_INFORMATION 4672371 0.005 9.291
QUERY_PATH_INFORMATION 26656735 0.018 19.719
QUERY_FS_INFORMATION 4887940 0.010 7.654
NTCreateX 29410811 0.059 28.551
Throughput 1026.21 MB/sec (sync open) 128 clients 128 procs
max_latency=60.075 ms
Cc: <stable@vger.kernel.org> # v4.19+
Tested-by: Xiao Ni <xni@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 4bc034d35377196c854236133b07730a777c4aba upstream.
This reverts commit 5a409b4f56d50b212334f338cb8465d65550cd85.
This patch has two problems.
1/ it make multiple calls to submit_bio() from inside a make_request_fn.
The bios thus submitted will be queued on current->bio_list and not
submitted immediately. As the bios are allocated from a mempool,
this can theoretically result in a deadlock - all the pool of requests
could be in various ->bio_list queues and a subsequent mempool_alloc
could block waiting for one of them to be released.
2/ It aims to handle a case when there are many concurrent flush requests.
It handles this by submitting many requests in parallel - all of which
are identical and so most of which do nothing useful.
It would be more efficient to just send one lower-level request, but
allow that to satisfy multiple upper-level requests.
Fixes: 5a409b4f56d5 ("MD: fix lock contention for flush bios")
Cc: <stable@vger.kernel.org> # v4.19+
Tested-by: Xiao Ni <xni@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 1bee2addc0c8470c8aaa65ef0599eeae96dd88bc upstream.
In journal_reclaim() ja->cur_idx of each cache will be update to
reclaim available journal buckets. Variable 'int n' is used to count how
many cache is successfully reclaimed, then n is set to c->journal.key
by SET_KEY_PTRS(). Later in journal_write_unlocked(), a for_each_cache()
loop will write the jset data onto each cache.
The problem is, if all jouranl buckets on each cache is full, the
following code in journal_reclaim(),
529 for_each_cache(ca, c, iter) {
530 struct journal_device *ja = &ca->journal;
531 unsigned int next = (ja->cur_idx + 1) % ca->sb.njournal_buckets;
532
533 /* No space available on this device */
534 if (next == ja->discard_idx)
535 continue;
536
537 ja->cur_idx = next;
538 k->ptr[n++] = MAKE_PTR(0,
539 bucket_to_sector(c, ca->sb.d[ja->cur_idx]),
540 ca->sb.nr_this_dev);
541 }
542
543 bkey_init(k);
544 SET_KEY_PTRS(k, n);
If there is no available bucket to reclaim, the if() condition at line
534 will always true, and n remains 0. Then at line 544, SET_KEY_PTRS()
will set KEY_PTRS field of c->journal.key to 0.
Setting KEY_PTRS field of c->journal.key to 0 is wrong. Because in
journal_write_unlocked() the journal data is written in following loop,
649 for (i = 0; i < KEY_PTRS(k); i++) {
650-671 submit journal data to cache device
672 }
If KEY_PTRS field is set to 0 in jouranl_reclaim(), the journal data
won't be written to cache device here. If system crahed or rebooted
before bkeys of the lost journal entries written into btree nodes, data
corruption will be reported during bcache reload after rebooting the
system.
Indeed there is only one cache in a cache set, there is no need to set
KEY_PTRS field in journal_reclaim() at all. But in order to keep the
for_each_cache() logic consistent for now, this patch fixes the above
problem by not setting 0 KEY_PTRS of journal key, if there is no bucket
available to reclaim.
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit a4b732a248d12cbdb46999daf0bf288c011335eb upstream.
There is a race between cache device register and cache set unregister.
For an already registered cache device, register_bcache will call
bch_is_open to iterate through all cachesets and check every cache
there. The race occurs if cache_set_free executes at the same time and
clears the caches right before ca is dereferenced in bch_is_open_cache.
To close the race, let's make sure the clean up work is protected by
the bch_register_lock as well.
This issue can be reproduced as follows,
while true; do echo /dev/XXX> /sys/fs/bcache/register ; done&
while true; do echo 1> /sys/block/XXX/bcache/set/unregister ; done &
and results in the following oops,
[ +0.000053] BUG: unable to handle kernel NULL pointer dereference at 0000000000000998
[ +0.000457] #PF error: [normal kernel read fault]
[ +0.000464] PGD 800000003ca9d067 P4D 800000003ca9d067 PUD 3ca9c067 PMD 0
[ +0.000388] Oops: 0000 [#1] SMP PTI
[ +0.000269] CPU: 1 PID: 3266 Comm: bash Not tainted 5.0.0+ #6
[ +0.000346] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 04/01/2014
[ +0.000472] RIP: 0010:register_bcache+0x1829/0x1990 [bcache]
[ +0.000344] Code: b0 48 83 e8 50 48 81 fa e0 e1 10 c0 0f 84 a9 00 00 00 48 89 c6 48 89 ca 0f b7 ba 54 04 00 00 4c 8b 82 60 0c 00 00 85 ff 74 2f <49> 3b a8 98 09 00 00 74 4e 44 8d 47 ff 31 ff 49 c1 e0 03 eb 0d
[ +0.000839] RSP: 0018:ffff92ee804cbd88 EFLAGS: 00010202
[ +0.000328] RAX: ffffffffc010e190 RBX: ffff918b5c6b5000 RCX: ffff918b7d8e0000
[ +0.000399] RDX: ffff918b7d8e0000 RSI: ffffffffc010e190 RDI: 0000000000000001
[ +0.000398] RBP: ffff918b7d318340 R08: 0000000000000000 R09: ffffffffb9bd2d7a
[ +0.000385] R10: ffff918b7eb253c0 R11: ffffb95980f51200 R12: ffffffffc010e1a0
[ +0.000411] R13: fffffffffffffff2 R14: 000000000000000b R15: ffff918b7e232620
[ +0.000384] FS: 00007f955bec2740(0000) GS:ffff918b7eb00000(0000) knlGS:0000000000000000
[ +0.000420] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ +0.000801] CR2: 0000000000000998 CR3: 000000003cad6000 CR4: 00000000001406e0
[ +0.000837] Call Trace:
[ +0.000682] ? _cond_resched+0x10/0x20
[ +0.000691] ? __kmalloc+0x131/0x1b0
[ +0.000710] kernfs_fop_write+0xfa/0x170
[ +0.000733] __vfs_write+0x2e/0x190
[ +0.000688] ? inode_security+0x10/0x30
[ +0.000698] ? selinux_file_permission+0xd2/0x120
[ +0.000752] ? security_file_permission+0x2b/0x100
[ +0.000753] vfs_write+0xa8/0x1a0
[ +0.000676] ksys_write+0x4d/0xb0
[ +0.000699] do_syscall_64+0x3a/0xf0
[ +0.000692] entry_SYSCALL_64_after_hwframe+0x44/0xa9
Signed-off-by: Liang Chen <liangchen.linux@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 4f4fd7c5798bbdd5a03a60f6269cf1177fbd11ef upstream.
Changing state from check_state_check_result to
check_state_compute_result not only is unsafe but also doesn't
appear to serve a valid purpose. A raid6 check should only be
pushing out extra writes if doing repair and a mis-match occurs.
The stripe dev management will already try and do repair writes
for failing sectors.
This patch makes the raid6 check_state_check_result handling
work more like raid5's. If somehow too many failures for a
check, just quit the check operation for the stripe. When any
checks pass, don't try and use check_state_compute_result for
a purpose it isn't needed for and is unsafe for. Just mark the
stripe as in sync for passing its parity checks and let the
stripe dev read/write code and the bad blocks list do their
job handling I/O errors.
Repro steps from Xiao:
These are the steps to reproduce this problem:
1. redefined OPT_MEDIUM_ERR_ADDR to 12000 in scsi_debug.c
2. insmod scsi_debug.ko dev_size_mb=11000 max_luns=1 num_tgts=1
3. mdadm --create /dev/md127 --level=6 --raid-devices=5 /dev/sde1 /dev/sde2 /dev/sde3 /dev/sde5 /dev/sde6
sde is the disk created by scsi_debug
4. echo "2" >/sys/module/scsi_debug/parameters/opts
5. raid-check
It panic:
[ 4854.730899] md: data-check of RAID array md127
[ 4854.857455] sd 5:0:0:0: [sdr] tag#80 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 4854.859246] sd 5:0:0:0: [sdr] tag#80 Sense Key : Medium Error [current]
[ 4854.860694] sd 5:0:0:0: [sdr] tag#80 Add. Sense: Unrecovered read error
[ 4854.862207] sd 5:0:0:0: [sdr] tag#80 CDB: Read(10) 28 00 00 00 2d 88 00 04 00 00
[ 4854.864196] print_req_error: critical medium error, dev sdr, sector 11656 flags 0
[ 4854.867409] sd 5:0:0:0: [sdr] tag#100 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 4854.869469] sd 5:0:0:0: [sdr] tag#100 Sense Key : Medium Error [current]
[ 4854.871206] sd 5:0:0:0: [sdr] tag#100 Add. Sense: Unrecovered read error
[ 4854.872858] sd 5:0:0:0: [sdr] tag#100 CDB: Read(10) 28 00 00 00 2e e0 00 00 08 00
[ 4854.874587] print_req_error: critical medium error, dev sdr, sector 12000 flags 4000
[ 4854.876456] sd 5:0:0:0: [sdr] tag#101 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 4854.878552] sd 5:0:0:0: [sdr] tag#101 Sense Key : Medium Error [current]
[ 4854.880278] sd 5:0:0:0: [sdr] tag#101 Add. Sense: Unrecovered read error
[ 4854.881846] sd 5:0:0:0: [sdr] tag#101 CDB: Read(10) 28 00 00 00 2e e8 00 00 08 00
[ 4854.883691] print_req_error: critical medium error, dev sdr, sector 12008 flags 4000
[ 4854.893927] sd 5:0:0:0: [sdr] tag#166 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 4854.896002] sd 5:0:0:0: [sdr] tag#166 Sense Key : Medium Error [current]
[ 4854.897561] sd 5:0:0:0: [sdr] tag#166 Add. Sense: Unrecovered read error
[ 4854.899110] sd 5:0:0:0: [sdr] tag#166 CDB: Read(10) 28 00 00 00 2e e0 00 00 10 00
[ 4854.900989] print_req_error: critical medium error, dev sdr, sector 12000 flags 0
[ 4854.902757] md/raid:md127: read error NOT corrected!! (sector 9952 on sdr1).
[ 4854.904375] md/raid:md127: read error NOT corrected!! (sector 9960 on sdr1).
[ 4854.906201] ------------[ cut here ]------------
[ 4854.907341] kernel BUG at drivers/md/raid5.c:4190!
raid5.c:4190 above is this BUG_ON:
handle_parity_checks6()
...
BUG_ON(s->uptodate < disks - 1); /* We don't need Q to recover */
Cc: <stable@vger.kernel.org> # v3.16+
OriginalAuthor: David Jeffery <djeffery@redhat.com>
Cc: Xiao Ni <xni@redhat.com>
Tested-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: David Jeffy <djeffery@redhat.com>
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
dm-integrity will deadlock if overlapping I/O is issued to it, the bug
was introduced by commit 724376a04d1a ("dm integrity: implement fair
range locks"). Users rarely use overlapping I/O so this bug went
undetected until now.
Fix this bug by correcting, likely cut-n-paste, typos in
ranges_overlap() and also remove a flawed ranges_overlap() check in
remove_range_unlocked(). This condition could leave unprocessed bios
hanging on wait_list forever.
Cc: stable@vger.kernel.org # v4.19+
Fixes: 724376a04d1a ("dm integrity: implement fair range locks")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Storage devices which report supporting discard commands like
WRITE_SAME_16 with unmap, but reject discard commands sent to the
storage device. This is a clear storage firmware bug but it doesn't
change the fact that should a program cause discards to be sent to a
multipath device layered on this buggy storage, all paths can end up
failed at the same time from the discards, causing possible I/O loss.
The first discard to a path will fail with Illegal Request, Invalid
field in cdb, e.g.:
kernel: sd 8:0:8:19: [sdfn] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
kernel: sd 8:0:8:19: [sdfn] tag#0 Sense Key : Illegal Request [current]
kernel: sd 8:0:8:19: [sdfn] tag#0 Add. Sense: Invalid field in cdb
kernel: sd 8:0:8:19: [sdfn] tag#0 CDB: Write same(16) 93 08 00 00 00 00 00 a0 08 00 00 00 80 00 00 00
kernel: blk_update_request: critical target error, dev sdfn, sector 10487808
The SCSI layer converts this to the BLK_STS_TARGET error number, the sd
device disables its support for discard on this path, and because of the
BLK_STS_TARGET error multipath fails the discard without failing any
path or retrying down a different path. But subsequent discards can
cause path failures. Any discards sent to the path which already failed
a discard ends up failing with EIO from blk_cloned_rq_check_limits with
an "over max size limit" error since the discard limit was set to 0 by
the sd driver for the path. As the error is EIO, this now fails the
path and multipath tries to send the discard down the next path. This
cycle continues as discards are sent until all paths fail.
Fix this by training DM core to disable DISCARD if the underlying
storage already did so.
Also, fix branching in dm_done() and clone_endio() to reflect the
mutually exclussive nature of the IO operations in question.
Cc: stable@vger.kernel.org
Reported-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Some devices don't use blk_integrity but still want stable pages
because they do their own checksumming. Examples include rbd and iSCSI
when data digests are negotiated. Stacking DM (and thus LVM) on top of
these devices results in sporadic checksum errors.
Set BDI_CAP_STABLE_WRITES if any underlying device has it set.
Cc: stable@vger.kernel.org
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
PAGE_SIZE")
The limit was already incorporated to dm-crypt with commit 4e870e948fba
("dm crypt: fix error with too large bios"), so we don't need to apply
it globally to all targets. The quantity BIO_MAX_PAGES * PAGE_SIZE is
wrong anyway because the variable ti->max_io_len it is supposed to be in
the units of 512-byte sectors not in bytes.
Reduction of the limit to 1048576 sectors could even cause data
corruption in rare cases - suppose that we have a dm-striped device with
stripe size 768MiB. The target will call dm_set_target_max_io_len with
the value 1572864. The buggy code would reduce it to 1048576. Now, the
dm-core will errorneously split the bios on 1048576-sector boundary
insetad of 1572864-sector boundary and pass these stripe-crossing bios
to the striped target.
Cc: stable@vger.kernel.org # v4.16+
Fixes: 8f50e358153d ("dm: limit the max bio size as BIO_MAX_PAGES * PAGE_SIZE")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Acked-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
A non const pointer to const cannot be marked initconst.
Mark the array actually const.
Fixes: 6bbc923dfcf5 dm: add support to directly boot to a mapped device
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|