summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2017-01-27blk-mq-sched: fix starvation for multiple hardware queues and shared tagsJens Axboe5-7/+41
If we have both multiple hardware queues and shared tag map between devices, we need to ensure that we propagate the hardware queue restart bit higher up. This is because we can get into a situation where we don't have any IO pending on a hardware queue, yet we fail getting a tag to start new IO. If that happens, it's not enough to mark the hardware queue as needing a restart, we need to bubble that up to the higher level queue as well. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Tested-by: Hannes Reinecke <hare@suse.com>
2017-01-27blk-mq: release driver tag on a requeue eventJens Axboe1-0/+16
We don't want to hold on to this resource when we have a scheduler attached. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Tested-by: Hannes Reinecke <hare@suse.com>
2017-01-27blk-mq: fix potential race in queue restart and driver tag allocationJens Axboe1-1/+9
Once we mark the queue as needing a restart, re-check if we can get a driver tag. This fixes a theoretical issue where the needed IO completes _after_ blk_mq_get_driver_tag() fails, but before we manage to set the restart bit. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Tested-by: Hannes Reinecke <hare@suse.com>
2017-01-27blk-mq: improve scheduler queue sync/async runningJens Axboe1-2/+4
We'll use the same criteria for whether we need to run the queue sync or async when we have a scheduler, as we do without one. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Tested-by: Hannes Reinecke <hare@suse.com>
2017-01-27blk-mq: move hctx and ctx counters from sysfs to debugfsOmar Sandoval2-64/+181
These counters aren't as out-of-place in sysfs as the other stuff, but debugfs is a slightly better home for them. Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-27blk-mq: move hctx io_poll, stats, and dispatched from sysfs to debugfsOmar Sandoval2-92/+132
These statistics _might_ be useful to userspace, but it's better not to commit to an ABI for these yet. Also, the dispatched file in sysfs couldn't be cleared, so make it clearable like the others in debugfs. Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-27blk-mq: add tags and sched_tags bitmaps to debugfsOmar Sandoval1-0/+50
These can be used to debug issues like tag leaks and stuck requests. Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-27blk-mq: move tags and sched_tags info from sysfs to debugfsOmar Sandoval4-45/+86
These are very tied to the blk-mq tag implementation, so exposing them to sysfs isn't a great idea. Move the debugging information to debugfs and add basic entries for the number of tags and the number of reserved tags to sysfs. Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-27blk-mq: export software queue pending map to debugfsOmar Sandoval1-0/+21
This is useful for debugging problems where we've gotten stuck with requests in the software queues. Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-27sbitmap: add helpers for dumping to a seq_fileOmar Sandoval2-0/+121
This is useful debugging information that will be used in the blk-mq debugfs directory. Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Changed 'weight' to 'busy'. Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-27blk-mq: add extra request information to debugfsOmar Sandoval1-1/+3
The request pointers by themselves aren't super useful. Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-27blk-mq: move hctx->dispatch and ctx->rq_list from sysfs to debugfsOmar Sandoval2-57/+106
These lists are only useful for debugging; they definitely don't belong in sysfs. Putting them in debugfs also removes the limitation of a single page of output. Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-27blk-mq: add hctx->{state,flags} to debugfsOmar Sandoval1-0/+42
hctx->state could come in handy for bugs where the hardware queue gets stuck in the stopped state, and hctx->flags is just useful to know. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-27blk-mq: create debugfs directory treeOmar Sandoval6-0/+201
In preparation for putting blk-mq debugging information in debugfs, create a directory tree mirroring the one in sysfs: # tree -d /sys/kernel/debug/block /sys/kernel/debug/block |-- nvme0n1 | `-- mq | |-- 0 | | `-- cpu0 | |-- 1 | | `-- cpu1 | |-- 2 | | `-- cpu2 | `-- 3 | `-- cpu3 `-- vda `-- mq `-- 0 |-- cpu0 |-- cpu1 |-- cpu2 `-- cpu3 Also add the scaffolding for the actual files that will go in here, either under the hardware queue or software queue directories. Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-27blk-mq-sched: check for successful allocation before assigning tagJens Axboe1-1/+2
We don't trigger this from the normal IO path, since we always use blocking allocations from there. But Bart saw it testing multipath dm, since that is a heavy user of atomic request allocations in the map and clone path. Reported-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-26blk-mq: don't lose flags passed in to blk_mq_alloc_request()Jens Axboe2-4/+4
If we come in from blk_mq_alloc_requst() with NOWAIT set in flags, we must ensure that we don't later overwrite that in blk_mq_sched_get_request(). Initialize alloc_data->flags before passing it in. Reported-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-25blk-mq: only apply active queue tag throttling for driver tagsJens Axboe2-10/+15
If we have a scheduler attached, we have two sets of tags. We don't want to apply our active queue throttling for the scheduler side of tags, that only applies to driver tags since that's the resource we need to dispatch an IO. Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-23cfq-iosched: Adjust one function call together with a variable assignmentMarkus Elfring1-2/+4
The script "checkpatch.pl" pointed information out like the following. ERROR: do not use assignment in if condition Thus fix the affected source code place. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-23blk-throttle: Adjust two function calls together with a variable assignmentMarkus Elfring1-2/+4
The script "checkpatch.pl" pointed information out like the following. ERROR: do not use assignment in if condition Thus fix the affected source code places. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-23block: Initialize cfqq->ioprio_class in cfq_get_queue()Alexander Potapenko1-0/+2
KMSAN (KernelMemorySanitizer, a new error detection tool) reports use of uninitialized memory in cfq_init_cfqq(): ================================================================== BUG: KMSAN: use of unitialized memory ... Call Trace: [< inline >] __dump_stack lib/dump_stack.c:15 [<ffffffff8202ac97>] dump_stack+0x157/0x1d0 lib/dump_stack.c:51 [<ffffffff813e9b65>] kmsan_report+0x205/0x360 ??:? [<ffffffff813eabbb>] __msan_warning+0x5b/0xb0 ??:? [< inline >] cfq_init_cfqq block/cfq-iosched.c:3754 [<ffffffff8201e110>] cfq_get_queue+0xc80/0x14d0 block/cfq-iosched.c:3857 ... origin: [<ffffffff8103ab37>] save_stack_trace+0x27/0x50 arch/x86/kernel/stacktrace.c:67 [<ffffffff813e836b>] kmsan_internal_poison_shadow+0xab/0x150 ??:? [<ffffffff813e88ab>] kmsan_poison_slab+0xbb/0x120 ??:? [< inline >] allocate_slab mm/slub.c:1627 [<ffffffff813e533f>] new_slab+0x3af/0x4b0 mm/slub.c:1641 [< inline >] new_slab_objects mm/slub.c:2407 [<ffffffff813e0ef3>] ___slab_alloc+0x323/0x4a0 mm/slub.c:2564 [< inline >] __slab_alloc mm/slub.c:2606 [< inline >] slab_alloc_node mm/slub.c:2669 [<ffffffff813dfb42>] kmem_cache_alloc_node+0x1d2/0x1f0 mm/slub.c:2746 [<ffffffff8201d90d>] cfq_get_queue+0x47d/0x14d0 block/cfq-iosched.c:3850 ... ================================================================== (the line numbers are relative to 4.8-rc6, but the bug persists upstream) The uninitialized struct cfq_queue is created by kmem_cache_alloc_node() and then passed to cfq_init_cfqq(), which accesses cfqq->ioprio_class before it's initialized. Signed-off-by: Alexander Potapenko <glider@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-20blk-mq: allow resize of scheduler requestsJens Axboe3-13/+61
Add support for growing the tags associated with a hardware queue, for the scheduler tags. Currently we only support resizing within the limits of the original depth, change that so we can grow it as well by allocating and replacing the existing scheduler tag set. This is similar to how we could increase the software queue depth with the legacy IO stack and schedulers. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Omar Sandoval <osandov@fb.com>
2017-01-19blk-mq: stop hardware queue in blk_mq_delay_queue()Jens Axboe1-0/+1
The run handler we register for the delayed work requires that the queue be stopped, yet we leave that up to the caller. Let's move it into blk_mq_delay_queue() itself, so that the API is sane. This fixes a stall with SCSI, where it calls blk_mq_delay_queue() without having stopped the queue. Hence the queue is never run. Reported-by: Hannes Reinecke <hare@suse.com> Fixes: 70f4db639c5b ("blk-mq: add blk_mq_delay_queue") Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-19blk-mq-tag: remove redundant check for 'data->hctx' being non-NULLJens Axboe1-4/+2
We used to pass in NULL for hctx for reserved tags, but we don't do that anymore. Hence the check for whether hctx is NULL or not is now redundant, kill it. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Fixes: a642a158aec6 ("blk-mq-tag: cleanup the normal/reserved tag allocation") Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-19elevator: fix unnecessary put of elevator in failure caseJens Axboe1-4/+0
We already checked that e is NULL, so no point in calling elevator_put() to free it. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Fixes: dc877dbd088f ("blk-mq-sched: add framework for MQ capable IO schedulers") Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-19blk-cgroup: don't quiesce the queue on policy activate/deactivateJens Axboe1-12/+8
There's no potential harm in quiescing the queue, but it also doesn't buy us anything. And we can't run the queue async for policy deactivate, since we could be in the path of tearing the queue down. If we schedule an async run of the queue at that time, we're racing with queue teardown AFTER having we've already torn most of it down. Reported-by: Omar Sandoval <osandov@fb.com> Fixes: 4d199c6f1c84 ("blk-cgroup: ensure that we clear the stop bit on quiesced queues") Tested-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-18sbitmap: fix wakeup hang after sbq resizeOmar Sandoval1-5/+30
When we resize a struct sbitmap_queue, we update the wakeup batch size, but we don't update the wait count in the struct sbq_wait_states. If we resized down from a size which could use a bigger batch size, these counts could be too large and cause us to miss necessary wakeups. To fix this, update the wait counts when we resize (ensuring some careful memory ordering so that it's safe w.r.t. concurrent clears). This also fixes a theoretical issue where two threads could end up bumping the wait count up by the batch size, which could also potentially lead to hangs. Reported-by: Martin Raiber <martin@urbackup.org> Fixes: e3a2b3f931f5 ("blk-mq: allow changing of queue depth through sysfs") Fixes: 2971c35f3588 ("blk-mq: bitmap tag: fix race on blk_mq_bitmap_tags::wake_cnt") Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-18sbitmap: use smp_mb__after_atomic() in sbq_wake_up()Omar Sandoval1-3/+10
We always do an atomic clear_bit() right before we call sbq_wake_up(), so we can use smp_mb__after_atomic(). While we're here, comment the memory barriers in here a little more. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-18blk-cgroup: ensure that we clear the stop bit on quiesced queuesJens Axboe1-4/+6
If we call blk_mq_quiesce_queue() on a queue, we must remember to pair that with something that clears the stopped by on the queues later on. Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-17blk-mq-sched: allow setting of default IO schedulerJens Axboe7-7/+89
Add Kconfig entries to manage what devices get assigned an MQ scheduler, and add a blk-mq flag for drivers to opt out of scheduling. The latter is useful for admin type queues that still allocate a blk-mq queue and tag set, but aren't use for normal IO. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Omar Sandoval <osandov@fb.com>
2017-01-17mq-deadline: add blk-mq adaptation of the deadline IO schedulerJens Axboe3-0/+560
This is basically identical to deadline-iosched, except it registers as a MQ capable scheduler. This is still a single queue design. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Omar Sandoval <osandov@fb.com>
2017-01-17blk-mq-sched: add framework for MQ capable IO schedulersJens Axboe17-194/+984
This adds a set of hooks that intercepts the blk-mq path of allocating/inserting/issuing/completing requests, allowing us to develop a scheduler within that framework. We reuse the existing elevator scheduler API on the registration side, but augment that with the scheduler flagging support for the blk-mq interfce, and with a separate set of ops hooks for MQ devices. We split driver and scheduler tags, so we can run the scheduling independently of device queue depth. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Omar Sandoval <osandov@fb.com>
2017-01-17blk-mq: split tag ->rqs[] into twoJens Axboe3-9/+26
This is in preparation for having two sets of tags available. For that we need a static index, and a dynamically assignable one. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Omar Sandoval <osandov@fb.com>
2017-01-17blk-mq: add support for carrying internal tag information in blk_qc_tJens Axboe2-8/+25
No functional change in this patch, just in preparation for having two types of tags available to the block layer for a single request. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Omar Sandoval <osandov@fb.com>
2017-01-17blk-mq: abstract out helpers for allocating/freeing tag mapsJens Axboe2-48/+83
Prep patch for adding an extra tag map for scheduler requests. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Omar Sandoval <osandov@fb.com>
2017-01-17blk-mq-tag: cleanup the normal/reserved tag allocationJens Axboe4-61/+44
This is in preparation for having another tag set available. Cleanup the parameters, and allow passing in of tags for blk_mq_put_tag(). Signed-off-by: Jens Axboe <axboe@fb.com> [hch: even more cleanups] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Omar Sandoval <osandov@fb.com>
2017-01-17blk-mq: export some helpers we need to the scheduling frameworkJens Axboe2-18/+46
Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Omar Sandoval <osandov@fb.com>
2017-01-17blk-mq: un-export blk_mq_free_hctx_request()Jens Axboe2-4/+2
It's only used in blk-mq, kill it from the main exported header and kill the symbol export as well. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Omar Sandoval <osandov@fb.com>
2017-01-17block: move rq_ioc() to blk.hJens Axboe2-16/+16
We want to use it outside of blk-core.c. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Omar Sandoval <osandov@fb.com>
2017-01-17block: move existing elevator ops to unionJens Axboe8-45/+47
Prep patch for adding MQ ops as well, since doing anon unions with named initializers doesn't work on older compilers. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Omar Sandoval <osandov@fb.com>
2017-01-17partitions/efi: Fix integer overflow in GPT size calculationAlden Tondettar1-5/+12
If a GUID Partition Table claims to have more than 2**25 entries, the calculation of the partition table size in alloc_read_gpt_entries() will overflow a 32-bit integer and not enough space will be allocated for the table. Nothing seems to get written out of bounds, but later efi_partition() will read up to 32768 bytes from a 128 byte buffer, possibly OOPSing or exposing information to /proc/partitions and uevents. The problem exists on both 64-bit and 32-bit platforms. Fix the overflow and also print a meaningful debug message if the table size is too large. Signed-off-by: Alden Tondettar <alden.tondettar@gmail.com> Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-12MAINTAINERS: Update maintainer entry for NBDJosef Bacik1-2/+2
The old maintainers email is bouncing and I've rewritten most of this driver in the recent months. Also add linux-block to the mailinglist and remove the old tree, I will send patches through the linux-block tree. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-12blk-mq: make mq_ops a const pointerJens Axboe3-3/+3
We never change it, make that clear. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
2017-01-12block: relax check on sg gapMing Lei1-1/+21
If the last bvec of the 1st bio and the 1st bvec of the next bio are physically contigious, and the latter can be merged to last segment of the 1st bio, we should think they don't violate sg gap(or virt boundary) limit. Both Vitaly and Dexuan reported lots of unmergeable small bios are observed when running mkfs on Hyper-V virtual storage, and performance becomes quite low. This patch fixes that performance issue. The same issue should exist on NVMe, since it sets virt boundary too. Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reported-by: Dexuan Cui <decui@microsoft.com> Tested-by: Dexuan Cui <decui@microsoft.com> Cc: Keith Busch <keith.busch@intel.com> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-12floppy: replace wrong kmalloc(GFP_USER) with GFP_KERNELVlastimil Babka1-1/+1
The raw_cmd_copyin() function does a kmalloc() with GFP_USER, although the allocated structure is obviously not mapped to userspace, just copied from/to. In this case GFP_KERNEL is more appropriate, so let's use it, although in the current implementation this does not manifest as any error. Reported-by: Matthew Wilcox <mawilcox@linuxonhyperv.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2017-01-09Linux 4.10-rc3v4.10-rc3Linus Torvalds1-1/+1
2017-01-08Merge tag 'usb-4.10-rc3' of ↵Linus Torvalds46-316/+550
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb Pull USB fixes from Greg KH: "Here are a bunch of USB fixes for 4.10-rc3. Yeah, it's a lot, an artifact of the holiday break I think. Lots of gadget and the usual XHCI fixups for reported issues (one day that driver will calm down...) Also included are a bunch of usb-serial driver fixes, and for good measure, a number of much-reported MUSB driver issues have finally been resolved. All of these have been in linux-next with no reported issues" * tag 'usb-4.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (72 commits) USB: fix problems with duplicate endpoint addresses usb: ohci-at91: use descriptor-based gpio APIs correctly usb: storage: unusual_uas: Add JMicron JMS56x to unusual device usb: hub: Move hub_port_disable() to fix warning if PM is disabled usb: musb: blackfin: add bfin_fifo_offset in bfin_ops usb: musb: fix compilation warning on unused function usb: musb: Fix trying to free already-free IRQ 4 usb: musb: dsps: implement clear_ep_rxintr() callback usb: musb: core: add clear_ep_rxintr() to musb_platform_ops USB: serial: ti_usb_3410_5052: fix NULL-deref at open USB: serial: spcp8x5: fix NULL-deref at open USB: serial: quatech2: fix sleep-while-atomic in close USB: serial: pl2303: fix NULL-deref at open USB: serial: oti6858: fix NULL-deref at open USB: serial: omninet: fix NULL-derefs at open and disconnect USB: serial: mos7840: fix misleading interrupt-URB comment USB: serial: mos7840: remove unused write URB USB: serial: mos7840: fix NULL-deref at open USB: serial: mos7720: remove obsolete port initialisation USB: serial: mos7720: fix parallel probe ...
2017-01-08Merge tag 'char-misc-4.10-rc3' of ↵Linus Torvalds6-19/+24
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc Pull char/misc fixes from Greg KH: "Here are a few small char/misc driver fixes for 4.10-rc3. Two MEI driver fixes, and three NVMEM patches for reported issues, and a new Hyper-V driver MAINTAINER update. Nothing major at all, all have been in linux-next with no reported issues" * tag 'char-misc-4.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: hyper-v: Add myself as additional MAINTAINER nvmem: fix nvmem_cell_read() return type doc nvmem: imx-ocotp: Fix wrong register size nvmem: qfprom: Allow single byte accesses for read/write mei: move write cb to completion on credentials failures mei: bus: fix mei_cldev_enable KDoc
2017-01-08Merge tag 'staging-4.10-rc3' of ↵Linus Torvalds10-30/+56
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging Pull staging/IIO fixes from Greg KH: "Here are some staging and IIO driver fixes for 4.10-rc3. Most of these are minor IIO fixes of reported issues, along with one network driver fix to resolve an issue. And a MAINTAINERS update with a new mailing list. All of these, except the MAINTAINERS file update, have been in linux-next with no reported issues (the MAINTAINERS patch happened on Friday...)" * tag 'staging-4.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: MAINTAINERS: add greybus subsystem mailing list staging: octeon: Call SET_NETDEV_DEV() iio: accel: st_accel: fix LIS3LV02 reading and scaling iio: common: st_sensors: fix channel data parsing iio: max44000: correct value in illuminance_integration_time_available iio: adc: TI_AM335X_ADC should depend on HAS_DMA iio: bmi160: Fix time needed to sleep after command execution iio: 104-quad-8: Fix active level mismatch for the preset enable option iio: 104-quad-8: Fix off-by-one errors when addressing IOR iio: 104-quad-8: Fix index control configuration
2017-01-08mm: workingset: fix use-after-free in shadow node shrinkerJohannes Weiner3-4/+14
Several people report seeing warnings about inconsistent radix tree nodes followed by crashes in the workingset code, which all looked like use-after-free access from the shadow node shrinker. Dave Jones managed to reproduce the issue with a debug patch applied, which confirmed that the radix tree shrinking indeed frees shadow nodes while they are still linked to the shadow LRU: WARNING: CPU: 2 PID: 53 at lib/radix-tree.c:643 delete_node+0x1e4/0x200 CPU: 2 PID: 53 Comm: kswapd0 Not tainted 4.10.0-rc2-think+ #3 Call Trace: delete_node+0x1e4/0x200 __radix_tree_delete_node+0xd/0x10 shadow_lru_isolate+0xe6/0x220 __list_lru_walk_one.isra.4+0x9b/0x190 list_lru_walk_one+0x23/0x30 scan_shadow_nodes+0x2e/0x40 shrink_slab.part.44+0x23d/0x5d0 shrink_node+0x22c/0x330 kswapd+0x392/0x8f0 This is the WARN_ON_ONCE(!list_empty(&node->private_list)) placed in the inlined radix_tree_shrink(). The problem is with 14b468791fa9 ("mm: workingset: move shadow entry tracking to radix tree exceptional tracking"), which passes an update callback into the radix tree to link and unlink shadow leaf nodes when tree entries change, but forgot to pass the callback when reclaiming a shadow node. While the reclaimed shadow node itself is unlinked by the shrinker, its deletion from the tree can cause the left-most leaf node in the tree to be shrunk. If that happens to be a shadow node as well, we don't unlink it from the LRU as we should. Consider this tree, where the s are shadow entries: root->rnode | [0 n] | | [s ] [sssss] Now the shadow node shrinker reclaims the rightmost leaf node through the shadow node LRU: root->rnode | [0 ] | [s ] Because the parent of the deleted node is the first level below the root and has only one child in the left-most slot, the intermediate level is shrunk and the node containing the single shadow is put in its place: root->rnode | [s ] The shrinker again sees a single left-most slot in a first level node and thus decides to store the shadow in root->rnode directly and free the node - which is a leaf node on the shadow node LRU. root->rnode | s Without the update callback, the freed node remains on the shadow LRU, where it causes later shrinker runs to crash. Pass the node updater callback into __radix_tree_delete_node() in case the deletion causes the left-most branch in the tree to collapse too. Also add warnings when linked nodes are freed right away, rather than wait for the use-after-free when the list is scanned much later. Fixes: 14b468791fa9 ("mm: workingset: move shadow entry tracking to radix tree exceptional tracking") Reported-by: Dave Chinner <david@fromorbit.com> Reported-by: Hugh Dickins <hughd@google.com> Reported-by: Andrea Arcangeli <aarcange@redhat.com> Reported-and-tested-by: Dave Jones <davej@codemonkey.org.uk> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Chris Leech <cleech@redhat.com> Cc: Lee Duncan <lduncan@suse.com> Cc: Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-08mm: stop leaking PageTablesHugh Dickins1-27/+20
4.10-rc loadtest (even on x86, and even without THPCache) fails with "fork: Cannot allocate memory" or some such; and /proc/meminfo shows PageTables growing. Commit 953c66c2b22a ("mm: THP page cache support for ppc64") that got merged in rc1 removed the freeing of an unused preallocated pagetable after do_fault_around() has called map_pages(). This is usually a good optimization, so that the followup doesn't have to reallocate one; but it's not sufficient to shift the freeing into alloc_set_pte(), since there are failure cases (most commonly VM_FAULT_RETRY) which never reach finish_fault(). Check and free it at the outer level in do_fault(), then we don't need to worry in alloc_set_pte(), and can restore that to how it was (I cannot find any reason to pte_free() under lock as it was doing). And fix a separate pagetable leak, or crash, introduced by the same change, that could only show up on some ppc64: why does do_set_pmd()'s failure case attempt to withdraw a pagetable when it never deposited one, at the same time overwriting (so leaking) the vmf->prealloc_pte? Residue of an earlier implementation, perhaps? Delete it. Fixes: 953c66c2b22a ("mm: THP page cache support for ppc64") Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Michael Neuling <mikey@neuling.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>