summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2020-10-10io_uring: keep a pointer ref_node in file_dataPavel Begunkov1-8/+6
->cur_refs of struct fixed_file_data always points to percpu_ref embedded into struct fixed_file_ref_node. Don't overuse container_of() and offsetting, and point directly to fixed_file_ref_node. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-10io_uring: refactor *files_register()'s error pathsPavel Begunkov1-46/+32
Don't keep repeating cleaning sequences in error paths, write it once in the and use labels. It's less error prone and looks cleaner. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-10io_uring: clean file_data access in files_registerPavel Begunkov1-36/+33
Keep file_data in a local var and replace with it complex references such as ctx->file_data. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-10io_uring: don't delay io_init_req() error checkPavel Begunkov1-2/+1
Don't postpone io_init_req() error checks and do that right after calling it. There is no control-flow statements or dependencies with sqe/submitted accounting, so do those earlier, that makes the code flow a bit more natural. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-10io_uring: clean leftovers after splitting issuePavel Begunkov1-8/+6
Kill extra if in io_issue_sqe() and place send/recv[msg] calls appropriately under switch's cases. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-10io_uring: remove timeout.list after hrtimer cancelPavel Begunkov1-9/+2
Remove timeouts from ctx->timeout_list after hrtimer_try_to_cancel() successfully cancels it. With this we don't need to care whether there was a race and it was removed in io_timeout_fn(), and that will be handy for following patches. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-10io_uring: use a separate struct for timeout_removePavel Begunkov1-9/+9
Don't use struct io_timeout for both IORING_OP_TIMEOUT and IORING_OP_TIMEOUT_REMOVE, they're quite different. Split them in two, that allows to remove an unused field in struct io_timeout, and btw kill ->flags not used by either. This also easier to follow, especially for timeout remove. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-10io_uring: improve submit_state.ios_left accountingPavel Begunkov1-5/+5
state->ios_left isn't decremented for requests that don't need a file, so it might be larger than number of SQEs left. That in some circumstances makes us to grab more files that is needed so imposing extra put. Deaccount one ios_left for each request. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-10io_uring: simplify io_file_get()Pavel Begunkov1-16/+14
Keep ->needs_file_no_error check out of io_file_get(), and let callers handle it. It makes it more straightforward. Also, as the only error it can hand back -EBADF, make it return a file or NULL. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-10io_uring: kill extra check in fixed io_file_get()Pavel Begunkov1-2/+1
ctx->nr_user_files == 0 IFF ctx->file_data == NULL and there fixed files are not used. Hence, verifying fds only against ctx->nr_user_files is enough. Remove the other check from hot path. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-10io_uring: clean up ->files grabbingPavel Begunkov1-39/+13
Move work.files grabbing into io_prep_async_work() to all other work resources initialisation. We don't need to keep it separately now, as ->ring_fd/file are gone. It also allows to not grab it when a request is not going to io-wq. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-10io_uring: don't io_prep_async_work() linked reqsPavel Begunkov1-3/+0
There is no real reason left for preparing io-wq work context for linked requests in advance, remove it as this might become a bottleneck in some cases. Reported-by: Roman Gershman <romger@amazon.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-09io_uring: Convert advanced XArray uses to the normal APIMatthew Wilcox (Oracle)1-12/+2
There are no bugs here that I've spotted, it's just easier to use the normal API and there are no performance advantages to using the more verbose advanced API. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-09io_uring: Fix XArray usage in io_uring_add_task_fileMatthew Wilcox (Oracle)1-12/+9
The xas_store() wasn't paired with an xas_nomem() loop, so if it couldn't allocate memory using GFP_NOWAIT, it would leak the reference to the file descriptor. Also the node pointed to by the xas could be freed between the call to xas_load() under the rcu_read_lock() and the acquisition of the xa_lock. It's easier to just use the normal xa_load/xa_store interface here. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> [axboe: fix missing assign after alloc, cur_uring -> tctx rename] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-09io_uring: Fix use of XArray in __io_uring_files_cancelMatthew Wilcox (Oracle)1-14/+5
We have to drop the lock during each iteration, so there's no advantage to using the advanced API. Convert this to a standard xa_for_each() loop. Reported-by: syzbot+27c12725d8ff0bfe1a13@syzkaller.appspotmail.com Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-09io_uring: fix break condition for __io_uring_register() waitingJens Axboe1-3/+3
Colin reports that there's unreachable code, since we only ever break if ret == 0. This is correct, and is due to a reversed logic condition in when to break or not. Break out of the loop if we don't process any task work, in that case we do want to return -EINTR. Fixes: af9c1a44f8de ("io_uring: process task work in io_uring_register()") Reported-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-08io_uring: no need to call xa_destroy() on empty xarrayJens Axboe1-1/+0
The kernel test robot reports this lockdep issue: [child1:659] mbind (274) returned ENOSYS, marking as inactive. [child1:659] mq_timedsend (279) returned ENOSYS, marking as inactive. [main] 10175 iterations. [F:7781 S:2344 HI:2397] [ 24.610601] [ 24.610743] ================================ [ 24.611083] WARNING: inconsistent lock state [ 24.611437] 5.9.0-rc7-00017-g0f2122045b9462 #5 Not tainted [ 24.611861] -------------------------------- [ 24.612193] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage. [ 24.612660] ksoftirqd/0/7 [HC0[0]:SC1[3]:HE0:SE0] takes: [ 24.613086] f00ed998 (&xa->xa_lock#4){+.?.}-{2:2}, at: xa_destroy+0x43/0xc1 [ 24.613642] {SOFTIRQ-ON-W} state was registered at: [ 24.614024] lock_acquire+0x20c/0x29b [ 24.614341] _raw_spin_lock+0x21/0x30 [ 24.614636] io_uring_add_task_file+0xe8/0x13a [ 24.614987] io_uring_create+0x535/0x6bd [ 24.615297] io_uring_setup+0x11d/0x136 [ 24.615606] __ia32_sys_io_uring_setup+0xd/0xf [ 24.615977] do_int80_syscall_32+0x53/0x6c [ 24.616306] restore_all_switch_stack+0x0/0xb1 [ 24.616677] irq event stamp: 939881 [ 24.616968] hardirqs last enabled at (939880): [<8105592d>] __local_bh_enable_ip+0x13c/0x145 [ 24.617642] hardirqs last disabled at (939881): [<81b6ace3>] _raw_spin_lock_irqsave+0x1b/0x4e [ 24.618321] softirqs last enabled at (939738): [<81b6c7c8>] __do_softirq+0x3f0/0x45a [ 24.618924] softirqs last disabled at (939743): [<81055741>] run_ksoftirqd+0x35/0x61 [ 24.619521] [ 24.619521] other info that might help us debug this: [ 24.620028] Possible unsafe locking scenario: [ 24.620028] [ 24.620492] CPU0 [ 24.620685] ---- [ 24.620894] lock(&xa->xa_lock#4); [ 24.621168] <Interrupt> [ 24.621381] lock(&xa->xa_lock#4); [ 24.621695] [ 24.621695] *** DEADLOCK *** [ 24.621695] [ 24.622154] 1 lock held by ksoftirqd/0/7: [ 24.622468] #0: 823bfb94 (rcu_callback){....}-{0:0}, at: rcu_process_callbacks+0xc0/0x155 [ 24.623106] [ 24.623106] stack backtrace: [ 24.623454] CPU: 0 PID: 7 Comm: ksoftirqd/0 Not tainted 5.9.0-rc7-00017-g0f2122045b9462 #5 [ 24.624090] Call Trace: [ 24.624284] ? show_stack+0x40/0x46 [ 24.624551] dump_stack+0x1b/0x1d [ 24.624809] print_usage_bug+0x17a/0x185 [ 24.625142] mark_lock+0x11d/0x1db [ 24.625474] ? print_shortest_lock_dependencies+0x121/0x121 [ 24.625905] __lock_acquire+0x41e/0x7bf [ 24.626206] lock_acquire+0x20c/0x29b [ 24.626517] ? xa_destroy+0x43/0xc1 [ 24.626810] ? lock_acquire+0x20c/0x29b [ 24.627110] _raw_spin_lock_irqsave+0x3e/0x4e [ 24.627450] ? xa_destroy+0x43/0xc1 [ 24.627725] xa_destroy+0x43/0xc1 [ 24.627989] __io_uring_free+0x57/0x71 [ 24.628286] ? get_pid+0x22/0x22 [ 24.628544] __put_task_struct+0xf2/0x163 [ 24.628865] put_task_struct+0x1f/0x2a [ 24.629161] delayed_put_task_struct+0xe2/0xe9 [ 24.629509] rcu_process_callbacks+0x128/0x155 [ 24.629860] __do_softirq+0x1a3/0x45a [ 24.630151] run_ksoftirqd+0x35/0x61 [ 24.630443] smpboot_thread_fn+0x304/0x31a [ 24.630763] kthread+0x124/0x139 [ 24.631016] ? sort_range+0x18/0x18 [ 24.631290] ? kthread_create_worker_on_cpu+0x17/0x17 [ 24.631682] ret_from_fork+0x1c/0x28 which is complaining about xa_destroy() grabbing the xa lock in an IRQ disabling fashion, whereas the io_uring uses cases aren't interrupt safe. This is really an xarray issue, since it should not assume the lock type. But for our use case, since we know the xarray is empty at this point, there's no need to actually call xa_destroy(). So just get rid of it. Fixes: 0f2122045b94 ("io_uring: don't rely on weak ->files references") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-07io_uring: batch account ->req_issue and task struct referencesJens Axboe1-2/+5
Identical to how we handle the ctx reference counts, increase by the batch we're expecting to submit, and handle any slow path residual, if any. The request alloc-and-issue path is very hot, and this makes a noticeable difference by avoiding an two atomic incs for each individual request. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-01io_uring: kill callback_head argument for io_req_task_work_add()Jens Axboe1-6/+5
We always use &req->task_work anyway, no point in passing it in. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-01io_uring: move req preps out of io_issue_sqe()Pavel Begunkov1-17/+13
All request preparations are done only during submission, reflect it in the code by moving io_req_prep() much earlier into io_queue_sqe(). That's much cleaner, because it doen't expose bits to async code which it won't ever use. Also it makes the interface harder to misuse, and there are potential places for bugs. For instance, __io_queue() doesn't clear @sqe before proceeding to a next linked request, that could have been disastrous, but hopefully there are linked requests IFF sqe==NULL, so not actually a bug. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-01io_uring: decouple issuing and req preparationPavel Begunkov1-206/+57
io_issue_sqe() does two things at once, trying to prepare request and issuing them. Split it in two and deduplicate with io_defer_prep(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-01io_uring: remove nonblock arg from io_{rw}_prep()Pavel Begunkov1-13/+10
All io_*_prep() functions including io_{read,write}_prep() are called only during submission where @force_nonblock is always true. Don't keep propagating it and instead remove the @force_nonblock argument from prep() altogether. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-01io_uring: set/clear IOCB_NOWAIT into io_read/writePavel Begunkov1-8/+9
Move setting IOCB_NOWAIT from io_prep_rw() into io_read()/io_write(), so it's set/cleared in a single place. Also remove @force_nonblock parameter from io_prep_rw(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-01io_uring: remove F_NEED_CLEANUP check in *prep()Pavel Begunkov1-16/+2
REQ_F_NEED_CLEANUP is set only by io_*_prep() and they're guaranteed to be called only once, so there is no one who may have set the flag before. Kill REQ_F_NEED_CLEANUP check in these *prep() handlers. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-01io_uring: io_kiocb_ppos() style changePavel Begunkov1-1/+1
Put brackets around bitwise ops in a complex expression Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-01io_uring: simplify io_alloc_req()Pavel Begunkov1-9/+4
Extract common code from if/else branches. That is cleaner and optimised even better. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-01io-wq: kill unused IO_WORKER_F_EXITINGJens Axboe1-3/+2
This flag is no longer used, remove it. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-01io-wq: fix use-after-free in io_wq_worker_runningHillf Danton1-58/+58
The smart syzbot has found a reproducer for the following issue: ================================================================== BUG: KASAN: use-after-free in instrument_atomic_write include/linux/instrumented.h:71 [inline] BUG: KASAN: use-after-free in atomic_inc include/asm-generic/atomic-instrumented.h:240 [inline] BUG: KASAN: use-after-free in io_wqe_inc_running fs/io-wq.c:301 [inline] BUG: KASAN: use-after-free in io_wq_worker_running+0xde/0x110 fs/io-wq.c:613 Write of size 4 at addr ffff8882183db08c by task io_wqe_worker-0/7771 CPU: 0 PID: 7771 Comm: io_wqe_worker-0 Not tainted 5.9.0-rc4-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x198/0x1fd lib/dump_stack.c:118 print_address_description.constprop.0.cold+0xae/0x497 mm/kasan/report.c:383 __kasan_report mm/kasan/report.c:513 [inline] kasan_report.cold+0x1f/0x37 mm/kasan/report.c:530 check_memory_region_inline mm/kasan/generic.c:186 [inline] check_memory_region+0x13d/0x180 mm/kasan/generic.c:192 instrument_atomic_write include/linux/instrumented.h:71 [inline] atomic_inc include/asm-generic/atomic-instrumented.h:240 [inline] io_wqe_inc_running fs/io-wq.c:301 [inline] io_wq_worker_running+0xde/0x110 fs/io-wq.c:613 schedule_timeout+0x148/0x250 kernel/time/timer.c:1879 io_wqe_worker+0x517/0x10e0 fs/io-wq.c:580 kthread+0x3b5/0x4a0 kernel/kthread.c:292 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294 Allocated by task 7768: kasan_save_stack+0x1b/0x40 mm/kasan/common.c:48 kasan_set_track mm/kasan/common.c:56 [inline] __kasan_kmalloc.constprop.0+0xbf/0xd0 mm/kasan/common.c:461 kmem_cache_alloc_node_trace+0x17b/0x3f0 mm/slab.c:3594 kmalloc_node include/linux/slab.h:572 [inline] kzalloc_node include/linux/slab.h:677 [inline] io_wq_create+0x57b/0xa10 fs/io-wq.c:1064 io_init_wq_offload fs/io_uring.c:7432 [inline] io_sq_offload_start fs/io_uring.c:7504 [inline] io_uring_create fs/io_uring.c:8625 [inline] io_uring_setup+0x1836/0x28e0 fs/io_uring.c:8694 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Freed by task 21: kasan_save_stack+0x1b/0x40 mm/kasan/common.c:48 kasan_set_track+0x1c/0x30 mm/kasan/common.c:56 kasan_set_free_info+0x1b/0x30 mm/kasan/generic.c:355 __kasan_slab_free+0xd8/0x120 mm/kasan/common.c:422 __cache_free mm/slab.c:3418 [inline] kfree+0x10e/0x2b0 mm/slab.c:3756 __io_wq_destroy fs/io-wq.c:1138 [inline] io_wq_destroy+0x2af/0x460 fs/io-wq.c:1146 io_finish_async fs/io_uring.c:6836 [inline] io_ring_ctx_free fs/io_uring.c:7870 [inline] io_ring_exit_work+0x1e4/0x6d0 fs/io_uring.c:7954 process_one_work+0x94c/0x1670 kernel/workqueue.c:2269 worker_thread+0x64c/0x1120 kernel/workqueue.c:2415 kthread+0x3b5/0x4a0 kernel/kthread.c:292 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294 The buggy address belongs to the object at ffff8882183db000 which belongs to the cache kmalloc-1k of size 1024 The buggy address is located 140 bytes inside of 1024-byte region [ffff8882183db000, ffff8882183db400) The buggy address belongs to the page: page:000000009bada22b refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x2183db flags: 0x57ffe0000000200(slab) raw: 057ffe0000000200 ffffea0008604c48 ffffea00086a8648 ffff8880aa040700 raw: 0000000000000000 ffff8882183db000 0000000100000002 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff8882183daf80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ffff8882183db000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >ffff8882183db080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff8882183db100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff8882183db180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ================================================================== which is down to the comment below, /* all workers gone, wq exit can proceed */ if (!nr_workers && refcount_dec_and_test(&wqe->wq->refs)) complete(&wqe->wq->done); because there might be multiple cases of wqe in a wq and we would wait for every worker in every wqe to go home before releasing wq's resources on destroying. To that end, rework wq's refcount by making it independent of the tracking of workers because after all they are two different things, and keeping it balanced when workers come and go. Note the manager kthread, like other workers, now holds a grab to wq during its lifetime. Finally to help destroy wq, check IO_WQ_BIT_EXIT upon creating worker and do nothing for exiting wq. Cc: stable@vger.kernel.org # v5.5+ Reported-by: syzbot+45fa0a195b941764e0f0@syzkaller.appspotmail.com Reported-by: syzbot+9af99580130003da82b1@syzkaller.appspotmail.com Cc: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Hillf Danton <hdanton@sina.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-01io_uring: show sqthread pid and cpu in fdinfoJoseph Qi1-0/+6
In most cases we'll specify IORING_SETUP_SQPOLL and run multiple io_uring instances in a host. Since all sqthreads are named "io_uring-sq", it's hard to distinguish the relations between application process and its io_uring sqthread. With this patch, application can get its corresponding sqthread pid and cpu through show_fdinfo. Steps: 1. Get io_uring fd first. $ ls -l /proc/<pid>/fd | grep -w io_uring 2. Then get io_uring instance related info, including corresponding sqthread pid and cpu. $ cat /proc/<pid>/fdinfo/<io_uring_fd> pos: 0 flags: 02000002 mnt_id: 13 SqThread: 6929 SqThreadCpu: 2 UserFiles: 1 0: testfile UserBufs: 0 PollList: Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> [axboe: fixed for new shared SQPOLL] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-01io_uring: process task work in io_uring_register()Jens Axboe1-12/+28
We do this for CQ ring wait, in case task_work completions come in. We should do the same in io_uring_register(), to avoid spurious -EINTR if the ring quiescing ends up having to process task_work to complete the operation Reported-by: Dan Melnic <dmm@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-01io_uring: add blkcg accounting to offloaded operationsDennis Zhou3-0/+118
There are a few operations that are offloaded to the worker threads. In this case, we lose process context and end up in kthread context. This results in ios to be not accounted to the issuing cgroup and consequently end up as issued by root. Just like others, adopt the personality of the blkcg too when issuing via the workqueues. For the SQPOLL thread, it will live and attach in the inited cgroup's context. Signed-off-by: Dennis Zhou <dennis@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-01io_uring: improve registered buffer accounting for huge pagesJens Axboe1-10/+80
io_uring does account any registered buffer as pinned/locked memory, and checks limit and fails if the given user doesn't have a big enough limit to register the ranges specified. However, if huge pages are used, we are potentially under-accounting the memory in terms of what gets pinned on the vm side. This patch rectifies that, by ensuring that we account the full size of a compound page, regardless of how much of it is being registered. Huge pages are not accounted mulitple times - if multiple sections of a huge page is registered, then the page is only accounted once. Reported-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-01io_uring: remove unneeded semicolonZheng Bin1-1/+1
Fixes coccicheck warning: fs/io_uring.c:4242:13-14: Unneeded semicolon Signed-off-by: Zheng Bin <zhengbin13@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-01io_uring: cap SQ submit size for SQPOLL with multiple ringsJens Axboe1-2/+9
In the spirit of fairness, cap the max number of SQ entries we'll submit for SQPOLL if we have multiple rings. If we don't do that, we could be submitting tons of entries for one ring, while others are waiting to get service. The value of 8 is somewhat arbitrarily chosen as something that allows a fair bit of batching, without using an excessive time per ring. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-01io_uring: get rid of req->io/io_async_ctx unionJens Axboe1-94/+117
There's really no point in having this union, it just means that we're always allocating enough room to cater to any command. But that's pointless, as the ->io field is request type private anyway. This gets rid of the io_async_ctx structure, and fills in the required size in the io_op_defs[] instead. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-01io_uring: kill extra user_bufs checkPavel Begunkov1-7/+1
Testing ctx->user_bufs for NULL in io_import_fixed() is not neccessary, because in that case ctx->nr_user_bufs would be zero, and the following check would fail. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-01io_uring: fix overlapped memcpy in io_req_map_rw()Pavel Begunkov1-1/+4
When io_req_map_rw() is called from io_rw_prep_async(), it memcpy() iorw->iter into itself. Even though it doesn't lead to an error, such a memcpy()'s aliasing rules violation is considered to be a bad practise. Inline io_req_map_rw() into io_rw_prep_async(). We don't really need any remapping there, so it's much simpler than the generic implementation. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-01io_uring: refactor io_req_map_rw()Pavel Begunkov1-2/+1
Set rw->free_iovec to @iovec, that gives an identical result and stresses that @iovec param rw->free_iovec play the same role. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-01io_uring: simplify io_rw_prep_async()Pavel Begunkov1-4/+2
Don't touch iter->iov and iov in between __io_import_iovec() and io_req_map_rw(), the former function aleady sets it correctly, because it creates one more case with NULL'ed iov to consider in io_req_map_rw(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-01io_uring: provide IORING_ENTER_SQ_WAIT for SQPOLL SQ ring waitsJens Axboe1-3/+37
When using SQPOLL, applications can run into the issue of running out of SQ ring entries because the thread hasn't consumed them yet. The only option for dealing with that is checking later, or busy checking for the condition. Provide IORING_ENTER_SQ_WAIT if applications want to wait on this condition. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-01io_uring: mark io_uring_fops/io_op_defs as __read_mostlyJens Axboe1-2/+2
These structures are never written, move them appropriately. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-01io_uring: enable IORING_SETUP_ATTACH_WQ to attach to SQPOLL thread tooJens Axboe1-0/+33
We support using IORING_SETUP_ATTACH_WQ to share async backends between rings created by the same process, this now also allows the same to happen with SQPOLL. The setup procedure remains the same, the caller sets io_uring_params->wq_fd to the 'parent' context, and then the newly created ring will attach to that async backend. This means that multiple rings can share the same SQPOLL thread, saving resources. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-01io_uring: base SQPOLL handling off io_sq_dataJens Axboe1-25/+101
Remove the SQPOLL thread from the ctx, and use the io_sq_data as the data structure we pass in. io_sq_data has a list of ctx's that we can then iterate over and handle. As of now we're ready to handle multiple ctx's, though we're still just handling a single one after this patch. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-01io_uring: split SQPOLL data into separate structureJens Axboe1-41/+86
Move all the necessary state out of io_ring_ctx, and into a new structure, io_sq_data. The latter now deals with any state or variables associated with the SQPOLL thread itself. In preparation for supporting more than one io_ring_ctx per SQPOLL thread. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-01io_uring: split work handling part of SQPOLL into helperJens Axboe1-84/+93
This is done in preparation for handling more than one ctx, but it also cleans up the code a bit since io_sq_thread() was a bit too unwieldy to get a get overview on. __io_sq_thread() is now the main handler, and it returns an enum sq_ret that tells io_sq_thread() what it ended up doing. The parent then makes a decision on idle, spinning, or work handling based on that. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-01io_uring: move SQPOLL post-wakeup ring need wakeup flag into wake handlerJens Axboe1-1/+18
We need to decouple the clearing on wakeup from the the inline schedule, as that is going to be required for handling multiple rings in one thread. Wrap our wakeup handler so we can clear it when we get the wakeup, by definition that is when we no longer need the flag set. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-01io_uring: use private ctx wait queue entries for SQPOLLJens Axboe1-15/+18
This is in preparation to sharing the poller thread between rings. For that we need per-ring wait_queue_entry storage, and we can't easily put that on the stack if one thread is managing multiple rings. We'll also be sharing the wait_queue_head across rings for the purposes of wakeups, provide the usual private ring wait_queue_head for now but make it a pointer so we can easily override it when sharing. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-01io_uring: io_sq_thread() doesn't need to flush signalsJens Axboe1-2/+0
We're not handling signals by default in kernel threads, and we never use TWA_SIGNAL for the SQPOLL thread internally. Hence we can never have a signal pending, and we don't need to check for it (nor flush it). Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-01io_wq: Make io_wqe::lock a raw_spinlock_tSebastian Andrzej Siewior1-26/+26
During a context switch the scheduler invokes wq_worker_sleeping() with disabled preemption. Disabling preemption is needed because it protects access to `worker->sleeping'. As an optimisation it avoids invoking schedule() within the schedule path as part of possible wake up (thus preempt_enable_no_resched() afterwards). The io-wq has been added to the mix in the same section with disabled preemption. This breaks on PREEMPT_RT because io_wq_worker_sleeping() acquires a spinlock_t. Also within the schedule() the spinlock_t must be acquired after tsk_is_pi_blocked() otherwise it will block on the sleeping lock again while scheduling out. While playing with `io_uring-bench' I didn't notice a significant latency spike after converting io_wqe::lock to a raw_spinlock_t. The latency was more or less the same. In order to keep the spinlock_t it would have to be moved after the tsk_is_pi_blocked() check which would introduce a branch instruction into the hot path. The lock is used to maintain the `work_list' and wakes one task up at most. Should io_wqe_cancel_pending_work() cause latency spikes, while searching for a specific item, then it would need to drop the lock during iterations. revert_creds() is also invoked under the lock. According to debug cred::non_rcu is 0. Otherwise it should be moved outside of the locked section because put_cred_rcu()->free_uid() acquires a sleeping lock. Convert io_wqe::lock to a raw_spinlock_t.c Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-01io_uring: allow disabling rings during the creationStefano Garzarella1-7/+54
This patch adds a new IORING_SETUP_R_DISABLED flag to start the rings disabled, allowing the user to register restrictions, buffers, files, before to start processing SQEs. When IORING_SETUP_R_DISABLED is set, SQE are not processed and SQPOLL kthread is not started. The restrictions registration are allowed only when the rings are disable to prevent concurrency issue while processing SQEs. The rings can be enabled using IORING_REGISTER_ENABLE_RINGS opcode with io_uring_register(2). Suggested-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>