summaryrefslogtreecommitdiff
path: root/fs/io_uring.c
AgeCommit message (Collapse)AuthorFilesLines
2020-06-12Merge tag 'io_uring-5.8-2020-06-11' of git://git.kernel.dk/linux-blockLinus Torvalds1-231/+193
Pull io_uring fixes from Jens Axboe: "A few late stragglers in here. In particular: - Validate full range for provided buffers (Bijan) - Fix bad use of kfree() in buffer registration failure (Denis) - Don't allow close of ring itself, it's not fully safe. Making it fully safe would require making the system call more expensive, which isn't worth it. - Buffer selection fix - Regression fix for O_NONBLOCK retry - Make IORING_OP_ACCEPT honor O_NONBLOCK (Jiufei) - Restrict opcode handling for SQ/IOPOLL (Pavel) - io-wq work handling cleanups and improvements (Pavel, Xiaoguang) - IOPOLL race fix (Xiaoguang)" * tag 'io_uring-5.8-2020-06-11' of git://git.kernel.dk/linux-block: io_uring: fix io_kiocb.flags modification race in IOPOLL mode io_uring: check file O_NONBLOCK state for accept io_uring: avoid unnecessary io_wq_work copy for fast poll feature io_uring: avoid whole io_wq_work copy for requests completed inline io_uring: allow O_NONBLOCK async retry io_wq: add per-wq work handler instead of per work io_uring: don't arm a timeout through work.func io_uring: remove custom ->func handlers io_uring: don't derive close state from ->func io_uring: use kvfree() in io_sqe_buffer_register() io_uring: validate the full range of provided buffers for access io_uring: re-set iov base/len for buffer select retry io_uring: move send/recv IOPOLL check into prep io_uring: deduplicate io_openat{,2}_prep() io_uring: do build_open_how() only once io_uring: fix {SQ,IO}POLL with unsupported opcodes io_uring: disallow close of ring itself
2020-06-11io_uring: fix io_kiocb.flags modification race in IOPOLL modeXiaoguang Wang1-6/+6
While testing io_uring in arm, we found sometimes io_sq_thread() keeps polling io requests even though there are not inflight io requests in block layer. After some investigations, found a possible race about io_kiocb.flags, see below race codes: 1) in the end of io_write() or io_read() req->flags &= ~REQ_F_NEED_CLEANUP; kfree(iovec); return ret; 2) in io_complete_rw_iopoll() if (res != -EAGAIN) req->flags |= REQ_F_IOPOLL_COMPLETED; In IOPOLL mode, io requests still maybe completed by interrupt, then above codes are not safe, concurrent modifications to req->flags, which is not protected by lock or is not atomic modifications. I also had disassemble io_complete_rw_iopoll() in arm: req->flags |= REQ_F_IOPOLL_COMPLETED; 0xffff000008387b18 <+76>: ldr w0, [x19,#104] 0xffff000008387b1c <+80>: orr w0, w0, #0x1000 0xffff000008387b20 <+84>: str w0, [x19,#104] Seems that the "req->flags |= REQ_F_IOPOLL_COMPLETED;" is load and modification, two instructions, which obviously is not atomic. To fix this issue, add a new iopoll_completed in io_kiocb to indicate whether io request is completed. Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-11kernel: set USER_DS in kthread_use_mmChristoph Hellwig1-4/+0
Some architectures like arm64 and s390 require USER_DS to be set for kernel threads to access user address space, which is the whole purpose of kthread_use_mm, but other like x86 don't. That has lead to a huge mess where some callers are fixed up once they are tested on said architectures, while others linger around and yet other like io_uring try to do "clever" optimizations for what usually is just a trivial asignment to a member in the thread_struct for most architectures. Make kthread_use_mm set USER_DS, and kthread_unuse_mm restore to the previous value instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Jens Axboe <axboe@kernel.dk> Acked-by: Michael S. Tsirkin <mst@redhat.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Felipe Balbi <balbi@kernel.org> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Zhenyu Wang <zhenyuw@linux.intel.com> Cc: Zhi Wang <zhi.a.wang@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: http://lkml.kernel.org/r/20200404094101.672954-7-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-11kernel: better document the use_mm/unuse_mm API contractChristoph Hellwig1-2/+2
Switch the function documentation to kerneldoc comments, and add WARN_ON_ONCE asserts that the calling thread is a kernel thread and does not have ->mm set (or has ->mm set in the case of unuse_mm). Also give the functions a kthread_ prefix to better document the use case. [hch@lst.de: fix a comment typo, cover the newly merged use_mm/unuse_mm caller in vfio] Link: http://lkml.kernel.org/r/20200416053158.586887-3-hch@lst.de [sfr@canb.auug.org.au: powerpc/vas: fix up for {un}use_mm() rename] Link: http://lkml.kernel.org/r/20200422163935.5aa93ba5@canb.auug.org.au Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Jens Axboe <axboe@kernel.dk> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> [usb] Acked-by: Haren Myneni <haren@linux.ibm.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Felipe Balbi <balbi@kernel.org> Cc: Jason Wang <jasowang@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Zhenyu Wang <zhenyuw@linux.intel.com> Cc: Zhi Wang <zhi.a.wang@intel.com> Link: http://lkml.kernel.org/r/20200404094101.672954-6-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-11kernel: move use_mm/unuse_mm to kthread.cChristoph Hellwig1-1/+0
Patch series "improve use_mm / unuse_mm", v2. This series improves the use_mm / unuse_mm interface by better documenting the assumptions, and my taking the set_fs manipulations spread over the callers into the core API. This patch (of 3): Use the proper API instead. Link: http://lkml.kernel.org/r/20200404094101.672954-1-hch@lst.de These helpers are only for use with kernel threads, and I will tie them more into the kthread infrastructure going forward. Also move the prototypes to kthread.h - mmu_context.h was a little weird to start with as it otherwise contains very low-level MM bits. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Jens Axboe <axboe@kernel.dk> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Felipe Balbi <balbi@kernel.org> Cc: Jason Wang <jasowang@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Zhenyu Wang <zhenyuw@linux.intel.com> Cc: Zhi Wang <zhi.a.wang@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: http://lkml.kernel.org/r/20200404094101.672954-1-hch@lst.de Link: http://lkml.kernel.org/r/20200416053158.586887-1-hch@lst.de Link: http://lkml.kernel.org/r/20200404094101.672954-5-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-11io_uring: check file O_NONBLOCK state for acceptJiufei Xue1-0/+3
If the socket is O_NONBLOCK, we should complete the accept request with -EAGAIN when data is not ready. Signed-off-by: Jiufei Xue <jiufei.xue@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-11io_uring: avoid unnecessary io_wq_work copy for fast poll featureXiaoguang Wang1-4/+9
Basically IORING_OP_POLL_ADD command and async armed poll handlers for regular commands don't touch io_wq_work, so only REQ_F_WORK_INITIALIZED is set, can we do io_wq_work copy and restore. Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-11io_uring: avoid whole io_wq_work copy for requests completed inlineXiaoguang Wang1-4/+36
If requests can be submitted and completed inline, we don't need to initialize whole io_wq_work in io_init_req(), which is an expensive operation, add a new 'REQ_F_WORK_INITIALIZED' to determine whether io_wq_work is initialized and add a helper io_req_init_async(), users must call io_req_init_async() for the first time touching any members of io_wq_work. I use /dev/nullb0 to evaluate performance improvement in my physical machine: modprobe null_blk nr_devices=1 completion_nsec=0 sudo taskset -c 60 fio -name=fiotest -filename=/dev/nullb0 -iodepth=128 -thread -rw=read -ioengine=io_uring -direct=1 -bs=4k -size=100G -numjobs=1 -time_based -runtime=120 before this patch: Run status group 0 (all jobs): READ: bw=724MiB/s (759MB/s), 724MiB/s-724MiB/s (759MB/s-759MB/s), io=84.8GiB (91.1GB), run=120001-120001msec With this patch: Run status group 0 (all jobs): READ: bw=761MiB/s (798MB/s), 761MiB/s-761MiB/s (798MB/s-798MB/s), io=89.2GiB (95.8GB), run=120001-120001msec About 5% improvement. Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-10io_uring: allow O_NONBLOCK async retryJens Axboe1-3/+7
We can assume that O_NONBLOCK is always honored, even if we don't have a ->read/write_iter() for the file type. Also unify the read/write checking for allowing async punt, having the write side factoring in the REQ_F_NOWAIT flag as well. Cc: stable@vger.kernel.org Fixes: 490e89676a52 ("io_uring: only force async punt if poll based retry can't handle it") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-09mmap locking API: use coccinelle to convert mmap_sem rwsem call sitesMichel Lespinasse1-2/+2
This change converts the existing mmap_sem rwsem calls to use the new mmap locking API instead. The change is generated using coccinelle with the following rule: // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir . @@ expression mm; @@ ( -init_rwsem +mmap_init_lock | -down_write +mmap_write_lock | -down_write_killable +mmap_write_lock_killable | -down_write_trylock +mmap_write_trylock | -up_write +mmap_write_unlock | -downgrade_write +mmap_write_downgrade | -down_read +mmap_read_lock | -down_read_killable +mmap_read_lock_killable | -down_read_trylock +mmap_read_trylock | -up_read +mmap_read_unlock ) -(&mm->mmap_sem) +(mm) Signed-off-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ying Han <yinghan@google.com> Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-08io_wq: add per-wq work handler instead of per workPavel Begunkov1-1/+2
io_uring is the only user of io-wq, and now it uses only io-wq callback for all its requests, namely io_wq_submit_work(). Instead of storing work->runner callback in each instance of io_wq_work, keep it in io-wq itself. pros: - reduces io_wq_work size - more robust -- ->func won't be invalidated with mem{cpy,set}(req) - helps other work Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-08io_uring: don't arm a timeout through work.funcPavel Begunkov1-11/+18
Remove io_link_work_cb() -- the last custom work.func. Not the prettiest thing, but works. Instead of queueing a linked timeout in io_link_work_cb() mark a request with REQ_F_QUEUE_TIMEOUT and do enqueueing based on the flag in io_wq_submit_work(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-08io_uring: remove custom ->func handlersPavel Begunkov1-112/+27
In preparation of getting rid of work.func, this removes almost all custom instances of it, leaving only io_wq_submit_work() and io_link_work_cb(). And the last one will be dealt later. Nothing fancy, just routinely remove *_finish() function and inline what's left. E.g. remove io_fsync_finish() + inline __io_fsync() into io_fsync(). As no users of io_req_cancelled() are left, delete it as well. The patch adds extra switch lookup on cold-ish path, but that's overweighted by nice diffstat and other benefits of the following patches. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-08io_uring: don't derive close state from ->funcPavel Begunkov1-33/+17
Relying on having a specific work.func is dangerous, even if an opcode handler set it itself. E.g. io_wq_assign_next() can modify it. io_close() sets a custom work.func to indicate that __close_fd_get_file() was already called. Fortunately, there is no bugs with io_wq_assign_next() and close yet. Still, do it safe and always be prepared to be called through io_wq_submit_work(). Zero req->close.put_file in prep, and call __close_fd_get_file() IFF it's NULL. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-08io_uring: use kvfree() in io_sqe_buffer_register()Denis Efremov1-2/+2
Use kvfree() to free the pages and vmas, since they are allocated by kvmalloc_array() in a loop. Fixes: d4ef647510b1 ("io_uring: avoid page allocation warnings") Signed-off-by: Denis Efremov <efremov@linux.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20200605093203.40087-1-efremov@linux.com
2020-06-08io_uring: validate the full range of provided buffers for accessBijan Mottahedeh1-1/+1
Account for the number of provided buffers when validating the address range. Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-04io_uring: re-set iov base/len for buffer select retryJens Axboe1-1/+7
We already have the buffer selected, but we should set the iter list again. Cc: stable@vger.kernel.org # v5.7 Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-04io_uring: move send/recv IOPOLL check into prepPavel Begunkov1-12/+6
Fail recv/send in case of IORING_SETUP_IOPOLL earlier during prep, so it'd be done only once. Removes duplication as well Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-04io_uring: deduplicate io_openat{,2}_prep()Pavel Begunkov1-36/+19
io_openat_prep() and io_openat2_prep() are identical except for how struct open_how is built. Deduplicate it with a helper. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-04io_uring: do build_open_how() only oncePavel Begunkov1-6/+7
build_open_how() is just adjusting open_flags/mode. Do it once during prep. It looks better than storing raw values for the future. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-04io_uring: fix {SQ,IO}POLL with unsupported opcodesPavel Begunkov1-0/+18
IORING_SETUP_IOPOLL is defined only for read/write, other opcodes should be disallowed, otherwise it'll get an error as below. Also refuse open/close with SQPOLL, as the polling thread wouldn't know which file table to use. RIP: 0010:io_iopoll_getevents+0x111/0x5a0 Call Trace: ? _raw_spin_unlock_irqrestore+0x24/0x40 ? do_send_sig_info+0x64/0x90 io_iopoll_reap_events.part.0+0x5e/0xa0 io_ring_ctx_wait_and_kill+0x132/0x1c0 io_uring_release+0x20/0x30 __fput+0xcd/0x230 ____fput+0xe/0x10 task_work_run+0x67/0xa0 do_exit+0x353/0xb10 ? handle_mm_fault+0xd4/0x200 ? syscall_trace_enter+0x18c/0x2c0 do_group_exit+0x43/0xa0 __x64_sys_exit_group+0x18/0x20 do_syscall_64+0x60/0x1e0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> [axboe: allow provide/remove buffers and files update] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-03io_uring: disallow close of ring itselfJens Axboe1-8/+17
A previous commit enabled this functionality, which also enabled O_PATH to work correctly with io_uring. But we can't safely close the ring itself, as the file handle isn't reference counted inside io_uring_enter(). Instead of jumping through hoops to enable ring closure, add a "soft" ->needs_file option, ->needs_file_no_error. This enables O_PATH file descriptors to work, but still catches the case of trying to close the ring itself. Reported-by: Jann Horn <jannh@google.com> Fixes: 904fbcb115c8 ("io_uring: remove 'fd is io_uring' from close path") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-03Merge tag 'for-5.8/io_uring-2020-06-01' of git://git.kernel.dk/linux-blockLinus Torvalds1-344/+408
Pull io_uring updates from Jens Axboe: "A relatively quiet round, mostly just fixes and code improvements. In particular: - Make statx just use the generic statx handler, instead of open coding it. We don't need that anymore, as we always call it async safe (Bijan) - Enable closing of the ring itself. Also fixes O_PATH closure (me) - Properly name completion members (me) - Batch reap of dead file registrations (me) - Allow IORING_OP_POLL with double waitqueues (me) - Add tee(2) support (Pavel) - Remove double off read (Pavel) - Fix overflow cancellations (Pavel) - Improve CQ timeouts (Pavel) - Async defer drain fixes (Pavel) - Add support for enabling/disabling notifications on a registered eventfd (Stefano) - Remove dead state parameter (Xiaoguang) - Disable SQPOLL submit on dying ctx (Xiaoguang) - Various code cleanups" * tag 'for-5.8/io_uring-2020-06-01' of git://git.kernel.dk/linux-block: (29 commits) io_uring: fix overflowed reqs cancellation io_uring: off timeouts based only on completions io_uring: move timeouts flushing to a helper statx: hide interfaces no longer used by io_uring io_uring: call statx directly statx: allow system call to be invoked from io_uring io_uring: add io_statx structure io_uring: get rid of manual punting in io_close io_uring: separate DRAIN flushing into a cold path io_uring: don't re-read sqe->off in timeout_prep() io_uring: simplify io_timeout locking io_uring: fix flush req->refs underflow io_uring: don't submit sqes when ctx->refs is dying io_uring: async task poll trigger cleanup io_uring: add tee(2) support splice: export do_tee() io_uring: don't repeat valid flag list io_uring: rename io_file_put() io_uring: remove req->needs_fixed_files io_uring: cleanup io_poll_remove_one() logic ...
2020-05-30io_uring: fix overflowed reqs cancellationPavel Begunkov1-2/+3
Overflowed requests in io_uring_cancel_files() should be shed only of inflight and overflowed refs. All other left references are owned by someone else. If refcount_sub_and_test() fails, it will go further and put put extra ref, don't do that. Also, don't need to do io_wq_cancel_work() for overflowed reqs, they will be let go shortly anyway. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-30io_uring: off timeouts based only on completionsPavel Begunkov1-51/+14
Offset timeouts wait not for sqe->off non-timeout CQEs, but rather sqe->off + number of prior inflight requests. Wait exactly for sqe->off non-timeout completions Reported-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-30io_uring: move timeouts flushing to a helperPavel Begunkov1-20/+14
Separate flushing offset timeouts io_commit_cqring() by moving it into a helper. Just a preparation, makes following patches clearer. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-27io_uring: call statx directlyBijan Mottahedeh1-46/+4
Calling statx directly both simplifies the interface and avoids potential incompatibilities between sync and async invokations. Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-27io_uring: add io_statx structureBijan Mottahedeh1-16/+22
Separate statx data from open in io_kiocb. No functional changes. Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-26io_uring: get rid of manual punting in io_closePavel Begunkov1-15/+5
io_close() was punting async manually to skip grabbing files. Use REQ_F_NO_FILE_TABLE instead, and pass it through the generic path with -EAGAIN. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-26io_uring: separate DRAIN flushing into a cold pathPavel Begunkov1-15/+15
io_commit_cqring() assembly doesn't look good with extra code handling drained requests. IOSQE_IO_DRAIN is slow and discouraged to be used in a hot path, so try to minimise its impact by putting it into a helper and doing a fast check. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-26io_uring: don't re-read sqe->off in timeout_prep()Pavel Begunkov1-2/+3
SQEs are user writable, don't read sqe->off twice in io_timeout_prep() Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-26io_uring: simplify io_timeout lockingPavel Begunkov1-2/+1
Move spin_lock_irq() earlier to have only 1 call site of it in io_timeout(). It makes the flow easier. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-26io_uring: fix flush req->refs underflowPavel Begunkov1-1/+1
In io_uring_cancel_files(), after refcount_sub_and_test() leaves 0 req->refs, it calls io_put_req(), which would also put a ref. Call io_free_req() instead. Cc: stable@vger.kernel.org Fixes: 2ca10259b418 ("io_uring: prune request from overflow list on flush") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-20io_uring: don't submit sqes when ctx->refs is dyingXiaoguang Wang1-11/+2
When IORING_SETUP_SQPOLL is enabled, io_ring_ctx_wait_and_kill() will wait for sq thread to idle by busy loop: while (ctx->sqo_thread && !wq_has_sleeper(&ctx->sqo_wait)) cond_resched(); Above loop isn't very CPU friendly, it may introduce a short cpu burst on the current cpu. If ctx->refs is dying, we forbid sq_thread from submitting any further SQEs. Instead they just get discarded when we exit. Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-20io_uring: reset -EBUSY error when io sq thread is waken upXiaoguang Wang1-0/+1
In io_sq_thread(), currently if we get an -EBUSY error and go to sleep, we will won't clear it again, which will result in io_sq_thread() will never have a chance to submit sqes again. Below test program test.c can reveal this bug: int main(int argc, char *argv[]) { struct io_uring ring; int i, fd, ret; struct io_uring_sqe *sqe; struct io_uring_cqe *cqe; struct iovec *iovecs; void *buf; struct io_uring_params p; if (argc < 2) { printf("%s: file\n", argv[0]); return 1; } memset(&p, 0, sizeof(p)); p.flags = IORING_SETUP_SQPOLL; ret = io_uring_queue_init_params(4, &ring, &p); if (ret < 0) { fprintf(stderr, "queue_init: %s\n", strerror(-ret)); return 1; } fd = open(argv[1], O_RDONLY | O_DIRECT); if (fd < 0) { perror("open"); return 1; } iovecs = calloc(10, sizeof(struct iovec)); for (i = 0; i < 10; i++) { if (posix_memalign(&buf, 4096, 4096)) return 1; iovecs[i].iov_base = buf; iovecs[i].iov_len = 4096; } ret = io_uring_register_files(&ring, &fd, 1); if (ret < 0) { fprintf(stderr, "%s: register %d\n", __FUNCTION__, ret); return ret; } for (i = 0; i < 10; i++) { sqe = io_uring_get_sqe(&ring); if (!sqe) break; io_uring_prep_readv(sqe, 0, &iovecs[i], 1, 0); sqe->flags |= IOSQE_FIXED_FILE; ret = io_uring_submit(&ring); sleep(1); printf("submit %d\n", i); } for (i = 0; i < 10; i++) { io_uring_wait_cqe(&ring, &cqe); printf("receive: %d\n", i); if (cqe->res != 4096) { fprintf(stderr, "ret=%d, wanted 4096\n", cqe->res); ret = 1; } io_uring_cqe_seen(&ring, cqe); } close(fd); io_uring_queue_exit(&ring); return 0; } sudo ./test testfile above command will hang on the tenth request, to fix this bug, when io sq_thread is waken up, we reset the variable 'ret' to be zero. Suggested-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-20io_uring: don't add non-IO requests to iopoll pending listJens Axboe1-1/+2
We normally disable any commands that aren't specifically poll commands for a ring that is setup for polling, but we do allow buffer provide and remove commands to support buffer selection for polled IO. Once a request is issued, we add it to the poll list to poll for completion. But we should not do that for non-IO commands, as those request complete inline immediately and aren't pollable. If we do, we can leave requests on the iopoll list after they are freed. Fixes: ddf0322db79c ("io_uring: add IORING_OP_PROVIDE_BUFFERS") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-20io_uring: don't use kiocb.private to store buf_indexBijan Mottahedeh1-8/+8
kiocb.private is used in iomap_dio_rw() so store buf_index separately. Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> Move 'buf_index' to a hole in io_kiocb. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-18io_uring: cancel work if task_work_add() failsJens Axboe1-3/+5
We currently move it to the io_wqe_manager for execution, but we cannot safely do so as we may lack some of the state to execute it out of context. As we cancel work anyway when the ring/task exits, just mark this request as canceled and io_async_task_func() will do the right thing. Fixes: aa96bf8a9ee3 ("io_uring: use io-wq manager as backup task if task is exiting") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-18io_uring: async task poll trigger cleanupJens Axboe1-17/+16
If the request is still hashed in io_async_task_func(), then it cannot have been canceled and it's pointless to check. So save that check. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-17io_uring: remove dead check in io_splice()Jens Axboe1-4/+1
We checked for 'force_nonblock' higher up, so it's definitely false at this point. Kill the check, it's a remnant of when we tried to do inline splice without always punting to async context. Fixes: 2fb3e82284fc ("io_uring: punt splice async because of inode mutex") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-17io_uring: add tee(2) supportPavel Begunkov1-3/+59
Add IORING_OP_TEE implementing tee(2) support. Almost identical to splice bits, but without offsets. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-17io_uring: don't repeat valid flag listPavel Begunkov1-3/+1
req->flags stores all sqe->flags. After checking that sqe->flags are valid set if IOSQE* flags, no need to double check it, just forward them all. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-17io_uring: rename io_file_put()Pavel Begunkov1-9/+13
io_file_put() deals with flushing state's file refs, adding "state" to its name makes it a bit clearer. Also, avoid double check of state->file in __io_file_get() in some cases. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-17io_uring: remove req->needs_fixed_filesPavel Begunkov1-9/+12
A submission is "async" IIF it's done by SQPOLL thread. Instead of passing @async flag into io_submit_sqes(), deduce it from ctx->flags. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-17io_uring: cleanup io_poll_remove_one() logicJens Axboe1-14/+13
We only need apoll in the one section, do the juggling with the work restoration there. This removes a special case further down as well. No functional changes in this patch. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-17io_uring: fix FORCE_ASYNC req preparationPavel Begunkov1-3/+9
As for other not inlined requests, alloc req->io for FORCE_ASYNC reqs, so they can be prepared properly. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-17io_uring: don't prepare DRAIN reqs twicePavel Begunkov1-6/+7
If req->io is not NULL, it's already prepared. Don't do it again, it's dangerous. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-17io_uring: initialize ctx->sqo_wait earlierJens Axboe1-1/+1
Ensure that ctx->sqo_wait is initialized as soon as the ctx is allocated, instead of deferring it to the offload setup. This fixes a syzbot reported lockdep complaint, which is really due to trying to wake_up on an uninitialized wait queue: RSP: 002b:00007fffb1fb9aa8 EFLAGS: 00000246 ORIG_RAX: 00000000000001a9 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000441319 RDX: 0000000000000001 RSI: 0000000020000140 RDI: 000000000000047b RBP: 0000000000010475 R08: 0000000000000001 R09: 00000000004002c8 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000402260 R13: 00000000004022f0 R14: 0000000000000000 R15: 0000000000000000 INFO: trying to register non-static key. the code is fine but needs lockdep annotation. turning off the locking correctness validator. CPU: 1 PID: 7090 Comm: syz-executor222 Not tainted 5.7.0-rc1-next-20200415-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x188/0x20d lib/dump_stack.c:118 assign_lock_key kernel/locking/lockdep.c:913 [inline] register_lock_class+0x1664/0x1760 kernel/locking/lockdep.c:1225 __lock_acquire+0x104/0x4c50 kernel/locking/lockdep.c:4234 lock_acquire+0x1f2/0x8f0 kernel/locking/lockdep.c:4934 __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline] _raw_spin_lock_irqsave+0x8c/0xbf kernel/locking/spinlock.c:159 __wake_up_common_lock+0xb4/0x130 kernel/sched/wait.c:122 io_cqring_ev_posted+0xa5/0x1e0 fs/io_uring.c:1160 io_poll_remove_all fs/io_uring.c:4357 [inline] io_ring_ctx_wait_and_kill+0x2bc/0x5a0 fs/io_uring.c:7305 io_uring_create fs/io_uring.c:7843 [inline] io_uring_setup+0x115e/0x22b0 fs/io_uring.c:7870 do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:295 entry_SYSCALL_64_after_hwframe+0x49/0xb3 RIP: 0033:0x441319 Code: e8 5c ae 02 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 bb 0a fc ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007fffb1fb9aa8 EFLAGS: 00000246 ORIG_RAX: 00000000000001a9 Reported-by: syzbot+8c91f5d054e998721c57@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-15io_uring: file registration list and lock optimizationJens Axboe1-14/+10
There's no point in using list_del_init() on entries that are going away, and the associated lock is always used in process context so let's not use the IRQ disabling+saving variant of the spinlock. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-15io_uring: add IORING_CQ_EVENTFD_DISABLED to the CQ ring flagsStefano Garzarella1-0/+2
This new flag should be set/clear from the application to disable/enable eventfd notifications when a request is completed and queued to the CQ ring. Before this patch, notifications were always sent if an eventfd is registered, so IORING_CQ_EVENTFD_DISABLED is not set during the initialization. It will be up to the application to set the flag after initialization if no notifications are required at the beginning. Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>