summaryrefslogtreecommitdiff
path: root/fs/io_uring.c
AgeCommit message (Collapse)AuthorFilesLines
2022-04-17io_uring: fix leaks on IOPOLL and CQE_SKIPPavel Begunkov1-2/+1
If all completed requests in io_do_iopoll() were marked with REQ_F_CQE_SKIP, we'll not only skip CQE posting but also io_free_batch_list() leaking memory and resources. Move @nr_events increment before REQ_F_CQE_SKIP check. We'll potentially return the value greater than the real one, but iopolling will deal with it and the userspace will re-iopoll if needed. In anyway, I don't think there are many use cases for REQ_F_CQE_SKIP + IOPOLL. Fixes: 83a13a4181b0e ("io_uring: tweak iopoll CQE_SKIP event counting") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/5072fc8693fbfd595f89e5d4305bfcfd5d2f0a64.1650186611.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-17io_uring: free iovec if file assignment failsJens Axboe1-2/+6
We just return failure in this case, but we need to release the iovec first. If we're doing IO with more than FAST_IOV segments, then the iovec is allocated and must be freed. Reported-by: syzbot+96b43810dfe9c3bb95ed@syzkaller.appspotmail.com Fixes: 584b0180f0f4 ("io_uring: move read/write file prep state into actual opcode handler") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-15io_uring: abort file assignment prior to assigning credsJens Axboe1-2/+3
We need to either restore creds properly if we fail on the file assignment, or just do the file assignment first instead. Let's do the latter as it's simpler, should make no difference here for file assignment. Link: https://lore.kernel.org/lkml/000000000000a7edb305dca75a50@google.com/ Reported-by: syzbot+60c52ca98513a8760a91@syzkaller.appspotmail.com Fixes: 6bf9c47a3989 ("io_uring: defer file assignment") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-13io_uring: fix poll error reportingPavel Begunkov1-3/+2
We should not return an error code in req->result in io_poll_check_events(), because it may get mangled and returned as success. Just return the error code directly, the callers will fail the request or proceed accordingly. Fixes: 6bf9c47a3989 ("io_uring: defer file assignment") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/5f03514ee33324dc811fb93df84aee0f695fb044.1649862516.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-13io_uring: fix poll file assign deadlockPavel Begunkov1-1/+2
We pass "unlocked" into io_assign_file() in io_poll_check_events(), which can lead to double locking. Fixes: 6bf9c47a3989 ("io_uring: defer file assignment") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2476d4ae46554324b599ee4055447b105f20a75a.1649862516.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-13io_uring: use right issue_flags for splice/teePavel Begunkov1-2/+2
Pass right issue_flags into into io_file_get_fixed() instead of IO_URING_F_UNLOCKED. It's probably not a problem at the moment but let's do it safer. Fixes: 6bf9c47a3989 ("io_uring: defer file assignment") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7d242daa9df5d776907686977cd29fbceb4a2d8d.1649862516.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-12io_uring: verify pad field is 0 in io_get_ext_argDylan Yudaken1-0/+2
Ensure that only 0 is passed for pad here. Fixes: c73ebb685fb6 ("io_uring: add timeout support for io_uring_enter()") Signed-off-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220412163042.2788062-5-dylany@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-12io_uring: verify resv is 0 in ringfd register/unregisterDylan Yudaken1-1/+6
Only allow resv field to be 0 in struct io_uring_rsrc_update user arguments. Fixes: e7a6c00dc77a ("io_uring: add support for registering ring file descriptors") Signed-off-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220412163042.2788062-4-dylany@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-12io_uring: verify that resv2 is 0 in io_uring_rsrc_update2Dylan Yudaken1-2/+3
Verify that the user does not pass in anything but 0 for this field. Fixes: 992da01aa932 ("io_uring: change registration/upd/rsrc tagging ABI") Signed-off-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220412163042.2788062-3-dylany@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-12io_uring: move io_uring_rsrc_update2 validationDylan Yudaken1-2/+2
Move validation to be more consistently straight after copy_from_user. This is already done in io_register_rsrc_update and so this removes that redundant check. Signed-off-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220412163042.2788062-2-dylany@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-12io_uring: fix assign file locking issuePavel Begunkov1-4/+6
io-wq work cancellation path can't take uring_lock as how it's done on file assignment, we have to handle IO_WQ_WORK_CANCEL first, this fixes encountered hangs. Fixes: 6bf9c47a3989 ("io_uring: defer file assignment") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0d9b9f37841645518503f6a207e509d14a286aba.1649773463.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-12io_uring: stop using io_wq_work as an fd placeholderJens Axboe1-4/+8
There are two reasons why this isn't the best idea: - It's an odd area to grab a bit of storage space, hence it's an odd area to grab storage from. - It puts the 3rd io_kiocb cacheline into the hot path, where normal hot path just needs the first two. Use 'cflags' for joint fd/cflags storage. We only need fd until we successfully issue, and we only need cflags once a request is done and is completed. Fixes: 6bf9c47a3989 ("io_uring: defer file assignment") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-12io_uring: move apoll->events cacheJens Axboe1-9/+12
In preparation for fixing a regression with pulling in an extra cacheline for IO that doesn't usually touch the last cacheline of the io_kiocb, move the cached location of apoll->events to space shared with some other completion data. Like cflags, this isn't used until after the request has been completed, so we can piggy back on top of comp_list. Fixes: 81459350d581 ("io_uring: cache req->apoll->events in req->cflags") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-12io_uring: io_kiocb_update_pos() should not touch file for non -1 offsetJens Axboe1-11/+10
-1 tells use to use the current position, but we check if the file is a stream regardless of that. Fix up io_kiocb_update_pos() to only dip into file if we need to. This is both more efficient and also drops 12 bytes of text on aarch64 and 64 bytes on x86-64. Fixes: b4aec4001595 ("io_uring: do not recalculate ppos unnecessarily") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-11io_uring: flag the fact that linked file assignment is saneJens Axboe1-1/+2
Give applications a way to tell if the kernel supports sane linked files, as in files being assigned at the right time to be able to reliably do <open file direct into slot X><read file from slot X> while using IOSQE_IO_LINK to order them. Not really a bug fix, but flag it as such so that it gets pulled in with backports of the deferred file assignment. Fixes: 6bf9c47a3989 ("io_uring: defer file assignment") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-08io_uring: fix race between timeout flush and removalJens Axboe1-4/+3
io_flush_timeouts() assumes the timeout isn't in progress of triggering or being removed/canceled, so it unconditionally removes it from the timeout list and attempts to cancel it. Leave it on the list and let the normal timeout cancelation take care of it. Cc: stable@vger.kernel.org # 5.5+ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-07io_uring: use nospec annotation for more indexesPavel Begunkov1-6/+5
There are still several places that using pre array_index_nospec() indexes, fix them up. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b01ef5ee83f72ed35ad525912370b729f5d145f4.1649336342.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-07io_uring: zero tag on rsrc removalPavel Begunkov1-1/+3
Automatically default rsrc tag in io_queue_rsrc_removal(), it's safer than leaving it there and relying on the rest of the code to behave and not use it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1cf262a50df17478ea25b22494dcc19f3a80301f.1649336342.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-07io_uring: don't touch scm_fp_list after queueing skbPavel Begunkov1-2/+6
It's safer to not touch scm_fp_list after we queued an skb to which it was assigned, there might be races lurking if we screw subtle sync guarantees on the io_uring side. Fixes: 6b06314c47e14 ("io_uring: add file set registration") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-07io_uring: nospec index for tags on files updatePavel Begunkov1-1/+1
Don't forget to array_index_nospec() for indexes before updating rsrc tags in __io_sqe_files_update(), just use already safe and precalculated index @i. Fixes: c3bdad0271834 ("io_uring: add generic rsrc update with tags") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-07io_uring: implement compat handling for IORING_REGISTER_IOWQ_AFFEugene Syromiatnikov1-1/+9
Similarly to the way it is done im mbind syscall. Cc: stable@vger.kernel.org # 5.14 Fixes: fe76421d1da1dcdb ("io_uring: allow user configurable IO thread CPU affinity") Signed-off-by: Eugene Syromiatnikov <esyr@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-07Revert "io_uring: Add support for napi_busy_poll"Jens Axboe1-231/+1
This reverts commit adc8682ec69012b68d5ab7123e246d2ad9a6f94b. There's some discussion on the API not being as good as it can be. Rather than ship something and be stuck with it forever, let's revert the NAPI support for now and work on getting something sorted out for the next kernel release instead. Link: https://lore.kernel.org/io-uring/b7bbc124-8502-0ee9-d4c8-7c41b4487264@kernel.dk/ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-07io_uring: drop the old style inflight file trackingJens Axboe1-58/+27
io_uring tracks requests that are referencing an io_uring descriptor to be able to cancel without worrying about loops in the references. Since we now assign the file at execution time, the easier approach is to drop a potentially problematic reference before we punt the request. This eliminates the need to special case these types of files beyond just marking them as such, and simplifies cancelation quite a bit. This also fixes a recent issue where an async punted tee operation would with the io_uring descriptor as the output file would crash when attempting to get a reference to the file from the io-wq worker. We could have worked around that, but this is the much cleaner fix. Fixes: 6bf9c47a3989 ("io_uring: defer file assignment") Reported-by: syzbot+c4b9303500a21750b250@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-07io_uring: defer file assignmentJens Axboe1-10/+29
If an application uses direct open or accept, it knows in advance what direct descriptor value it will get as it picks it itself. This allows combined requests such as: sqe = io_uring_get_sqe(ring); io_uring_prep_openat_direct(sqe, ..., file_slot); sqe->flags |= IOSQE_IO_LINK | IOSQE_CQE_SKIP_SUCCESS; sqe = io_uring_get_sqe(ring); io_uring_prep_read(sqe,file_slot, buf, buf_size, 0); sqe->flags |= IOSQE_FIXED_FILE; io_uring_submit(ring); where we prepare both a file open and read, and only get a completion event for the read when both have completed successfully. Currently links are fully prepared before the head is issued, but that fails if the dependent link needs a file assigned that isn't valid until the head has completed. Conversely, if the same chain is performed but the fixed file slot is already valid, then we would be unexpectedly returning data from the old file slot rather than the newly opened one. Make sure we're consistent here. Allow deferral of file setup, which makes this documented case work. Cc: stable@vger.kernel.org # v5.15+ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-07io_uring: propagate issue_flags state down to file assignmentJens Axboe1-35/+47
We'll need this in a future patch, when we could be assigning the file after the prep stage. While at it, get rid of the io_file_get() helper, it just makes the code harder to read. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-05io_uring: move read/write file prep state into actual opcode handlerJens Axboe1-48/+53
In preparation for not necessarily having a file assigned at prep time, defer any initialization associated with the file to when the opcode handler is run. Cc: stable@vger.kernel.org # v5.15+ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-05io_uring: defer splice/tee file validity check until command issueJens Axboe1-28/+21
In preparation for not using the file at prep time, defer checking if this file refers to a valid io_uring instance until issue time. This also means we can get rid of the cleanup flag for splice and tee. Cc: stable@vger.kernel.org # v5.15+ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-04io_uring: don't check req->file in io_fsync_prep()Jens Axboe1-3/+0
This is a leftover from the really old days where we weren't able to track and error early if we need a file and it wasn't assigned. Kill the check. Cc: stable@vger.kernel.org # v5.15+ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-02Merge tag 'for-5.18/io_uring-2022-04-01' of git://git.kernel.dk/linux-blockLinus Torvalds1-23/+110
Pull io_uring fixes from Jens Axboe: "A little bit all over the map, some regression fixes for this merge window, and some general fixes that are stable bound. In detail: - Fix an SQPOLL memory ordering issue (Almog) - Accept fixes (Dylan) - Poll fixes (me) - Fixes for provided buffers and recycling (me) - Tweak to IORING_OP_MSG_RING command added in this merge window (me) - Memory leak fix (Pavel) - Misc fixes and tweaks (Pavel, me)" * tag 'for-5.18/io_uring-2022-04-01' of git://git.kernel.dk/linux-block: io_uring: defer msg-ring file validity check until command issue io_uring: fail links if msg-ring doesn't succeeed io_uring: fix memory leak of uid in files registration io_uring: fix put_kbuf without proper locking io_uring: fix invalid flags for io_put_kbuf() io_uring: improve req fields comments io_uring: enable EPOLLEXCLUSIVE for accept poll io_uring: improve task work cache utilization io_uring: fix async accept on O_NONBLOCK sockets io_uring: remove IORING_CQE_F_MSG io_uring: add flag for disabling provided buffer recycling io_uring: ensure recv and recvmsg handle MSG_WAITALL correctly io_uring: don't recycle provided buffer if punted to async worker io_uring: fix assuming triggered poll waitqueue is the single poll io_uring: bump poll refs to full 31-bits io_uring: remove poll entry from list when canceling all io_uring: fix memory ordering when SQPOLL thread goes to sleep io_uring: ensure that fsnotify is always called io_uring: recycle provided before arming poll
2022-03-29io_uring: defer msg-ring file validity check until command issueJens Axboe1-4/+7
In preparation for not using the file at prep time, defer checking if this file refers to a valid io_uring instance until issue time. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-29io_uring: fail links if msg-ring doesn't succeeedJens Axboe1-0/+2
We must always call req_set_fail() if the request is failed, otherwise we won't sever links for dependent chains correctly. Fixes: 4f57f06ce218 ("io_uring: add support for IORING_OP_MSG_RING command") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-29Merge tag 'ptrace-cleanups-for-v5.18' of ↵Linus Torvalds1-5/+6
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull ptrace cleanups from Eric Biederman: "This set of changes removes tracehook.h, moves modification of all of the ptrace fields inside of siglock to remove races, adds a missing permission check to ptrace.c The removal of tracehook.h is quite significant as it has been a major source of confusion in recent years. Much of that confusion was around task_work and TIF_NOTIFY_SIGNAL (which I have now decoupled making the semantics clearer). For people who don't know tracehook.h is a vestiage of an attempt to implement uprobes like functionality that was never fully merged, and was later superseeded by uprobes when uprobes was merged. For many years now we have been removing what tracehook functionaly a little bit at a time. To the point where anything left in tracehook.h was some weird strange thing that was difficult to understand" * tag 'ptrace-cleanups-for-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: ptrace: Remove duplicated include in ptrace.c ptrace: Check PTRACE_O_SUSPEND_SECCOMP permission on PTRACE_SEIZE ptrace: Return the signal to continue with from ptrace_stop ptrace: Move setting/clearing ptrace_message into ptrace_stop tracehook: Remove tracehook.h resume_user_mode: Move to resume_user_mode.h resume_user_mode: Remove #ifdef TIF_NOTIFY_RESUME in set_notify_resume signal: Move set_notify_signal and clear_notify_signal into sched/signal.h task_work: Decouple TIF_NOTIFY_SIGNAL and task_work task_work: Call tracehook_notify_signal from get_signal on all architectures task_work: Introduce task_work_pending task_work: Remove unnecessary include from posix_timers.h ptrace: Remove tracehook_signal_handler ptrace: Remove arch_syscall_{enter,exit}_tracehook ptrace: Create ptrace_report_syscall_{entry,exit} in ptrace.h ptrace/arm: Rename tracehook_report_syscall report_syscall ptrace: Move ptrace_report_syscall into ptrace.h
2022-03-26Merge tag 'for-5.18/write-streams-2022-03-18' of git://git.kernel.dk/linux-blockLinus Torvalds1-1/+0
Pull NVMe write streams removal from Jens Axboe: "This removes the write streams support in NVMe. No vendor ever really shipped working support for this, and they are not interested in supporting it. With the NVMe support gone, we have nothing in the tree that supports this. Remove passing around of the hints. The only discussion point in this patchset imho is the fact that the file specific write hint setting/getting fcntl helpers will now return -1/EINVAL like they did before we supported write hints. No known applications use these functions, I only know of one prototype that I help do for RocksDB, and that's not used. That said, with a change like this, it's always a bit controversial. Alternatively, we could just make them return 0 and pretend it worked. It's placement based hints after all" * tag 'for-5.18/write-streams-2022-03-18' of git://git.kernel.dk/linux-block: fs: remove fs.f_write_hint fs: remove kiocb.ki_hint block: remove the per-bio/request write hint nvme: remove support or stream based temperature hint
2022-03-26io_uring: fix memory leak of uid in files registrationPavel Begunkov1-0/+1
When there are no files for __io_sqe_files_scm() to process in the range, it'll free everything and return. However, it forgets to put uid. Fixes: 08a451739a9b5 ("io_uring: allow sparse fixed file sets") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/accee442376f33ce8aaebb099d04967533efde92.1648226048.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-25io_uring: fix put_kbuf without proper lockingPavel Begunkov1-1/+12
io_put_kbuf_comp() should only be called while holding ->completion_lock, however there is no such assumption in io_clean_op() and thus it can corrupt ->io_buffer_comp. Take the lock there, and workaround the only user of io_clean_op() calling it with locks. Not the prettiest solution, but it's easier to refactor it for-next. Fixes: cc3cec8367cba ("io_uring: speedup provided buffer handling") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/743e2130b73ec6d48c4c5dd15db896c433431e6d.1648212967.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-25io_uring: fix invalid flags for io_put_kbuf()Pavel Begunkov1-1/+3
io_req_complete_failed() doesn't require callers to hold ->uring_lock, use IO_URING_F_UNLOCKED version of io_put_kbuf(). The only affected place is the fail path of io_apoll_task_func(). Also add a lockdep annotation to catch such bugs in the future. Fixes: 3b2b78a8eb7cc ("io_uring: extend provided buf return to fails") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/ccf602dbf8df3b6a8552a262d8ee0a13a086fbc7.1648212967.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-25io_uring: improve req fields commentsPavel Begunkov1-1/+2
Move a misplaced comment about req->creds and add a line with assumptions about req->link. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1e51d1e6b1f3708c2d4127b4e371f9daa4c5f859.1648209006.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-25io_uring: enable EPOLLEXCLUSIVE for accept pollDylan Yudaken1-1/+4
When polling sockets for accept, use EPOLLEXCLUSIVE. This is helpful when multiple accept SQEs are submitted. For O_NONBLOCK sockets multiple queued SQEs would previously have all completed at once, but most with -EAGAIN as the result. Now only one wakes up and completes. For sockets without O_NONBLOCK there is no user facing change, but internally the extra requests would previously be queued onto a worker thread as they would wake up with no connection waiting, and be punted. Now they do not wake up unnecessarily. Co-developed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220325093755.4123343-1-dylany@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-25io_uring: improve task work cache utilizationJens Axboe1-1/+5
While profiling task_work intensive workloads, I noticed that most of the time in tctx_task_work() is spending stalled on loading 'req'. This is one of the unfortunate side effects of using linked lists, particularly when they end up being passe around. Prefetch the next request, if there is one. There's a sufficient amount of work in between that this makes it available for the next loop. While fiddling with the cache layout, move the link outside of the hot completion cacheline. It's rarely used in hot workloads, so better to bring in kbuf which is used for networked loads with provided buffers. This reduces tctx_task_work() overhead from ~3% to 1-1.5% in my testing. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-25io_uring: fix async accept on O_NONBLOCK socketsDylan Yudaken1-3/+0
Do not set REQ_F_NOWAIT if the socket is non blocking. When enabled this causes the accept to immediately post a CQE with EAGAIN, which means you cannot perform an accept SQE on a NONBLOCK socket asynchronously. By removing the flag if there is no pending accept then poll is armed as usual and when a connection comes in the CQE is posted. Signed-off-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220324143435.2875844-1-dylany@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-24io_uring: remove IORING_CQE_F_MSGJens Axboe1-2/+1
This was introduced with the message ring opcode, but isn't strictly required for the request itself. The sender can encode what is needed in user_data, which is passed to the receiver. It's unclear if having a separate flag that essentially says "This CQE did not originate from an SQE on this ring" provides any real utility to applications. While we can always re-introduce a flag to provide this information, we cannot take it away at a later point in time. Remove the flag while we still can, before it's in a released kernel. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-24io_uring: add flag for disabling provided buffer recyclingJens Axboe1-0/+8
If we need to continue doing this IO, then we don't want a potentially selected buffer recycled. Add a flag for that. Set this for recv/recvmsg if they do partial IO. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-24io_uring: ensure recv and recvmsg handle MSG_WAITALL correctlyJens Axboe1-0/+28
We currently don't attempt to get the full asked for length even if MSG_WAITALL is set, if we get a partial receive. If we do see a partial receive, then just note how many bytes we did and return -EAGAIN to get it retried. The iov is advanced appropriately for the vector based case, and we manually bump the buffer and remainder for the non-vector case. Cc: stable@vger.kernel.org Reported-by: Constantine Gavrilov <constantine.gavrilov@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-23io_uring: don't recycle provided buffer if punted to async workerJens Axboe1-3/+8
We only really need to recycle the buffer when going async for a file type that has an indefinite reponse time (eg non-file/bdev). And for files that to arm poll, the async worker will arm poll anyway and the buffer will get recycled there. In that latter case, we're not holding ctx->uring_lock. Ensure we take the issue_flags into account and acquire it if we need to. Fixes: b1c62645758e ("io_uring: recycle provided buffers if request goes async") Reported-by: Stefan Roesch <shr@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-23io_uring: fix assuming triggered poll waitqueue is the single pollJens Axboe1-4/+12
syzbot reports a recent regression: BUG: KASAN: use-after-free in __wake_up_common+0x637/0x650 kernel/sched/wait.c:101 Read of size 8 at addr ffff888011e8a130 by task syz-executor413/3618 CPU: 0 PID: 3618 Comm: syz-executor413 Tainted: G W 5.17.0-syzkaller-01402-g8565d64430f8 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106 print_address_description.constprop.0.cold+0x8d/0x303 mm/kasan/report.c:255 __kasan_report mm/kasan/report.c:442 [inline] kasan_report.cold+0x83/0xdf mm/kasan/report.c:459 __wake_up_common+0x637/0x650 kernel/sched/wait.c:101 __wake_up_common_lock+0xd0/0x130 kernel/sched/wait.c:138 tty_release+0x657/0x1200 drivers/tty/tty_io.c:1781 __fput+0x286/0x9f0 fs/file_table.c:317 task_work_run+0xdd/0x1a0 kernel/task_work.c:164 exit_task_work include/linux/task_work.h:32 [inline] do_exit+0xaff/0x29d0 kernel/exit.c:806 do_group_exit+0xd2/0x2f0 kernel/exit.c:936 __do_sys_exit_group kernel/exit.c:947 [inline] __se_sys_exit_group kernel/exit.c:945 [inline] __x64_sys_exit_group+0x3a/0x50 kernel/exit.c:945 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f439a1fac69 which is due to leaving the request on the waitqueue mistakenly. The reproducer is using a tty device, which means we end up arming the same poll queue twice (it uses the same poll waitqueue for both), but in io_poll_wake() we always just clear REQ_F_SINGLE_POLL regardless of which entry triggered. This leaves one waitqueue potentially armed after we're done, which then blows up in tty when the waitqueue is attempted removed. We have no room to store this information, so simply encode it in the wait_queue_entry->private where we store the io_kiocb request pointer. Fixes: 91eac1c69c20 ("io_uring: cache poll/double-poll state with a request flag") Reported-by: syzbot+09ad4050dd3a120bfccd@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-23io_uring: bump poll refs to full 31-bitsJens Axboe1-1/+1
The previous commit: 1bc84c40088 ("io_uring: remove poll entry from list when canceling all") removed a potential overflow condition for the poll references. They are currently limited to 20-bits, even if we have 31-bits available. The upper bit is used to mark for cancelation. Bump the poll ref space to 31-bits, making that kind of situation much harder to trigger in general. We'll separately add overflow checking and handling. Fixes: aa43477b0402 ("io_uring: poll rework") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-22io_uring: remove poll entry from list when canceling allJens Axboe1-0/+1
When the ring is exiting, as part of the shutdown, poll requests are removed. But io_poll_remove_all() does not remove entries when finding them, and since completions are done out-of-band, we can find and remove the same entry multiple times. We do guard the poll execution by poll ownership, but that does not exclude us from reissuing a new one once the previous removal ownership goes away. This can race with poll execution as well, where we then end up seeing req->apoll be NULL because a previous task_work requeue finished the request. Remove the poll entry when we find it and get ownership of it. This prevents multiple invocations from finding it. Fixes: aa43477b0402 ("io_uring: poll rework") Reported-by: Dylan Yudaken <dylany@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-22Merge tag 'for-5.18/io_uring-statx-2022-03-18' of ↵Linus Torvalds1-2/+20
git://git.kernel.dk/linux-block Pull io_uring statx fixes from Jens Axboe: "On top of the main io_uring branch, this is to ensure that the filename component of statx is stable after submit. That requires a few VFS related changes" * tag 'for-5.18/io_uring-statx-2022-03-18' of git://git.kernel.dk/linux-block: io-uring: Make statx API stable
2022-03-21io_uring: fix memory ordering when SQPOLL thread goes to sleepAlmog Khaikin1-0/+7
Without a full memory barrier between the store to the flags and the load of the SQ tail the two operations can be reordered and this can lead to a situation where the SQPOLL thread goes to sleep while the application writes to the SQ tail and doesn't see the wakeup flag. This memory barrier pairs with a full memory barrier in the application between its store to the SQ tail and its load of the flags. Signed-off-by: Almog Khaikin <almogkh@gmail.com> Link: https://lore.kernel.org/r/20220321090059.46313-1-almogkh@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-21io_uring: ensure that fsnotify is always calledJens Axboe1-1/+7
Ensure that we call fsnotify_modify() if we write a file, and that we do fsnotify_access() if we read it. This enables anyone using inotify on the file to get notified. Ditto for fallocate, ensure that fsnotify_modify() is called. Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>