summaryrefslogtreecommitdiff
path: root/io_uring/io_uring.c
AgeCommit message (Collapse)AuthorFilesLines
2023-08-08io_uring/parisc: Adjust pgoff in io_uring mmap() for pariscHelge Deller1-0/+3
The changes from commit 32832a407a71 ("io_uring: Fix io_uring mmap() by using architecture-provided get_unmapped_area()") to the parisc implementation of get_unmapped_area() broke glibc's locale-gen executable when running on parisc. This patch reverts those architecture-specific changes, and instead adjusts in io_uring_mmu_get_unmapped_area() the pgoff offset which is then given to parisc's get_unmapped_area() function. This is much cleaner than the previous approach, and we still will get a coherent addresss. This patch has no effect on other architectures (SHM_COLOUR is only defined on parisc), and the liburing testcase stil passes on parisc. Cc: stable@vger.kernel.org # 6.4 Signed-off-by: Helge Deller <deller@gmx.de> Reported-by: Christoph Biedl <linux-kernel.bfrz@manchmal.in-ulm.de> Fixes: 32832a407a71 ("io_uring: Fix io_uring mmap() by using architecture-provided get_unmapped_area()") Fixes: d808459b2e31 ("io_uring: Adjust mapping wrt architecture aliasing requirements") Link: https://lore.kernel.org/r/ZNEyGV0jyI8kOOfz@p100 Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-24io_uring: gate iowait schedule on having pending requestsJens Axboe1-6/+17
A previous commit made all cqring waits marked as iowait, as a way to improve performance for short schedules with pending IO. However, for use cases that have a special reaper thread that does nothing but wait on events on the ring, this causes a cosmetic issue where we know have one core marked as being "busy" with 100% iowait. While this isn't a grave issue, it is confusing to users. Rather than always mark us as being in iowait, gate setting of current->in_iowait to 1 by whether or not the waiting task has pending requests. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/io-uring/CAMEGJJ2RxopfNQ7GNLhr7X9=bHXKo+G5OOe0LUq=+UgLXsv1Xg@mail.gmail.com/ Link: https://bugzilla.kernel.org/show_bug.cgi?id=217699 Link: https://bugzilla.kernel.org/show_bug.cgi?id=217700 Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name> Reported-by: Phil Elwell <phil@raspberrypi.com> Tested-by: Andres Freund <andres@anarazel.de> Fixes: 8a796565cec3 ("io_uring: Use io_schedule* in cqring wait") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-21io_uring: Fix io_uring mmap() by using architecture-provided get_unmapped_area()Helge Deller1-25/+17
The io_uring testcase is broken on IA-64 since commit d808459b2e31 ("io_uring: Adjust mapping wrt architecture aliasing requirements"). The reason is, that this commit introduced an own architecture independend get_unmapped_area() search algorithm which finds on IA-64 a memory region which is outside of the regular memory region used for shared userspace mappings and which can't be used on that platform due to aliasing. To avoid similar problems on IA-64 and other platforms in the future, it's better to switch back to the architecture-provided get_unmapped_area() function and adjust the needed input parameters before the call. Beside fixing the issue, the function now becomes easier to understand and maintain. This patch has been successfully tested with the io_uring testcase on physical x86-64, ppc64le, IA-64 and PA-RISC machines. On PA-RISC the LTP mmmap testcases did not report any regressions. Cc: stable@vger.kernel.org # 6.4 Signed-off-by: Helge Deller <deller@gmx.de> Reported-by: matoro <matoro_mailinglist_kernel@matoro.tk> Fixes: d808459b2e31 ("io_uring: Adjust mapping wrt architecture aliasing requirements") Link: https://lore.kernel.org/r/20230721152432.196382-2-deller@gmx.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-20io_uring: treat -EAGAIN for REQ_F_NOWAIT as final for io-wqJens Axboe1-0/+8
io-wq assumes that an issue is blocking, but it may not be if the request type has asked for a non-blocking attempt. If we get -EAGAIN for that case, then we need to treat it as a final result and not retry or arm poll for it. Cc: stable@vger.kernel.org # 5.10+ Link: https://github.com/axboe/liburing/issues/897 Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-18io_uring: don't audit the capability check in io_uring_create()Ondrej Mosnacek1-1/+1
The check being unconditional may lead to unwanted denials reported by LSMs when a process has the capability granted by DAC, but denied by an LSM. In the case of SELinux such denials are a problem, since they can't be effectively filtered out via the policy and when not silenced, they produce noise that may hide a true problem or an attack. Since not having the capability merely means that the created io_uring context will be accounted against the current user's RLIMIT_MEMLOCK limit, we can disable auditing of denials for this check by using ns_capable_noaudit() instead of capable(). Fixes: 2b188cc1bb85 ("Add io_uring IO interface") Link: https://bugzilla.redhat.com/show_bug.cgi?id=2193317 Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Link: https://lore.kernel.org/r/20230718115607.65652-1-omosnace@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-07io_uring: Use io_schedule* in cqring waitAndres Freund1-2/+13
I observed poor performance of io_uring compared to synchronous IO. That turns out to be caused by deeper CPU idle states entered with io_uring, due to io_uring using plain schedule(), whereas synchronous IO uses io_schedule(). The losses due to this are substantial. On my cascade lake workstation, t/io_uring from the fio repository e.g. yields regressions between 20% and 40% with the following command: ./t/io_uring -r 5 -X0 -d 1 -s 1 -c 1 -p 0 -S$use_sync -R 0 /mnt/t2/fio/write.0.0 This is repeatable with different filesystems, using raw block devices and using different block devices. Use io_schedule_prepare() / io_schedule_finish() in io_cqring_wait_schedule() to address the difference. After that using io_uring is on par or surpassing synchronous IO (using registered files etc makes it reliably win, but arguably is a less fair comparison). There are other calls to schedule() in io_uring/, but none immediately jump out to be similarly situated, so I did not touch them. Similarly, it's possible that mutex_lock_io() should be used, but it's not clear if there are cases where that matters. Cc: stable@vger.kernel.org # 5.10+ Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: io-uring@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Andres Freund <andres@anarazel.de> Link: https://lore.kernel.org/r/20230707162007.194068-1-andres@anarazel.de [axboe: minor style fixup] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-28io_uring: flush offloaded and delayed task_work on exitJens Axboe1-3/+19
io_uring offloads task_work for cancelation purposes when the task is exiting. This is conceptually fine, but we should be nicer and actually wait for that work to complete before returning. Add an argument to io_fallback_tw() telling it to flush the deferred work when it's all queued up, and have it flush a ctx behind whenever the ctx changes. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-28io_uring: remove io_fallback_tw() forward declarationJens Axboe1-15/+14
It's used just one function higher up, get rid of the declaration and just move it up a bit. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-23io_uring: merge conditional unlock flush helpersPavel Begunkov1-12/+1
There is no reason not to use __io_cq_unlock_post_flush for intermediate aux CQE flushing, all ->task_complete should apply there, i.e. if set it should be the submitter task. Combine them, get rid of of __io_cq_unlock_post() and rename the left function. This place was also taking a couple percents of CPU according to profiles for max throughput net benchmarks due to multishot recv flooding it with completions. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/bbed60734cbec2e833d9c7bdcf9741aada5d8aab.1687518903.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-23io_uring: make io_cq_unlock_post staticPavel Begunkov1-1/+1
io_cq_unlock_post() is exclusively used in io_uring/io_uring.c, mark it static and don't expose to other files. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3dc8127dda4514e1dd24bb32035faac887c5fa37.1687518903.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-23io_uring: inline __io_cq_unlockPavel Begunkov1-8/+4
__io_cq_unlock is not very helpful, and users should be calling flush variants anyway. Open code the function. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d875c4cfb69f38ccecb58a57111446c77a614caa.1687518903.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-23io_uring: fix acquire/release annotationsPavel Begunkov1-3/+0
We do conditional locking, so __io_cq_lock() and friends not always actually grab/release the lock, so kill misleading annotations. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2a098f9144c24cab622f8bf90b39f44da5d0401e.1687518903.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-23io_uring: kill io_cq_unlock()Pavel Begunkov1-8/+2
We're abusing ->completion_lock helpers. io_cq_unlock() neither locking conditionally nor doing CQE flushing, which means that callers must have some side reason of taking the lock and should do it directly. Open code io_cq_unlock() into io_cqring_overflow_kill() and clean it up. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7dabb36856db2b562e78780480396c52c29b2bf4.1687518903.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-23io_uring: remove IOU_F_TWQ_FORCE_NORMALPavel Begunkov1-11/+14
Extract a function for non-local task_work_add, and use it directly from io_move_task_work_from_local(). Now we don't use IOU_F_TWQ_FORCE_NORMAL and it can be killed. As a small positive side effect we don't grab task->io_uring in io_req_normal_work_add anymore, which is not needed for io_req_local_work_add(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2e55571e8ff2927ae3cc12da606d204e2485525b.1687518903.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-23io_uring: don't batch task put on reqs freePavel Begunkov1-22/+10
We're trying to batch io_put_task() in io_free_batch_list(), but considering that the hot path is a simple inc, it's most cerainly and probably faster to just do io_put_task() instead of task tracking. We don't care about io_put_task_remote() as it's only for IOPOLL where polling/waiting is done by not the submitter task. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4a7ef7dce845fe2bd35507bf389d6bd2d5c1edf0.1687518903.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-23io_uring: move io_clean_op()Pavel Begunkov1-34/+33
Move io_clean_op() up in the source file and remove the forward declaration, as the function doesn't have tricky dependencies anymore. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1b7163b2ba7c3a8322d972c79c1b0a9301b3057e.1687518903.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-23io_uring: inline io_dismantle_req()Pavel Begunkov1-12/+5
io_dismantle_req() is only used in __io_req_complete_post(), open code it there. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/ba8f20cb2c914eefa2e7d120a104a198552050db.1687518903.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-23io_uring: remove io_free_req_twPavel Begunkov1-18/+5
Request completion is a very hot path in general, but there are 3 places that can be doing it: io_free_batch_list(), io_req_complete_post() and io_free_req_tw(). io_free_req_tw() is used rather marginally and we don't care about it. Killing it can help to clean up and optimise the left two, do that by replacing it with io_req_task_complete(). There are two things to consider: 1) io_free_req() is called when all refs are put, so we need to reinit references. The easiest way to do that is to clear REQ_F_REFCOUNT. 2) We also don't need a cqe from it, so silence it with REQ_F_CQE_SKIP. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/434a2be8f33d474ad888ce1c17fe5ea7bbcb2a55.1687518903.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-23io_uring: open code io_put_req_find_nextPavel Begunkov1-18/+7
There is only one user of io_put_req_find_next() and it doesn't make much sense to have it. Open code the function. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/38b5c5e48e4adc8e6a0cd16fdd5c1531d7ff81a9.1687518903.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-20io_uring: add helpers to decode the fixed file file_ptrChristoph Hellwig1-6/+4
Remove all the open coded magic on slot->file_ptr by introducing two helpers that return the file pointer and the flags instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230620113235.920399-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-20io_uring: return REQ_F_ flags from io_file_get_flagsChristoph Hellwig1-3/+3
Two of the three callers want them, so return the more usual format, and shift into the FFS_ form only for the fixed file table. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230620113235.920399-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-20io_uring: remove io_req_ffs_setChristoph Hellwig1-1/+1
Just checking the flag directly makes it a lot more obvious what is going on here. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230620113235.920399-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-20io_uring: remove a confusing comment above io_file_get_flagsChristoph Hellwig1-5/+0
The SCM inflight mechanism has nothing to do with the fact that a file might be a regular file or not and if it supports non-blocking operations. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230620113235.920399-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-20io_uring: remove the mode variable in io_file_get_flagsChristoph Hellwig1-2/+1
The variable is only once now, so don't bother with it. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230620113235.920399-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-20io_uring: remove __io_file_supports_nowaitChristoph Hellwig1-14/+1
Now that this only checks O_NONBLOCK and FMODE_NOWAIT, the helper is complete overkilļ, and the comments are confusing bordering to wrong. Just inline the check into the caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230620113235.920399-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12io_uring: wait interruptibly for request completions on exitJens Axboe1-2/+18
WHen the ring exits, cleanup is done and the final cancelation and waiting on completions is done by io_ring_exit_work. That function is invoked by kworker, which doesn't take any signals. Because of that, it doesn't really matter if we wait for completions in TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE state. However, it does matter to the hung task detection checker! Normally we expect cancelations and completions to happen rather quickly. Some test cases, however, will exit the ring and park the owning task stopped (eg via SIGSTOP). If the owning task needs to run task_work to complete requests, then io_ring_exit_work won't make any progress until the task is runnable again. Hence io_ring_exit_work can trigger the hung task detection, which is particularly problematic if panic-on-hung-task is enabled. As the ring exit doesn't take signals to begin with, have it wait interruptibly rather than uninterruptibly. io_uring has a separate stuck-exit warning that triggers independently anyway, so we're not really missing anything by making this switch. Cc: stable@vger.kernel.org # 5.10+ Link: https://lore.kernel.org/r/b0e4aaef-7088-56ce-244c-976edeac0e66@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-08io_uring: get rid of unnecessary 'length' variableJens Axboe1-4/+1
Just use the ARRAY_SIZE directly, we don't use length for anything else in this function. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-07io_uring: cleanup io_aux_cqe() APIJens Axboe1-1/+3
Everybody is passing in the request, so get rid of the io_ring_ctx and explicit user_data pass-in. Both the ctx and user_data can be deduced from the request at hand. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-02io_uring: avoid indirect function calls for the hottest task_workJens Axboe1-2/+7
We use task_work for a variety of reasons, but doing completions or triggering rety after poll are by far the hottest two. Use the indirect funtion call wrappers to avoid the indirect function call if CONFIG_RETPOLINE is set. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-19io_uring: maintain ordering for DEFER_TASKRUN tw listJens Axboe1-1/+5
We use lockless lists for the local and deferred task_work, which means that when we queue up events for processing, we ultimately process them in reverse order to how they were received. This usually doesn't matter, but for some cases, it does seem to make a big difference. Do the right thing and reverse the list before processing it, so that we know it's processed in the same order in which it was received. This makes a rather big difference for some medium load network tests, where consistency of performance was a bit all over the place. Here's a case that has 4 connections each doing two sends and receives: io_uring port=10002: rps:161.13k Bps: 1.45M idle=256ms io_uring port=10002: rps:107.27k Bps: 0.97M idle=413ms io_uring port=10002: rps:136.98k Bps: 1.23M idle=321ms io_uring port=10002: rps:155.58k Bps: 1.40M idle=268ms and after the change: io_uring port=10002: rps:205.48k Bps: 1.85M idle=140ms user=40ms io_uring port=10002: rps:203.57k Bps: 1.83M idle=139ms user=20ms io_uring port=10002: rps:218.79k Bps: 1.97M idle=106ms user=30ms io_uring port=10002: rps:217.88k Bps: 1.96M idle=110ms user=20ms io_uring port=10002: rps:222.31k Bps: 2.00M idle=101ms user=0ms io_uring port=10002: rps:218.74k Bps: 1.97M idle=102ms user=20ms io_uring port=10002: rps:208.43k Bps: 1.88M idle=125ms user=40ms using more of the time to actually process work rather than sitting idle. No effects have been observed at the peak end of the spectrum, where performance is still the same even with deep batch depths (and hence more items to sort). Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-16io_uring: Add io_uring_setup flag to pre-register ring fd and never install itJosh Triplett1-15/+22
With IORING_REGISTER_USE_REGISTERED_RING, an application can register the ring fd and use it via registered index rather than installed fd. This allows using a registered ring for everything *except* the initial mmap. With IORING_SETUP_NO_MMAP, io_uring_setup uses buffers allocated by the user, rather than requiring a subsequent mmap. The combination of the two allows a user to operate *entirely* via a registered ring fd, making it unnecessary to ever install the fd in the first place. So, add a flag IORING_SETUP_REGISTERED_FD_ONLY to make io_uring_setup register the fd and return a registered index, without installing the fd. This allows an application to avoid touching the fd table at all, and allows a library to never even momentarily install a file descriptor. This splits out an io_ring_add_registered_file helper from io_ring_add_registered_fd, for use by io_uring_setup. Signed-off-by: Josh Triplett <josh@joshtriplett.org> Link: https://lore.kernel.org/r/bc8f431bada371c183b95a83399628b605e978a3.1682699803.git.josh@joshtriplett.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-16io_uring: support for user allocated memory for rings/sqesJens Axboe1-9/+97
Currently io_uring applications must call mmap(2) twice to map the rings themselves, and the sqes array. This works fine, but it does not support using huge pages to back the rings/sqes. Provide a way for the application to pass in pre-allocated memory for the rings/sqes, which can then suitably be allocated from shmfs or via mmap to get huge page support. Particularly for larger rings, this reduces the TLBs needed. If an application wishes to take advantage of that, it must pre-allocate the memory needed for the sq/cq ring, and the sqes. The former must be passed in via the io_uring_params->cq_off.user_data field, while the latter is passed in via the io_uring_params->sq_off.user_data field. Then it must set IORING_SETUP_NO_MMAP in the io_uring_params->flags field, and io_uring will then map the existing memory into the kernel for shared use. The application must not call mmap(2) to map rings as it otherwise would have, that will now fail with -EINVAL if this setup flag was used. The pages used for the rings and sqes must be contigious. The intent here is clearly that huge pages should be used, otherwise the normal setup procedure works fine as-is. The application may use one huge page for both the rings and sqes. Outside of those initialization changes, everything works like it did before. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-16io_uring: add ring freeing helperJens Axboe1-6/+11
We do rings and sqes separately, move them into a helper that does both the freeing and clearing of the memory. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-16io_uring: return error pointer from io_mem_alloc()Jens Axboe1-6/+12
In preparation for having more than one time of ring allocator, make the existing one return valid/error-pointer rather than just NULL. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-16io_uring: remove sq/cq_off memsetJens Axboe1-2/+4
We only have two reserved members we're not clearing, do so manually instead. This is in preparation for using one of these members for a new feature. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-15io_uring: rely solely on FMODE_NOWAITJens Axboe1-21/+0
Now that we have both sockets and block devices setting FMODE_NOWAIT appropriately, we can get rid of all the odd special casing in __io_file_supports_nowait() and rely soley on FMODE_NOWAIT and O_NONBLOCK rather than special case sockets and (in particular) bdevs. Link: https://lore.kernel.org/r/20230509151910.183637-4-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-26Merge tag 'for-6.4/io_uring-2023-04-21' of git://git.kernel.dk/linuxLinus Torvalds1-111/+237
Pull io_uring updates from Jens Axboe: - Cleanup of the io-wq per-node mapping, notably getting rid of it so we just have a single io_wq entry per ring (Breno) - Followup to the above, move accounting to io_wq as well and completely drop struct io_wqe (Gabriel) - Enable KASAN for the internal io_uring caches (Breno) - Add support for multishot timeouts. Some applications use timeouts to wake someone waiting on completion entries, and this makes it a bit easier to just have a recurring timer rather than needing to rearm it every time (David) - Support archs that have shared cache coloring between userspace and the kernel, and hence have strict address requirements for mmap'ing the ring into userspace. This should only be parisc/hppa. (Helge, me) - XFS has supported O_DIRECT writes without needing to lock the inode exclusively for a long time, and ext4 now supports it as well. This is true for the common cases of not extending the file size. Flag the fs as having that feature, and utilize that to avoid serializing those writes in io_uring (me) - Enable completion batching for uring commands (me) - Revert patch adding io_uring restriction to what can be GUP mapped or not. This does not belong in io_uring, as io_uring isn't really special in this regard. Since this is also getting in the way of cleanups and improvements to the GUP code, get rid of if (me) - A few series greatly reducing the complexity of registered resources, like buffers or files. Not only does this clean up the code a lot, the simplified code is also a LOT more efficient (Pavel) - Series optimizing how we wait for events and run task_work related to it (Pavel) - Fixes for file/buffer unregistration with DEFER_TASKRUN (Pavel) - Misc cleanups and improvements (Pavel, me) * tag 'for-6.4/io_uring-2023-04-21' of git://git.kernel.dk/linux: (71 commits) Revert "io_uring/rsrc: disallow multi-source reg buffers" io_uring: add support for multishot timeouts io_uring/rsrc: disassociate nodes and rsrc_data io_uring/rsrc: devirtualise rsrc put callbacks io_uring/rsrc: pass node to io_rsrc_put_work() io_uring/rsrc: inline io_rsrc_put_work() io_uring/rsrc: add empty flag in rsrc_node io_uring/rsrc: merge nodes and io_rsrc_put io_uring/rsrc: infer node from ctx on io_queue_rsrc_removal io_uring/rsrc: remove unused io_rsrc_node::llist io_uring/rsrc: refactor io_queue_rsrc_removal io_uring/rsrc: simplify single file node switching io_uring/rsrc: clean up __io_sqe_buffers_update() io_uring/rsrc: inline switch_start fast path io_uring/rsrc: remove rsrc_data refs io_uring/rsrc: fix DEFER_TASKRUN rsrc quiesce io_uring/rsrc: use wq for quiescing io_uring/rsrc: refactor io_rsrc_ref_quiesce io_uring/rsrc: remove io_rsrc_node::done io_uring/rsrc: use nospec'ed indexes ...
2023-04-15io_uring/rsrc: remove rsrc_data refsPavel Begunkov1-2/+2
Instead of waiting for rsrc_data->refs to be downed to zero, check whether there are rsrc nodes queued for completion, that's easier then maintaining references. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/8e33fd143d83e11af3e386aea28eb6d6c6a1be10.1681395792.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-15io_uring/rsrc: use wq for quiescingPavel Begunkov1-0/+1
Replace completions with waitqueues for rsrc data quiesce, the main wakeup condition is when data refs hit zero. Note that data refs are only changes under ->uring_lock, so we prepare before mutex_unlock() reacquire it after taking the lock back. This change will be needed in the next patch. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1d0dbc74b3b4fd67c8f01819e680c5e0da252956.1681395792.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-14io_uring: complete request via task work in case of DEFER_TASKRUNMing Lei1-1/+1
So far io_req_complete_post() only covers DEFER_TASKRUN by completing request via task work when the request is completed from IOWQ. However, uring command could be completed from any context, and if io uring is setup with DEFER_TASKRUN, the command is required to be completed from current context, otherwise wait on IORING_ENTER_GETEVENTS can't be wakeup, and may hang forever. The issue can be observed on removing ublk device, but turns out it is one generic issue for uring command & DEFER_TASKRUN, so solve it in io_uring core code. Fixes: e6aeb2721d3b ("io_uring: complete all requests in task context") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/linux-block/b3fc9991-4c53-9218-a8cc-5b4dd3952108@kernel.dk/ Reported-by: Jens Axboe <axboe@kernel.dk> Cc: Kanchan Joshi <joshi.k@samsung.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-12io_uring/rsrc: refactor io_rsrc_node_switchPavel Begunkov1-3/+2
We use io_rsrc_node_switch() coupled with io_rsrc_node_switch_start() for a bunch of cases including initialising ctx->rsrc_node, i.e. by passing NULL instead of rsrc_data. Leave it to only deal with actual node changing. For that, first remove it from io_uring_create() and add a function allocating the first node. Then also remove all calls to io_rsrc_node_switch() from files/buffers register as we already have a node installed and it does essentially nothing. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d146fe306ff98b1a5a60c997c252534f03d423d7.1681210788.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-12io_uring/rsrc: consolidate node cachingPavel Begunkov1-2/+0
We store one pre-allocated rsrc node in ->rsrc_backup_node, merge it with ->rsrc_node_cache. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/6d5410e51ccd29be7a716be045b51d6b371baef6.1681210788.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-12io_uring: shut io_prep_async_work warningPavel Begunkov1-1/+1
io_uring/io_uring.c:432 io_prep_async_work() error: we previously assumed 'req->file' could be null (see line 425). Even though it's a false positive as there will not be REQ_F_ISREG set without a file, let's add a simple check to make the kernel test robot happy. We don't care about performance here, but assumingly it'll be optimised out by the compiler. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/a6cfbe92c74b789c0b4f046f7f98d19b1ca2e5b7.1681210788.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-07io_uring: optimise io_req_local_work_addPavel Begunkov1-8/+16
Chains of memory accesses are never good for performance. The req->task->io_uring->in_cancel in io_req_local_work_add() is there so that when a task is exiting via io_uring_try_cancel_requests() and starts waiting for completions, it gets woken up by every new task_work item queued. Do a little trick by announcing waiting in io_uring_try_cancel_requests(), making io_req_local_work_add() wake us up. We also need to check for deferred tw items after prepare_to_wait(TASK_INTERRUPTIBLE); Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/fb11597e9bbcb365901824f8c5c2cf0d6ee100d0.1680782017.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-07io_uring: refactor __io_cq_unlock_post_flush()Pavel Begunkov1-10/+12
Separate ->task_complete path in __io_cq_unlock_post_flush(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/baa9b8d822f024e4ee01c40209dbbe38d9c8c11d.1680782017.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-07io_uring: reduce scheduling due to twPavel Begunkov1-21/+47
Every task_work will try to wake the task to be executed, which causes excessive scheduling and additional overhead. For some tw it's justified, but others won't do much but post a single CQE. When a task waits for multiple cqes, every such task_work will wake it up. Instead, the task may give a hint about how many cqes it waits for, io_req_local_work_add() will compare against it and skip wake ups if #cqes + #tw is not enough to satisfy the waiting condition. Task_work that uses the optimisation should be simple enough and never post more than one CQE. It's also ignored for non DEFER_TASKRUN rings. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d2b77e99d1e86624d8a69f7037d764b739dcd225.1680782017.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-07io_uring: inline llist_add()Pavel Begunkov1-1/+8
We'll need to grab some information from the previous request in the tw list, inline llist_add(), it'll be used in the following patch. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f0165493af7b379943c792114b972f331e7d7d10.1680782017.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-07io_uring: add tw add flagsPavel Begunkov1-3/+4
We pass 'allow_local' into io_req_task_work_add() but will need more flags. Replace it with a flags bit field and name this allow_local flag. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4c0f01e7ef4e6feebfb199093cc995af7a19befa.1680782017.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-07io_uring: refactor io_cqring_wake()Pavel Begunkov1-4/+2
Instead of smp_mb() + __io_cqring_wake() in __io_cq_unlock_post_flush() use equivalent io_cqring_wake(). With that we can clean it up further and remove __io_cqring_wake(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/662ee5d898168ac206be06038525e97b64072a46.1680782017.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-07io_uring: optimize local tw add ctx pinningPavel Begunkov1-2/+6
We currently pin the ctx for io_req_local_work_add() with percpu_ref_get/put, which implies two rcu_read_lock/unlock pairs and some extra overhead on top in the fast path. Replace it with a pure rcu read and let io_ring_exit_work() synchronise against it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/cbdfcb6b232627f30e9e50ef91f13c4f05910247.1680782017.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>