summaryrefslogtreecommitdiff
path: root/io_uring/io_uring.c
AgeCommit message (Collapse)AuthorFilesLines
2023-09-07Revert "io_uring: fix IO hang in io_wq_put_and_exit from do_exit()"Jens Axboe1-32/+0
This reverts commit b484a40dc1f16edb58e5430105a021e1916e6f27. This commit cancels all requests with io-wq, not just the ones from the originating task. This breaks use cases that have thread pools, or just multiple tasks issuing requests on the same ring. The liburing regression test for this also shows that problem: $ test/thread-exit.t cqe->res=-125, Expected 512 where an IO thread gets its request canceled rather than complete successfully. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-09-07io_uring: fix unprotected iopoll overflowPavel Begunkov1-2/+2
[ 71.490669] WARNING: CPU: 3 PID: 17070 at io_uring/io_uring.c:769 io_cqring_event_overflow+0x47b/0x6b0 [ 71.498381] Call Trace: [ 71.498590] <TASK> [ 71.501858] io_req_cqe_overflow+0x105/0x1e0 [ 71.502194] __io_submit_flush_completions+0x9f9/0x1090 [ 71.503537] io_submit_sqes+0xebd/0x1f00 [ 71.503879] __do_sys_io_uring_enter+0x8c5/0x2380 [ 71.507360] do_syscall_64+0x39/0x80 We decoupled CQ locking from ->task_complete but haven't fixed up places forcing locking for CQ overflows. Fixes: ec26c225f06f5 ("io_uring: merge iopoll and normal completion paths") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-09-07io_uring: break out of iowq iopoll on teardownPavel Begunkov1-0/+2
io-wq will retry iopoll even when it failed with -EAGAIN. If that races with task exit, which sets TIF_NOTIFY_SIGNAL for all its workers, such workers might potentially infinitely spin retrying iopoll again and again and each time failing on some allocation / waiting / etc. Don't keep spinning if io-wq is dying. Fixes: 561fb04a6a225 ("io_uring: replace workqueue usage with io-wq") Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-09-05io_uring: add a sysctl to disable io_uring system-wideMatteo Rizzo1-0/+50
Introduce a new sysctl (io_uring_disabled) which can be either 0, 1, or 2. When 0 (the default), all processes are allowed to create io_uring instances, which is the current behavior. When 1, io_uring creation is disabled (io_uring_setup() will fail with -EPERM) for unprivileged processes not in the kernel.io_uring_group group. When 2, calls to io_uring_setup() fail with -EPERM regardless of privilege. Signed-off-by: Matteo Rizzo <matteorizzo@google.com> [JEM: modified to add io_uring_group] Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Link: https://lore.kernel.org/r/x49y1i42j1z.fsf@segfault.boston.devel.redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-09-01io_uring: fix IO hang in io_wq_put_and_exit from do_exit()Ming Lei1-0/+32
io_wq_put_and_exit() is called from do_exit(), but all FIXED_FILE requests in io_wq aren't canceled in io_uring_cancel_generic() called from do_exit(). Meantime io_wq IO code path may share resource with normal iopoll code path. So if any HIPRI request is submittd via io_wq, this request may not get resouce for moving on, given iopoll isn't possible in io_wq_put_and_exit(). The issue can be triggered when terminating 't/io_uring -n4 /dev/nullb0' with default null_blk parameters. Fix it by always cancelling all requests in io_wq by adding helper of io_uring_cancel_wq(), and this way is reasonable because io_wq destroying follows canceling requests immediately. Closes: https://lore.kernel.org/linux-block/3893581.1691785261@warthog.procyon.org.uk/ Reported-by: David Howells <dhowells@redhat.com> Cc: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230901134916.2415386-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-30Merge tag 'for-6.6/io_uring-2023-08-28' of git://git.kernel.dk/linuxLinus Torvalds1-102/+123
Pull io_uring updates from Jens Axboe: "Fairly quiet round in terms of features, mostly just improvements all over the map for existing code. In detail: - Initial support for socket operations through io_uring. Latter half of this will likely land with the 6.7 kernel, then allowing things like get/setsockopt (Breno) - Cleanup of the cancel code, and then adding support for canceling requests with the opcode as the key (me) - Improvements for the io-wq locking (me) - Fix affinity setting for SQPOLL based io-wq (me) - Remove the io_uring userspace code. These were added initially as copies from liburing, but all of them have since bitrotted and are way out of date at this point. Rather than attempt to keep them in sync, just get rid of them. People will have liburing available anyway for these examples. (Pavel) - Series improving the CQ/SQ ring caching (Pavel) - Misc fixes and cleanups (Pavel, Yue, me)" * tag 'for-6.6/io_uring-2023-08-28' of git://git.kernel.dk/linux: (47 commits) io_uring: move iopoll ctx fields around io_uring: move multishot cqe cache in ctx io_uring: separate task_work/waiting cache line io_uring: banish non-hot data to end of io_ring_ctx io_uring: move non aligned field to the end io_uring: add option to remove SQ indirection io_uring: compact SQ/CQ heads/tails io_uring: force inline io_fill_cqe_req io_uring: merge iopoll and normal completion paths io_uring: reorder cqring_flush and wakeups io_uring: optimise extra io_get_cqe null check io_uring: refactor __io_get_cqe() io_uring: simplify big_cqe handling io_uring: cqe init hardening io_uring: improve cqe !tracing hot path io_uring/rsrc: Annotate struct io_mapped_ubuf with __counted_by io_uring/sqpoll: fix io-wq affinity when IORING_SETUP_SQPOLL is used io_uring: simplify io_run_task_work_sig return io_uring/rsrc: keep one global dummy_ubuf io_uring: never overflow io_aux_cqe ...
2023-08-30Merge tag 'mm-stable-2023-08-28-18-26' of ↵Linus Torvalds1-5/+1
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Some swap cleanups from Ma Wupeng ("fix WARN_ON in add_to_avail_list") - Peter Xu has a series (mm/gup: Unify hugetlb, speed up thp") which reduces the special-case code for handling hugetlb pages in GUP. It also speeds up GUP handling of transparent hugepages. - Peng Zhang provides some maple tree speedups ("Optimize the fast path of mas_store()"). - Sergey Senozhatsky has improved te performance of zsmalloc during compaction (zsmalloc: small compaction improvements"). - Domenico Cerasuolo has developed additional selftest code for zswap ("selftests: cgroup: add zswap test program"). - xu xin has doe some work on KSM's handling of zero pages. These changes are mainly to enable the user to better understand the effectiveness of KSM's treatment of zero pages ("ksm: support tracking KSM-placed zero-pages"). - Jeff Xu has fixes the behaviour of memfd's MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED sysctl ("mm/memfd: fix sysctl MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED"). - David Howells has fixed an fscache optimization ("mm, netfs, fscache: Stop read optimisation when folio removed from pagecache"). - Axel Rasmussen has given userfaultfd the ability to simulate memory poisoning ("add UFFDIO_POISON to simulate memory poisoning with UFFD"). - Miaohe Lin has contributed some routine maintenance work on the memory-failure code ("mm: memory-failure: remove unneeded PageHuge() check"). - Peng Zhang has contributed some maintenance work on the maple tree code ("Improve the validation for maple tree and some cleanup"). - Hugh Dickins has optimized the collapsing of shmem or file pages into THPs ("mm: free retracted page table by RCU"). - Jiaqi Yan has a patch series which permits us to use the healthy subpages within a hardware poisoned huge page for general purposes ("Improve hugetlbfs read on HWPOISON hugepages"). - Kemeng Shi has done some maintenance work on the pagetable-check code ("Remove unused parameters in page_table_check"). - More folioification work from Matthew Wilcox ("More filesystem folio conversions for 6.6"), ("Followup folio conversions for zswap"). And from ZhangPeng ("Convert several functions in page_io.c to use a folio"). - page_ext cleanups from Kemeng Shi ("minor cleanups for page_ext"). - Baoquan He has converted some architectures to use the GENERIC_IOREMAP ioremap()/iounmap() code ("mm: ioremap: Convert architectures to take GENERIC_IOREMAP way"). - Anshuman Khandual has optimized arm64 tlb shootdown ("arm64: support batched/deferred tlb shootdown during page reclamation/migration"). - Better maple tree lockdep checking from Liam Howlett ("More strict maple tree lockdep"). Liam also developed some efficiency improvements ("Reduce preallocations for maple tree"). - Cleanup and optimization to the secondary IOMMU TLB invalidation, from Alistair Popple ("Invalidate secondary IOMMU TLB on permission upgrade"). - Ryan Roberts fixes some arm64 MM selftest issues ("selftests/mm fixes for arm64"). - Kemeng Shi provides some maintenance work on the compaction code ("Two minor cleanups for compaction"). - Some reduction in mmap_lock pressure from Matthew Wilcox ("Handle most file-backed faults under the VMA lock"). - Aneesh Kumar contributes code to use the vmemmap optimization for DAX on ppc64, under some circumstances ("Add support for DAX vmemmap optimization for ppc64"). - page-ext cleanups from Kemeng Shi ("add page_ext_data to get client data in page_ext"), ("minor cleanups to page_ext header"). - Some zswap cleanups from Johannes Weiner ("mm: zswap: three cleanups"). - kmsan cleanups from ZhangPeng ("minor cleanups for kmsan"). - VMA handling cleanups from Kefeng Wang ("mm: convert to vma_is_initial_heap/stack()"). - DAMON feature work from SeongJae Park ("mm/damon/sysfs-schemes: implement DAMOS tried total bytes file"), ("Extend DAMOS filters for address ranges and DAMON monitoring targets"). - Compaction work from Kemeng Shi ("Fixes and cleanups to compaction"). - Liam Howlett has improved the maple tree node replacement code ("maple_tree: Change replacement strategy"). - ZhangPeng has a general code cleanup - use the K() macro more widely ("cleanup with helper macro K()"). - Aneesh Kumar brings memmap-on-memory to ppc64 ("Add support for memmap on memory feature on ppc64"). - pagealloc cleanups from Kemeng Shi ("Two minor cleanups for pcp list in page_alloc"), ("Two minor cleanups for get pageblock migratetype"). - Vishal Moola introduces a memory descriptor for page table tracking, "struct ptdesc" ("Split ptdesc from struct page"). - memfd selftest maintenance work from Aleksa Sarai ("memfd: cleanups for vm.memfd_noexec"). - MM include file rationalization from Hugh Dickins ("arch: include asm/cacheflush.h in asm/hugetlb.h"). - THP debug output fixes from Hugh Dickins ("mm,thp: fix sloppy text output"). - kmemleak improvements from Xiaolei Wang ("mm/kmemleak: use object_cache instead of kmemleak_initialized"). - More folio-related cleanups from Matthew Wilcox ("Remove _folio_dtor and _folio_order"). - A VMA locking scalability improvement from Suren Baghdasaryan ("Per-VMA lock support for swap and userfaults"). - pagetable handling cleanups from Matthew Wilcox ("New page table range API"). - A batch of swap/thp cleanups from David Hildenbrand ("mm/swap: stop using page->private on tail pages for THP_SWAP + cleanups"). - Cleanups and speedups to the hugetlb fault handling from Matthew Wilcox ("Change calling convention for ->huge_fault"). - Matthew Wilcox has also done some maintenance work on the MM subsystem documentation ("Improve mm documentation"). * tag 'mm-stable-2023-08-28-18-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (489 commits) maple_tree: shrink struct maple_tree maple_tree: clean up mas_wr_append() secretmem: convert page_is_secretmem() to folio_is_secretmem() nios2: fix flush_dcache_page() for usage from irq context hugetlb: add documentation for vma_kernel_pagesize() mm: add orphaned kernel-doc to the rst files. mm: fix clean_record_shared_mapping_range kernel-doc mm: fix get_mctgt_type() kernel-doc mm: fix kernel-doc warning from tlb_flush_rmaps() mm: remove enum page_entry_size mm: allow ->huge_fault() to be called without the mmap_lock held mm: move PMD_ORDER to pgtable.h mm: remove checks for pte_index memcg: remove duplication detection for mem_cgroup_uncharge_swap mm/huge_memory: work on folio->swap instead of page->private when splitting folio mm/swap: inline folio_set_swap_entry() and folio_swap_entry() mm/swap: use dedicated entry for swap in folio mm/swap: stop using page->private on tail pages for THP_SWAP selftests/mm: fix WARNING comparing pointer to 0 selftests: cgroup: fix test_kmem_memcg_deletion kernel mem check ...
2023-08-25io_uring: move multishot cqe cache in ctxPavel Begunkov1-3/+3
We cache multishot CQEs before flushing them to the CQ in submit_state.cqe. It's a 16 entry cache totalling 256 bytes in the middle of the io_submit_state structure. Move it out of there, it should help with CPU caches for the submission state, and shouldn't affect cached CQEs. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/dbe1f39c043ee23da918836be44fcec252ce6711.1692916914.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-25io_uring: add option to remove SQ indirectionPavel Begunkov1-20/+32
Not many aware, but io_uring submission queue has two levels. The first level usually appears as sq_array and stores indexes into the actual SQ. To my knowledge, no one has ever seriously used it, nor liburing exposes it to users. Add IORING_SETUP_NO_SQARRAY, when set we don't bother creating and using the sq_array and SQ heads/tails will be pointing directly into the SQ. Improves memory footprint, in term of both allocations as well as cache usage, and also should make io_get_sqe() less branchy in the end. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0ffa3268a5ef61d326201ff43a233315c96312e0.1692916914.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-25io_uring: merge iopoll and normal completion pathsPavel Begunkov1-6/+12
io_do_iopoll() and io_submit_flush_completions() are pretty similar, both filling CQEs and then free a list of requests. Don't duplicate it and make iopoll use __io_submit_flush_completions(), which also helps with inlining and other optimisations. For that, we need to first find all completed iopoll requests and splice them from the iopoll list and then pass it down. This adds one extra list traversal, which should be fine as requests will stay hot in cache. CQ locking is already conditional, introduce ->lockless_cq and skip locking for IOPOLL as it's protected by ->uring_lock. We also add a wakeup optimisation for IOPOLL to __io_cq_unlock_post(), so it works just like io_cqring_ev_posted_iopoll(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3840473f5e8a960de35b77292026691880f6bdbc.1692916914.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-25io_uring: reorder cqring_flush and wakeupsPavel Begunkov1-11/+3
Unlike in the past, io_commit_cqring_flush() doesn't do anything that may need io_cqring_wake() to be issued after, all requests it completes will go via task_work. Do io_commit_cqring_flush() after io_cqring_wake() to clean up __io_cq_unlock_post(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/ed32dcfeec47e6c97bd6b18c152ddce5b218403f.1692916914.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-25io_uring: optimise extra io_get_cqe null checkPavel Begunkov1-4/+3
If the cached cqe check passes in io_get_cqe*() it already means that the cqe we return is valid and non-zero, however the compiler is unable to optimise null checks like in io_fill_cqe_req(). Do a bit of trickery, return success/fail boolean from io_get_cqe*() and store cqe in the cqe parameter. That makes it do the right thing, erasing the check together with the introduced indirection. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/322ea4d3377d3d4efd8ae90ab8ed28a99f518210.1692916914.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-25io_uring: refactor __io_get_cqe()Pavel Begunkov1-9/+4
Make __io_get_cqe simpler by not grabbing the cqe from refilled cached, but letting io_get_cqe() do it for us. That's cleaner and removes some duplication. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/74dc8fdf2657e438b2e05e1d478a3596924604e9.1692916914.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-25io_uring: simplify big_cqe handlingPavel Begunkov1-5/+3
Don't keep big_cqe bits of req in a union with hash_node, find a separate space for it. It's bit safer, but also if we keep it always initialised, we can get rid of ugly REQ_F_CQE32_INIT handling. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/447aa1b2968978c99e655ba88db536e903df0fe9.1692916914.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-25io_uring: cqe init hardeningPavel Begunkov1-1/+1
io_kiocb::cqe stores the completion info which we'll memcpy to userspace, and we rely on callbacks and other later steps to populate it with right values. We have never had problems with that, but it would still be safer to zero it on allocation. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b16a3b64dde678686460d3c3792c3ba6d3d1bc7a.1692916914.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-22io_uring: stop calling free_compound_page()Matthew Wilcox (Oracle)1-5/+1
Patch series "Remove _folio_dtor and _folio_order", v2. This patch (of 13): folio_put() is the standard way to write this, and it's not appreciably slower. This is an enabling patch for removing free_compound_page() entirely. Link: https://lkml.kernel.org/r/20230816151201.3655946-1-willy@infradead.org Link: https://lkml.kernel.org/r/20230816151201.3655946-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Yanteng Si <siyanteng@loongson.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-16io_uring/sqpoll: fix io-wq affinity when IORING_SETUP_SQPOLL is usedJens Axboe1-11/+18
If we setup the ring with SQPOLL, then that polling thread has its own io-wq setup. This means that if the application uses IORING_REGISTER_IOWQ_AFF to set the io-wq affinity, we should not be setting it for the invoking task, but rather the sqpoll task. Add an sqpoll helper that parks the thread and updates the affinity, and use that one if we're using SQPOLL. Fixes: fe76421d1da1 ("io_uring: allow user configurable IO thread CPU affinity") Cc: stable@vger.kernel.org # 5.10+ Link: https://github.com/axboe/liburing/discussions/884 Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-11io_uring: simplify io_run_task_work_sig returnPavel Begunkov1-2/+2
Nobody cares about io_run_task_work_sig returning 1, we only check for negative errors. Simplify by keeping to 0/-error returns. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3aec8a532c003d6e50739b969a82989402696170.1691757663.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-11io_uring/rsrc: keep one global dummy_ubufPavel Begunkov1-9/+0
We set empty registered buffers to dummy_ubuf as an optimisation. Currently, we allocate the dummy entry for each ring, whenever we can simply have one global instance. We're casting out const on assignment, it's fine as we're not going to change the content of the dummy, the constness gives us an extra layer of protection if sth ever goes wrong. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e4a96dda35ab755914bc43f6781bba0df97ac489.1691757663.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-11io_uring: never overflow io_aux_cqePavel Begunkov1-4/+7
Now all callers of io_aux_cqe() set allow_overflow to false, remove the parameter and not allow overflowing auxilary multishot cqes. When CQ is full the function callers and all multishot requests in general are expected to complete the request. That prevents indefinite in-background grows of the overflow list and let's the userspace to handle the backlog at its own pace. Resubmitting a request should also be faster than accounting a bunch of overflows, so it should be better for perf when it happens, but a well behaving userspace should be trying to avoid overflows in any case. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/bb20d14d708ea174721e58bb53786b0521e4dd6d.1691757663.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-11io_uring: remove return from io_req_cqe_overflow()Pavel Begunkov1-4/+4
Nobody checks io_req_cqe_overflow()'s return, make it return void. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/8f2029ad0c22f73451664172d834372608ee0a77.1691757663.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-11io_uring: open code io_fill_cqe_req()Pavel Begunkov1-3/+5
io_fill_cqe_req() is only called from one place, open code it, and rename __io_fill_cqe_req(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f432ce75bb1c94cadf0bd2add4d6aa510bd1fb36.1691757663.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-11io_uring: remove unnecessary forward declarationJens Axboe1-1/+0
We never use io_move_task_work_from_local() before it's defined in the file anyway, so kill the forward declaration. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-10io_uring: have io_file_put() take an io_kiocb rather than the fileJens Axboe1-4/+2
No functional changes in this patch, just a prep patch for needing the request in io_file_put(). Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-09io_uring: cleanup 'ret' handling in io_iopoll_check()Jens Axboe1-7/+10
We return 0 for success, or -error when there's an error. Move the 'ret' variable into the loop where we are actually using it, to make it clearer that we don't carry this variable forward for return outside of the loop. While at it, also move the need_resched() break condition out of the while check itself, keeping it with the signal pending check. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-09io_uring: break iopolling on signalPavel Begunkov1-0/+3
Don't keep spinning iopoll with a signal set. It'll eventually return back, e.g. by virtue of need_resched(), but it's not a nice user experience. Cc: stable@vger.kernel.org Fixes: def596e9557c9 ("io_uring: support for IO polling") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/eeba551e82cad12af30c3220125eb6cb244cc94c.1691594339.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-09io_uring: fix false positive KASAN warningsPavel Begunkov1-1/+0
io_req_local_work_add() peeks into the work list, which can be executed in the meanwhile. It's completely fine without KASAN as we're in an RCU read section and it's SLAB_TYPESAFE_BY_RCU. With KASAN though it may trigger a false positive warning because internal io_uring caches are sanitised. Remove sanitisation from the io_uring request cache for now. Cc: stable@vger.kernel.org Fixes: 8751d15426a31 ("io_uring: reduce scheduling due to tw") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c6fbf7a82a341e66a0007c76eefd9d57f2d3ba51.1691541473.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-09io_uring: fix drain stalls by invalid SQEPavel Begunkov1-0/+2
cq_extra is protected by ->completion_lock, which io_get_sqe() misses. The bug is harmless as it doesn't happen in real life, requires invalid SQ index array and racing with submission, and only messes up the userspace, i.e. stall requests execution but will be cleaned up on ring destruction. Fixes: 15641e427070f ("io_uring: don't cache number of dropped SQEs") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/66096d54651b1a60534bb2023f2947f09f50ef73.1691538547.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-09io_uring: annotate the struct io_kiocb slab for appropriate user copyJens Axboe1-2/+14
When compiling the kernel with clang and having HARDENED_USERCOPY enabled, the liburing openat2.t test case fails during request setup: usercopy: Kernel memory overwrite attempt detected to SLUB object 'io_kiocb' (offset 24, size 24)! ------------[ cut here ]------------ kernel BUG at mm/usercopy.c:102! invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC CPU: 3 PID: 413 Comm: openat2.t Tainted: G N 6.4.3-g6995e2de6891-dirty #19 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014 RIP: 0010:usercopy_abort+0x84/0x90 Code: ce 49 89 ce 48 c7 c3 68 48 98 82 48 0f 44 de 48 c7 c7 56 c6 94 82 4c 89 de 48 89 c1 41 52 41 56 53 e8 e0 51 c5 00 48 83 c4 18 <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 41 57 41 56 RSP: 0018:ffffc900016b3da0 EFLAGS: 00010296 RAX: 0000000000000062 RBX: ffffffff82984868 RCX: 4e9b661ac6275b00 RDX: ffff8881b90ec580 RSI: ffffffff82949a64 RDI: 00000000ffffffff RBP: 0000000000000018 R08: 0000000000000000 R09: 0000000000000000 R10: ffffc900016b3c88 R11: ffffc900016b3c30 R12: 00007ffe549659e0 R13: ffff888119014000 R14: 0000000000000018 R15: 0000000000000018 FS: 00007f862e3ca680(0000) GS:ffff8881b90c0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00005571483542a8 CR3: 0000000118c11000 CR4: 00000000003506e0 Call Trace: <TASK> ? __die_body+0x63/0xb0 ? die+0x9d/0xc0 ? do_trap+0xa7/0x180 ? usercopy_abort+0x84/0x90 ? do_error_trap+0xc6/0x110 ? usercopy_abort+0x84/0x90 ? handle_invalid_op+0x2c/0x40 ? usercopy_abort+0x84/0x90 ? exc_invalid_op+0x2f/0x40 ? asm_exc_invalid_op+0x16/0x20 ? usercopy_abort+0x84/0x90 __check_heap_object+0xe2/0x110 __check_object_size+0x142/0x3d0 io_openat2_prep+0x68/0x140 io_submit_sqes+0x28a/0x680 __se_sys_io_uring_enter+0x120/0x580 do_syscall_64+0x3d/0x80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 RIP: 0033:0x55714834de26 Code: ca 01 0f b6 82 d0 00 00 00 8b ba cc 00 00 00 45 31 c0 31 d2 41 b9 08 00 00 00 83 e0 01 c1 e0 04 41 09 c2 b8 aa 01 00 00 0f 05 <c3> 66 0f 1f 84 00 00 00 00 00 89 30 eb 89 0f 1f 40 00 8b 00 a8 06 RSP: 002b:00007ffe549659c8 EFLAGS: 00000246 ORIG_RAX: 00000000000001aa RAX: ffffffffffffffda RBX: 00007ffe54965a50 RCX: 000055714834de26 RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000003 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000008 R10: 0000000000000000 R11: 0000000000000246 R12: 000055714834f057 R13: 00007ffe54965a50 R14: 0000000000000001 R15: 0000557148351dd8 </TASK> Modules linked in: ---[ end trace 0000000000000000 ]--- when it tries to copy struct open_how from userspace into the per-command space in the io_kiocb. There's nothing wrong with the copy, but we're missing the appropriate annotations for allowing user copies to/from the io_kiocb slab. Allow copies in the per-command area, which is from the 'file' pointer to when 'opcode' starts. We do have existing user copies there, but they are not all annotated like the one that openat2_prep() uses, copy_struct_from_user(). But in practice opcodes should be allowed to copy data into their per-command area in the io_kiocb. Reported-by: Breno Leitao <leitao@debian.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-08io_uring/parisc: Adjust pgoff in io_uring mmap() for pariscHelge Deller1-0/+3
The changes from commit 32832a407a71 ("io_uring: Fix io_uring mmap() by using architecture-provided get_unmapped_area()") to the parisc implementation of get_unmapped_area() broke glibc's locale-gen executable when running on parisc. This patch reverts those architecture-specific changes, and instead adjusts in io_uring_mmu_get_unmapped_area() the pgoff offset which is then given to parisc's get_unmapped_area() function. This is much cleaner than the previous approach, and we still will get a coherent addresss. This patch has no effect on other architectures (SHM_COLOUR is only defined on parisc), and the liburing testcase stil passes on parisc. Cc: stable@vger.kernel.org # 6.4 Signed-off-by: Helge Deller <deller@gmx.de> Reported-by: Christoph Biedl <linux-kernel.bfrz@manchmal.in-ulm.de> Fixes: 32832a407a71 ("io_uring: Fix io_uring mmap() by using architecture-provided get_unmapped_area()") Fixes: d808459b2e31 ("io_uring: Adjust mapping wrt architecture aliasing requirements") Link: https://lore.kernel.org/r/ZNEyGV0jyI8kOOfz@p100 Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-24io_uring: gate iowait schedule on having pending requestsJens Axboe1-6/+17
A previous commit made all cqring waits marked as iowait, as a way to improve performance for short schedules with pending IO. However, for use cases that have a special reaper thread that does nothing but wait on events on the ring, this causes a cosmetic issue where we know have one core marked as being "busy" with 100% iowait. While this isn't a grave issue, it is confusing to users. Rather than always mark us as being in iowait, gate setting of current->in_iowait to 1 by whether or not the waiting task has pending requests. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/io-uring/CAMEGJJ2RxopfNQ7GNLhr7X9=bHXKo+G5OOe0LUq=+UgLXsv1Xg@mail.gmail.com/ Link: https://bugzilla.kernel.org/show_bug.cgi?id=217699 Link: https://bugzilla.kernel.org/show_bug.cgi?id=217700 Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name> Reported-by: Phil Elwell <phil@raspberrypi.com> Tested-by: Andres Freund <andres@anarazel.de> Fixes: 8a796565cec3 ("io_uring: Use io_schedule* in cqring wait") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-21io_uring: Fix io_uring mmap() by using architecture-provided get_unmapped_area()Helge Deller1-25/+17
The io_uring testcase is broken on IA-64 since commit d808459b2e31 ("io_uring: Adjust mapping wrt architecture aliasing requirements"). The reason is, that this commit introduced an own architecture independend get_unmapped_area() search algorithm which finds on IA-64 a memory region which is outside of the regular memory region used for shared userspace mappings and which can't be used on that platform due to aliasing. To avoid similar problems on IA-64 and other platforms in the future, it's better to switch back to the architecture-provided get_unmapped_area() function and adjust the needed input parameters before the call. Beside fixing the issue, the function now becomes easier to understand and maintain. This patch has been successfully tested with the io_uring testcase on physical x86-64, ppc64le, IA-64 and PA-RISC machines. On PA-RISC the LTP mmmap testcases did not report any regressions. Cc: stable@vger.kernel.org # 6.4 Signed-off-by: Helge Deller <deller@gmx.de> Reported-by: matoro <matoro_mailinglist_kernel@matoro.tk> Fixes: d808459b2e31 ("io_uring: Adjust mapping wrt architecture aliasing requirements") Link: https://lore.kernel.org/r/20230721152432.196382-2-deller@gmx.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-20io_uring: treat -EAGAIN for REQ_F_NOWAIT as final for io-wqJens Axboe1-0/+8
io-wq assumes that an issue is blocking, but it may not be if the request type has asked for a non-blocking attempt. If we get -EAGAIN for that case, then we need to treat it as a final result and not retry or arm poll for it. Cc: stable@vger.kernel.org # 5.10+ Link: https://github.com/axboe/liburing/issues/897 Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-18io_uring: don't audit the capability check in io_uring_create()Ondrej Mosnacek1-1/+1
The check being unconditional may lead to unwanted denials reported by LSMs when a process has the capability granted by DAC, but denied by an LSM. In the case of SELinux such denials are a problem, since they can't be effectively filtered out via the policy and when not silenced, they produce noise that may hide a true problem or an attack. Since not having the capability merely means that the created io_uring context will be accounted against the current user's RLIMIT_MEMLOCK limit, we can disable auditing of denials for this check by using ns_capable_noaudit() instead of capable(). Fixes: 2b188cc1bb85 ("Add io_uring IO interface") Link: https://bugzilla.redhat.com/show_bug.cgi?id=2193317 Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Link: https://lore.kernel.org/r/20230718115607.65652-1-omosnace@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-07io_uring: Use io_schedule* in cqring waitAndres Freund1-2/+13
I observed poor performance of io_uring compared to synchronous IO. That turns out to be caused by deeper CPU idle states entered with io_uring, due to io_uring using plain schedule(), whereas synchronous IO uses io_schedule(). The losses due to this are substantial. On my cascade lake workstation, t/io_uring from the fio repository e.g. yields regressions between 20% and 40% with the following command: ./t/io_uring -r 5 -X0 -d 1 -s 1 -c 1 -p 0 -S$use_sync -R 0 /mnt/t2/fio/write.0.0 This is repeatable with different filesystems, using raw block devices and using different block devices. Use io_schedule_prepare() / io_schedule_finish() in io_cqring_wait_schedule() to address the difference. After that using io_uring is on par or surpassing synchronous IO (using registered files etc makes it reliably win, but arguably is a less fair comparison). There are other calls to schedule() in io_uring/, but none immediately jump out to be similarly situated, so I did not touch them. Similarly, it's possible that mutex_lock_io() should be used, but it's not clear if there are cases where that matters. Cc: stable@vger.kernel.org # 5.10+ Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: io-uring@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Andres Freund <andres@anarazel.de> Link: https://lore.kernel.org/r/20230707162007.194068-1-andres@anarazel.de [axboe: minor style fixup] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-28io_uring: flush offloaded and delayed task_work on exitJens Axboe1-3/+19
io_uring offloads task_work for cancelation purposes when the task is exiting. This is conceptually fine, but we should be nicer and actually wait for that work to complete before returning. Add an argument to io_fallback_tw() telling it to flush the deferred work when it's all queued up, and have it flush a ctx behind whenever the ctx changes. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-28io_uring: remove io_fallback_tw() forward declarationJens Axboe1-15/+14
It's used just one function higher up, get rid of the declaration and just move it up a bit. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-23io_uring: merge conditional unlock flush helpersPavel Begunkov1-12/+1
There is no reason not to use __io_cq_unlock_post_flush for intermediate aux CQE flushing, all ->task_complete should apply there, i.e. if set it should be the submitter task. Combine them, get rid of of __io_cq_unlock_post() and rename the left function. This place was also taking a couple percents of CPU according to profiles for max throughput net benchmarks due to multishot recv flooding it with completions. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/bbed60734cbec2e833d9c7bdcf9741aada5d8aab.1687518903.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-23io_uring: make io_cq_unlock_post staticPavel Begunkov1-1/+1
io_cq_unlock_post() is exclusively used in io_uring/io_uring.c, mark it static and don't expose to other files. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3dc8127dda4514e1dd24bb32035faac887c5fa37.1687518903.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-23io_uring: inline __io_cq_unlockPavel Begunkov1-8/+4
__io_cq_unlock is not very helpful, and users should be calling flush variants anyway. Open code the function. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d875c4cfb69f38ccecb58a57111446c77a614caa.1687518903.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-23io_uring: fix acquire/release annotationsPavel Begunkov1-3/+0
We do conditional locking, so __io_cq_lock() and friends not always actually grab/release the lock, so kill misleading annotations. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2a098f9144c24cab622f8bf90b39f44da5d0401e.1687518903.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-23io_uring: kill io_cq_unlock()Pavel Begunkov1-8/+2
We're abusing ->completion_lock helpers. io_cq_unlock() neither locking conditionally nor doing CQE flushing, which means that callers must have some side reason of taking the lock and should do it directly. Open code io_cq_unlock() into io_cqring_overflow_kill() and clean it up. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7dabb36856db2b562e78780480396c52c29b2bf4.1687518903.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-23io_uring: remove IOU_F_TWQ_FORCE_NORMALPavel Begunkov1-11/+14
Extract a function for non-local task_work_add, and use it directly from io_move_task_work_from_local(). Now we don't use IOU_F_TWQ_FORCE_NORMAL and it can be killed. As a small positive side effect we don't grab task->io_uring in io_req_normal_work_add anymore, which is not needed for io_req_local_work_add(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2e55571e8ff2927ae3cc12da606d204e2485525b.1687518903.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-23io_uring: don't batch task put on reqs freePavel Begunkov1-22/+10
We're trying to batch io_put_task() in io_free_batch_list(), but considering that the hot path is a simple inc, it's most cerainly and probably faster to just do io_put_task() instead of task tracking. We don't care about io_put_task_remote() as it's only for IOPOLL where polling/waiting is done by not the submitter task. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4a7ef7dce845fe2bd35507bf389d6bd2d5c1edf0.1687518903.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-23io_uring: move io_clean_op()Pavel Begunkov1-34/+33
Move io_clean_op() up in the source file and remove the forward declaration, as the function doesn't have tricky dependencies anymore. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1b7163b2ba7c3a8322d972c79c1b0a9301b3057e.1687518903.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-23io_uring: inline io_dismantle_req()Pavel Begunkov1-12/+5
io_dismantle_req() is only used in __io_req_complete_post(), open code it there. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/ba8f20cb2c914eefa2e7d120a104a198552050db.1687518903.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-23io_uring: remove io_free_req_twPavel Begunkov1-18/+5
Request completion is a very hot path in general, but there are 3 places that can be doing it: io_free_batch_list(), io_req_complete_post() and io_free_req_tw(). io_free_req_tw() is used rather marginally and we don't care about it. Killing it can help to clean up and optimise the left two, do that by replacing it with io_req_task_complete(). There are two things to consider: 1) io_free_req() is called when all refs are put, so we need to reinit references. The easiest way to do that is to clear REQ_F_REFCOUNT. 2) We also don't need a cqe from it, so silence it with REQ_F_CQE_SKIP. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/434a2be8f33d474ad888ce1c17fe5ea7bbcb2a55.1687518903.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-23io_uring: open code io_put_req_find_nextPavel Begunkov1-18/+7
There is only one user of io_put_req_find_next() and it doesn't make much sense to have it. Open code the function. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/38b5c5e48e4adc8e6a0cd16fdd5c1531d7ff81a9.1687518903.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-20io_uring: add helpers to decode the fixed file file_ptrChristoph Hellwig1-6/+4
Remove all the open coded magic on slot->file_ptr by introducing two helpers that return the file pointer and the flags instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230620113235.920399-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-20io_uring: return REQ_F_ flags from io_file_get_flagsChristoph Hellwig1-3/+3
Two of the three callers want them, so return the more usual format, and shift into the FFS_ form only for the fixed file table. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230620113235.920399-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>