diff options
Diffstat (limited to 'Documentation/core-api')
-rw-r--r-- | Documentation/core-api/cachetlb.rst | 55 | ||||
-rw-r--r-- | Documentation/core-api/cpu_hotplug.rst | 41 | ||||
-rw-r--r-- | Documentation/core-api/genericirq.rst | 2 | ||||
-rw-r--r-- | Documentation/core-api/kernel-api.rst | 48 | ||||
-rw-r--r-- | Documentation/core-api/memory-allocation.rst | 17 | ||||
-rw-r--r-- | Documentation/core-api/mm-api.rst | 25 | ||||
-rw-r--r-- | Documentation/core-api/netlink.rst | 9 | ||||
-rw-r--r-- | Documentation/core-api/pin_user_pages.rst | 6 | ||||
-rw-r--r-- | Documentation/core-api/printk-formats.rst | 16 | ||||
-rw-r--r-- | Documentation/core-api/this_cpu_ops.rst | 2 | ||||
-rw-r--r-- | Documentation/core-api/workqueue.rst | 380 |
11 files changed, 521 insertions, 80 deletions
diff --git a/Documentation/core-api/cachetlb.rst b/Documentation/core-api/cachetlb.rst index 5c0552e78c58..889fc84ccd1b 100644 --- a/Documentation/core-api/cachetlb.rst +++ b/Documentation/core-api/cachetlb.rst @@ -88,13 +88,17 @@ changes occur: This is used primarily during fault processing. -5) ``void update_mmu_cache(struct vm_area_struct *vma, - unsigned long address, pte_t *ptep)`` +5) ``void update_mmu_cache_range(struct vm_fault *vmf, + struct vm_area_struct *vma, unsigned long address, pte_t *ptep, + unsigned int nr)`` - At the end of every page fault, this routine is invoked to - tell the architecture specific code that a translation - now exists at virtual address "address" for address space - "vma->vm_mm", in the software page tables. + At the end of every page fault, this routine is invoked to tell + the architecture specific code that translations now exists + in the software page tables for address space "vma->vm_mm" + at virtual address "address" for "nr" consecutive pages. + + This routine is also invoked in various other places which pass + a NULL "vmf". A port may use this information in any way it so chooses. For example, it could use this event to pre-load TLB @@ -269,7 +273,7 @@ maps this page at its virtual address. If D-cache aliasing is not an issue, these two routines may simply call memcpy/memset directly and do nothing more. - ``void flush_dcache_page(struct page *page)`` + ``void flush_dcache_folio(struct folio *folio)`` This routines must be called when: @@ -277,7 +281,7 @@ maps this page at its virtual address. and / or in high memory b) the kernel is about to read from a page cache page and user space shared/writable mappings of this page potentially exist. Note - that {get,pin}_user_pages{_fast} already call flush_dcache_page + that {get,pin}_user_pages{_fast} already call flush_dcache_folio on any page found in the user address space and thus driver code rarely needs to take this into account. @@ -291,7 +295,7 @@ maps this page at its virtual address. The phrase "kernel writes to a page cache page" means, specifically, that the kernel executes store instructions that dirty data in that - page at the page->virtual mapping of that page. It is important to + page at the kernel virtual mapping of that page. It is important to flush here to handle D-cache aliasing, to make sure these kernel stores are visible to user space mappings of that page. @@ -302,21 +306,22 @@ maps this page at its virtual address. If D-cache aliasing is not an issue, this routine may simply be defined as a nop on that architecture. - There is a bit set aside in page->flags (PG_arch_1) as "architecture + There is a bit set aside in folio->flags (PG_arch_1) as "architecture private". The kernel guarantees that, for pagecache pages, it will clear this bit when such a page first enters the pagecache. - This allows these interfaces to be implemented much more efficiently. - It allows one to "defer" (perhaps indefinitely) the actual flush if - there are currently no user processes mapping this page. See sparc64's - flush_dcache_page and update_mmu_cache implementations for an example - of how to go about doing this. + This allows these interfaces to be implemented much more + efficiently. It allows one to "defer" (perhaps indefinitely) the + actual flush if there are currently no user processes mapping this + page. See sparc64's flush_dcache_folio and update_mmu_cache_range + implementations for an example of how to go about doing this. - The idea is, first at flush_dcache_page() time, if page_file_mapping() - returns a mapping, and mapping_mapped on that mapping returns %false, - just mark the architecture private page flag bit. Later, in - update_mmu_cache(), a check is made of this flag bit, and if set the - flush is done and the flag bit is cleared. + The idea is, first at flush_dcache_folio() time, if + folio_flush_mapping() returns a mapping, and mapping_mapped() on that + mapping returns %false, just mark the architecture private page + flag bit. Later, in update_mmu_cache_range(), a check is made + of this flag bit, and if set the flush is done and the flag bit + is cleared. .. important:: @@ -326,12 +331,6 @@ maps this page at its virtual address. dirty. Again, see sparc64 for examples of how to deal with this. - ``void flush_dcache_folio(struct folio *folio)`` - This function is called under the same circumstances as - flush_dcache_page(). It allows the architecture to - optimise for flushing the entire folio of pages instead - of flushing one page at a time. - ``void copy_to_user_page(struct vm_area_struct *vma, struct page *page, unsigned long user_vaddr, void *dst, void *src, int len)`` ``void copy_from_user_page(struct vm_area_struct *vma, struct page *page, @@ -352,7 +351,7 @@ maps this page at its virtual address. When the kernel needs to access the contents of an anonymous page, it calls this function (currently only - get_user_pages()). Note: flush_dcache_page() deliberately + get_user_pages()). Note: flush_dcache_folio() deliberately doesn't work for an anonymous page. The default implementation is a nop (and should remain so for all coherent architectures). For incoherent architectures, it should flush @@ -369,7 +368,7 @@ maps this page at its virtual address. ``void flush_icache_page(struct vm_area_struct *vma, struct page *page)`` All the functionality of flush_icache_page can be implemented in - flush_dcache_page and update_mmu_cache. In the future, the hope + flush_dcache_folio and update_mmu_cache_range. In the future, the hope is to remove this interface completely. The final category of APIs is for I/O to deliberately aliased address diff --git a/Documentation/core-api/cpu_hotplug.rst b/Documentation/core-api/cpu_hotplug.rst index f75778d37488..9511e405aabd 100644 --- a/Documentation/core-api/cpu_hotplug.rst +++ b/Documentation/core-api/cpu_hotplug.rst @@ -127,17 +127,8 @@ bring CPU4 back online:: $ echo 1 > /sys/devices/system/cpu/cpu4/online smpboot: Booting Node 0 Processor 4 APIC 0x1 -The CPU is usable again. This should work on all CPUs. CPU0 is often special -and excluded from CPU hotplug. On X86 the kernel option -*CONFIG_BOOTPARAM_HOTPLUG_CPU0* has to be enabled in order to be able to -shutdown CPU0. Alternatively the kernel command option *cpu0_hotplug* can be -used. Some known dependencies of CPU0: - -* Resume from hibernate/suspend. Hibernate/suspend will fail if CPU0 is offline. -* PIC interrupts. CPU0 can't be removed if a PIC interrupt is detected. - -Please let Fenghua Yu <fenghua.yu@intel.com> know if you find any dependencies -on CPU0. +The CPU is usable again. This should work on all CPUs, but CPU0 is often special +and excluded from CPU hotplug. The CPU hotplug coordination ============================ @@ -404,8 +395,8 @@ multi-instance state the following function is available: * cpuhp_setup_state_multi(state, name, startup, teardown) The @state argument is either a statically allocated state or one of the -constants for dynamically allocated states - CPUHP_PREPARE_DYN, -CPUHP_ONLINE_DYN - depending on the state section (PREPARE, ONLINE) for +constants for dynamically allocated states - CPUHP_BP_PREPARE_DYN, +CPUHP_AP_ONLINE_DYN - depending on the state section (PREPARE, ONLINE) for which a dynamic state should be allocated. The @name argument is used for sysfs output and for instrumentation. The @@ -597,7 +588,7 @@ notifications on online and offline operations:: Setup and teardown a dynamically allocated state in the ONLINE section for notifications on offline operations:: - state = cpuhp_setup_state(CPUHP_ONLINE_DYN, "subsys:offline", NULL, subsys_cpu_offline); + state = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "subsys:offline", NULL, subsys_cpu_offline); if (state < 0) return state; .... @@ -606,7 +597,7 @@ for notifications on offline operations:: Setup and teardown a dynamically allocated state in the ONLINE section for notifications on online operations without invoking the callbacks:: - state = cpuhp_setup_state_nocalls(CPUHP_ONLINE_DYN, "subsys:online", subsys_cpu_online, NULL); + state = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN, "subsys:online", subsys_cpu_online, NULL); if (state < 0) return state; .... @@ -615,7 +606,7 @@ for notifications on online operations without invoking the callbacks:: Setup, use and teardown a dynamically allocated multi-instance state in the ONLINE section for notifications on online and offline operation:: - state = cpuhp_setup_state_multi(CPUHP_ONLINE_DYN, "subsys:online", subsys_cpu_online, subsys_cpu_offline); + state = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN, "subsys:online", subsys_cpu_online, subsys_cpu_offline); if (state < 0) return state; .... @@ -750,6 +741,24 @@ will receive all events. A script like:: can process the event further. +When changes to the CPUs in the system occur, the sysfs file +/sys/devices/system/cpu/crash_hotplug contains '1' if the kernel +updates the kdump capture kernel list of CPUs itself (via elfcorehdr), +or '0' if userspace must update the kdump capture kernel list of CPUs. + +The availability depends on the CONFIG_HOTPLUG_CPU kernel configuration +option. + +To skip userspace processing of CPU hot un/plug events for kdump +(i.e. the unload-then-reload to obtain a current list of CPUs), this sysfs +file can be used in a udev rule as follows: + + SUBSYSTEM=="cpu", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end" + +For a CPU hot un/plug event, if the architecture supports kernel updates +of the elfcorehdr (which contains the list of CPUs), then the rule skips +the unload-then-reload of the kdump capture kernel. + Kernel Inline Documentations Reference ====================================== diff --git a/Documentation/core-api/genericirq.rst b/Documentation/core-api/genericirq.rst index f959c9b53f61..4a460639ab1c 100644 --- a/Documentation/core-api/genericirq.rst +++ b/Documentation/core-api/genericirq.rst @@ -264,7 +264,7 @@ The following control flow is implemented (simplified excerpt):: desc->irq_data.chip->irq_unmask(); desc->status &= ~pending; handle_irq_event(desc->action); - } while (status & pending); + } while (desc->status & pending); desc->status &= ~running; diff --git a/Documentation/core-api/kernel-api.rst b/Documentation/core-api/kernel-api.rst index 62f961610773..ae92a2571388 100644 --- a/Documentation/core-api/kernel-api.rst +++ b/Documentation/core-api/kernel-api.rst @@ -96,6 +96,12 @@ Command-line Parsing .. kernel-doc:: lib/cmdline.c :export: +Error Pointers +-------------- + +.. kernel-doc:: include/linux/err.h + :internal: + Sorting ------- @@ -156,8 +162,10 @@ Base 2 log and power Functions .. kernel-doc:: include/linux/log2.h :internal: -Integer power Functions ------------------------ +Integer log and power Functions +------------------------------- + +.. kernel-doc:: include/linux/int_log.h .. kernel-doc:: lib/math/int_pow.c :export: @@ -220,12 +228,30 @@ relay interface Module Support ============== -Module Loading --------------- +Kernel module auto-loading +-------------------------- -.. kernel-doc:: kernel/kmod.c +.. kernel-doc:: kernel/module/kmod.c :export: +Module debugging +---------------- + +.. kernel-doc:: kernel/module/stats.c + :doc: module debugging statistics overview + +dup_failed_modules - tracks duplicate failed modules +**************************************************** + +.. kernel-doc:: kernel/module/stats.c + :doc: dup_failed_modules - tracks duplicate failed modules + +module statistics debugfs counters +********************************** + +.. kernel-doc:: kernel/module/stats.c + :doc: module statistics debugfs counters + Inter Module support -------------------- @@ -394,3 +420,15 @@ Read-Copy Update (RCU) .. kernel-doc:: include/linux/rcu_sync.h .. kernel-doc:: kernel/rcu/sync.c + +.. kernel-doc:: kernel/rcu/tasks.h + +.. kernel-doc:: kernel/rcu/tree_stall.h + +.. kernel-doc:: include/linux/rcupdate_trace.h + +.. kernel-doc:: include/linux/rcupdate_wait.h + +.. kernel-doc:: include/linux/rcuref.h + +.. kernel-doc:: include/linux/rcutree.h diff --git a/Documentation/core-api/memory-allocation.rst b/Documentation/core-api/memory-allocation.rst index 5954ddf6ee13..1c58d883b273 100644 --- a/Documentation/core-api/memory-allocation.rst +++ b/Documentation/core-api/memory-allocation.rst @@ -170,7 +170,16 @@ should be used if a part of the cache might be copied to the userspace. After the cache is created kmem_cache_alloc() and its convenience wrappers can allocate memory from that cache. -When the allocated memory is no longer needed it must be freed. You can -use kvfree() for the memory allocated with `kmalloc`, `vmalloc` and -`kvmalloc`. The slab caches should be freed with kmem_cache_free(). And -don't forget to destroy the cache with kmem_cache_destroy(). +When the allocated memory is no longer needed it must be freed. + +Objects allocated by `kmalloc` can be freed by `kfree` or `kvfree`. Objects +allocated by `kmem_cache_alloc` can be freed with `kmem_cache_free`, `kfree` +or `kvfree`, where the latter two might be more convenient thanks to not +needing the kmem_cache pointer. + +The same rules apply to _bulk and _rcu flavors of freeing functions. + +Memory allocated by `vmalloc` can be freed with `vfree` or `kvfree`. +Memory allocated by `kvmalloc` can be freed with `kvfree`. +Caches created by `kmem_cache_create` should be freed with +`kmem_cache_destroy` only after freeing all the allocated objects first. diff --git a/Documentation/core-api/mm-api.rst b/Documentation/core-api/mm-api.rst index f5dde5bceaea..2d091c873d1e 100644 --- a/Documentation/core-api/mm-api.rst +++ b/Documentation/core-api/mm-api.rst @@ -115,3 +115,28 @@ More Memory Management Functions .. kernel-doc:: include/linux/mmzone.h .. kernel-doc:: mm/util.c :functions: folio_mapping + +.. kernel-doc:: mm/rmap.c +.. kernel-doc:: mm/migrate.c +.. kernel-doc:: mm/mmap.c +.. kernel-doc:: mm/kmemleak.c +.. #kernel-doc:: mm/hmm.c (build warnings) +.. kernel-doc:: mm/memremap.c +.. kernel-doc:: mm/hugetlb.c +.. kernel-doc:: mm/swap.c +.. kernel-doc:: mm/zpool.c +.. kernel-doc:: mm/memcontrol.c +.. #kernel-doc:: mm/memory-tiers.c (build warnings) +.. kernel-doc:: mm/shmem.c +.. kernel-doc:: mm/migrate_device.c +.. #kernel-doc:: mm/nommu.c (duplicates kernel-doc from other files) +.. kernel-doc:: mm/mapping_dirty_helpers.c +.. #kernel-doc:: mm/memory-failure.c (build warnings) +.. kernel-doc:: mm/percpu.c +.. kernel-doc:: mm/maccess.c +.. kernel-doc:: mm/vmscan.c +.. kernel-doc:: mm/memory_hotplug.c +.. kernel-doc:: mm/mmu_notifier.c +.. kernel-doc:: mm/balloon_compaction.c +.. kernel-doc:: mm/huge_memory.c +.. kernel-doc:: mm/io-mapping.c diff --git a/Documentation/core-api/netlink.rst b/Documentation/core-api/netlink.rst index e4a938a05cc9..9f692b02bfe6 100644 --- a/Documentation/core-api/netlink.rst +++ b/Documentation/core-api/netlink.rst @@ -67,10 +67,11 @@ Globals kernel-policy ~~~~~~~~~~~~~ -Defines if the kernel validation policy is per operation (``per-op``) -or for the entire family (``global``). New families should use ``per-op`` -(default) to be able to narrow down the attributes accepted by a specific -command. +Defines whether the kernel validation policy is ``global`` i.e. the same for all +operations of the family, defined for each operation individually - ``per-op``, +or separately for each operation and operation type (do vs dump) - ``split``. +New families should use ``per-op`` (default) to be able to narrow down the +attributes accepted by a specific command. checks ------ diff --git a/Documentation/core-api/pin_user_pages.rst b/Documentation/core-api/pin_user_pages.rst index 9fb0b1080d3b..d3c1f6d8c0e0 100644 --- a/Documentation/core-api/pin_user_pages.rst +++ b/Documentation/core-api/pin_user_pages.rst @@ -112,6 +112,12 @@ pages: This also leads to limitations: there are only 31-10==21 bits available for a counter that increments 10 bits at a time. +* Because of that limitation, special handling is applied to the zero pages + when using FOLL_PIN. We only pretend to pin a zero page - we don't alter its + refcount or pincount at all (it is permanent, so there's no need). The + unpinning functions also don't do anything to a zero page. This is + transparent to the caller. + * Callers must specifically request "dma-pinned tracking of pages". In other words, just calling get_user_pages() will not suffice; a new set of functions, pin_user_page() and related, must be used. diff --git a/Documentation/core-api/printk-formats.rst b/Documentation/core-api/printk-formats.rst index 0eed8a2dcae9..4451ef501936 100644 --- a/Documentation/core-api/printk-formats.rst +++ b/Documentation/core-api/printk-formats.rst @@ -576,20 +576,26 @@ The field width is passed by value, the bitmap is passed by reference. Helper macros cpumask_pr_args() and nodemask_pr_args() are available to ease printing cpumask and nodemask. -Flags bitfields such as page flags, gfp_flags ---------------------------------------------- +Flags bitfields such as page flags, page_type, gfp_flags +-------------------------------------------------------- :: %pGp 0x17ffffc0002036(referenced|uptodate|lru|active|private|node=0|zone=2|lastcpupid=0x1fffff) + %pGt 0xffffff7f(buddy) %pGg GFP_USER|GFP_DMA32|GFP_NOWARN %pGv read|exec|mayread|maywrite|mayexec|denywrite For printing flags bitfields as a collection of symbolic constants that would construct the value. The type of flags is given by the third -character. Currently supported are [p]age flags, [v]ma_flags (both -expect ``unsigned long *``) and [g]fp_flags (expects ``gfp_t *``). The flag -names and print order depends on the particular type. +character. Currently supported are: + + - p - [p]age flags, expects value of type (``unsigned long *``) + - t - page [t]ype, expects value of type (``unsigned int *``) + - v - [v]ma_flags, expects value of type (``unsigned long *``) + - g - [g]fp_flags, expects value of type (``gfp_t *``) + +The flag names and print order depends on the particular type. Note that this format should not be used directly in the :c:func:`TP_printk()` part of a tracepoint. Instead, use the show_*_flags() diff --git a/Documentation/core-api/this_cpu_ops.rst b/Documentation/core-api/this_cpu_ops.rst index 5cb8b883ae83..91acbcf30e9b 100644 --- a/Documentation/core-api/this_cpu_ops.rst +++ b/Documentation/core-api/this_cpu_ops.rst @@ -53,7 +53,6 @@ preemption and interrupts:: this_cpu_add_return(pcp, val) this_cpu_xchg(pcp, nval) this_cpu_cmpxchg(pcp, oval, nval) - this_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2) this_cpu_sub(pcp, val) this_cpu_inc(pcp) this_cpu_dec(pcp) @@ -242,7 +241,6 @@ safe:: __this_cpu_add_return(pcp, val) __this_cpu_xchg(pcp, nval) __this_cpu_cmpxchg(pcp, oval, nval) - __this_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2) __this_cpu_sub(pcp, val) __this_cpu_inc(pcp) __this_cpu_dec(pcp) diff --git a/Documentation/core-api/workqueue.rst b/Documentation/core-api/workqueue.rst index 8ec4d6270b24..5d7b01aed1fe 100644 --- a/Documentation/core-api/workqueue.rst +++ b/Documentation/core-api/workqueue.rst @@ -1,6 +1,6 @@ -==================================== -Concurrency Managed Workqueue (cmwq) -==================================== +========= +Workqueue +========= :Date: September, 2010 :Author: Tejun Heo <tj@kernel.org> @@ -25,8 +25,8 @@ there is no work item left on the workqueue the worker becomes idle. When a new work item gets queued, the worker begins executing again. -Why cmwq? -========= +Why Concurrency Managed Workqueue? +================================== In the original wq implementation, a multi threaded (MT) wq had one worker thread per CPU and a single threaded (ST) wq had one worker @@ -220,17 +220,16 @@ resources, scheduled and executed. ``max_active`` -------------- -``@max_active`` determines the maximum number of execution contexts -per CPU which can be assigned to the work items of a wq. For example, -with ``@max_active`` of 16, at most 16 work items of the wq can be -executing at the same time per CPU. +``@max_active`` determines the maximum number of execution contexts per +CPU which can be assigned to the work items of a wq. For example, with +``@max_active`` of 16, at most 16 work items of the wq can be executing +at the same time per CPU. This is always a per-CPU attribute, even for +unbound workqueues. -Currently, for a bound wq, the maximum limit for ``@max_active`` is -512 and the default value used when 0 is specified is 256. For an -unbound wq, the limit is higher of 512 and 4 * -``num_possible_cpus()``. These values are chosen sufficiently high -such that they are not the limiting factor while providing protection -in runaway cases. +The maximum limit for ``@max_active`` is 512 and the default value used +when 0 is specified is 256. These values are chosen sufficiently high +such that they are not the limiting factor while providing protection in +runaway cases. The number of active work items of a wq is usually regulated by the users of the wq, more specifically, by how many work items the users @@ -348,6 +347,356 @@ Guidelines level of locality in wq operations and work item execution. +Affinity Scopes +=============== + +An unbound workqueue groups CPUs according to its affinity scope to improve +cache locality. For example, if a workqueue is using the default affinity +scope of "cache", it will group CPUs according to last level cache +boundaries. A work item queued on the workqueue will be assigned to a worker +on one of the CPUs which share the last level cache with the issuing CPU. +Once started, the worker may or may not be allowed to move outside the scope +depending on the ``affinity_strict`` setting of the scope. + +Workqueue currently supports the following affinity scopes. + +``default`` + Use the scope in module parameter ``workqueue.default_affinity_scope`` + which is always set to one of the scopes below. + +``cpu`` + CPUs are not grouped. A work item issued on one CPU is processed by a + worker on the same CPU. This makes unbound workqueues behave as per-cpu + workqueues without concurrency management. + +``smt`` + CPUs are grouped according to SMT boundaries. This usually means that the + logical threads of each physical CPU core are grouped together. + +``cache`` + CPUs are grouped according to cache boundaries. Which specific cache + boundary is used is determined by the arch code. L3 is used in a lot of + cases. This is the default affinity scope. + +``numa`` + CPUs are grouped according to NUMA bounaries. + +``system`` + All CPUs are put in the same group. Workqueue makes no effort to process a + work item on a CPU close to the issuing CPU. + +The default affinity scope can be changed with the module parameter +``workqueue.default_affinity_scope`` and a specific workqueue's affinity +scope can be changed using ``apply_workqueue_attrs()``. + +If ``WQ_SYSFS`` is set, the workqueue will have the following affinity scope +related interface files under its ``/sys/devices/virtual/WQ_NAME/`` +directory. + +``affinity_scope`` + Read to see the current affinity scope. Write to change. + + When default is the current scope, reading this file will also show the + current effective scope in parentheses, for example, ``default (cache)``. + +``affinity_strict`` + 0 by default indicating that affinity scopes are not strict. When a work + item starts execution, workqueue makes a best-effort attempt to ensure + that the worker is inside its affinity scope, which is called + repatriation. Once started, the scheduler is free to move the worker + anywhere in the system as it sees fit. This enables benefiting from scope + locality while still being able to utilize other CPUs if necessary and + available. + + If set to 1, all workers of the scope are guaranteed always to be in the + scope. This may be useful when crossing affinity scopes has other + implications, for example, in terms of power consumption or workload + isolation. Strict NUMA scope can also be used to match the workqueue + behavior of older kernels. + + +Affinity Scopes and Performance +=============================== + +It'd be ideal if an unbound workqueue's behavior is optimal for vast +majority of use cases without further tuning. Unfortunately, in the current +kernel, there exists a pronounced trade-off between locality and utilization +necessitating explicit configurations when workqueues are heavily used. + +Higher locality leads to higher efficiency where more work is performed for +the same number of consumed CPU cycles. However, higher locality may also +cause lower overall system utilization if the work items are not spread +enough across the affinity scopes by the issuers. The following performance +testing with dm-crypt clearly illustrates this trade-off. + +The tests are run on a CPU with 12-cores/24-threads split across four L3 +caches (AMD Ryzen 9 3900x). CPU clock boost is turned off for consistency. +``/dev/dm-0`` is a dm-crypt device created on NVME SSD (Samsung 990 PRO) and +opened with ``cryptsetup`` with default settings. + + +Scenario 1: Enough issuers and work spread across the machine +------------------------------------------------------------- + +The command used: :: + + $ fio --filename=/dev/dm-0 --direct=1 --rw=randrw --bs=32k --ioengine=libaio \ + --iodepth=64 --runtime=60 --numjobs=24 --time_based --group_reporting \ + --name=iops-test-job --verify=sha512 + +There are 24 issuers, each issuing 64 IOs concurrently. ``--verify=sha512`` +makes ``fio`` generate and read back the content each time which makes +execution locality matter between the issuer and ``kcryptd``. The followings +are the read bandwidths and CPU utilizations depending on different affinity +scope settings on ``kcryptd`` measured over five runs. Bandwidths are in +MiBps, and CPU util in percents. + +.. list-table:: + :widths: 16 20 20 + :header-rows: 1 + + * - Affinity + - Bandwidth (MiBps) + - CPU util (%) + + * - system + - 1159.40 ±1.34 + - 99.31 ±0.02 + + * - cache + - 1166.40 ±0.89 + - 99.34 ±0.01 + + * - cache (strict) + - 1166.00 ±0.71 + - 99.35 ±0.01 + +With enough issuers spread across the system, there is no downside to +"cache", strict or otherwise. All three configurations saturate the whole +machine but the cache-affine ones outperform by 0.6% thanks to improved +locality. + + +Scenario 2: Fewer issuers, enough work for saturation +----------------------------------------------------- + +The command used: :: + + $ fio --filename=/dev/dm-0 --direct=1 --rw=randrw --bs=32k \ + --ioengine=libaio --iodepth=64 --runtime=60 --numjobs=8 \ + --time_based --group_reporting --name=iops-test-job --verify=sha512 + +The only difference from the previous scenario is ``--numjobs=8``. There are +a third of the issuers but is still enough total work to saturate the +system. + +.. list-table:: + :widths: 16 20 20 + :header-rows: 1 + + * - Affinity + - Bandwidth (MiBps) + - CPU util (%) + + * - system + - 1155.40 ±0.89 + - 97.41 ±0.05 + + * - cache + - 1154.40 ±1.14 + - 96.15 ±0.09 + + * - cache (strict) + - 1112.00 ±4.64 + - 93.26 ±0.35 + +This is more than enough work to saturate the system. Both "system" and +"cache" are nearly saturating the machine but not fully. "cache" is using +less CPU but the better efficiency puts it at the same bandwidth as +"system". + +Eight issuers moving around over four L3 cache scope still allow "cache +(strict)" to mostly saturate the machine but the loss of work conservation +is now starting to hurt with 3.7% bandwidth loss. + + +Scenario 3: Even fewer issuers, not enough work to saturate +----------------------------------------------------------- + +The command used: :: + + $ fio --filename=/dev/dm-0 --direct=1 --rw=randrw --bs=32k \ + --ioengine=libaio --iodepth=64 --runtime=60 --numjobs=4 \ + --time_based --group_reporting --name=iops-test-job --verify=sha512 + +Again, the only difference is ``--numjobs=4``. With the number of issuers +reduced to four, there now isn't enough work to saturate the whole system +and the bandwidth becomes dependent on completion latencies. + +.. list-table:: + :widths: 16 20 20 + :header-rows: 1 + + * - Affinity + - Bandwidth (MiBps) + - CPU util (%) + + * - system + - 993.60 ±1.82 + - 75.49 ±0.06 + + * - cache + - 973.40 ±1.52 + - 74.90 ±0.07 + + * - cache (strict) + - 828.20 ±4.49 + - 66.84 ±0.29 + +Now, the tradeoff between locality and utilization is clearer. "cache" shows +2% bandwidth loss compared to "system" and "cache (struct)" whopping 20%. + + +Conclusion and Recommendations +------------------------------ + +In the above experiments, the efficiency advantage of the "cache" affinity +scope over "system" is, while consistent and noticeable, small. However, the +impact is dependent on the distances between the scopes and may be more +pronounced in processors with more complex topologies. + +While the loss of work-conservation in certain scenarios hurts, it is a lot +better than "cache (strict)" and maximizing workqueue utilization is +unlikely to be the common case anyway. As such, "cache" is the default +affinity scope for unbound pools. + +* As there is no one option which is great for most cases, workqueue usages + that may consume a significant amount of CPU are recommended to configure + the workqueues using ``apply_workqueue_attrs()`` and/or enable + ``WQ_SYSFS``. + +* An unbound workqueue with strict "cpu" affinity scope behaves the same as + ``WQ_CPU_INTENSIVE`` per-cpu workqueue. There is no real advanage to the + latter and an unbound workqueue provides a lot more flexibility. + +* Affinity scopes are introduced in Linux v6.5. To emulate the previous + behavior, use strict "numa" affinity scope. + +* The loss of work-conservation in non-strict affinity scopes is likely + originating from the scheduler. There is no theoretical reason why the + kernel wouldn't be able to do the right thing and maintain + work-conservation in most cases. As such, it is possible that future + scheduler improvements may make most of these tunables unnecessary. + + +Examining Configuration +======================= + +Use tools/workqueue/wq_dump.py to examine unbound CPU affinity +configuration, worker pools and how workqueues map to the pools: :: + + $ tools/workqueue/wq_dump.py + Affinity Scopes + =============== + wq_unbound_cpumask=0000000f + + CPU + nr_pods 4 + pod_cpus [0]=00000001 [1]=00000002 [2]=00000004 [3]=00000008 + pod_node [0]=0 [1]=0 [2]=1 [3]=1 + cpu_pod [0]=0 [1]=1 [2]=2 [3]=3 + + SMT + nr_pods 4 + pod_cpus [0]=00000001 [1]=00000002 [2]=00000004 [3]=00000008 + pod_node [0]=0 [1]=0 [2]=1 [3]=1 + cpu_pod [0]=0 [1]=1 [2]=2 [3]=3 + + CACHE (default) + nr_pods 2 + pod_cpus [0]=00000003 [1]=0000000c + pod_node [0]=0 [1]=1 + cpu_pod [0]=0 [1]=0 [2]=1 [3]=1 + + NUMA + nr_pods 2 + pod_cpus [0]=00000003 [1]=0000000c + pod_node [0]=0 [1]=1 + cpu_pod [0]=0 [1]=0 [2]=1 [3]=1 + + SYSTEM + nr_pods 1 + pod_cpus [0]=0000000f + pod_node [0]=-1 + cpu_pod [0]=0 [1]=0 [2]=0 [3]=0 + + Worker Pools + ============ + pool[00] ref= 1 nice= 0 idle/workers= 4/ 4 cpu= 0 + pool[01] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 0 + pool[02] ref= 1 nice= 0 idle/workers= 4/ 4 cpu= 1 + pool[03] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 1 + pool[04] ref= 1 nice= 0 idle/workers= 4/ 4 cpu= 2 + pool[05] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 2 + pool[06] ref= 1 nice= 0 idle/workers= 3/ 3 cpu= 3 + pool[07] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 3 + pool[08] ref=42 nice= 0 idle/workers= 6/ 6 cpus=0000000f + pool[09] ref=28 nice= 0 idle/workers= 3/ 3 cpus=00000003 + pool[10] ref=28 nice= 0 idle/workers= 17/ 17 cpus=0000000c + pool[11] ref= 1 nice=-20 idle/workers= 1/ 1 cpus=0000000f + pool[12] ref= 2 nice=-20 idle/workers= 1/ 1 cpus=00000003 + pool[13] ref= 2 nice=-20 idle/workers= 1/ 1 cpus=0000000c + + Workqueue CPU -> pool + ===================== + [ workqueue \ CPU 0 1 2 3 dfl] + events percpu 0 2 4 6 + events_highpri percpu 1 3 5 7 + events_long percpu 0 2 4 6 + events_unbound unbound 9 9 10 10 8 + events_freezable percpu 0 2 4 6 + events_power_efficient percpu 0 2 4 6 + events_freezable_power_ percpu 0 2 4 6 + rcu_gp percpu 0 2 4 6 + rcu_par_gp percpu 0 2 4 6 + slub_flushwq percpu 0 2 4 6 + netns ordered 8 8 8 8 8 + ... + +See the command's help message for more info. + + +Monitoring +========== + +Use tools/workqueue/wq_monitor.py to monitor workqueue operations: :: + + $ tools/workqueue/wq_monitor.py events + total infl CPUtime CPUhog CMW/RPR mayday rescued + events 18545 0 6.1 0 5 - - + events_highpri 8 0 0.0 0 0 - - + events_long 3 0 0.0 0 0 - - + events_unbound 38306 0 0.1 - 7 - - + events_freezable 0 0 0.0 0 0 - - + events_power_efficient 29598 0 0.2 0 0 - - + events_freezable_power_ 10 0 0.0 0 0 - - + sock_diag_events 0 0 0.0 0 0 - - + + total infl CPUtime CPUhog CMW/RPR mayday rescued + events 18548 0 6.1 0 5 - - + events_highpri 8 0 0.0 0 0 - - + events_long 3 0 0.0 0 0 - - + events_unbound 38322 0 0.1 - 7 - - + events_freezable 0 0 0.0 0 0 - - + events_power_efficient 29603 0 0.2 0 0 - - + events_freezable_power_ 10 0 0.0 0 0 - - + sock_diag_events 0 0 0.0 0 0 - - + + ... + +See the command's help message for more info. + + Debugging ========= @@ -387,6 +736,7 @@ the stack trace of the offending worker thread. :: The work item's function should be trivially visible in the stack trace. + Non-reentrance Conditions ========================= |