summaryrefslogtreecommitdiff
path: root/Documentation/core-api
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/core-api')
-rw-r--r--Documentation/core-api/cachetlb.rst55
-rw-r--r--Documentation/core-api/cpu_hotplug.rst41
-rw-r--r--Documentation/core-api/genericirq.rst2
-rw-r--r--Documentation/core-api/kernel-api.rst48
-rw-r--r--Documentation/core-api/memory-allocation.rst17
-rw-r--r--Documentation/core-api/mm-api.rst25
-rw-r--r--Documentation/core-api/netlink.rst9
-rw-r--r--Documentation/core-api/pin_user_pages.rst6
-rw-r--r--Documentation/core-api/printk-formats.rst16
-rw-r--r--Documentation/core-api/this_cpu_ops.rst2
-rw-r--r--Documentation/core-api/workqueue.rst380
11 files changed, 521 insertions, 80 deletions
diff --git a/Documentation/core-api/cachetlb.rst b/Documentation/core-api/cachetlb.rst
index 5c0552e78c58..889fc84ccd1b 100644
--- a/Documentation/core-api/cachetlb.rst
+++ b/Documentation/core-api/cachetlb.rst
@@ -88,13 +88,17 @@ changes occur:
This is used primarily during fault processing.
-5) ``void update_mmu_cache(struct vm_area_struct *vma,
- unsigned long address, pte_t *ptep)``
+5) ``void update_mmu_cache_range(struct vm_fault *vmf,
+ struct vm_area_struct *vma, unsigned long address, pte_t *ptep,
+ unsigned int nr)``
- At the end of every page fault, this routine is invoked to
- tell the architecture specific code that a translation
- now exists at virtual address "address" for address space
- "vma->vm_mm", in the software page tables.
+ At the end of every page fault, this routine is invoked to tell
+ the architecture specific code that translations now exists
+ in the software page tables for address space "vma->vm_mm"
+ at virtual address "address" for "nr" consecutive pages.
+
+ This routine is also invoked in various other places which pass
+ a NULL "vmf".
A port may use this information in any way it so chooses.
For example, it could use this event to pre-load TLB
@@ -269,7 +273,7 @@ maps this page at its virtual address.
If D-cache aliasing is not an issue, these two routines may
simply call memcpy/memset directly and do nothing more.
- ``void flush_dcache_page(struct page *page)``
+ ``void flush_dcache_folio(struct folio *folio)``
This routines must be called when:
@@ -277,7 +281,7 @@ maps this page at its virtual address.
and / or in high memory
b) the kernel is about to read from a page cache page and user space
shared/writable mappings of this page potentially exist. Note
- that {get,pin}_user_pages{_fast} already call flush_dcache_page
+ that {get,pin}_user_pages{_fast} already call flush_dcache_folio
on any page found in the user address space and thus driver
code rarely needs to take this into account.
@@ -291,7 +295,7 @@ maps this page at its virtual address.
The phrase "kernel writes to a page cache page" means, specifically,
that the kernel executes store instructions that dirty data in that
- page at the page->virtual mapping of that page. It is important to
+ page at the kernel virtual mapping of that page. It is important to
flush here to handle D-cache aliasing, to make sure these kernel stores
are visible to user space mappings of that page.
@@ -302,21 +306,22 @@ maps this page at its virtual address.
If D-cache aliasing is not an issue, this routine may simply be defined
as a nop on that architecture.
- There is a bit set aside in page->flags (PG_arch_1) as "architecture
+ There is a bit set aside in folio->flags (PG_arch_1) as "architecture
private". The kernel guarantees that, for pagecache pages, it will
clear this bit when such a page first enters the pagecache.
- This allows these interfaces to be implemented much more efficiently.
- It allows one to "defer" (perhaps indefinitely) the actual flush if
- there are currently no user processes mapping this page. See sparc64's
- flush_dcache_page and update_mmu_cache implementations for an example
- of how to go about doing this.
+ This allows these interfaces to be implemented much more
+ efficiently. It allows one to "defer" (perhaps indefinitely) the
+ actual flush if there are currently no user processes mapping this
+ page. See sparc64's flush_dcache_folio and update_mmu_cache_range
+ implementations for an example of how to go about doing this.
- The idea is, first at flush_dcache_page() time, if page_file_mapping()
- returns a mapping, and mapping_mapped on that mapping returns %false,
- just mark the architecture private page flag bit. Later, in
- update_mmu_cache(), a check is made of this flag bit, and if set the
- flush is done and the flag bit is cleared.
+ The idea is, first at flush_dcache_folio() time, if
+ folio_flush_mapping() returns a mapping, and mapping_mapped() on that
+ mapping returns %false, just mark the architecture private page
+ flag bit. Later, in update_mmu_cache_range(), a check is made
+ of this flag bit, and if set the flush is done and the flag bit
+ is cleared.
.. important::
@@ -326,12 +331,6 @@ maps this page at its virtual address.
dirty. Again, see sparc64 for examples of how
to deal with this.
- ``void flush_dcache_folio(struct folio *folio)``
- This function is called under the same circumstances as
- flush_dcache_page(). It allows the architecture to
- optimise for flushing the entire folio of pages instead
- of flushing one page at a time.
-
``void copy_to_user_page(struct vm_area_struct *vma, struct page *page,
unsigned long user_vaddr, void *dst, void *src, int len)``
``void copy_from_user_page(struct vm_area_struct *vma, struct page *page,
@@ -352,7 +351,7 @@ maps this page at its virtual address.
When the kernel needs to access the contents of an anonymous
page, it calls this function (currently only
- get_user_pages()). Note: flush_dcache_page() deliberately
+ get_user_pages()). Note: flush_dcache_folio() deliberately
doesn't work for an anonymous page. The default
implementation is a nop (and should remain so for all coherent
architectures). For incoherent architectures, it should flush
@@ -369,7 +368,7 @@ maps this page at its virtual address.
``void flush_icache_page(struct vm_area_struct *vma, struct page *page)``
All the functionality of flush_icache_page can be implemented in
- flush_dcache_page and update_mmu_cache. In the future, the hope
+ flush_dcache_folio and update_mmu_cache_range. In the future, the hope
is to remove this interface completely.
The final category of APIs is for I/O to deliberately aliased address
diff --git a/Documentation/core-api/cpu_hotplug.rst b/Documentation/core-api/cpu_hotplug.rst
index f75778d37488..9511e405aabd 100644
--- a/Documentation/core-api/cpu_hotplug.rst
+++ b/Documentation/core-api/cpu_hotplug.rst
@@ -127,17 +127,8 @@ bring CPU4 back online::
$ echo 1 > /sys/devices/system/cpu/cpu4/online
smpboot: Booting Node 0 Processor 4 APIC 0x1
-The CPU is usable again. This should work on all CPUs. CPU0 is often special
-and excluded from CPU hotplug. On X86 the kernel option
-*CONFIG_BOOTPARAM_HOTPLUG_CPU0* has to be enabled in order to be able to
-shutdown CPU0. Alternatively the kernel command option *cpu0_hotplug* can be
-used. Some known dependencies of CPU0:
-
-* Resume from hibernate/suspend. Hibernate/suspend will fail if CPU0 is offline.
-* PIC interrupts. CPU0 can't be removed if a PIC interrupt is detected.
-
-Please let Fenghua Yu <fenghua.yu@intel.com> know if you find any dependencies
-on CPU0.
+The CPU is usable again. This should work on all CPUs, but CPU0 is often special
+and excluded from CPU hotplug.
The CPU hotplug coordination
============================
@@ -404,8 +395,8 @@ multi-instance state the following function is available:
* cpuhp_setup_state_multi(state, name, startup, teardown)
The @state argument is either a statically allocated state or one of the
-constants for dynamically allocated states - CPUHP_PREPARE_DYN,
-CPUHP_ONLINE_DYN - depending on the state section (PREPARE, ONLINE) for
+constants for dynamically allocated states - CPUHP_BP_PREPARE_DYN,
+CPUHP_AP_ONLINE_DYN - depending on the state section (PREPARE, ONLINE) for
which a dynamic state should be allocated.
The @name argument is used for sysfs output and for instrumentation. The
@@ -597,7 +588,7 @@ notifications on online and offline operations::
Setup and teardown a dynamically allocated state in the ONLINE section
for notifications on offline operations::
- state = cpuhp_setup_state(CPUHP_ONLINE_DYN, "subsys:offline", NULL, subsys_cpu_offline);
+ state = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "subsys:offline", NULL, subsys_cpu_offline);
if (state < 0)
return state;
....
@@ -606,7 +597,7 @@ for notifications on offline operations::
Setup and teardown a dynamically allocated state in the ONLINE section
for notifications on online operations without invoking the callbacks::
- state = cpuhp_setup_state_nocalls(CPUHP_ONLINE_DYN, "subsys:online", subsys_cpu_online, NULL);
+ state = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN, "subsys:online", subsys_cpu_online, NULL);
if (state < 0)
return state;
....
@@ -615,7 +606,7 @@ for notifications on online operations without invoking the callbacks::
Setup, use and teardown a dynamically allocated multi-instance state in the
ONLINE section for notifications on online and offline operation::
- state = cpuhp_setup_state_multi(CPUHP_ONLINE_DYN, "subsys:online", subsys_cpu_online, subsys_cpu_offline);
+ state = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN, "subsys:online", subsys_cpu_online, subsys_cpu_offline);
if (state < 0)
return state;
....
@@ -750,6 +741,24 @@ will receive all events. A script like::
can process the event further.
+When changes to the CPUs in the system occur, the sysfs file
+/sys/devices/system/cpu/crash_hotplug contains '1' if the kernel
+updates the kdump capture kernel list of CPUs itself (via elfcorehdr),
+or '0' if userspace must update the kdump capture kernel list of CPUs.
+
+The availability depends on the CONFIG_HOTPLUG_CPU kernel configuration
+option.
+
+To skip userspace processing of CPU hot un/plug events for kdump
+(i.e. the unload-then-reload to obtain a current list of CPUs), this sysfs
+file can be used in a udev rule as follows:
+
+ SUBSYSTEM=="cpu", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end"
+
+For a CPU hot un/plug event, if the architecture supports kernel updates
+of the elfcorehdr (which contains the list of CPUs), then the rule skips
+the unload-then-reload of the kdump capture kernel.
+
Kernel Inline Documentations Reference
======================================
diff --git a/Documentation/core-api/genericirq.rst b/Documentation/core-api/genericirq.rst
index f959c9b53f61..4a460639ab1c 100644
--- a/Documentation/core-api/genericirq.rst
+++ b/Documentation/core-api/genericirq.rst
@@ -264,7 +264,7 @@ The following control flow is implemented (simplified excerpt)::
desc->irq_data.chip->irq_unmask();
desc->status &= ~pending;
handle_irq_event(desc->action);
- } while (status & pending);
+ } while (desc->status & pending);
desc->status &= ~running;
diff --git a/Documentation/core-api/kernel-api.rst b/Documentation/core-api/kernel-api.rst
index 62f961610773..ae92a2571388 100644
--- a/Documentation/core-api/kernel-api.rst
+++ b/Documentation/core-api/kernel-api.rst
@@ -96,6 +96,12 @@ Command-line Parsing
.. kernel-doc:: lib/cmdline.c
:export:
+Error Pointers
+--------------
+
+.. kernel-doc:: include/linux/err.h
+ :internal:
+
Sorting
-------
@@ -156,8 +162,10 @@ Base 2 log and power Functions
.. kernel-doc:: include/linux/log2.h
:internal:
-Integer power Functions
------------------------
+Integer log and power Functions
+-------------------------------
+
+.. kernel-doc:: include/linux/int_log.h
.. kernel-doc:: lib/math/int_pow.c
:export:
@@ -220,12 +228,30 @@ relay interface
Module Support
==============
-Module Loading
---------------
+Kernel module auto-loading
+--------------------------
-.. kernel-doc:: kernel/kmod.c
+.. kernel-doc:: kernel/module/kmod.c
:export:
+Module debugging
+----------------
+
+.. kernel-doc:: kernel/module/stats.c
+ :doc: module debugging statistics overview
+
+dup_failed_modules - tracks duplicate failed modules
+****************************************************
+
+.. kernel-doc:: kernel/module/stats.c
+ :doc: dup_failed_modules - tracks duplicate failed modules
+
+module statistics debugfs counters
+**********************************
+
+.. kernel-doc:: kernel/module/stats.c
+ :doc: module statistics debugfs counters
+
Inter Module support
--------------------
@@ -394,3 +420,15 @@ Read-Copy Update (RCU)
.. kernel-doc:: include/linux/rcu_sync.h
.. kernel-doc:: kernel/rcu/sync.c
+
+.. kernel-doc:: kernel/rcu/tasks.h
+
+.. kernel-doc:: kernel/rcu/tree_stall.h
+
+.. kernel-doc:: include/linux/rcupdate_trace.h
+
+.. kernel-doc:: include/linux/rcupdate_wait.h
+
+.. kernel-doc:: include/linux/rcuref.h
+
+.. kernel-doc:: include/linux/rcutree.h
diff --git a/Documentation/core-api/memory-allocation.rst b/Documentation/core-api/memory-allocation.rst
index 5954ddf6ee13..1c58d883b273 100644
--- a/Documentation/core-api/memory-allocation.rst
+++ b/Documentation/core-api/memory-allocation.rst
@@ -170,7 +170,16 @@ should be used if a part of the cache might be copied to the userspace.
After the cache is created kmem_cache_alloc() and its convenience
wrappers can allocate memory from that cache.
-When the allocated memory is no longer needed it must be freed. You can
-use kvfree() for the memory allocated with `kmalloc`, `vmalloc` and
-`kvmalloc`. The slab caches should be freed with kmem_cache_free(). And
-don't forget to destroy the cache with kmem_cache_destroy().
+When the allocated memory is no longer needed it must be freed.
+
+Objects allocated by `kmalloc` can be freed by `kfree` or `kvfree`. Objects
+allocated by `kmem_cache_alloc` can be freed with `kmem_cache_free`, `kfree`
+or `kvfree`, where the latter two might be more convenient thanks to not
+needing the kmem_cache pointer.
+
+The same rules apply to _bulk and _rcu flavors of freeing functions.
+
+Memory allocated by `vmalloc` can be freed with `vfree` or `kvfree`.
+Memory allocated by `kvmalloc` can be freed with `kvfree`.
+Caches created by `kmem_cache_create` should be freed with
+`kmem_cache_destroy` only after freeing all the allocated objects first.
diff --git a/Documentation/core-api/mm-api.rst b/Documentation/core-api/mm-api.rst
index f5dde5bceaea..2d091c873d1e 100644
--- a/Documentation/core-api/mm-api.rst
+++ b/Documentation/core-api/mm-api.rst
@@ -115,3 +115,28 @@ More Memory Management Functions
.. kernel-doc:: include/linux/mmzone.h
.. kernel-doc:: mm/util.c
:functions: folio_mapping
+
+.. kernel-doc:: mm/rmap.c
+.. kernel-doc:: mm/migrate.c
+.. kernel-doc:: mm/mmap.c
+.. kernel-doc:: mm/kmemleak.c
+.. #kernel-doc:: mm/hmm.c (build warnings)
+.. kernel-doc:: mm/memremap.c
+.. kernel-doc:: mm/hugetlb.c
+.. kernel-doc:: mm/swap.c
+.. kernel-doc:: mm/zpool.c
+.. kernel-doc:: mm/memcontrol.c
+.. #kernel-doc:: mm/memory-tiers.c (build warnings)
+.. kernel-doc:: mm/shmem.c
+.. kernel-doc:: mm/migrate_device.c
+.. #kernel-doc:: mm/nommu.c (duplicates kernel-doc from other files)
+.. kernel-doc:: mm/mapping_dirty_helpers.c
+.. #kernel-doc:: mm/memory-failure.c (build warnings)
+.. kernel-doc:: mm/percpu.c
+.. kernel-doc:: mm/maccess.c
+.. kernel-doc:: mm/vmscan.c
+.. kernel-doc:: mm/memory_hotplug.c
+.. kernel-doc:: mm/mmu_notifier.c
+.. kernel-doc:: mm/balloon_compaction.c
+.. kernel-doc:: mm/huge_memory.c
+.. kernel-doc:: mm/io-mapping.c
diff --git a/Documentation/core-api/netlink.rst b/Documentation/core-api/netlink.rst
index e4a938a05cc9..9f692b02bfe6 100644
--- a/Documentation/core-api/netlink.rst
+++ b/Documentation/core-api/netlink.rst
@@ -67,10 +67,11 @@ Globals
kernel-policy
~~~~~~~~~~~~~
-Defines if the kernel validation policy is per operation (``per-op``)
-or for the entire family (``global``). New families should use ``per-op``
-(default) to be able to narrow down the attributes accepted by a specific
-command.
+Defines whether the kernel validation policy is ``global`` i.e. the same for all
+operations of the family, defined for each operation individually - ``per-op``,
+or separately for each operation and operation type (do vs dump) - ``split``.
+New families should use ``per-op`` (default) to be able to narrow down the
+attributes accepted by a specific command.
checks
------
diff --git a/Documentation/core-api/pin_user_pages.rst b/Documentation/core-api/pin_user_pages.rst
index 9fb0b1080d3b..d3c1f6d8c0e0 100644
--- a/Documentation/core-api/pin_user_pages.rst
+++ b/Documentation/core-api/pin_user_pages.rst
@@ -112,6 +112,12 @@ pages:
This also leads to limitations: there are only 31-10==21 bits available for a
counter that increments 10 bits at a time.
+* Because of that limitation, special handling is applied to the zero pages
+ when using FOLL_PIN. We only pretend to pin a zero page - we don't alter its
+ refcount or pincount at all (it is permanent, so there's no need). The
+ unpinning functions also don't do anything to a zero page. This is
+ transparent to the caller.
+
* Callers must specifically request "dma-pinned tracking of pages". In other
words, just calling get_user_pages() will not suffice; a new set of functions,
pin_user_page() and related, must be used.
diff --git a/Documentation/core-api/printk-formats.rst b/Documentation/core-api/printk-formats.rst
index 0eed8a2dcae9..4451ef501936 100644
--- a/Documentation/core-api/printk-formats.rst
+++ b/Documentation/core-api/printk-formats.rst
@@ -576,20 +576,26 @@ The field width is passed by value, the bitmap is passed by reference.
Helper macros cpumask_pr_args() and nodemask_pr_args() are available to ease
printing cpumask and nodemask.
-Flags bitfields such as page flags, gfp_flags
----------------------------------------------
+Flags bitfields such as page flags, page_type, gfp_flags
+--------------------------------------------------------
::
%pGp 0x17ffffc0002036(referenced|uptodate|lru|active|private|node=0|zone=2|lastcpupid=0x1fffff)
+ %pGt 0xffffff7f(buddy)
%pGg GFP_USER|GFP_DMA32|GFP_NOWARN
%pGv read|exec|mayread|maywrite|mayexec|denywrite
For printing flags bitfields as a collection of symbolic constants that
would construct the value. The type of flags is given by the third
-character. Currently supported are [p]age flags, [v]ma_flags (both
-expect ``unsigned long *``) and [g]fp_flags (expects ``gfp_t *``). The flag
-names and print order depends on the particular type.
+character. Currently supported are:
+
+ - p - [p]age flags, expects value of type (``unsigned long *``)
+ - t - page [t]ype, expects value of type (``unsigned int *``)
+ - v - [v]ma_flags, expects value of type (``unsigned long *``)
+ - g - [g]fp_flags, expects value of type (``gfp_t *``)
+
+The flag names and print order depends on the particular type.
Note that this format should not be used directly in the
:c:func:`TP_printk()` part of a tracepoint. Instead, use the show_*_flags()
diff --git a/Documentation/core-api/this_cpu_ops.rst b/Documentation/core-api/this_cpu_ops.rst
index 5cb8b883ae83..91acbcf30e9b 100644
--- a/Documentation/core-api/this_cpu_ops.rst
+++ b/Documentation/core-api/this_cpu_ops.rst
@@ -53,7 +53,6 @@ preemption and interrupts::
this_cpu_add_return(pcp, val)
this_cpu_xchg(pcp, nval)
this_cpu_cmpxchg(pcp, oval, nval)
- this_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
this_cpu_sub(pcp, val)
this_cpu_inc(pcp)
this_cpu_dec(pcp)
@@ -242,7 +241,6 @@ safe::
__this_cpu_add_return(pcp, val)
__this_cpu_xchg(pcp, nval)
__this_cpu_cmpxchg(pcp, oval, nval)
- __this_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
__this_cpu_sub(pcp, val)
__this_cpu_inc(pcp)
__this_cpu_dec(pcp)
diff --git a/Documentation/core-api/workqueue.rst b/Documentation/core-api/workqueue.rst
index 8ec4d6270b24..5d7b01aed1fe 100644
--- a/Documentation/core-api/workqueue.rst
+++ b/Documentation/core-api/workqueue.rst
@@ -1,6 +1,6 @@
-====================================
-Concurrency Managed Workqueue (cmwq)
-====================================
+=========
+Workqueue
+=========
:Date: September, 2010
:Author: Tejun Heo <tj@kernel.org>
@@ -25,8 +25,8 @@ there is no work item left on the workqueue the worker becomes idle.
When a new work item gets queued, the worker begins executing again.
-Why cmwq?
-=========
+Why Concurrency Managed Workqueue?
+==================================
In the original wq implementation, a multi threaded (MT) wq had one
worker thread per CPU and a single threaded (ST) wq had one worker
@@ -220,17 +220,16 @@ resources, scheduled and executed.
``max_active``
--------------
-``@max_active`` determines the maximum number of execution contexts
-per CPU which can be assigned to the work items of a wq. For example,
-with ``@max_active`` of 16, at most 16 work items of the wq can be
-executing at the same time per CPU.
+``@max_active`` determines the maximum number of execution contexts per
+CPU which can be assigned to the work items of a wq. For example, with
+``@max_active`` of 16, at most 16 work items of the wq can be executing
+at the same time per CPU. This is always a per-CPU attribute, even for
+unbound workqueues.
-Currently, for a bound wq, the maximum limit for ``@max_active`` is
-512 and the default value used when 0 is specified is 256. For an
-unbound wq, the limit is higher of 512 and 4 *
-``num_possible_cpus()``. These values are chosen sufficiently high
-such that they are not the limiting factor while providing protection
-in runaway cases.
+The maximum limit for ``@max_active`` is 512 and the default value used
+when 0 is specified is 256. These values are chosen sufficiently high
+such that they are not the limiting factor while providing protection in
+runaway cases.
The number of active work items of a wq is usually regulated by the
users of the wq, more specifically, by how many work items the users
@@ -348,6 +347,356 @@ Guidelines
level of locality in wq operations and work item execution.
+Affinity Scopes
+===============
+
+An unbound workqueue groups CPUs according to its affinity scope to improve
+cache locality. For example, if a workqueue is using the default affinity
+scope of "cache", it will group CPUs according to last level cache
+boundaries. A work item queued on the workqueue will be assigned to a worker
+on one of the CPUs which share the last level cache with the issuing CPU.
+Once started, the worker may or may not be allowed to move outside the scope
+depending on the ``affinity_strict`` setting of the scope.
+
+Workqueue currently supports the following affinity scopes.
+
+``default``
+ Use the scope in module parameter ``workqueue.default_affinity_scope``
+ which is always set to one of the scopes below.
+
+``cpu``
+ CPUs are not grouped. A work item issued on one CPU is processed by a
+ worker on the same CPU. This makes unbound workqueues behave as per-cpu
+ workqueues without concurrency management.
+
+``smt``
+ CPUs are grouped according to SMT boundaries. This usually means that the
+ logical threads of each physical CPU core are grouped together.
+
+``cache``
+ CPUs are grouped according to cache boundaries. Which specific cache
+ boundary is used is determined by the arch code. L3 is used in a lot of
+ cases. This is the default affinity scope.
+
+``numa``
+ CPUs are grouped according to NUMA bounaries.
+
+``system``
+ All CPUs are put in the same group. Workqueue makes no effort to process a
+ work item on a CPU close to the issuing CPU.
+
+The default affinity scope can be changed with the module parameter
+``workqueue.default_affinity_scope`` and a specific workqueue's affinity
+scope can be changed using ``apply_workqueue_attrs()``.
+
+If ``WQ_SYSFS`` is set, the workqueue will have the following affinity scope
+related interface files under its ``/sys/devices/virtual/WQ_NAME/``
+directory.
+
+``affinity_scope``
+ Read to see the current affinity scope. Write to change.
+
+ When default is the current scope, reading this file will also show the
+ current effective scope in parentheses, for example, ``default (cache)``.
+
+``affinity_strict``
+ 0 by default indicating that affinity scopes are not strict. When a work
+ item starts execution, workqueue makes a best-effort attempt to ensure
+ that the worker is inside its affinity scope, which is called
+ repatriation. Once started, the scheduler is free to move the worker
+ anywhere in the system as it sees fit. This enables benefiting from scope
+ locality while still being able to utilize other CPUs if necessary and
+ available.
+
+ If set to 1, all workers of the scope are guaranteed always to be in the
+ scope. This may be useful when crossing affinity scopes has other
+ implications, for example, in terms of power consumption or workload
+ isolation. Strict NUMA scope can also be used to match the workqueue
+ behavior of older kernels.
+
+
+Affinity Scopes and Performance
+===============================
+
+It'd be ideal if an unbound workqueue's behavior is optimal for vast
+majority of use cases without further tuning. Unfortunately, in the current
+kernel, there exists a pronounced trade-off between locality and utilization
+necessitating explicit configurations when workqueues are heavily used.
+
+Higher locality leads to higher efficiency where more work is performed for
+the same number of consumed CPU cycles. However, higher locality may also
+cause lower overall system utilization if the work items are not spread
+enough across the affinity scopes by the issuers. The following performance
+testing with dm-crypt clearly illustrates this trade-off.
+
+The tests are run on a CPU with 12-cores/24-threads split across four L3
+caches (AMD Ryzen 9 3900x). CPU clock boost is turned off for consistency.
+``/dev/dm-0`` is a dm-crypt device created on NVME SSD (Samsung 990 PRO) and
+opened with ``cryptsetup`` with default settings.
+
+
+Scenario 1: Enough issuers and work spread across the machine
+-------------------------------------------------------------
+
+The command used: ::
+
+ $ fio --filename=/dev/dm-0 --direct=1 --rw=randrw --bs=32k --ioengine=libaio \
+ --iodepth=64 --runtime=60 --numjobs=24 --time_based --group_reporting \
+ --name=iops-test-job --verify=sha512
+
+There are 24 issuers, each issuing 64 IOs concurrently. ``--verify=sha512``
+makes ``fio`` generate and read back the content each time which makes
+execution locality matter between the issuer and ``kcryptd``. The followings
+are the read bandwidths and CPU utilizations depending on different affinity
+scope settings on ``kcryptd`` measured over five runs. Bandwidths are in
+MiBps, and CPU util in percents.
+
+.. list-table::
+ :widths: 16 20 20
+ :header-rows: 1
+
+ * - Affinity
+ - Bandwidth (MiBps)
+ - CPU util (%)
+
+ * - system
+ - 1159.40 ±1.34
+ - 99.31 ±0.02
+
+ * - cache
+ - 1166.40 ±0.89
+ - 99.34 ±0.01
+
+ * - cache (strict)
+ - 1166.00 ±0.71
+ - 99.35 ±0.01
+
+With enough issuers spread across the system, there is no downside to
+"cache", strict or otherwise. All three configurations saturate the whole
+machine but the cache-affine ones outperform by 0.6% thanks to improved
+locality.
+
+
+Scenario 2: Fewer issuers, enough work for saturation
+-----------------------------------------------------
+
+The command used: ::
+
+ $ fio --filename=/dev/dm-0 --direct=1 --rw=randrw --bs=32k \
+ --ioengine=libaio --iodepth=64 --runtime=60 --numjobs=8 \
+ --time_based --group_reporting --name=iops-test-job --verify=sha512
+
+The only difference from the previous scenario is ``--numjobs=8``. There are
+a third of the issuers but is still enough total work to saturate the
+system.
+
+.. list-table::
+ :widths: 16 20 20
+ :header-rows: 1
+
+ * - Affinity
+ - Bandwidth (MiBps)
+ - CPU util (%)
+
+ * - system
+ - 1155.40 ±0.89
+ - 97.41 ±0.05
+
+ * - cache
+ - 1154.40 ±1.14
+ - 96.15 ±0.09
+
+ * - cache (strict)
+ - 1112.00 ±4.64
+ - 93.26 ±0.35
+
+This is more than enough work to saturate the system. Both "system" and
+"cache" are nearly saturating the machine but not fully. "cache" is using
+less CPU but the better efficiency puts it at the same bandwidth as
+"system".
+
+Eight issuers moving around over four L3 cache scope still allow "cache
+(strict)" to mostly saturate the machine but the loss of work conservation
+is now starting to hurt with 3.7% bandwidth loss.
+
+
+Scenario 3: Even fewer issuers, not enough work to saturate
+-----------------------------------------------------------
+
+The command used: ::
+
+ $ fio --filename=/dev/dm-0 --direct=1 --rw=randrw --bs=32k \
+ --ioengine=libaio --iodepth=64 --runtime=60 --numjobs=4 \
+ --time_based --group_reporting --name=iops-test-job --verify=sha512
+
+Again, the only difference is ``--numjobs=4``. With the number of issuers
+reduced to four, there now isn't enough work to saturate the whole system
+and the bandwidth becomes dependent on completion latencies.
+
+.. list-table::
+ :widths: 16 20 20
+ :header-rows: 1
+
+ * - Affinity
+ - Bandwidth (MiBps)
+ - CPU util (%)
+
+ * - system
+ - 993.60 ±1.82
+ - 75.49 ±0.06
+
+ * - cache
+ - 973.40 ±1.52
+ - 74.90 ±0.07
+
+ * - cache (strict)
+ - 828.20 ±4.49
+ - 66.84 ±0.29
+
+Now, the tradeoff between locality and utilization is clearer. "cache" shows
+2% bandwidth loss compared to "system" and "cache (struct)" whopping 20%.
+
+
+Conclusion and Recommendations
+------------------------------
+
+In the above experiments, the efficiency advantage of the "cache" affinity
+scope over "system" is, while consistent and noticeable, small. However, the
+impact is dependent on the distances between the scopes and may be more
+pronounced in processors with more complex topologies.
+
+While the loss of work-conservation in certain scenarios hurts, it is a lot
+better than "cache (strict)" and maximizing workqueue utilization is
+unlikely to be the common case anyway. As such, "cache" is the default
+affinity scope for unbound pools.
+
+* As there is no one option which is great for most cases, workqueue usages
+ that may consume a significant amount of CPU are recommended to configure
+ the workqueues using ``apply_workqueue_attrs()`` and/or enable
+ ``WQ_SYSFS``.
+
+* An unbound workqueue with strict "cpu" affinity scope behaves the same as
+ ``WQ_CPU_INTENSIVE`` per-cpu workqueue. There is no real advanage to the
+ latter and an unbound workqueue provides a lot more flexibility.
+
+* Affinity scopes are introduced in Linux v6.5. To emulate the previous
+ behavior, use strict "numa" affinity scope.
+
+* The loss of work-conservation in non-strict affinity scopes is likely
+ originating from the scheduler. There is no theoretical reason why the
+ kernel wouldn't be able to do the right thing and maintain
+ work-conservation in most cases. As such, it is possible that future
+ scheduler improvements may make most of these tunables unnecessary.
+
+
+Examining Configuration
+=======================
+
+Use tools/workqueue/wq_dump.py to examine unbound CPU affinity
+configuration, worker pools and how workqueues map to the pools: ::
+
+ $ tools/workqueue/wq_dump.py
+ Affinity Scopes
+ ===============
+ wq_unbound_cpumask=0000000f
+
+ CPU
+ nr_pods 4
+ pod_cpus [0]=00000001 [1]=00000002 [2]=00000004 [3]=00000008
+ pod_node [0]=0 [1]=0 [2]=1 [3]=1
+ cpu_pod [0]=0 [1]=1 [2]=2 [3]=3
+
+ SMT
+ nr_pods 4
+ pod_cpus [0]=00000001 [1]=00000002 [2]=00000004 [3]=00000008
+ pod_node [0]=0 [1]=0 [2]=1 [3]=1
+ cpu_pod [0]=0 [1]=1 [2]=2 [3]=3
+
+ CACHE (default)
+ nr_pods 2
+ pod_cpus [0]=00000003 [1]=0000000c
+ pod_node [0]=0 [1]=1
+ cpu_pod [0]=0 [1]=0 [2]=1 [3]=1
+
+ NUMA
+ nr_pods 2
+ pod_cpus [0]=00000003 [1]=0000000c
+ pod_node [0]=0 [1]=1
+ cpu_pod [0]=0 [1]=0 [2]=1 [3]=1
+
+ SYSTEM
+ nr_pods 1
+ pod_cpus [0]=0000000f
+ pod_node [0]=-1
+ cpu_pod [0]=0 [1]=0 [2]=0 [3]=0
+
+ Worker Pools
+ ============
+ pool[00] ref= 1 nice= 0 idle/workers= 4/ 4 cpu= 0
+ pool[01] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 0
+ pool[02] ref= 1 nice= 0 idle/workers= 4/ 4 cpu= 1
+ pool[03] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 1
+ pool[04] ref= 1 nice= 0 idle/workers= 4/ 4 cpu= 2
+ pool[05] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 2
+ pool[06] ref= 1 nice= 0 idle/workers= 3/ 3 cpu= 3
+ pool[07] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 3
+ pool[08] ref=42 nice= 0 idle/workers= 6/ 6 cpus=0000000f
+ pool[09] ref=28 nice= 0 idle/workers= 3/ 3 cpus=00000003
+ pool[10] ref=28 nice= 0 idle/workers= 17/ 17 cpus=0000000c
+ pool[11] ref= 1 nice=-20 idle/workers= 1/ 1 cpus=0000000f
+ pool[12] ref= 2 nice=-20 idle/workers= 1/ 1 cpus=00000003
+ pool[13] ref= 2 nice=-20 idle/workers= 1/ 1 cpus=0000000c
+
+ Workqueue CPU -> pool
+ =====================
+ [ workqueue \ CPU 0 1 2 3 dfl]
+ events percpu 0 2 4 6
+ events_highpri percpu 1 3 5 7
+ events_long percpu 0 2 4 6
+ events_unbound unbound 9 9 10 10 8
+ events_freezable percpu 0 2 4 6
+ events_power_efficient percpu 0 2 4 6
+ events_freezable_power_ percpu 0 2 4 6
+ rcu_gp percpu 0 2 4 6
+ rcu_par_gp percpu 0 2 4 6
+ slub_flushwq percpu 0 2 4 6
+ netns ordered 8 8 8 8 8
+ ...
+
+See the command's help message for more info.
+
+
+Monitoring
+==========
+
+Use tools/workqueue/wq_monitor.py to monitor workqueue operations: ::
+
+ $ tools/workqueue/wq_monitor.py events
+ total infl CPUtime CPUhog CMW/RPR mayday rescued
+ events 18545 0 6.1 0 5 - -
+ events_highpri 8 0 0.0 0 0 - -
+ events_long 3 0 0.0 0 0 - -
+ events_unbound 38306 0 0.1 - 7 - -
+ events_freezable 0 0 0.0 0 0 - -
+ events_power_efficient 29598 0 0.2 0 0 - -
+ events_freezable_power_ 10 0 0.0 0 0 - -
+ sock_diag_events 0 0 0.0 0 0 - -
+
+ total infl CPUtime CPUhog CMW/RPR mayday rescued
+ events 18548 0 6.1 0 5 - -
+ events_highpri 8 0 0.0 0 0 - -
+ events_long 3 0 0.0 0 0 - -
+ events_unbound 38322 0 0.1 - 7 - -
+ events_freezable 0 0 0.0 0 0 - -
+ events_power_efficient 29603 0 0.2 0 0 - -
+ events_freezable_power_ 10 0 0.0 0 0 - -
+ sock_diag_events 0 0 0.0 0 0 - -
+
+ ...
+
+See the command's help message for more info.
+
+
Debugging
=========
@@ -387,6 +736,7 @@ the stack trace of the offending worker thread. ::
The work item's function should be trivially visible in the stack
trace.
+
Non-reentrance Conditions
=========================