summaryrefslogtreecommitdiff
path: root/mm/page-writeback.c
AgeCommit message (Collapse)AuthorFilesLines
2021-10-18mm/writeback: Add folio_start_writeback()Matthew Wilcox (Oracle)1-19/+20
Rename set_page_writeback() to folio_start_writeback() to match folio_end_writeback(). Do not bother with wrappers that return void; callers are perfectly capable of ignoring return values. Add wrappers for set_page_writeback(), set_page_writeback_keepwrite() and test_set_page_writeback() for compatibililty with existing filesystems. The main advantage of this patch is getting the statistics right, although it does eliminate a couple of calls to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Howells <dhowells@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz>
2021-10-18mm/writeback: Add __folio_end_writeback()Matthew Wilcox (Oracle)1-14/+15
test_clear_page_writeback() is actually an mm-internal function, although it's named as if it's a pagecache function. Move it to mm/internal.h, rename it to __folio_end_writeback() and change the return type to bool. The conversion from page to folio is mostly about accounting the number of pages being written back, although it does eliminate a couple of calls to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Howells <dhowells@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz>
2021-10-18mm/writeback: Change __wb_writeout_inc() to __wb_writeout_add()Matthew Wilcox (Oracle)1-11/+11
Allow for accounting N pages at once instead of one page at a time. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: David Howells <dhowells@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz>
2021-10-18flex_proportions: Allow N events instead of 1Matthew Wilcox (Oracle)1-2/+2
When batching events (such as writing back N pages in a single I/O), it is better to do one flex_proportion operation instead of N. There is only one caller of __fprop_inc_percpu_max(), and it's the one we're going to change in the next patch, so rename it instead of adding a compatibility wrapper. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz>
2021-09-27mm/filemap: Add folio_wait_bit()Matthew Wilcox (Oracle)1-2/+2
Rename wait_on_page_bit() to folio_wait_bit(). We must always wait on the folio, otherwise we won't be woken up due to the tail page hashing to a different bucket from the head page. This commit shrinks the kernel by 770 bytes, mostly due to moving the page waitqueue lookup into folio_wait_bit_common(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Jeff Layton <jlayton@kernel.org> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Reviewed-by: David Howells <dhowells@redhat.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com>
2021-09-27mm/writeback: Add folio_wait_stable()Matthew Wilcox (Oracle)1-10/+14
Move wait_for_stable_page() into the folio compatibility file. folio_wait_stable() avoids a call to compound_head() and is 14 bytes smaller than wait_for_stable_page() was. The net text size grows by 16 bytes as a result of this patch. We can also remove thp_head() as this was the last user. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Jeff Layton <jlayton@kernel.org> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Reviewed-by: David Howells <dhowells@redhat.com>
2021-09-27mm/writeback: Add folio_wait_writeback()Matthew Wilcox (Oracle)1-15/+33
wait_on_page_writeback_killable() only has one caller, so convert it to call folio_wait_writeback_killable(). For the wait_on_page_writeback() callers, add a compatibility wrapper around folio_wait_writeback(). Turning PageWriteback() into folio_test_writeback() eliminates a call to compound_head() which saves 8 bytes and 15 bytes in the two functions. Unfortunately, that is more than offset by adding the wait_on_page_writeback compatibility wrapper for a net increase in text of 7 bytes. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Jeff Layton <jlayton@kernel.org> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: David Howells <dhowells@redhat.com>
2021-09-03Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-38/+83
Merge misc updates from Andrew Morton: "173 patches. Subsystems affected by this series: ia64, ocfs2, block, and mm (debug, pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap, bootmem, sparsemem, vmalloc, kasan, pagealloc, memory-failure, hugetlb, userfaultfd, vmscan, compaction, mempolicy, memblock, oom-kill, migration, ksm, percpu, vmstat, and madvise)" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (173 commits) mm/madvise: add MADV_WILLNEED to process_madvise() mm/vmstat: remove unneeded return value mm/vmstat: simplify the array size calculation mm/vmstat: correct some wrong comments mm/percpu,c: remove obsolete comments of pcpu_chunk_populated() selftests: vm: add COW time test for KSM pages selftests: vm: add KSM merging time test mm: KSM: fix data type selftests: vm: add KSM merging across nodes test selftests: vm: add KSM zero page merging test selftests: vm: add KSM unmerge test selftests: vm: add KSM merge test mm/migrate: correct kernel-doc notation mm: wire up syscall process_mrelease mm: introduce process_mrelease system call memblock: make memblock_find_in_range method private mm/mempolicy.c: use in_task() in mempolicy_slab_node() mm/mempolicy: unify the create() func for bind/interleave/prefer-many policies mm/mempolicy: advertise new MPOL_PREFERRED_MANY mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY ...
2021-09-03writeback: use READ_ONCE for unlocked reads of writeback statsJan Kara1-12/+13
We do some unlocked reads of writeback statistics like avg_write_bandwidth, dirty_ratelimit, or bw_time_stamp. Generally we are fine with getting somewhat out-of-date values but actually getting different values in various parts of the functions because the compiler decided to reload value from original memory location could confuse calculations. Use READ_ONCE for these unlocked accesses and WRITE_ONCE for the updates to be on the safe side. Link: https://lkml.kernel.org/r/20210713104716.22868-5-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Cc: Michael Stapelberg <stapelberg+linux@google.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03writeback: rename domain_update_bandwidth()Jan Kara1-4/+4
Rename domain_update_bandwidth() to domain_update_dirty_limit(). The original name is a misnomer. The function has nothing to do with a bandwidth, it updates dirty limits. Link: https://lkml.kernel.org/r/20210713104716.22868-4-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Cc: Michael Stapelberg <stapelberg+linux@google.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03writeback: fix bandwidth estimate for spiky workloadJan Kara1-14/+25
Michael Stapelberg has reported that for workload with short big spikes of writes (GCC linker seem to trigger this frequently) the write throughput is heavily underestimated and tends to steadily sink until it reaches zero. This has rather bad impact on writeback throttling (causing stalls). The problem is that writeback throughput estimate gets updated at most once per 200 ms. One update happens early after we submit pages for writeback (at that point writeout of only small fraction of pages is completed and thus observed throughput is tiny). Next update happens only during the next write spike (updates happen only from inode writeback and dirty throttling code) and if that is more than 1s after previous spike, we decide system was idle and just ignore whatever was written until this moment. Fix the problem by making sure writeback throughput estimate is also updated shortly after writeback completes to get reasonable estimate of throughput for spiky workloads. [jack@suse.cz: avoid division by 0 in wb_update_dirty_ratelimit()] Link: https://lore.kernel.org/lkml/20210617095309.3542373-1-stapelberg+linux@google.com Link: https://lkml.kernel.org/r/20210713104716.22868-3-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reported-by: Michael Stapelberg <stapelberg+linux@google.com> Tested-by: Michael Stapelberg <stapelberg+linux@google.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03writeback: reliably update bandwidth estimationJan Kara1-12/+27
Currently we trigger writeback bandwidth estimation from balance_dirty_pages() and from wb_writeback(). However neither of these need to trigger when the system is relatively idle and writeback is triggered e.g. from fsync(2). Make sure writeback estimates happen reliably by triggering them from do_writepages(). Link: https://lkml.kernel.org/r/20210713104716.22868-2-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Cc: Michael Stapelberg <stapelberg+linux@google.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03writeback: track number of inodes under writebackJan Kara1-2/+20
Patch series "writeback: Fix bandwidth estimates", v4. Fix estimate of writeback throughput when device is not fully busy doing writeback. Michael Stapelberg has reported that such workload (e.g. generated by linking) tends to push estimated throughput down to 0 and as a result writeback on the device is practically stalled. The first three patches fix the reported issue, the remaining two patches are unrelated cleanups of problems I've noticed when reading the code. This patch (of 4): Track number of inodes under writeback for each bdi_writeback structure. We will use this to decide whether wb does any IO and so we can estimate its writeback throughput. In principle we could use number of pages under writeback (WB_WRITEBACK counter) for this however normal percpu counter reads are too inaccurate for our purposes and summing the counter is too expensive. Link: https://lkml.kernel.org/r/20210713104519.16394-1-jack@suse.cz Link: https://lkml.kernel.org/r/20210713104716.22868-1-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Michael Stapelberg <stapelberg+linux@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-08-09mm: hide laptop_mode_wb_timer entirely behind the BDI APIChristoph Hellwig1-2/+0
Don't leak the detaіls of the timer into the block layer, instead initialize the timer in bdi_alloc and delete it in bdi_unregister. Note that this means the timer is initialized (but not armed) for non-block queues as well now. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20210809141744.1203023-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-30Merge tag 'for-5.14/block-2021-06-29' of git://git.kernel.dk/linux-blockLinus Torvalds1-5/+0
Pull core block updates from Jens Axboe: - disk events cleanup (Christoph) - gendisk and request queue allocation simplifications (Christoph) - bdev_disk_changed cleanups (Christoph) - IO priority improvements (Bart) - Chained bio completion trace fix (Edward) - blk-wbt fixes (Jan) - blk-wbt enable/disable fix (Zhang) - Scheduler dispatch improvements (Jan, Ming) - Shared tagset scheduler improvements (John) - BFQ updates (Paolo, Luca, Pietro) - BFQ lock inversion fix (Jan) - Documentation improvements (Kir) - CLONE_IO block cgroup fix (Tejun) - Remove of ancient and deprecated block dump feature (zhangyi) - Discard merge fix (Ming) - Misc fixes or followup fixes (Colin, Damien, Dan, Long, Max, Thomas, Yang) * tag 'for-5.14/block-2021-06-29' of git://git.kernel.dk/linux-block: (129 commits) block: fix discard request merge block/mq-deadline: Remove a WARN_ON_ONCE() call blk-mq: update hctx->dispatch_busy in case of real scheduler blk: Fix lock inversion between ioc lock and bfqd lock bfq: Remove merged request already in bfq_requests_merged() block: pass a gendisk to bdev_disk_changed block: move bdev_disk_changed block: add the events* attributes to disk_attrs block: move the disk events code to a separate file block: fix trace completion for chained bio block/partitions/msdos: Fix typo inidicator -> indicator block, bfq: reset waker pointer with shared queues block, bfq: check waker only for queues with no in-flight I/O block, bfq: avoid delayed merge of async queues block, bfq: boost throughput by extending queue-merging times block, bfq: consider also creation time in delayed stable merge block, bfq: fix delayed stable merge check block, bfq: let also stably merged queues enjoy weight raising blk-wbt: make sure throttle is enabled properly blk-wbt: introduce a new disable state to prevent false positive by rwb_enabled() ...
2021-06-29fs: remove noop_set_page_dirty()Matthew Wilcox (Oracle)1-0/+1
Use __set_page_dirty_no_writeback() instead. This will set the dirty bit on the page, which will be used to avoid calling set_page_dirty() in the future. It will have no effect on actually writing the page back, as the pages are not on any LRU lists. [akpm@linux-foundation.org: export __set_page_dirty_no_writeback() to modules] Link: https://lkml.kernel.org/r/20210615162342.1669332-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm/writeback: use __set_page_dirty in __set_page_dirty_nobuffersMatthew Wilcox (Oracle)1-9/+1
This is fundamentally the same code, so just call it instead of duplicating it. Link: https://lkml.kernel.org/r/20210615162342.1669332-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm/writeback: move __set_page_dirty() to core mmMatthew Wilcox (Oracle)1-1/+26
Patch series "Further set_page_dirty cleanups". Prompted by Christoph's recent patches, here are some more patches to improve the state of set_page_dirty(). They're all from the folio tree, so they've been tested to a certain extent. This patch (of 6): Nothing in __set_page_dirty() is specific to buffer_head, so move it to mm/page-writeback.c. That removes the only caller of account_page_dirtied() outside of page-writeback.c, so make it static. Link: https://lkml.kernel.org/r/20210615162342.1669332-1-willy@infradead.org Link: https://lkml.kernel.org/r/20210615162342.1669332-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jan Kara <jack@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm: require ->set_page_dirty to be explicitly wired upChristoph Hellwig1-14/+4
Remove the CONFIG_BLOCK default to __set_page_dirty_buffers and just wire that method up for the missing instances. [hch@lst.de: ecryptfs: add a ->set_page_dirty cludge] Link: https://lkml.kernel.org/r/20210624125250.536369-1-hch@lst.de Link: https://lkml.kernel.org/r/20210614061512.3966143-4-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Tyler Hicks <code@tyhicks.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm/page-writeback: use __this_cpu_inc() in account_page_dirtied()Chi Wu1-1/+1
As account_page_dirtied() was always protected by xa_lock_irqsave(), so using __this_cpu_inc() is better. Link: https://lkml.kernel.org/r/20210512144742.4764-1-wuchi.zero@gmail.com Signed-off-by: Chi Wu <wuchi.zero@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Howard Cochran <hcochran@kernelspring.com> Cc: Miklos Szeredi <mszeredi@redhat.com> Cc: Sedat Dilek <sedat.dilek@gmail.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm/page-writeback: update the comment of Dirty position controlChi Wu1-1/+1
As the value of pos_ratio_polynom() clamp between 0 and 2LL << RATELIMIT_CALC_SHIFT, the global control line should be consistent with it. Link: https://lkml.kernel.org/r/20210511103606.3732-1-wuchi.zero@gmail.com Signed-off-by: Chi Wu <wuchi.zero@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@fb.com> Cc: Howard Cochran <hcochran@kernelspring.com> Cc: Miklos Szeredi <mszeredi@redhat.com> Cc: Sedat Dilek <sedat.dilek@gmail.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm/page-writeback: Fix performance when BDI's share of ratio is 0.Chi Wu1-4/+16
Fix performance when BDI's share of ratio is 0. The issue is similar to commit 74d369443325 ("writeback: Fix performance regression in wb_over_bg_thresh()"). Balance_dirty_pages and the writeback worker will also disagree on whether writeback when a BDI uses BDI_CAP_STRICTLIMIT and BDI's share of the thresh ratio is zero. For example, A thread on cpu0 writes 32 pages and then balance_dirty_pages, it will wake up background writeback and pauses because wb_dirty > wb->wb_thresh = 0 (share of thresh ratio is zero). A thread may runs on cpu0 again because scheduler prefers pre_cpu. Then writeback worker may runs on other cpus(1,2..) which causes the value of wb_stat(wb, WB_RECLAIMABLE) in wb_over_bg_thresh is 0 and does not writeback and returns. Thus, balance_dirty_pages keeps looping, sleeping and then waking up the worker who will do nothing. It remains stuck in this state until the writeback worker hit the right dirty cpu or the dirty pages expire. The fix that we should get the wb_stat_sum radically when thresh is low. Link: https://lkml.kernel.org/r/20210428225046.16301-1-wuchi.zero@gmail.com Signed-off-by: Chi Wu <wuchi.zero@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Miklos Szeredi <mszeredi@redhat.com> Cc: Sedat Dilek <sedat.dilek@gmail.com> Cc: Jens Axboe <axboe@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm: page-writeback: kill get_writeback_state() commentsKefeng Wang1-6/+3
The get_writeback_state() has gone since 2006, kill related comments. Link: https://lkml.kernel.org/r/20210508125026.56600-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-24block_dump: remove block_dump featurezhangyi (F)1-5/+0
We have already delete block_dump feature in mark_inode_dirty() because it can be replaced by tracepoints, now we also remove the part in submit_bio() for the same reason. The part of block dump feature in submit_bio() dump the write process, write region and sectors on the target disk into kernel message. it can be replaced by block_bio_queue tracepoint in submit_bio_checks(), so we do not need block_dump anymore, remove the whole block_dump feature. Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210313030146.2882027-3-yi.zhang@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-05-07mm: fix typos in commentsIngo Molnar1-2/+2
Fix ~94 single-word typos in locking code comments, plus a few very obvious grammar mistakes. Link: https://lkml.kernel.org/r/20210322212624.GA1963421@gmail.com Link: https://lore.kernel.org/r/20210322205203.GB1959563@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Cc: Bhaskar Chowdhury <unixbhaskar@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30mm: page-writeback: simplify memcg handling in test_clear_page_writeback()Johannes Weiner1-6/+3
Page writeback doesn't hold a page reference, which allows truncate to free a page the second PageWriteback is cleared. This used to require special attention in test_clear_page_writeback(), where we had to be careful not to rely on the unstable page->memcg binding and look up all the necessary information before clearing the writeback flag. Since commit 073861ed77b6 ("mm: fix VM_BUG_ON(PageTail) and BUG_ON(PageWriteback)") test_clear_page_writeback() is called with an explicit reference on the page, and this dance is no longer needed. Use unlock_page_memcg() and dec_lruvec_page_state() directly. This removes the last user of the lock_page_memcg() return value, change it to void. Touch up the comments in there as well. This also removes the last extern user of __unlock_page_memcg(), make it static. Further, it removes the last user of dec_lruvec_state(), delete it, along with a few other unused helpers. Link: https://lkml.kernel.org/r/YCQbYAWg4nvBFL6h@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Hugh Dickins <hughd@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-03-23mm/writeback: Add wait_on_page_writeback_killableMatthew Wilcox (Oracle)1-0/+16
This is the killable version of wait_on_page_writeback. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: kafs-testing@auristor.com cc: linux-afs@lists.infradead.org cc: linux-mm@kvack.org Link: https://lore.kernel.org/r/20210320054104.1300774-3-willy@infradead.org
2021-01-05mm: make wait_on_page_writeback() wait for multiple pending writebacksLinus Torvalds1-1/+1
Ever since commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") we've had some very occasional reports of BUG_ON(PageWriteback) in write_cache_pages(), which we thought we already fixed in commit 073861ed77b6 ("mm: fix VM_BUG_ON(PageTail) and BUG_ON(PageWriteback)"). But syzbot just reported another one, even with that commit in place. And it turns out that there's a simpler way to trigger the BUG_ON() than the one Hugh found with page re-use. It all boils down to the fact that the page writeback is ostensibly serialized by the page lock, but that isn't actually really true. Yes, the people _setting_ writeback all do so under the page lock, but the actual clearing of the bit - and waking up any waiters - happens without any page lock. This gives us this fairly simple race condition: CPU1 = end previous writeback CPU2 = start new writeback under page lock CPU3 = write_cache_pages() CPU1 CPU2 CPU3 ---- ---- ---- end_page_writeback() test_clear_page_writeback(page) ... delayed... lock_page(); set_page_writeback() unlock_page() lock_page() wait_on_page_writeback(); wake_up_page(page, PG_writeback); .. wakes up CPU3 .. BUG_ON(PageWriteback(page)); where the BUG_ON() happens because we woke up the PG_writeback bit becasue of the _previous_ writeback, but a new one had already been started because the clearing of the bit wasn't actually atomic wrt the actual wakeup or serialized by the page lock. The reason this didn't use to happen was that the old logic in waiting on a page bit would just loop if it ever saw the bit set again. The nice proper fix would probably be to get rid of the whole "wait for writeback to clear, and then set it" logic in the writeback path, and replace it with an atomic "wait-to-set" (ie the same as we have for page locking: we set the page lock bit with a single "lock_page()", not with "wait for lock bit to clear and then set it"). However, out current model for writeback is that the waiting for the writeback bit is done by the generic VFS code (ie write_cache_pages()), but the actual setting of the writeback bit is done much later by the filesystem ".writepages()" function. IOW, to make the writeback bit have that same kind of "wait-to-set" behavior as we have for page locking, we'd have to change our roughly ~50 different writeback functions. Painful. Instead, just make "wait_on_page_writeback()" loop on the very unlikely situation that the PG_writeback bit is still set, basically re-instating the old behavior. This is very non-optimal in case of contention, but since we only ever set the bit under the page lock, that situation is controlled. Reported-by: syzbot+2fc0712f8f8b8b8fa0ef@syzkaller.appspotmail.com Fixes: 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") Acked-by: Hugh Dickins <hughd@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-11-25mm: fix VM_BUG_ON(PageTail) and BUG_ON(PageWriteback)Hugh Dickins1-6/+0
Twice now, when exercising ext4 looped on shmem huge pages, I have crashed on the PF_ONLY_HEAD check inside PageWaiters(): ext4_finish_bio() calling end_page_writeback() calling wake_up_page() on tail of a shmem huge page, no longer an ext4 page at all. The problem is that PageWriteback is not accompanied by a page reference (as the NOTE at the end of test_clear_page_writeback() acknowledges): as soon as TestClearPageWriteback has been done, that page could be removed from page cache, freed, and reused for something else by the time that wake_up_page() is reached. https://lore.kernel.org/linux-mm/20200827122019.GC14765@casper.infradead.org/ Matthew Wilcox suggested avoiding or weakening the PageWaiters() tail check; but I'm paranoid about even looking at an unreferenced struct page, lest its memory might itself have already been reused or hotremoved (and wake_up_page_bit() may modify that memory with its ClearPageWaiters()). Then on crashing a second time, realized there's a stronger reason against that approach. If my testing just occasionally crashes on that check, when the page is reused for part of a compound page, wouldn't it be much more common for the page to get reused as an order-0 page before reaching wake_up_page()? And on rare occasions, might that reused page already be marked PageWriteback by its new user, and already be waited upon? What would that look like? It would look like BUG_ON(PageWriteback) after wait_on_page_writeback() in write_cache_pages() (though I have never seen that crash myself). Matthew Wilcox explaining this to himself: "page is allocated, added to page cache, dirtied, writeback starts, --- thread A --- filesystem calls end_page_writeback() test_clear_page_writeback() --- context switch to thread B --- truncate_inode_pages_range() finds the page, it doesn't have writeback set, we delete it from the page cache. Page gets reallocated, dirtied, writeback starts again. Then we call write_cache_pages(), see PageWriteback() set, call wait_on_page_writeback() --- context switch back to thread A --- wake_up_page(page, PG_writeback); ... thread B is woken, but because the wakeup was for the old use of the page, PageWriteback is still set. Devious" And prior to 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") this would have been much less likely: before that, wake_page_function()'s non-exclusive case would stop walking and not wake if it found Writeback already set again; whereas now the non-exclusive case proceeds to wake. I have not thought of a fix that does not add a little overhead: the simplest fix is for end_page_writeback() to get_page() before calling test_clear_page_writeback(), then put_page() after wake_up_page(). Was there a chance of missed wakeups before, since a page freed before reaching wake_up_page() would have PageWaiters cleared? I think not, because each waiter does hold a reference on the page. This bug comes when the old use of the page, the one we do TestClearPageWriteback on, had *no* waiters, so no additional page reference beyond the page cache (and whoever racily freed it). The reuse of the page has a waiter holding a reference, and its own PageWriteback set; but the belated wake_up_page() has woken the reuse to hit that BUG_ON(PageWriteback). Reported-by: syzbot+3622cea378100f45d59f@syzkaller.appspotmail.com Reported-by: Qian Cai <cai@lca.pw> Fixes: 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") Signed-off-by: Hugh Dickins <hughd@google.com> Cc: stable@vger.kernel.org # v5.8+ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16mm/page-writeback: support tail pages in wait_for_stable_pageMatthew Wilcox (Oracle)1-0/+1
page->mapping is undefined for tail pages, so operate exclusively on the head page. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: SeongJae Park <sjpark@amazon.de> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Huang Ying <ying.huang@intel.com> Link: https://lkml.kernel.org/r/20200908195539.25896-11-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-24bdi: replace BDI_CAP_NO_{WRITEBACK,ACCT_DIRTY} with a single flagChristoph Hellwig1-6/+6
Replace the two negative flags that are always used together with a single positive flag that indicates the writeback capability instead of two related non-capabilities. Also remove the pointless wrappers to just check the flag. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-24bdi: invert BDI_CAP_NO_ACCT_WBChristoph Hellwig1-2/+2
Replace BDI_CAP_NO_ACCT_WB with a positive BDI_CAP_WRITEBACK_ACCT to make the checks more obvious. Also remove the pointless bdi_cap_account_writeback wrapper that just obsfucates the check. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-24bdi: replace BDI_CAP_STABLE_WRITES with a queue and a sb flagChristoph Hellwig1-1/+1
The BDI_CAP_STABLE_WRITES is one of the few bits of information in the backing_dev_info shared between the block drivers and the writeback code. To help untangling the dependency replace it with a queue flag and a superblock flag derived from it. This also helps with the case of e.g. a file system requiring stable writes due to its own checksumming, but not forcing it on other users of the block device like the swap code. One downside is that we an't support the stable_pages_required bdi attribute in sysfs anymore. It is replaced with a queue attribute which also is writable for easier testing. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-07mm: remove vm_total_pagesDavid Hildenbrand1-4/+2
The global variable "vm_total_pages" is a relic from older days. There is only a single user that reads the variable - build_all_zonelists() - and the first thing it does is update it. Use a local variable in build_all_zonelists() instead and remove the global variable. Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Huang Ying <ying.huang@intel.com> Cc: Minchan Kim <minchan@kernel.org> Link: http://lkml.kernel.org/r/20200619132410.23859-2-david@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-05mm/page-writeback: fix a typo in comment "effictive"->"effective"Ethon Paul1-1/+1
There is a typo in comment, fix it. Signed-off-by: Ethon Paul <ethp@qq.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20200411003513.14613-1-ethp@qq.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-nextLinus Torvalds1-10/+6
Pull networking updates from David Miller: 1) Allow setting bluetooth L2CAP modes via socket option, from Luiz Augusto von Dentz. 2) Add GSO partial support to igc, from Sasha Neftin. 3) Several cleanups and improvements to r8169 from Heiner Kallweit. 4) Add IF_OPER_TESTING link state and use it when ethtool triggers a device self-test. From Andrew Lunn. 5) Start moving away from custom driver versions, use the globally defined kernel version instead, from Leon Romanovsky. 6) Support GRO vis gro_cells in DSA layer, from Alexander Lobakin. 7) Allow hard IRQ deferral during NAPI, from Eric Dumazet. 8) Add sriov and vf support to hinic, from Luo bin. 9) Support Media Redundancy Protocol (MRP) in the bridging code, from Horatiu Vultur. 10) Support netmap in the nft_nat code, from Pablo Neira Ayuso. 11) Allow UDPv6 encapsulation of ESP in the ipsec code, from Sabrina Dubroca. Also add ipv6 support for espintcp. 12) Lots of ReST conversions of the networking documentation, from Mauro Carvalho Chehab. 13) Support configuration of ethtool rxnfc flows in bcmgenet driver, from Doug Berger. 14) Allow to dump cgroup id and filter by it in inet_diag code, from Dmitry Yakunin. 15) Add infrastructure to export netlink attribute policies to userspace, from Johannes Berg. 16) Several optimizations to sch_fq scheduler, from Eric Dumazet. 17) Fallback to the default qdisc if qdisc init fails because otherwise a packet scheduler init failure will make a device inoperative. From Jesper Dangaard Brouer. 18) Several RISCV bpf jit optimizations, from Luke Nelson. 19) Correct the return type of the ->ndo_start_xmit() method in several drivers, it's netdev_tx_t but many drivers were using 'int'. From Yunjian Wang. 20) Add an ethtool interface for PHY master/slave config, from Oleksij Rempel. 21) Add BPF iterators, from Yonghang Song. 22) Add cable test infrastructure, including ethool interfaces, from Andrew Lunn. Marvell PHY driver is the first to support this facility. 23) Remove zero-length arrays all over, from Gustavo A. R. Silva. 24) Calculate and maintain an explicit frame size in XDP, from Jesper Dangaard Brouer. 25) Add CAP_BPF, from Alexei Starovoitov. 26) Support terse dumps in the packet scheduler, from Vlad Buslov. 27) Support XDP_TX bulking in dpaa2 driver, from Ioana Ciornei. 28) Add devm_register_netdev(), from Bartosz Golaszewski. 29) Minimize qdisc resets, from Cong Wang. 30) Get rid of kernel_getsockopt and kernel_setsockopt in order to eliminate set_fs/get_fs calls. From Christoph Hellwig. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2517 commits) selftests: net: ip_defrag: ignore EPERM net_failover: fixed rollback in net_failover_open() Revert "tipc: Fix potential tipc_aead refcnt leak in tipc_crypto_rcv" Revert "tipc: Fix potential tipc_node refcnt leak in tipc_rcv" vmxnet3: allow rx flow hash ops only when rss is enabled hinic: add set_channels ethtool_ops support selftests/bpf: Add a default $(CXX) value tools/bpf: Don't use $(COMPILE.c) bpf, selftests: Use bpf_probe_read_kernel s390/bpf: Use bcr 0,%0 as tail call nop filler s390/bpf: Maintain 8-byte stack alignment selftests/bpf: Fix verifier test selftests/bpf: Fix sample_cnt shared between two threads bpf, selftests: Adapt cls_redirect to call csum_level helper bpf: Add csum_level helper for fixing up csum levels bpf: Fix up bpf_skb_adjust_room helper's skb csum setting sfc: add missing annotation for efx_ef10_try_update_nic_stats_vf() crypto/chtls: IPv6 support for inline TLS Crypto/chcr: Fixes a coccinile check error Crypto/chcr: Fixes compilations warnings ...
2020-06-02mm/writeback: discard NR_UNSTABLE_NFS, use NR_WRITEBACK insteadNeilBrown1-13/+4
After an NFS page has been written it is considered "unstable" until a COMMIT request succeeds. If the COMMIT fails, the page will be re-written. These "unstable" pages are currently accounted as "reclaimable", either in WB_RECLAIMABLE, or in NR_UNSTABLE_NFS which is included in a 'reclaimable' count. This might have made sense when sending the COMMIT required a separate action by the VFS/MM (e.g. releasepage() used to send a COMMIT). However now that all writes generated by ->writepages() will automatically be followed by a COMMIT (since commit 919e3bd9a875 ("NFS: Ensure we commit after writeback is complete")) it makes more sense to treat them as writeback pages. So this patch removes NR_UNSTABLE_NFS and accounts unstable pages in NR_WRITEBACK and WB_WRITEBACK. A particular effect of this change is that when wb_check_background_flush() calls wb_over_bg_threshold(), the latter will report 'true' a lot less often as the 'unstable' pages are no longer considered 'dirty' (as there is nothing that writeback can do about them anyway). Currently wb_check_background_flush() will trigger writeback to NFS even when there are relatively few dirty pages (if there are lots of unstable pages), this can result in small writes going to the server (10s of Kilobytes rather than a Megabyte) which hurts throughput. With this patch, there are fewer writes which are each larger on average. Where the NR_UNSTABLE_NFS count was included in statistics virtual-files, the entry is retained, but the value is hard-coded as zero. static trace points and warning printks which mentioned this counter no longer report it. [akpm@linux-foundation.org: re-layout comment] [akpm@linux-foundation.org: fix printk warning] Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Trond Myklebust <trond.myklebust@hammerspace.com> Acked-by: Michal Hocko <mhocko@suse.com> [mm] Cc: Christoph Hellwig <hch@lst.de> Cc: Chuck Lever <chuck.lever@oracle.com> Link: http://lkml.kernel.org/r/87d06j7gqa.fsf@notabene.neil.brown.name Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02mm/writeback: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLENeilBrown1-8/+33
PF_LESS_THROTTLE exists for loop-back nfsd (and a similar need in the loop block driver and callers of prctl(PR_SET_IO_FLUSHER)), where a daemon needs to write to one bdi (the final bdi) in order to free up writes queued to another bdi (the client bdi). The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty pages, so that it can still dirty pages after other processses have been throttled. The purpose of this is to avoid deadlock that happen when the PF_LESS_THROTTLE process must write for any dirty pages to be freed, but it is being thottled and cannot write. This approach was designed when all threads were blocked equally, independently on which device they were writing to, or how fast it was. Since that time the writeback algorithm has changed substantially with different threads getting different allowances based on non-trivial heuristics. This means the simple "add 25%" heuristic is no longer reliable. The important issue is not that the daemon needs a *larger* dirty page allowance, but that it needs a *private* dirty page allowance, so that dirty pages for the "client" bdi that it is helping to clear (the bdi for an NFS filesystem or loop block device etc) do not affect the throttling of the daemon writing to the "final" bdi. This patch changes the heuristic so that the task is not throttled when the bdi it is writing to has a dirty page count below below (or equal to) the free-run threshold for that bdi. This ensures it will always be able to have some pages in flight, and so will not deadlock. In a steady-state, it is expected that PF_LOCAL_THROTTLE tasks might still be throttled by global threshold, but that is acceptable as it is only the deadlock state that is interesting for this flag. This approach of "only throttle when target bdi is busy" is consistent with the other use of PF_LESS_THROTTLE in current_may_throttle(), were it causes attention to be focussed only on the target bdi. So this patch - renames PF_LESS_THROTTLE to PF_LOCAL_THROTTLE, - removes the 25% bonus that that flag gives, and - If PF_LOCAL_THROTTLE is set, don't delay at all unless the global and the local free-run thresholds are exceeded. Note that previously realtime threads were treated the same as PF_LESS_THROTTLE threads. This patch does *not* change the behvaiour for real-time threads, so it is now different from the behaviour of nfsd and loop tasks. I don't know what is wanted for realtime. [akpm@linux-foundation.org: coding style fixes] Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Chuck Lever <chuck.lever@oracle.com> [nfsd] Cc: Christoph Hellwig <hch@lst.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Link: http://lkml.kernel.org/r/87ftbf7gs3.fsf@notabene.neil.brown.name Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02mm/page-writeback.c: remove unused variableChao Yu1-3/+1
Commit 64081362e8ff ("mm/page-writeback.c: fix range_cyclic writeback vs writepages deadlock") left unused variable, remove it. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: David Hildenbrand <david@redhat.com> Link: http://lkml.kernel.org/r/20200528033740.17269-1-yuchao0@huawei.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-27sysctl: pass kernel pointers to ->proc_handlerChristoph Hellwig1-10/+6
Instead of having all the sysctl handlers deal with user pointers, which is rather hairy in terms of the BPF interaction, copy the input to and from userspace in common code. This also means that the strings are always NUL-terminated by the common code, making the API a little bit safer. As most handler just pass through the data to one of the common handlers a lot of the changes are mechnical. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-04-02mm/gup/writeback: add callbacks for inaccessible pagesClaudio Imbrenda1-1/+8
With the introduction of protected KVM guests on s390 there is now a concept of inaccessible pages. These pages need to be made accessible before the host can access them. While cpu accesses will trigger a fault that can be resolved, I/O accesses will just fail. We need to add a callback into architecture code for places that will do I/O, namely when writeback is started or when a page reference is taken. This is not only to enable paging, file backing etc, it is also necessary to protect the host against a malicious user space. For example a bad QEMU could simply start direct I/O on such protected memory. We do not want userspace to be able to trigger I/O errors and thus the logic is "whenever somebody accesses that page (gup) or does I/O, make sure that this page can be accessed". When the guest tries to access that page we will wait in the page fault handler for writeback to have finished and for the page_ref to be the expected value. On s390x the function is not supposed to fail, so it is ok to use a WARN_ON on failure. If we ever need some more finegrained handling we can tackle this when we know the details. Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Acked-by: Will Deacon <will@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Link: http://lkml.kernel.org/r/20200306132537.783769-3-imbrenda@linux.ibm.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02mm/page-writeback.c: use VM_BUG_ON_PAGE in clear_page_dirty_for_ioMatthew Wilcox (Oracle)1-1/+1
Dumping the page information in this circumstance helps for debugging. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Link: http://lkml.kernel.org/r/20200318140253.6141-7-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02mm/page-writeback.c: write_cache_pages(): deduplicate identical checksMauricio Faria de Oliveira1-4/+4
There used to be a 'retry' label in between the two (identical) checks when first introduced in commit f446daaea9d4 ("mm: implement writeback livelock avoidance using page tagging"), and later modified/updated in commit 6e6938b6d313 ("writeback: introduce .tagged_writepages for the WB_SYNC_NONE sync stage"). The label has been removed in commit 64081362e8ff ("mm/page-writeback.c: fix range_cyclic writeback vs writepages deadlock"), and the (identical) checks are now present / performed immediately one after another. So, remove/deduplicate the latter check, moving tag_pages_for_writeback() into the former check before the 'tag' variable assignment, so it's clear that it's not used in this (similarly-named) function call but only later in pagevec_lookup_range_tag(). Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Jan Kara <jack@suse.cz> Link: http://lkml.kernel.org/r/20200218221716.1648-1-mfo@canonical.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-14mm/page-writeback.c: improve arithmetic divisionsWen Yang1-1/+1
Use div64_ul() instead of do_div() if the divisor is unsigned long, to avoid truncation to 32-bit on 64-bit platforms. Link: http://lkml.kernel.org/r/20200102081442.8273-4-wenyang@linux.alibaba.com Signed-off-by: Wen Yang <wenyang@linux.alibaba.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Qian Cai <cai@lca.pw> Cc: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-14mm/page-writeback.c: use div64_ul() for u64-by-unsigned-long divideWen Yang1-2/+2
The two variables 'numerator' and 'denominator', though they are declared as long, they should actually be unsigned long (according to the implementation of the fprop_fraction_percpu() function) And do_div() does a 64-by-32 division, while the divisor 'denominator' is unsigned long, thus 64-bit on 64-bit platforms. Hence the proper function to call is div64_ul(). Link: http://lkml.kernel.org/r/20200102081442.8273-3-wenyang@linux.alibaba.com Signed-off-by: Wen Yang <wenyang@linux.alibaba.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Qian Cai <cai@lca.pw> Cc: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-14mm/page-writeback.c: avoid potential division by zero in wb_min_max_ratio()Wen Yang1-2/+2
Patch series "use div64_ul() instead of div_u64() if the divisor is unsigned long". We were first inspired by commit b0ab99e7736a ("sched: Fix possible divide by zero in avg_atom () calculation"), then refer to the recently analyzed mm code, we found this suspicious place. 201 if (min) { 202 min *= this_bw; 203 do_div(min, tot_bw); 204 } And we also disassembled and confirmed it: /usr/src/debug/kernel-4.9.168-016.ali3000/linux-4.9.168-016.ali3000.alios7.x86_64/mm/page-writeback.c: 201 0xffffffff811c37da <__wb_calc_thresh+234>: xor %r10d,%r10d 0xffffffff811c37dd <__wb_calc_thresh+237>: test %rax,%rax 0xffffffff811c37e0 <__wb_calc_thresh+240>: je 0xffffffff811c3800 <__wb_calc_thresh+272> /usr/src/debug/kernel-4.9.168-016.ali3000/linux-4.9.168-016.ali3000.alios7.x86_64/mm/page-writeback.c: 202 0xffffffff811c37e2 <__wb_calc_thresh+242>: imul %r8,%rax /usr/src/debug/kernel-4.9.168-016.ali3000/linux-4.9.168-016.ali3000.alios7.x86_64/mm/page-writeback.c: 203 0xffffffff811c37e6 <__wb_calc_thresh+246>: mov %r9d,%r10d ---> truncates it to 32 bits here 0xffffffff811c37e9 <__wb_calc_thresh+249>: xor %edx,%edx 0xffffffff811c37eb <__wb_calc_thresh+251>: div %r10 0xffffffff811c37ee <__wb_calc_thresh+254>: imul %rbx,%rax 0xffffffff811c37f2 <__wb_calc_thresh+258>: shr $0x2,%rax 0xffffffff811c37f6 <__wb_calc_thresh+262>: mul %rcx 0xffffffff811c37f9 <__wb_calc_thresh+265>: shr $0x2,%rdx 0xffffffff811c37fd <__wb_calc_thresh+269>: mov %rdx,%r10 This series uses div64_ul() instead of div_u64() if the divisor is unsigned long, to avoid truncation to 32-bit on 64-bit platforms. This patch (of 3): The variables 'min' and 'max' are unsigned long and do_div truncates them to 32 bits, which means it can test non-zero and be truncated to zero for division. Fix this issue by using div64_ul() instead. Link: http://lkml.kernel.org/r/20200102081442.8273-2-wenyang@linux.alibaba.com Fixes: 693108a8a667 ("writeback: make bdi->min/max_ratio handling cgroup writeback aware") Signed-off-by: Wen Yang <wenyang@linux.alibaba.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Qian Cai <cai@lca.pw> Cc: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-27writeback, memcg: Implement foreign dirty flushingTejun Heo1-0/+4
There's an inherent mismatch between memcg and writeback. The former trackes ownership per-page while the latter per-inode. This was a deliberate design decision because honoring per-page ownership in the writeback path is complicated, may lead to higher CPU and IO overheads and deemed unnecessary given that write-sharing an inode across different cgroups isn't a common use-case. Combined with inode majority-writer ownership switching, this works well enough in most cases but there are some pathological cases. For example, let's say there are two cgroups A and B which keep writing to different but confined parts of the same inode. B owns the inode and A's memory is limited far below B's. A's dirty ratio can rise enough to trigger balance_dirty_pages() sleeps but B's can be low enough to avoid triggering background writeback. A will be slowed down without a way to make writeback of the dirty pages happen. This patch implements foreign dirty recording and foreign mechanism so that when a memcg encounters a condition as above it can trigger flushes on bdi_writebacks which can clean its pages. Please see the comment on top of mem_cgroup_track_foreign_dirty_slowpath() for details. A reproducer follows. write-range.c:: #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <fcntl.h> #include <sys/types.h> static const char *usage = "write-range FILE START SIZE\n"; int main(int argc, char **argv) { int fd; unsigned long start, size, end, pos; char *endp; char buf[4096]; if (argc < 4) { fprintf(stderr, usage); return 1; } fd = open(argv[1], O_WRONLY); if (fd < 0) { perror("open"); return 1; } start = strtoul(argv[2], &endp, 0); if (*endp != '\0') { fprintf(stderr, usage); return 1; } size = strtoul(argv[3], &endp, 0); if (*endp != '\0') { fprintf(stderr, usage); return 1; } end = start + size; while (1) { for (pos = start; pos < end; ) { long bread, bwritten = 0; if (lseek(fd, pos, SEEK_SET) < 0) { perror("lseek"); return 1; } bread = read(0, buf, sizeof(buf) < end - pos ? sizeof(buf) : end - pos); if (bread < 0) { perror("read"); return 1; } if (bread == 0) return 0; while (bwritten < bread) { long this; this = write(fd, buf + bwritten, bread - bwritten); if (this < 0) { perror("write"); return 1; } bwritten += this; pos += bwritten; } } } } repro.sh:: #!/bin/bash set -e set -x sysctl -w vm.dirty_expire_centisecs=300000 sysctl -w vm.dirty_writeback_centisecs=300000 sysctl -w vm.dirtytime_expire_seconds=300000 echo 3 > /proc/sys/vm/drop_caches TEST=/sys/fs/cgroup/test A=$TEST/A B=$TEST/B mkdir -p $A $B echo "+memory +io" > $TEST/cgroup.subtree_control echo $((1<<30)) > $A/memory.high echo $((32<<30)) > $B/memory.high rm -f testfile touch testfile fallocate -l 4G testfile echo "Starting B" (echo $BASHPID > $B/cgroup.procs pv -q --rate-limit 70M < /dev/urandom | ./write-range testfile $((2<<30)) $((2<<30))) & echo "Waiting 10s to ensure B claims the testfile inode" sleep 5 sync sleep 5 sync echo "Starting A" (echo $BASHPID > $A/cgroup.procs pv < /dev/urandom | ./write-range testfile 0 $((2<<30))) v2: Added comments explaining why the specific intervals are being used. v3: Use 0 @nr when calling cgroup_writeback_by_id() to use best-effort flushing while avoding possible livelocks. v4: Use get_jiffies_64() and time_before/after64() instead of raw jiffies_64 and arthimetic comparisons as suggested by Jan. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-07-12mm: remove the account_page_dirtied exportChristoph Hellwig1-1/+0
account_page_dirtied() is only used by our set_page_dirty() helpers and should not be used anywhere else. Link: http://lkml.kernel.org/r/20190605183702.30572-1-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-21treewide: Add SPDX license identifier for missed filesThomas Gleixner1-0/+1
Add SPDX license identifiers to all files which: - Have no license information of any form - Have EXPORT_.*_SYMBOL_GPL inside which was used in the initial scan/conversion to ignore the file These files fall under the project license, GPL v2 only. The resulting SPDX license identifier is: GPL-2.0-only Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-14mm/page-writeback: introduce tracepoint for wait_on_page_writeback()Yafang Shao1-0/+12
Recently there have been some hung tasks on our server due to wait_on_page_writeback(), and we want to know the details of this PG_writeback, i.e. this page is writing back to which device. But it is not so convenient to get the details. I think it would be better to introduce a tracepoint for diagnosing the writeback details. Link: http://lkml.kernel.org/r/1556274402-19018-1-git-send-email-laoar.shao@gmail.com Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>