summaryrefslogtreecommitdiff
path: root/fs/erofs
AgeCommit message (Collapse)AuthorFilesLines
2022-11-15Merge tag 'erofs-for-6.1-rc6-fixes' of ↵Linus Torvalds5-37/+54
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs Pull erofs fixes from Gao Xiang: "Most patches randomly fix error paths or corner cases in fscache mode reported recently. One fixes an invalid access relating to fragments on crafted images. Summary: - Fix packed_inode invalid access when reading fragments on crafted images - Add a missing erofs_put_metabuf() in an error path in fscache mode - Fix incorrect `count' for unmapped extents in fscache mode - Fix use-after-free of fsid and domain_id string when remounting - Fix missing xas_retry() in fscache mode" * tag 'erofs-for-6.1-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: erofs: fix missing xas_retry() in fscache mode erofs: fix use-after-free of fsid and domain_id string erofs: get correct count for unmapped range in fscache mode erofs: put metabuf in error path in fscache mode erofs: fix general protection fault when reading fragment
2022-11-14erofs: fix missing xas_retry() in fscache modeJingbo Xu1-3/+7
The xarray iteration only holds the RCU read lock and thus may encounter XA_RETRY_ENTRY if there's process modifying the xarray concurrently. This will cause oops when referring to the invalid entry. Fix this by adding the missing xas_retry(), which will make the iteration wind back to the root node if XA_RETRY_ENTRY is encountered. Fixes: d435d53228dd ("erofs: change to use asynchronous io for fscache readpage/readahead") Suggested-by: David Howells <dhowells@redhat.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Jia Zhu <zhujia.zj@bytedance.com> Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Link: https://lore.kernel.org/r/20221114121943.29987-1-jefflexu@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-11-10erofs: fix use-after-free of fsid and domain_id stringJingbo Xu4-30/+37
When erofs instance is remounted with fsid or domain_id mount option specified, the original fsid and domain_id string pointer in sbi->opt is directly overridden with the fsid and domain_id string in the new fs_context, without freeing the original fsid and domain_id string. What's worse, when the new fsid and domain_id string is transferred to sbi, they are not reset to NULL in fs_context, and thus they are freed when remount finishes, while sbi is still referring to these strings. Reconfiguration for fsid and domain_id seems unusual. Thus clarify this restriction explicitly and dump a warning when users are attempting to do this. Besides, to fix the use-after-free issue, move fsid and domain_id from erofs_mount_opts to outside. Fixes: c6be2bd0a5dd ("erofs: register fscache volume") Fixes: 8b7adf1dff3d ("erofs: introduce fscache-based domain") Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Jia Zhu <zhujia.zj@bytedance.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20221021023153.1330-1-jefflexu@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-11-09fs: fix leaked psi pressure stateJohannes Weiner1-7/+11
When psi annotations were added to to btrfs compression reads, the psi state tracking over add_ra_bio_pages and btrfs_submit_compressed_read was faulty. A pressure state, once entered, is never left. This results in incorrectly elevated pressure, which triggers OOM kills. pflags record the *previous* memstall state when we enter a new one. The code tried to initialize pflags to 1, and then optimize the leave call when we either didn't enter a memstall, or were already inside a nested stall. However, there can be multiple PageWorkingset pages in the bio, at which point it's that path itself that enters repeatedly and overwrites pflags. This causes us to miss the exit. Enter the stall only once if needed, then unwind correctly. erofs has the same problem, fix that up too. And move the memstall exit past submit_bio() to restore submit accounting originally added by b8e24a9300b0 ("block: annotate refault stalls from IO submission"). Link: https://lkml.kernel.org/r/Y2UHRqthNUwuIQGS@cmpxchg.org Fixes: 4088a47e78f9 ("btrfs: add manual PSI accounting for compressed reads") Fixes: 99486c511f68 ("erofs: add manual PSI accounting for the compressed address space") Fixes: 118f3663fbc6 ("block: remove PSI accounting from the bio layer") Link: https://lore.kernel.org/r/d20a0a85-e415-cf78-27f9-77dd7a94bc8d@leemhuis.info/ Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Thorsten Leemhuis <linux@leemhuis.info> Tested-by: Thorsten Leemhuis <linux@leemhuis.info> Cc: Chao Yu <chao@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: Christoph Hellwig <hch@lst.de> Cc: David Sterba <dsterba@suse.com> Cc: Gao Xiang <xiang@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-08erofs: get correct count for unmapped range in fscache modeJingbo Xu1-3/+4
For unmapped range, the returned map.m_llen is zero, and thus the calculated count is unexpected zero. Prior to the refactoring introduced by commit 1ae9470c3e14 ("erofs: clean up .read_folio() and .readahead() in fscache mode"), only the readahead routine suffers from this. With the refactoring of making .read_folio() and .readahead() calling one common routine, both read_folio and readahead have this issue now. Fix this by calculating count separately in unmapped condition. Fixes: c665b394b9e8 ("erofs: implement fscache-based data readahead") Fixes: 1ae9470c3e14 ("erofs: clean up .read_folio() and .readahead() in fscache mode") Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20221104054028.52208-3-jefflexu@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-11-08erofs: put metabuf in error path in fscache modeJingbo Xu1-1/+3
For tail packing layout, put metabuf when error is encountered. Fixes: 1ae9470c3e14 ("erofs: clean up .read_folio() and .readahead() in fscache mode") Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Jia Zhu <zhujia.zj@bytedance.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20221104054028.52208-2-jefflexu@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-11-08erofs: fix general protection fault when reading fragmentYue Hu1-0/+3
As syzbot reported [1], the fragment feature sb flag is not set, so packed_inode != NULL needs to be checked in z_erofs_read_fragment(). [1] https://lore.kernel.org/all/0000000000002e7a8905eb841ddd@google.com/ Reported-by: syzbot+3faecbfd845a895c04cb@syzkaller.appspotmail.com Fixes: b15b2e307c3a ("erofs: support on-disk compressed fragments data") Signed-off-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20221021085325.25788-1-zbestahu@gmail.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-10-17erofs: protect s_inodes with s_inode_list_lock for fscacheDawei Li1-0/+3
s_inodes is superblock-specific resource, which should be protected by sb's specific lock s_inode_list_lock. Link: https://lore.kernel.org/r/TYCP286MB23238380DE3B74874E8D78ABCA299@TYCP286MB2323.JPNP286.PROD.OUTLOOK.COM Fixes: 7d41963759fe ("erofs: Support sharing cookies in the same domain") Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Jia Zhu <zhujia.zj@bytedance.com> Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com> Signed-off-by: Dawei Li <set_pte_at@outlook.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-10-17erofs: fix up inplace decompression success rateGao Xiang1-5/+4
Partial decompression should be checked after updating length. It's a new regression when introducing multi-reference pclusters. Fixes: 2bfab9c0edac ("erofs: record the longest decompressed size in this round") Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20221014064915.8103-1-hsiangkao@linux.alibaba.com
2022-10-17erofs: shouldn't churn the mapping page for duplicated copiesGao Xiang2-8/+6
If other duplicated copies exist in one decompression shot, should leave the old page as is rather than replace it with the new duplicated one. Otherwise, the following cold path to deal with duplicated copies will use the invalid bvec. It impacts compressed data deduplication. Also, shift the onlinepage EIO bit to avoid touching the signed bit. Fixes: 267f2492c8f7 ("erofs: introduce multi-reference pclusters (fully-referenced)") Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20221012045056.13421-1-hsiangkao@linux.alibaba.com
2022-10-17erofs: fix illegal unmapped accesses in z_erofs_fill_inode_lazy()Yue Hu1-12/+10
Note that we are still accessing 'h_idata_size' and 'h_fragmentoff' after calling erofs_put_metabuf(), that is not correct. Fix it. Fixes: ab92184ff8f1 ("erofs: add on-disk compressed tail-packing inline support") Fixes: b15b2e307c3a ("erofs: support on-disk compressed fragments data") Signed-off-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20221005013528.62977-1-zbestahu@163.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-10-07Merge tag 'for-6.1/block-2022-10-03' of git://git.kernel.dk/linuxLinus Torvalds1-1/+12
Pull block updates from Jens Axboe: - NVMe pull requests via Christoph: - handle number of queue changes in the TCP and RDMA drivers (Daniel Wagner) - allow changing the number of queues in nvmet (Daniel Wagner) - also consider host_iface when checking ip options (Daniel Wagner) - don't map pages which can't come from HIGHMEM (Fabio M. De Francesco) - avoid unnecessary flush bios in nvmet (Guixin Liu) - shrink and better pack the nvme_iod structure (Keith Busch) - add comment for unaligned "fake" nqn (Linjun Bao) - print actual source IP address through sysfs "address" attr (Martin Belanger) - various cleanups (Jackie Liu, Wolfram Sang, Genjian Zhang) - handle effects after freeing the request (Keith Busch) - copy firmware_rev on each init (Keith Busch) - restrict management ioctls to admin (Keith Busch) - ensure subsystem reset is single threaded (Keith Busch) - report the actual number of tagset maps in nvme-pci (Keith Busch) - small fabrics authentication fixups (Christoph Hellwig) - add common code for tagset allocation and freeing (Christoph Hellwig) - stop using the request_queue in nvmet (Christoph Hellwig) - set min_align_mask before calculating max_hw_sectors (Rishabh Bhatnagar) - send a rediscover uevent when a persistent discovery controller reconnects (Sagi Grimberg) - misc nvmet-tcp fixes (Varun Prakash, zhenwei pi) - MD pull request via Song: - Various raid5 fix and clean up, by Logan Gunthorpe and David Sloan. - Raid10 performance optimization, by Yu Kuai. - sbitmap wakeup hang fixes (Hugh, Keith, Jan, Yu) - IO scheduler switching quisce fix (Keith) - s390/dasd block driver updates (Stefan) - support for recovery for the ublk driver (ZiyangZhang) - rnbd drivers fixes and updates (Guoqing, Santosh, ye, Christoph) - blk-mq and null_blk map fixes (Bart) - various bcache fixes (Coly, Jilin, Jules) - nbd signal hang fix (Shigeru) - block writeback throttling fix (Yu) - optimize the passthrough mapping handling (me) - prepare block cgroups to being gendisk based (Christoph) - get rid of an old PSI hack in the block layer, moving it to the callers instead where it belongs (Christoph) - blk-throttle fixes and cleanups (Yu) - misc fixes and cleanups (Liu Shixin, Liu Song, Miaohe, Pankaj, Ping-Xiang, Wolfram, Saurabh, Li Jinlin, Li Lei, Lin, Li zeming, Miaohe, Bart, Coly, Gaosheng * tag 'for-6.1/block-2022-10-03' of git://git.kernel.dk/linux: (162 commits) sbitmap: fix lockup while swapping block: add rationale for not using blk_mq_plug() when applicable block: adapt blk_mq_plug() to not plug for writes that require a zone lock s390/dasd: use blk_mq_alloc_disk blk-cgroup: don't update the blkg lookup hint in blkg_conf_prep nvmet: don't look at the request_queue in nvmet_bdev_set_limits nvmet: don't look at the request_queue in nvmet_bdev_zone_mgmt_emulate_all blk-mq: use quiesced elevator switch when reinitializing queues block: replace blk_queue_nowait with bdev_nowait nvme: remove nvme_ctrl_init_connect_q nvme-loop: use the tagset alloc/free helpers nvme-loop: store the generic nvme_ctrl in set->driver_data nvme-loop: initialize sqsize later nvme-fc: use the tagset alloc/free helpers nvme-fc: store the generic nvme_ctrl in set->driver_data nvme-fc: keep ctrl->sqsize in sync with opts->queue_size nvme-rdma: use the tagset alloc/free helpers nvme-rdma: store the generic nvme_ctrl in set->driver_data nvme-tcp: use the tagset alloc/free helpers nvme-tcp: store the generic nvme_ctrl in set->driver_data ...
2022-09-27erofs: clean up erofs_iget()Gao Xiang4-22/+14
isdir indicated REQ_META|REQ_PRIO which no longer works now. Get rid of isdir entirely. Link: https://lore.kernel.org/r/20220927063607.54832-2-hsiangkao@linux.alibaba.com Reviewed-by: Yue Hu <huyue2@coolpad.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-09-27erofs: clean up unnecessary code and commentsGao Xiang4-16/+2
Some conditional macros and comments are useless. Link: https://lore.kernel.org/r/20220927063607.54832-1-hsiangkao@linux.alibaba.com Reviewed-by: Yue Hu <huyue2@coolpad.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-09-27erofs: fold in z_erofs_reload_indexes()Yue Hu1-20/+8
The name of this function looks not very accurate compared to it's implementation and it's only a wrapper to erofs_read_metabuf(). So, let's fold it directly instead. Signed-off-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20220927032518.25266-1-zbestahu@gmail.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-09-26erofs: introduce partial-referenced pclustersGao Xiang7-2/+23
Due to deduplication for compressed data, pclusters can be partially referenced with their prefixes. Together with the user-space implementation, it enables EROFS variable-length global compressed data deduplication with rolling hash. Link: https://lore.kernel.org/r/20220923014915.4362-1-hsiangkao@linux.alibaba.com Reviewed-by: Yue Hu <huyue2@coolpad.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-09-26erofs: support on-disk compressed fragments dataYue Hu6-17/+152
Introduce on-disk compressed fragments data feature. This approach adds a new field called `h_fragmentoff' in the per-file compression header to indicate the fragment offset of each tail pcluster or the whole file in the special packed inode. Similar to ztailpacking, it will also find and record the 'headlcn' of the tail pcluster when initializing per-inode zmap for making follow-on requests more easy. Signed-off-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/YzHKxcFTlHGgXeH9@B-P7TQMD6M-0146.local Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-09-23erofs: support interlaced uncompressed data for compressed filesYue Hu4-23/+41
Currently, uncompressed data is all handled in the shifted way, which means we have to shift the whole on-disk plain pcluster to get the logical data. However, since we are also using in-place I/O for uncompressed data, data copy will be reduced a lot if pcluster is recorded in the interlaced way as illustrated below: _______________________________________________________________ | | | |_ tail part |_ head part _| |<- blk0 ->| .. |<- blkn-2 ->|<- blkn-1 ->| The logical data then becomes: ________________________________________________________ |_ head part _|_ blk0 _| .. |_ blkn-2 _|_ tail part _| In addition, non-4k plain pclusters are also survived by the interlaced way, which can be used for non-4k lclusters as well. However, it's almost impossible to de-duplicate uncompressed data in the interlaced way, therefore shifted uncompressed data is still useful. Signed-off-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/8369112678604fdf4ef796626d59b1fdd0745a53.1663898962.git.huyue2@coolpad.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-09-23erofs: clean up .read_folio() and .readahead() in fscache modeJingbo Xu1-130/+83
The implementation of these two functions in fscache mode is almost the same. Extract the same part as a generic helper to remove the code duplication. Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Jia Zhu <zhujia.zj@bytedance.com> Link: https://lore.kernel.org/r/20220922062414.20437-1-jefflexu@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-09-20erofs: add manual PSI accounting for the compressed address spaceChristoph Hellwig1-1/+12
erofs uses an additional address space for compressed data read from disk in addition to the one directly associated with the inode. Reading into the lower address space is open coded using add_to_page_cache_lru instead of using the filemap.c helper for page allocation micro-optimizations, which means it is not covered by the MM PSI annotations for ->read_folio and ->readahead, so add manual ones instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20220915094200.139713-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-20erofs: introduce 'domain_id' mount optionJia Zhu2-2/+34
Introduce 'domain_id' mount option to enable shared domain sementics. In which case, the related cookie is shared if two mountpoints in the same domain have the same data blob. Users could specify the name of domain by this mount option. Signed-off-by: Jia Zhu <zhujia.zj@bytedance.com> Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com> Link: https://lore.kernel.org/r/20220918043456.147-7-zhujia.zj@bytedance.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-09-20erofs: Support sharing cookies in the same domainJia Zhu2-6/+96
Several erofs filesystems can belong to one domain, and data blobs can be shared among these erofs filesystems of same domain. Users could specify domain_id mount option to create or join into a domain. Signed-off-by: Jia Zhu <zhujia.zj@bytedance.com> Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com> Link: https://lore.kernel.org/r/20220918110150.6338-1-zhujia.zj@bytedance.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-09-20erofs: introduce a pseudo mnt to manage shared cookiesJia Zhu3-2/+45
Use a pseudo mnt to manage shared cookies. Signed-off-by: Jia Zhu <zhujia.zj@bytedance.com> Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com> Link: https://lore.kernel.org/r/20220918043456.147-5-zhujia.zj@bytedance.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-09-20erofs: introduce fscache-based domainJia Zhu2-17/+121
A new fscache-based shared domain mode is going to be introduced for erofs. In which case, same data blobs in same domain will be shared and reused to reduce on-disk space usage. The implementation of sharing blobs will be introduced in subsequent patches. Signed-off-by: Jia Zhu <zhujia.zj@bytedance.com> Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com> Link: https://lore.kernel.org/r/20220918043456.147-4-zhujia.zj@bytedance.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-09-20erofs: code clean up for fscacheJia Zhu3-43/+36
Some cleanups. No logic changes. Suggested-by: Jingbo Xu <jefflexu@linux.alibaba.com> Signed-off-by: Jia Zhu <zhujia.zj@bytedance.com> Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com> Link: https://lore.kernel.org/r/20220918043456.147-3-zhujia.zj@bytedance.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-09-20erofs: use kill_anon_super() to kill super in fscache modeJia Zhu1-1/+1
Use kill_anon_super() instead of generic_shutdown_super() since the mount() in erofs fscache mode uses get_tree_nodev() and associated anon bdev needs to be freed. Fixes: 9c0cc9c729657 ("erofs: add 'fsid' mount option") Suggested-by: Jingbo Xu <jefflexu@linux.alibaba.com> Signed-off-by: Jia Zhu <zhujia.zj@bytedance.com> Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com> Link: https://lore.kernel.org/r/20220918043456.147-2-zhujia.zj@bytedance.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-09-20erofs: fix order >= MAX_ORDER warning due to crafted negative i_sizeGao Xiang1-1/+1
As syzbot reported [1], the root cause is that i_size field is a signed type, and negative i_size is also less than EROFS_BLKSIZ. As a consequence, it's handled as fast symlink unexpectedly. Let's fall back to the generic path to deal with such unusual i_size. [1] https://lore.kernel.org/r/000000000000ac8efa05e7feaa1f@google.com Reported-by: syzbot+f966c13b1b4fc0403b19@syzkaller.appspotmail.com Fixes: 431339ba9042 ("staging: erofs: add inode operations") Reviewed-by: Yue Hu <huyue2@coolpad.com> Link: https://lore.kernel.org/r/20220909023948.28925-1-hsiangkao@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-09-05erofs: fix pcluster use-after-free on UP platformsGao Xiang1-29/+0
During stress testing with CONFIG_SMP disabled, KASAN reports as below: ================================================================== BUG: KASAN: use-after-free in __mutex_lock+0xe5/0xc30 Read of size 8 at addr ffff8881094223f8 by task stress/7789 CPU: 0 PID: 7789 Comm: stress Not tainted 6.0.0-rc1-00002-g0d53d2e882f9 #3 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Call Trace: <TASK> .. __mutex_lock+0xe5/0xc30 .. z_erofs_do_read_page+0x8ce/0x1560 .. z_erofs_readahead+0x31c/0x580 .. Freed by task 7787 kasan_save_stack+0x1e/0x40 kasan_set_track+0x20/0x30 kasan_set_free_info+0x20/0x40 __kasan_slab_free+0x10c/0x190 kmem_cache_free+0xed/0x380 rcu_core+0x3d5/0xc90 __do_softirq+0x12d/0x389 Last potentially related work creation: kasan_save_stack+0x1e/0x40 __kasan_record_aux_stack+0x97/0xb0 call_rcu+0x3d/0x3f0 erofs_shrink_workstation+0x11f/0x210 erofs_shrink_scan+0xdc/0x170 shrink_slab.constprop.0+0x296/0x530 drop_slab+0x1c/0x70 drop_caches_sysctl_handler+0x70/0x80 proc_sys_call_handler+0x20a/0x2f0 vfs_write+0x555/0x6c0 ksys_write+0xbe/0x160 do_syscall_64+0x3b/0x90 The root cause is that erofs_workgroup_unfreeze() doesn't reset to orig_val thus it causes a race that the pcluster reuses unexpectedly before freeing. Since UP platforms are quite rare now, such path becomes unnecessary. Let's drop such specific-designed path directly instead. Fixes: 73f5c66df3e2 ("staging: erofs: fix `erofs_workgroup_{try_to_freeze, unfreeze}'") Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20220902045710.109530-1-hsiangkao@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-09-05erofs: avoid the potentially wrong m_plen for big pclusterYue Hu1-8/+8
Actually, 'compressedlcs' stores compressed block count rather than lcluster count. Therefore, the number of bits for shifting the count should be 'LOG_BLOCK_SIZE' rather than 'lclusterbits' although current lcluster size is 4K. The value of 'm_plen' will be wrong once we enable the non 4K-sized lcluster. Signed-off-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20220812060150.8510-1-huyue2@coolpad.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-09-05erofs: fix error return code in erofs_fscache_{meta_,}read_folioSun Ke1-2/+6
If erofs_fscache_alloc_request fail and then goto out, it will return 0. it should return a negative error code instead of 0. Fixes: d435d53228dd ("erofs: change to use asynchronous io for fscache readpage/readahead") Signed-off-by: Sun Ke <sunke32@huawei.com> Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20220815034829.3940803-1-sunke32@huawei.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-08-06Merge tag 'mm-stable-2022-08-03' of ↵Linus Torvalds2-5/+7
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Most of the MM queue. A few things are still pending. Liam's maple tree rework didn't make it. This has resulted in a few other minor patch series being held over for next time. Multi-gen LRU still isn't merged as we were waiting for mapletree to stabilize. The current plan is to merge MGLRU into -mm soon and to later reintroduce mapletree, with a view to hopefully getting both into 6.1-rc1. Summary: - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe Lin, Yang Shi, Anshuman Khandual and Mike Rapoport - Some kmemleak fixes from Patrick Wang and Waiman Long - DAMON updates from SeongJae Park - memcg debug/visibility work from Roman Gushchin - vmalloc speedup from Uladzislau Rezki - more folio conversion work from Matthew Wilcox - enhancements for coherent device memory mapping from Alex Sierra - addition of shared pages tracking and CoW support for fsdax, from Shiyang Ruan - hugetlb optimizations from Mike Kravetz - Mel Gorman has contributed some pagealloc changes to improve latency and realtime behaviour. - mprotect soft-dirty checking has been improved by Peter Xu - Many other singleton patches all over the place" [ XFS merge from hell as per Darrick Wong in https://lore.kernel.org/all/YshKnxb4VwXycPO8@magnolia/ ] * tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (282 commits) tools/testing/selftests/vm/hmm-tests.c: fix build mm: Kconfig: fix typo mm: memory-failure: convert to pr_fmt() mm: use is_zone_movable_page() helper hugetlbfs: fix inaccurate comment in hugetlbfs_statfs() hugetlbfs: cleanup some comments in inode.c hugetlbfs: remove unneeded header file hugetlbfs: remove unneeded hugetlbfs_ops forward declaration hugetlbfs: use helper macro SZ_1{K,M} mm: cleanup is_highmem() mm/hmm: add a test for cross device private faults selftests: add soft-dirty into run_vmtests.sh selftests: soft-dirty: add test for mprotect mm/mprotect: fix soft-dirty check in can_change_pte_writable() mm: memcontrol: fix potential oom_lock recursion deadlock mm/gup.c: fix formatting in check_and_migrate_movable_page() xfs: fail dax mount if reflink is enabled on a partition mm/memcontrol.c: remove the redundant updating of stats_flush_threshold userfaultfd: don't fail on unrecognized features hugetlb_cgroup: fix wrong hugetlb cgroup numa stat ...
2022-07-31erofs: update ctx->pos for every emitted direntHongnan Li1-9/+7
erofs_readdir update ctx->pos after filling a batch of dentries and it may cause dir/files duplication for NFS readdirplus which depends on ctx->pos to fill dir correctly. So update ctx->pos for every emitted dirent in erofs_fill_dentries to fix it. Also fix the update of ctx->pos when the initial file position has exceeded nameoff. Fixes: 3e917cc305c6 ("erofs: make filesystem exportable") Signed-off-by: Hongnan Li <hongnan.li@linux.alibaba.com> Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20220722082732.30935-1-jefflexu@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-07-22erofs: get rid of the leftover PAGE_SIZE in dir.cGao Xiang1-2/+2
Convert the last hardcoded PAGE_SIZEs of uncompressed cases. Reviewed-by: Jeffle Xu <jefflexu@linux.alibaba.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20220619150940.121005-1-hsiangkao@linux.alibaba.com
2022-07-22erofs: get rid of erofs_prepare_dio() helperGao Xiang1-24/+15
Fold in erofs_prepare_dio() in order to simplify the code. Reviewed-by: Jeffle Xu <jefflexu@linux.alibaba.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20220720082229.12172-1-hsiangkao@linux.alibaba.com
2022-07-22erofs: introduce multi-reference pclusters (fully-referenced)Gao Xiang4-56/+93
Let's introduce multi-reference pclusters at runtime. In details, if one pcluster is requested by multiple extents at almost the same time (even belong to different files), the longest extent will be decompressed as representative and the other extents are actually copied from the longest one in one round. After this patch, fully-referenced extents can be correctly handled and the full decoding check needs to be bypassed for partial-referenced extents. Acked-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20220715154203.48093-17-hsiangkao@linux.alibaba.com
2022-07-21erofs: record the longest decompressed size in this roundGao Xiang2-58/+31
Currently, `pcl->length' records the longest decompressed length as long as the pcluster itself isn't reclaimed. However, such number is unneeded for the general cases since it doesn't indicate the exact decompressed size in this round. Instead, let's record the decompressed size for this round instead, thus `pcl->nr_pages' can be completely dropped and pageofs_out is also designed to be kept in sync with `pcl->length'. Acked-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20220715154203.48093-16-hsiangkao@linux.alibaba.com
2022-07-21erofs: introduce z_erofs_do_decompressed_bvec()Gao Xiang1-27/+22
Both out_bvecs and in_bvecs share the common logic for decompressed buffers. So let's make a helper for this. Acked-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20220715154203.48093-15-hsiangkao@linux.alibaba.com
2022-07-21erofs: try to leave (de)compressed_pages on stack if possibleGao Xiang1-13/+21
For the most cases, small pclusters can be decompressed with page arrays on stack. Try to leave both (de)compressed_pages on stack if possible as before. Acked-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20220715154203.48093-14-hsiangkao@linux.alibaba.com
2022-07-21erofs: introduce struct z_erofs_decompress_backendGao Xiang2-67/+76
Let's introduce struct z_erofs_decompress_backend in order to pass on the decompression backend context between helper functions more easier and avoid too many arguments. Acked-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20220715154203.48093-13-hsiangkao@linux.alibaba.com
2022-07-21erofs: get rid of `z_pagemap_global'Gao Xiang2-25/+4
In order to introduce multi-reference pclusters for compressed data deduplication, let's get rid of the global page array for now since it needs to be re-designed then at least. Acked-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20220715154203.48093-12-hsiangkao@linux.alibaba.com
2022-07-21erofs: clean up `enum z_erofs_collectmode'Gao Xiang1-32/+31
`enum z_erofs_collectmode' is really ambiguous, but I'm not quite sure if there are better naming, basically it's used to judge whether inplace I/O can be used due to the current status of pclusters in the chain. Rename it as `enum z_erofs_pclustermode' instead. Acked-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20220715154203.48093-11-hsiangkao@linux.alibaba.com
2022-07-21erofs: get rid of `enum z_erofs_page_type'Gao Xiang1-25/+5
Remove it since pagevec[] is no longer used. Acked-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20220715154203.48093-10-hsiangkao@linux.alibaba.com
2022-07-21erofs: rework online page handlingGao Xiang2-83/+42
Since all decompressed offsets have been integrated to bvecs[], this patch avoids all sub-indexes so that page->private only includes a part count and an eio flag, thus in the future folio->private can have the same meaning. In addition, PG_error will not be used anymore after this patch and we're heading to use page->private (later folio->private) and page->mapping (later folio->mapping) only. Acked-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20220715154203.48093-9-hsiangkao@linux.alibaba.com
2022-07-21erofs: switch compressed_pages[] to bufvecGao Xiang2-60/+57
Convert compressed_pages[] to bufvec in order to avoid using page->private to keep onlinepage_index (decompressed offset) for inplace I/O pages. In the future, we only rely on folio->private to keep a countdown to unlock folios and set folio_uptodate. Acked-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20220715154203.48093-8-hsiangkao@linux.alibaba.com
2022-07-21erofs: introduce `z_erofs_parse_in_bvecs'Gao Xiang1-52/+80
`z_erofs_decompress_pcluster()' is too long therefore it'd be better to introduce another helper to parse compressed pages (or laterly, compressed bvecs.) BTW, since `compressed_bvecs' is too long as a part of the function name, `in_bvecs' is used here instead. Acked-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20220715154203.48093-7-hsiangkao@linux.alibaba.com
2022-07-21erofs: drop the old pagevec approachGao Xiang3-171/+18
Remove the old pagevec approach but keep z_erofs_page_type for now. It will be reworked in the following commits as well. Also rename Z_EROFS_NR_INLINE_PAGEVECS as Z_EROFS_INLINE_BVECS with the new value 2 since it's actually enough to bootstrap. Acked-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20220715154203.48093-6-hsiangkao@linux.alibaba.com
2022-07-21erofs: introduce bufvec to store decompressed buffersGao Xiang2-50/+153
For each pcluster, the total compressed buffers are determined in advance, yet the number of decompressed buffers actually vary. Too many decompressed pages can be recorded if one pcluster is highly compressed or its pcluster size is large. That takes extra memory footprints compared to uncompressed filesystems, especially a lot of I/O in flight on low-ended devices. Therefore, similar to inplace I/O, pagevec was introduced to reuse page cache to store these pointers in the time-sharing way since these pages are actually unused before decompressing. In order to make it more flexable, a cleaner bufvec is used to replace the old pagevec stuffs so that - Decompressed offsets can be stored inline, thus it can be used for the upcoming feature like compressed data deduplication. It's calculated by `page_offset(page) - map->m_la'; - Towards supporting large folios for compressed inodes since our final goal is to completely avoid page->private but use folio->private only for all page cache pages. Acked-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20220715154203.48093-5-hsiangkao@linux.alibaba.com
2022-07-21erofs: introduce `z_erofs_parse_out_bvecs()'Gao Xiang1-38/+43
`z_erofs_decompress_pcluster()' is too long therefore it'd be better to introduce another helper to parse decompressed pages (or laterly, decompressed bvecs.) BTW, since `decompressed_bvecs' is too long as a part of the function name, `out_bvecs' is used instead. Reviewed-by: Yue Hu <huyue2@coolpad.com> Acked-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20220715154203.48093-4-hsiangkao@linux.alibaba.com
2022-07-21erofs: clean up z_erofs_collector_begin()Gao Xiang1-17/+15
Rearrange the code and get rid of all gotos. Reviewed-by: Yue Hu <huyue2@coolpad.com> Acked-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20220715154203.48093-3-hsiangkao@linux.alibaba.com
2022-07-21erofs: get rid of unneeded `inode', `map' and `sb'Gao Xiang1-23/+19
Since commit 5c6dcc57e2e5 ("erofs: get rid of `struct z_erofs_collector'"), these arguments can be dropped as well. No logic changes. Reviewed-by: Yue Hu <huyue2@coolpad.com> Acked-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20220715154203.48093-2-hsiangkao@linux.alibaba.com