summaryrefslogtreecommitdiff
path: root/fs/overlayfs
AgeCommit message (Collapse)AuthorFilesLines
2023-08-30Merge tag 'ovl-update-6.6' of ↵Linus Torvalds11-68/+580
git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs Pull overlayfs updates from Amir Goldstein: - add verification feature needed by composefs (Alexander Larsson) - improve integration of overlayfs and fanotify (Amir Goldstein) - fortify some overlayfs code (Andrea Righi) * tag 'ovl-update-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs: ovl: validate superblock in OVL_FS() ovl: make consistent use of OVL_FS() ovl: Kconfig: introduce CONFIG_OVERLAY_FS_DEBUG ovl: auto generate uuid for new overlay filesystems ovl: store persistent uuid/fsid with uuid=on ovl: add support for unique fsid per instance ovl: support encoding non-decodable file handles ovl: Handle verity during copy-up ovl: Validate verity xattr when resolving lowerdata ovl: Add versioned header for overlay.metacopy xattr ovl: Add framework for verity support
2023-08-28Merge tag 'v6.6-vfs.misc' of ↵Linus Torvalds1-8/+2
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains the usual miscellaneous features, cleanups, and fixes for vfs and individual filesystems. Features: - Block mode changes on symlinks and rectify our broken semantics - Report file modifications via fsnotify() for splice - Allow specifying an explicit timeout for the "rootwait" kernel command line option. This allows to timeout and reboot instead of always waiting indefinitely for the root device to show up - Use synchronous fput for the close system call Cleanups: - Get rid of open-coded lockdep workarounds for async io submitters and replace it all with a single consolidated helper - Simplify epoll allocation helper - Convert simple_write_begin and simple_write_end to use a folio - Convert page_cache_pipe_buf_confirm() to use a folio - Simplify __range_close to avoid pointless locking - Disable per-cpu buffer head cache for isolated cpus - Port ecryptfs to kmap_local_page() api - Remove redundant initialization of pointer buf in pipe code - Unexport the d_genocide() function which is only used within core vfs - Replace printk(KERN_ERR) and WARN_ON() with WARN() Fixes: - Fix various kernel-doc issues - Fix refcount underflow for eventfds when used as EFD_SEMAPHORE - Fix a mainly theoretical issue in devpts - Check the return value of __getblk() in reiserfs - Fix a racy assert in i_readcount_dec - Fix integer conversion issues in various functions - Fix LSM security context handling during automounts that prevented NFS superblock sharing" * tag 'v6.6-vfs.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (39 commits) cachefiles: use kiocb_{start,end}_write() helpers ovl: use kiocb_{start,end}_write() helpers aio: use kiocb_{start,end}_write() helpers io_uring: use kiocb_{start,end}_write() helpers fs: create kiocb_{start,end}_write() helpers fs: add kerneldoc to file_{start,end}_write() helpers io_uring: rename kiocb_end_write() local helper splice: Convert page_cache_pipe_buf_confirm() to use a folio libfs: Convert simple_write_begin and simple_write_end to use a folio fs/dcache: Replace printk and WARN_ON by WARN fs/pipe: remove redundant initialization of pointer buf fs: Fix kernel-doc warnings devpts: Fix kernel-doc warnings doc: idmappings: fix an error and rephrase a paragraph init: Add support for rootwait timeout parameter vfs: fix up the assert in i_readcount_dec fs: Fix one kernel-doc comment docs: filesystems: idmappings: clarify from where idmappings are taken fs/buffer.c: disable per-CPU buffer_head cache for isolated CPUs vfs, security: Fix automount superblock LSM init problem, preventing NFS sb sharing ...
2023-08-28Merge tag 'v6.6-vfs.ctime' of ↵Linus Torvalds4-5/+8
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs timestamp updates from Christian Brauner: "This adds VFS support for multi-grain timestamps and converts tmpfs, xfs, ext4, and btrfs to use them. This carries acks from all relevant filesystems. The VFS always uses coarse-grained timestamps when updating the ctime and mtime after a change. This has the benefit of allowing filesystems to optimize away a lot of metadata updates, down to around 1 per jiffy, even when a file is under heavy writes. Unfortunately, this has always been an issue when we're exporting via NFSv3, which relies on timestamps to validate caches. A lot of changes can happen in a jiffy, so timestamps aren't sufficient to help the client decide to invalidate the cache. Even with NFSv4, a lot of exported filesystems don't properly support a change attribute and are subject to the same problems with timestamp granularity. Other applications have similar issues with timestamps (e.g., backup applications). If we were to always use fine-grained timestamps, that would improve the situation, but that becomes rather expensive, as the underlying filesystem would have to log a lot more metadata updates. This introduces fine-grained timestamps that are used when they are actively queried. This uses the 31st bit of the ctime tv_nsec field to indicate that something has queried the inode for the mtime or ctime. When this flag is set, on the next mtime or ctime update, the kernel will fetch a fine-grained timestamp instead of the usual coarse-grained one. As POSIX generally mandates that when the mtime changes, the ctime must also change the kernel always stores normalized ctime values, so only the first 30 bits of the tv_nsec field are ever used. Filesytems can opt into this behavior by setting the FS_MGTIME flag in the fstype. Filesystems that don't set this flag will continue to use coarse-grained timestamps. Various preparatory changes, fixes and cleanups are included: - Fixup all relevant places where POSIX requires updating ctime together with mtime. This is a wide-range of places and all maintainers provided necessary Acks. - Add new accessors for inode->i_ctime directly and change all callers to rely on them. Plain accesses to inode->i_ctime are now gone and it is accordingly rename to inode->__i_ctime and commented as requiring accessors. - Extend generic_fillattr() to pass in a request mask mirroring in a sense the statx() uapi. This allows callers to pass in a request mask to only get a subset of attributes filled in. - Rework timestamp updates so it's possible to drop the @now parameter the update_time() inode operation and associated helpers. - Add inode_update_timestamps() and convert all filesystems to it removing a bunch of open-coding" * tag 'v6.6-vfs.ctime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (107 commits) btrfs: convert to multigrain timestamps ext4: switch to multigrain timestamps xfs: switch to multigrain timestamps tmpfs: add support for multigrain timestamps fs: add infrastructure for multigrain timestamps fs: drop the timespec64 argument from update_time xfs: have xfs_vn_update_time gets its own timestamp fat: make fat_update_time get its own timestamp fat: remove i_version handling from fat_update_time ubifs: have ubifs_update_time use inode_update_timestamps btrfs: have it use inode_update_timestamps fs: drop the timespec64 arg from generic_update_time fs: pass the request_mask to generic_fillattr fs: remove silly warning from current_time gfs2: fix timestamp handling on quota inodes fs: rename i_ctime field to __i_ctime selinux: convert to ctime accessor functions security: convert to ctime accessor functions apparmor: convert to ctime accessor functions sunrpc: convert to ctime accessor functions ...
2023-08-21ovl: use kiocb_{start,end}_write() helpersAmir Goldstein1-8/+2
Use helpers instead of the open coded dance to silence lockdep warnings. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jens Axboe <axboe@kernel.dk> Message-Id: <20230817141337.1025891-7-amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-12ovl: validate superblock in OVL_FS()Andrea Righi1-0/+3
When CONFIG_OVERLAY_FS_DEBUG is enabled add an explicit check to make sure that OVL_FS() is always used with a valid overlayfs superblock. Otherwise trigger a WARN_ON_ONCE(). Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Andrea Righi <andrea.righi@canonical.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-08-12ovl: make consistent use of OVL_FS()Andrea Righi8-24/+26
Always use OVL_FS() to retrieve the corresponding struct ovl_fs from a struct super_block. Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Andrea Righi <andrea.righi@canonical.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-08-12ovl: Kconfig: introduce CONFIG_OVERLAY_FS_DEBUGAndrea Righi1-0/+9
Provide a Kconfig option to enable extra debugging checks for overlayfs. Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Andrea Righi <andrea.righi@canonical.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-08-12ovl: auto generate uuid for new overlay filesystemsAmir Goldstein3-2/+27
Add a new mount option uuid=auto, which is the default. If a persistent UUID xattr is found it is used. Otherwise, an existing ovelrayfs with copied up subdirs in upper dir that was never mounted with uuid=on retains the null UUID. A new overlayfs with no copied up subdirs, generates the persistent UUID on first mount. Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-08-12ovl: store persistent uuid/fsid with uuid=onAmir Goldstein4-3/+54
With uuid=on, store a persistent uuid in xattr on the upper dir to give the overlayfs instance a persistent identifier. This also makes f_fsid persistent and more reliable for reporting fid info in fanotify events. uuid=on is not supported on non-upper overlayfs or with upper fs that does not support xattrs. Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-08-12ovl: add support for unique fsid per instanceAmir Goldstein6-13/+53
The legacy behavior of ovl_statfs() reports the f_fsid filled by underlying upper fs. This fsid is not unique among overlayfs instances on the same upper fs. With mount option uuid=on, generate a non-persistent uuid per overlayfs instance and use it as the seed for f_fsid, similar to tmpfs. This is useful for reporting fanotify events with fid info from different instances of overlayfs over the same upper fs. The old behavior of null uuid and upper fs fsid is retained with the mount option uuid=null, which is the default. The mount option uuid=off that disables uuid checks in underlying layers also retains the legacy behavior. Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-08-12ovl: support encoding non-decodable file handlesAmir Goldstein5-7/+32
When all layers support file handles, we support encoding non-decodable file handles (a.k.a. fid) even with nfs_export=off. When file handles do not need to be decoded, we do not need to copy up redirected lower directories on encode, and we encode also non-indexed upper with lower file handle, so fid will not change on copy up. This enables reporting fanotify events with file handles on overlayfs with default config/mount options. Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-08-12ovl: Handle verity during copy-upAlexander Larsson3-5/+79
During regular metacopy, if lowerdata file has fs-verity enabled, and the verity option is enabled, we add the digest to the metacopy xattr. If verity is required, and lowerdata does not have fs-verity enabled, fall back to full copy-up (or the generated metacopy would not validate). Signed-off-by: Alexander Larsson <alexl@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-08-12ovl: Validate verity xattr when resolving lowerdataAlexander Larsson6-8/+178
The new digest field in the metacopy xattr is used during lookup to record whether the header contained a digest in the OVL_HAS_DIGEST flags. When accessing file data the first time, if OVL_HAS_DIGEST is set, we reload the metadata and check that the source lowerdata inode matches the specified digest in it (according to the enabled verity options). If the verity check passes we store this info in the inode flags as OVL_VERIFIED_DIGEST, so that we can avoid doing it again if the inode remains in memory. The verification is done in ovl_maybe_validate_verity() which needs to be called in the same places as ovl_maybe_lookup_lowerdata(), so there is a new ovl_verify_lowerdata() helper that calls these in the right order, and all current callers of ovl_maybe_lookup_lowerdata() are changed to call it instead. Signed-off-by: Alexander Larsson <alexl@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-08-12ovl: Add versioned header for overlay.metacopy xattrAlexander Larsson3-10/+61
Historically overlay.metacopy was a zero-size xattr, and it's existence marked a metacopy file. This change adds a versioned header with a flag field, a length and a digest. The initial use-case of this will be for validating a fs-verity digest, but the flags field could also be used later for other new features. ovl_check_metacopy_xattr() now returns the size of the xattr, emulating a size of OVL_METACOPY_MIN_SIZE for empty xattrs to distinguish it from the no-xattr case. Signed-off-by: Alexander Larsson <alexl@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-08-12ovl: Add framework for verity supportAlexander Larsson3-3/+65
This adds the scaffolding (docs, config, mount options) for supporting the new digest field in the metacopy xattr. This contains a fs-verity digest that need to match the fs-verity digest of the lowerdata file. The mount option "verity" specifies how this xattr is handled. If you enable verity ("verity=on") all existing xattrs are validated before use, and during metacopy we generate verity xattr in the upper metacopy file (if the source file has verity enabled). This means later accesses can guarantee that the same data is used. Additionally you can use "verity=require". In this mode all metacopy files must have a valid verity xattr. For this to work metadata copy-up must be able to create a verity xattr (so that later accesses are validated). Therefore, in this mode, if the lower data file doesn't have fs-verity enabled we fall back to a full copy rather than a metacopy. Actual implementation follows in a separate commit. Signed-off-by: Alexander Larsson <alexl@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-08-11fs: drop the timespec64 argument from update_timeJeff Layton2-2/+2
Now that all of the update_time operations are prepared for it, we can drop the timespec64 argument from the update_time operation. Do that and remove it from some associated functions like inode_update_time and inode_needs_update_time. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Message-Id: <20230807-mgctime-v7-8-d1dec143a704@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-06vfs: get rid of old '->iterate' directory operationLinus Torvalds1-1/+2
All users now just use '->iterate_shared()', which only takes the directory inode lock for reading. Filesystems that never got convered to shared mode now instead use a wrapper that drops the lock, re-takes it in write mode, calls the old function, and then downgrades the lock back to read mode. This way the VFS layer and other callers no longer need to care about filesystems that never got converted to the modern era. The filesystems that use the new wrapper are ceph, coda, exfat, jfs, ntfs, ocfs2, overlayfs, and vboxsf. Honestly, several of them look like they really could just iterate their directories in shared mode and skip the wrapper entirely, but the point of this change is to not change semantics or fix filesystems that haven't been fixed in the last 7+ years, but to finally get rid of the dual iterators. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-07-26ovl: Always reevaluate the file signature for IMAEric Snowberg1-1/+1
Commit db1d1e8b9867 ("IMA: use vfs_getattr_nosec to get the i_version") partially closed an IMA integrity issue when directly modifying a file on the lower filesystem. If the overlay file is first opened by a user and later the lower backing file is modified by root, but the extended attribute is NOT updated, the signature validation succeeds with the old original signature. Update the super_block s_iflags to SB_I_IMA_UNVERIFIABLE_SIGNATURE to force signature reevaluation on every file access until a fine grained solution can be found. Signed-off-by: Eric Snowberg <eric.snowberg@oracle.com> Signed-off-by: Mimi Zohar <zohar@linux.ibm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-07-24overlayfs: convert to ctime accessor functionsJeff Layton2-3/+6
In later patches, we're going to change how the inode's ctime field is used. Switch to using accessor functions instead of raw accesses of inode->i_ctime. Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Message-Id: <20230705190309.579783-64-jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-07-03ovl: move all parameter handling into params.{c,h}Christian Brauner4-564/+581
While initially I thought that we couldn't move all new mount api handling into params.{c,h} it turns out it is possible. So this just moves a good chunk of code out of super.c and into params.{c,h}. Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-06-29Merge tag 'ovl-update-6.5' of ↵Linus Torvalds13-699/+1350
git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs Pull overlayfs update from Amir Goldstein: - fix two NULL pointer deref bugs (Zhihao Cheng) - add support for "data-only" lower layers destined to be used by composefs - port overlayfs to the new mount api (Christian Brauner) * tag 'ovl-update-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs: (26 commits) ovl: add Amir as co-maintainer ovl: reserve ability to reconfigure mount options with new mount api ovl: modify layer parameter parsing ovl: port to new mount api ovl: factor out ovl_parse_options() helper ovl: store enum redirect_mode in config instead of a string ovl: pass ovl_fs to xino helpers ovl: clarify ovl_get_root() semantics ovl: negate the ofs->share_whiteout boolean ovl: check type and offset of struct vfsmount in ovl_entry ovl: implement lazy lookup of lowerdata in data-only layers ovl: prepare for lazy lookup of lowerdata inode ovl: prepare to store lowerdata redirect for lazy lowerdata lookup ovl: implement lookup in data-only layers ovl: introduce data-only lower layers ovl: remove unneeded goto instructions ovl: deduplicate lowerdata and lowerstack[] ovl: deduplicate lowerpath and lowerstack[] ovl: move ovl_entry into ovl_inode ovl: factor out ovl_free_entry() and ovl_stack_*() helpers ...
2023-06-26Merge tag 'for-6.5/splice-2023-06-23' of git://git.kernel.dk/linuxLinus Torvalds1-1/+22
Pull splice updates from Jens Axboe: "This kills off ITER_PIPE to avoid a race between truncate, iov_iter_revert() on the pipe and an as-yet incomplete DMA to a bio with unpinned/unref'ed pages from an O_DIRECT splice read. This causes memory corruption. Instead, we either use (a) filemap_splice_read(), which invokes the buffered file reading code and splices from the pagecache into the pipe; (b) copy_splice_read(), which bulk-allocates a buffer, reads into it and then pushes the filled pages into the pipe; or (c) handle it in filesystem-specific code. Summary: - Rename direct_splice_read() to copy_splice_read() - Simplify the calculations for the number of pages to be reclaimed in copy_splice_read() - Turn do_splice_to() into a helper, vfs_splice_read(), so that it can be used by overlayfs and coda to perform the checks on the lower fs - Make vfs_splice_read() jump to copy_splice_read() to handle direct-I/O and DAX - Provide shmem with its own splice_read to handle non-existent pages in the pagecache. We don't want a ->read_folio() as we don't want to populate holes, but filemap_get_pages() requires it - Provide overlayfs with its own splice_read to call down to a lower layer as overlayfs doesn't provide ->read_folio() - Provide coda with its own splice_read to call down to a lower layer as coda doesn't provide ->read_folio() - Direct ->splice_read to copy_splice_read() in tty, procfs, kernfs and random files as they just copy to the output buffer and don't splice pages - Provide wrappers for afs, ceph, ecryptfs, ext4, f2fs, nfs, ntfs3, ocfs2, orangefs, xfs and zonefs to do locking and/or revalidation - Make cifs use filemap_splice_read() - Replace pointers to generic_file_splice_read() with pointers to filemap_splice_read() as DIO and DAX are handled in the caller; filesystems can still provide their own alternate ->splice_read() op - Remove generic_file_splice_read() - Remove ITER_PIPE and its paraphernalia as generic_file_splice_read was the only user" * tag 'for-6.5/splice-2023-06-23' of git://git.kernel.dk/linux: (31 commits) splice: kdoc for filemap_splice_read() and copy_splice_read() iov_iter: Kill ITER_PIPE splice: Remove generic_file_splice_read() splice: Use filemap_splice_read() instead of generic_file_splice_read() cifs: Use filemap_splice_read() trace: Convert trace/seq to use copy_splice_read() zonefs: Provide a splice-read wrapper xfs: Provide a splice-read wrapper orangefs: Provide a splice-read wrapper ocfs2: Provide a splice-read wrapper ntfs3: Provide a splice-read wrapper nfs: Provide a splice-read wrapper f2fs: Provide a splice-read wrapper ext4: Provide a splice-read wrapper ecryptfs: Provide a splice-read wrapper ceph: Provide a splice-read wrapper afs: Provide a splice-read wrapper 9p: Add splice_read wrapper net: Make sock_splice_read() use copy_splice_read() by default tty, proc, kernfs, random: Use copy_splice_read() ...
2023-06-20ovl: reserve ability to reconfigure mount options with new mount apiChristian Brauner1-7/+18
Using the old mount api to remount an overlayfs superblock via mount(MS_REMOUNT) all mount options will be silently ignored. For example, if you create an overlayfs mount: mount -t overlay overlay -o lowerdir=/mnt/a:/mnt/b,upperdir=/mnt/upper,workdir=/mnt/work /mnt/merged and then issue a remount via: # force mount(8) to use mount(2) export LIBMOUNT_FORCE_MOUNT2=always mount -t overlay overlay -o remount,WOOTWOOT,lowerdir=/DOESNT-EXIST /mnt/merged with completely nonsensical mount options whatsoever it will succeed nonetheless. This prevents us from every changing any mount options we might introduce in the future that could reasonably be changed during a remount. We don't need to carry this issue into the new mount api port. Similar to FUSE we can use the fs_context::oldapi member to figure out that this is a request coming through the legacy mount api. If we detect it we continue silently ignoring all mount options. But for the new mount api we simply report that mount options cannot currently be changed. This will allow us to potentially alter mount properties for new or even old properties. It any case, silently ignoring everything is not something new apis should do. Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-06-20ovl: modify layer parameter parsingChristian Brauner5-258/+520
We ran into issues where mount(8) passed multiple lower layers as one big string through fsconfig(). But the fsconfig() FSCONFIG_SET_STRING option is limited to 256 bytes in strndup_user(). While this would be fixable by extending the fsconfig() buffer I'd rather encourage users to append layers via multiple fsconfig() calls as the interface allows nicely for this. This has also been requested as a feature before. With this port to the new mount api the following will be possible: fsconfig(fs_fd, FSCONFIG_SET_STRING, "lowerdir", "/lower1", 0); /* set upper layer */ fsconfig(fs_fd, FSCONFIG_SET_STRING, "upperdir", "/upper", 0); /* append "/lower2", "/lower3", and "/lower4" */ fsconfig(fs_fd, FSCONFIG_SET_STRING, "lowerdir", ":/lower2:/lower3:/lower4", 0); /* turn index feature on */ fsconfig(fs_fd, FSCONFIG_SET_STRING, "index", "on", 0); /* append "/lower5" */ fsconfig(fs_fd, FSCONFIG_SET_STRING, "lowerdir", ":/lower5", 0); Specifying ':' would have been rejected so this isn't a regression. And we can't simply use "lowerdir=/lower" to append on top of existing layers as "lowerdir=/lower,lowerdir=/other-lower" would make "/other-lower" the only lower layer so we'd break uapi if we changed this. So the ':' prefix seems a good compromise. Users can choose to specify multiple layers at once or individual layers. A layer is appended if it starts with ":". This requires that the user has already added at least one layer before. If lowerdir is specified again without a leading ":" then all previous layers are dropped and replaced with the new layers. If lowerdir is specified and empty than all layers are simply dropped. An additional change is that overlayfs will now parse and resolve layers right when they are specified in fsconfig() instead of deferring until super block creation. This allows users to receive early errors. It also allows users to actually use up to 500 layers something which was theoretically possible but ended up not working due to the mount option string passed via mount(2) being too large. This also allows a more privileged process to set config options for a lesser privileged process as the creds for fsconfig() and the creds for fsopen() can differ. We could restrict that they match by enforcing that the creds of fsopen() and fsconfig() match but I don't see why that needs to be the case and allows for a good delegation mechanism. Plus, in the future it means we're able to extend overlayfs mount options and allow users to specify layers via file descriptors instead of paths: fsconfig(FSCONFIG_SET_PATH{_EMPTY}, "lowerdir", "lower1", dirfd); /* append */ fsconfig(FSCONFIG_SET_PATH{_EMPTY}, "lowerdir", "lower2", dirfd); /* append */ fsconfig(FSCONFIG_SET_PATH{_EMPTY}, "lowerdir", "lower3", dirfd); /* clear all layers specified until now */ fsconfig(FSCONFIG_SET_STRING, "lowerdir", NULL, 0); This would be especially nice if users create an overlayfs mount on top of idmapped layers or just in general private mounts created via open_tree(OPEN_TREE_CLONE). Those mounts would then never have to appear anywhere in the filesystem. But for now just do the minimal thing. We should probably aim to move more validation into ovl_fs_parse_param() so users get errors before fsconfig(FSCONFIG_CMD_CREATE). But that can be done in additional patches later. This is now also rebased on top of the lazy lowerdata lookup which allows the specificatin of data only layers using the new "::" syntax. The rules are simple. A data only layers cannot be followed by any regular layers and data layers must be preceeded by at least one regular layer. Parsing the lowerdir mount option must change because of this. The original patchset used the old lowerdir parsing function to split a lowerdir mount option string such as: lowerdir=/lower1:/lower2::/lower3::/lower4 simply replacing each non-escaped ":" by "\0". So sequences of non-escaped ":" were counted as layers. For example, the previous lowerdir mount option above would've counted 6 layers instead of 4 and a lowerdir mount option such as: lowerdir="/lower1:/lower2::/lower3::/lower4:::::::::::::::::::::::::::" would be counted as 33 layers. Other than being ugly this didn't matter much because kern_path() would reject the first "\0" layer. However, this overcounting of layers becomes problematic when we base allocations on it where we very much only want to allocate space for 4 layers instead of 33. So the new parsing function rejects non-escaped sequences of colons other than ":" and "::" immediately instead of relying on kern_path(). Link: https://github.com/util-linux/util-linux/issues/2287 Link: https://github.com/util-linux/util-linux/issues/1992 Link: https://bugs.archlinux.org/task/78702 Link: https://lore.kernel.org/linux-unionfs/20230530-klagen-zudem-32c0908c2108@brauner Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-06-19ovl: enable fsnotify events on underlying real filesAmir Goldstein1-2/+2
Overlayfs creates the real underlying files with fake f_path, whose f_inode is on the underlying fs and f_path on overlayfs. Those real files were open with FMODE_NONOTIFY, because fsnotify code was not prapared to handle fsnotify hooks on files with fake path correctly and fanotify would report unexpected event->fd with fake overlayfs path, when the underlying fs was being watched. Teach fsnotify to handle events on the real files, and do not set real files to FMODE_NONOTIFY to allow operations on real file (e.g. open, access, modify, close) to generate async and permission events. Because fsnotify does not have notifications on address space operations, we do not need to worry about ->vm_file not reporting events to a watched overlayfs when users are accessing a mapped overlayfs file. Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Message-Id: <20230615112229.2143178-6-amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-19fs: use backing_file container for internal files with "fake" f_pathAmir Goldstein1-2/+2
Overlayfs uses open_with_fake_path() to allocate internal kernel files, with a "fake" path - whose f_path is not on the same fs as f_inode. Allocate a container struct backing_file for those internal files, that is used to hold the "fake" ovl path along with the real path. backing_file_real_path() can be used to access the stored real path. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Message-Id: <20230615112229.2143178-5-amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-19fs: rename {vfs,kernel}_tmpfile_open()Amir Goldstein1-2/+3
Overlayfs and cachefiles use vfs_open_tmpfile() to open a tmpfile without accounting for nr_files. Rename this helper to kernel_tmpfile_open() to better reflect this helper is used for kernel internal users. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Message-Id: <20230615112229.2143178-2-amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-19ovl: port to new mount apiChristian Brauner1-205/+193
We recently ported util-linux to the new mount api. Now the mount(8) tool will by default use the new mount api. While trying hard to fall back to the old mount api gracefully there are still cases where we run into issues that are difficult to handle nicely. Now with mount(8) and libmount supporting the new mount api I expect an increase in the number of bug reports and issues we're going to see with filesystems that don't yet support the new mount api. So it's time we rectify this. When ovl_fill_super() fails before setting sb->s_root, we need to cleanup sb->s_fs_info. The logic is a bit convoluted but tl;dr: If sget_fc() has succeeded fc->s_fs_info will have been transferred to sb->s_fs_info. So by the time ->fill_super()/ovl_fill_super() is called fc->s_fs_info is NULL consequently fs_context->free() won't call ovl_free_fs(). If we fail before sb->s_root() is set then ->put_super() won't be called which would call ovl_free_fs(). IOW, if we fail in ->fill_super() before sb->s_root we have to clean it up. Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-06-19ovl: factor out ovl_parse_options() helperAmir Goldstein2-116/+135
For parsing a single mount option. Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-06-19ovl: store enum redirect_mode in config instead of a stringAmir Goldstein6-84/+103
Do all the logic to set the mode during mount options parsing and do not keep the option string around. Use a constant_table to translate from enum redirect mode to string in preperation for new mount api option parsing. The mount option "off" is translated to either "follow" or "nofollow", depending on the "redirect_always_follow" build/module config, so in effect, there are only three possible redirect modes. This results in a minor change to the string that is displayed in show_options() - when redirect_dir is enabled by default and the user mounts with the option "redirect_dir=off", instead of displaying the mode "redirect_dir=off" in show_options(), the displayed mode will be either "redirect_dir=follow" or "redirect_dir=nofollow", depending on the value of "redirect_always_follow" build/module config. The displayed mode reflects the effective mode, so mounting overlayfs again with the dispalyed redirect_dir option will result with the same effective and displayed mode. Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-06-19ovl: pass ovl_fs to xino helpersAmir Goldstein4-31/+43
Internal ovl methods should use ovl_fs and not sb as much as possible. Use a constant_table to translate from enum xino mode to string in preperation for new mount api option parsing. Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-06-19ovl: clarify ovl_get_root() semanticsAmir Goldstein1-1/+3
Change the semantics to take a reference on upperdentry instead of transferrig the reference. This is needed for upcoming port to new mount api. Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-06-19ovl: negate the ofs->share_whiteout booleanAmir Goldstein3-7/+4
The default common case is that whiteout sharing is enabled. Change to storing the negated no_shared_whiteout state, so we will not need to initialize it. This is the first step towards removing all config and feature initializations out of ovl_fill_super(). Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-06-19ovl: check type and offset of struct vfsmount in ovl_entryChristian Brauner1-0/+9
Porting overlayfs to the new amount api I started experiencing random crashes that couldn't be explained easily. So after much debugging and reasoning it became clear that struct ovl_entry requires the point to struct vfsmount to be the first member and of type struct vfsmount. During the port I added a new member at the beginning of struct ovl_entry which broke all over the place in the form of random crashes and cache corruptions. While there's a comment in ovl_free_fs() to the effect of "Hack! Reuse ofs->layers as a vfsmount array before freeing it" there's no such comment on struct ovl_entry which makes this easy to trip over. Add a comment and two static asserts for both the offset and the type of pointer in struct ovl_entry. Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-06-19ovl: implement lazy lookup of lowerdata in data-only layersAmir Goldstein7-14/+107
Defer lookup of lowerdata in the data-only layers to first data access or before copy up. We perform lowerdata lookup before copy up even if copy up is metadata only copy up. We can further optimize this lookup later if needed. We do best effort lazy lookup of lowerdata for d_real_inode(), because this interface does not expect errors. The only current in-tree caller of d_real_inode() is trace_uprobe and this caller is likely going to be followed reading from the file, before placing uprobes on offset within the file, so lowerdata should be available when setting the uprobe. Tested-by: kernel test robot <oliver.sang@intel.com> Reviewed-by: Alexander Larsson <alexl@redhat.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-19ovl: prepare for lazy lookup of lowerdata inodeAmir Goldstein4-5/+33
Make the code handle the case of numlower > 1 and missing lowerdata dentry gracefully. Missing lowerdata dentry is an indication for lazy lookup of lowerdata and in that case the lowerdata_redirect path is stored in ovl_inode. Following commits will defer lookup and perform the lazy lookup on access. Reviewed-by: Alexander Larsson <alexl@redhat.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-19ovl: prepare to store lowerdata redirect for lazy lowerdata lookupAmir Goldstein6-4/+24
Prepare to allow ovl_lookup() to leave the last entry in a non-dir lowerstack empty to signify lazy lowerdata lookup. In this case, ovl_lookup() stores the redirect path from metacopy to lowerdata in ovl_inode, which is going to be used later to perform the lazy lowerdata lookup. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-19ovl: implement lookup in data-only layersAmir Goldstein1-1/+72
Lookup in data-only layers only for a lower metacopy with an absolute redirect xattr. The metacopy xattr is not checked on files found in the data-only layers and redirect xattr are not followed in the data-only layers. Reviewed-by: Alexander Larsson <alexl@redhat.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-19ovl: introduce data-only lower layersAmir Goldstein3-8/+72
Introduce the format lowerdir=lower1:lower2::lowerdata1::lowerdata2 where the lower layers on the right of the :: separators are not merged into the overlayfs merge dirs. Data-only lower layers are only allowed at the bottom of the stack. The files in those layers are only meant to be accessible via absolute redirect from metacopy files in lower layers. Following changes will implement lookup in the data layers. This feature was requested for composefs ostree use case, where the lower data layer should only be accessiable via absolute redirects from metacopy inodes. The lower data layers are not required to a have a unique uuid or any uuid at all, because they are never used to compose the overlayfs inode st_ino/st_dev. Reviewed-by: Alexander Larsson <alexl@redhat.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-19ovl: remove unneeded goto instructionsAmir Goldstein1-12/+9
There is nothing in the out goto target of ovl_get_layers(). Reviewed-by: Alexander Larsson <alexl@redhat.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-19ovl: deduplicate lowerdata and lowerstack[]Amir Goldstein6-17/+22
The ovl_inode contains a copy of lowerdata in lowerstack[], so the lowerdata inode member can be removed. Use accessors ovl_lowerdata*() to get the lowerdata whereever the member was accessed directly. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-19ovl: deduplicate lowerpath and lowerstack[]Amir Goldstein7-21/+13
The ovl_inode contains a copy of lowerpath in lowerstack[0], so the lowerpath member can be removed. Use accessor ovl_lowerpath() to get the lowerpath whereever the member was accessed directly. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-19ovl: move ovl_entry into ovl_inodeAmir Goldstein8-62/+57
The lower stacks of all the ovl inode aliases should be identical and there is redundant information in ovl_entry and ovl_inode. Move lowerstack into ovl_inode and keep only the OVL_E_FLAGS per overlay dentry. Following patches will deduplicate redundant ovl_inode fields. Note that for pure upper and negative dentries, OVL_E(dentry) may be NULL now, so it is imporatnt to use the ovl_numlower() accessor. Reviewed-by: Alexander Larsson <alexl@redhat.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-19ovl: factor out ovl_free_entry() and ovl_stack_*() helpersAmir Goldstein5-22/+46
In preparation for moving lowerstack into ovl_inode. Note that in ovl_lookup() the temp stack dentry refs are now cloned into the final ovl_lowerstack instead of being transferred, so cleanup always needs to call ovl_stack_free(stack). Reviewed-by: Alexander Larsson <alexl@redhat.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-19ovl: use ovl_numlower() and ovl_lowerstack() accessorsAmir Goldstein5-52/+73
This helps fortify against dereferencing a NULL ovl_entry, before we move the ovl_entry reference into ovl_inode. Reviewed-by: Alexander Larsson <alexl@redhat.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-19ovl: use OVL_E() and OVL_E_FLAGS() accessorsAmir Goldstein5-16/+21
Instead of open coded instances, because we are about to split the two apart. Reviewed-by: Alexander Larsson <alexl@redhat.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-19ovl: update of dentry revalidate flags after copy upAmir Goldstein7-13/+30
After copy up, we may need to update d_flags if upper dentry is on a remote fs and lower dentries are not. Add helpers to allow incremental update of the revalidate flags. Fixes: bccece1ead36 ("ovl: allow remote upper") Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-19ovl: fix null pointer dereference in ovl_get_acl_rcu()Zhihao Cheng1-6/+6
Following process: P1 P2 path_openat link_path_walk may_lookup inode_permission(rcu) ovl_permission acl_permission_check check_acl get_cached_acl_rcu ovl_get_inode_acl realinode = ovl_inode_real(ovl_inode) drop_cache __dentry_kill(ovl_dentry) iput(ovl_inode) ovl_destroy_inode(ovl_inode) dput(oi->__upperdentry) dentry_kill(upperdentry) dentry_unlink_inode upperdentry->d_inode = NULL ovl_inode_upper upperdentry = ovl_i_dentry_upper(ovl_inode) d_inode(upperdentry) // returns NULL IS_POSIXACL(realinode) // NULL pointer dereference , will trigger an null pointer dereference at realinode: [ 205.472797] BUG: kernel NULL pointer dereference, address: 0000000000000028 [ 205.476701] CPU: 2 PID: 2713 Comm: ls Not tainted 6.3.0-12064-g2edfa098e750-dirty #1216 [ 205.478754] RIP: 0010:do_ovl_get_acl+0x5d/0x300 [ 205.489584] Call Trace: [ 205.489812] <TASK> [ 205.490014] ovl_get_inode_acl+0x26/0x30 [ 205.490466] get_cached_acl_rcu+0x61/0xa0 [ 205.490908] generic_permission+0x1bf/0x4e0 [ 205.491447] ovl_permission+0x79/0x1b0 [ 205.491917] inode_permission+0x15e/0x2c0 [ 205.492425] link_path_walk+0x115/0x550 [ 205.493311] path_lookupat.isra.0+0xb2/0x200 [ 205.493803] filename_lookup+0xda/0x240 [ 205.495747] vfs_fstatat+0x7b/0xb0 Fetch a reproducer in [Link]. Use the helper ovl_i_path_realinode() to get realinode and then do non-nullptr checking. Link: https://bugzilla.kernel.org/show_bug.cgi?id=217404 Fixes: 332f606b32b6 ("ovl: enable RCU'd ->get_acl()") Cc: <stable@vger.kernel.org> # v5.15 Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Suggested-by: Christian Brauner <brauner@kernel.org> Suggested-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-19ovl: fix null pointer dereference in ovl_permission()Zhihao Cheng1-3/+2
Following process: P1 P2 path_lookupat link_path_walk inode_permission ovl_permission ovl_i_path_real(inode, &realpath) path->dentry = ovl_i_dentry_upper(inode) drop_cache __dentry_kill(ovl_dentry) iput(ovl_inode) ovl_destroy_inode(ovl_inode) dput(oi->__upperdentry) dentry_kill(upperdentry) dentry_unlink_inode upperdentry->d_inode = NULL realinode = d_inode(realpath.dentry) // return NULL inode_permission(realinode) inode->i_sb // NULL pointer dereference , will trigger an null pointer dereference at realinode: [ 335.664979] BUG: kernel NULL pointer dereference, address: 0000000000000002 [ 335.668032] CPU: 0 PID: 2592 Comm: ls Not tainted 6.3.0 [ 335.669956] RIP: 0010:inode_permission+0x33/0x2c0 [ 335.678939] Call Trace: [ 335.679165] <TASK> [ 335.679371] ovl_permission+0xde/0x320 [ 335.679723] inode_permission+0x15e/0x2c0 [ 335.680090] link_path_walk+0x115/0x550 [ 335.680771] path_lookupat.isra.0+0xb2/0x200 [ 335.681170] filename_lookup+0xda/0x240 [ 335.681922] vfs_statx+0xa6/0x1f0 [ 335.682233] vfs_fstatat+0x7b/0xb0 Fetch a reproducer in [Link]. Use the helper ovl_i_path_realinode() to get realinode and then do non-nullptr checking. Link: https://bugzilla.kernel.org/show_bug.cgi?id=217405 Fixes: 4b7791b2e958 ("ovl: handle idmappings in ovl_permission()") Cc: <stable@vger.kernel.org> # v5.19 Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Suggested-by: Christian Brauner <brauner@kernel.org> Suggested-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-19ovl: let helper ovl_i_path_real() return the realinodeZhihao Cheng2-4/+5
Let helper ovl_i_path_real() return the realinode to prepare for checking non-null realinode in RCU walking path. [msz] Use d_inode_rcu() since we are depending on the consitency between dentry and inode being non-NULL in an RCU setting. Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Fixes: ffa5723c6d25 ("ovl: store lower path in ovl_inode") Cc: <stable@vger.kernel.org> # v5.19 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>