summaryrefslogtreecommitdiff
path: root/fs/fuse/inode.c
AgeCommit message (Collapse)AuthorFilesLines
2021-05-01Merge tag 'fuse-update-5.13' of ↵Linus Torvalds1-2/+5
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse updates from Miklos Szeredi: - Fix a page locking bug in write (introduced in 2.6.26) - Allow sgid bit to be killed in setacl() - Miscellaneous fixes and cleanups * tag 'fuse-update-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: cuse: simplify refcount cuse: prevent clone virtiofs: fix userns virtiofs: remove useless function virtiofs: split requests that exceed virtqueue size virtiofs: fix memory leak in virtio_fs_probe() fuse: invalidate attrs when page writeback completes fuse: add a flag FUSE_SETXATTR_ACL_KILL_SGID to kill SGID fuse: extend FUSE_SETXATTR request fuse: fix matching of FUSE_DEV_IOC_CLONE command fuse: fix a typo fuse: don't zero pages twice fuse: fix typo for fuse_conn.max_pages comment fuse: fix write deadlock
2021-04-14virtiofs: split requests that exceed virtqueue sizeConnor Kuehl1-1/+2
If an incoming FUSE request can't fit on the virtqueue, the request is placed onto a workqueue so a worker can try to resubmit it later where there will (hopefully) be space for it next time. This is fine for requests that aren't larger than a virtqueue's maximum capacity. However, if a request's size exceeds the maximum capacity of the virtqueue (even if the virtqueue is empty), it will be doomed to a life of being placed on the workqueue, removed, discovered it won't fit, and placed on the workqueue yet again. Furthermore, from section 2.6.5.3.1 (Driver Requirements: Indirect Descriptors) of the virtio spec: "A driver MUST NOT create a descriptor chain longer than the Queue Size of the device." To fix this, limit the number of pages FUSE will use for an overall request. This way, each request can realistically fit on the virtqueue when it is decomposed into a scattergather list and avoid violating section 2.6.5.3.1 of the virtio spec. Signed-off-by: Connor Kuehl <ckuehl@redhat.com> Reviewed-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-04-14fuse: extend FUSE_SETXATTR requestVivek Goyal1-1/+3
Fuse client needs to send additional information to file server when it calls SETXATTR(system.posix_acl_access), so add extra flags field to the structure. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-03-08new helper: inode_wrong_type()Al Viro1-1/+1
inode_wrong_type(inode, mode) returns true if setting inode->i_mode to given value would've changed the inode type. We have enough of those checks open-coded to make a helper worthwhile. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-12-10fuse: fix bad inodeMiklos Szeredi1-2/+2
Jan Kara's analysis of the syzbot report (edited): The reproducer opens a directory on FUSE filesystem, it then attaches dnotify mark to the open directory. After that a fuse_do_getattr() call finds that attributes returned by the server are inconsistent, and calls make_bad_inode() which, among other things does: inode->i_mode = S_IFREG; This then confuses dnotify which doesn't tear down its structures properly and eventually crashes. Avoid calling make_bad_inode() on a live inode: switch to a private flag on the fuse inode. Also add the test to ops which the bad_inode_ops would have caught. This bug goes back to the initial merge of fuse in 2.6.14... Reported-by: syzbot+f427adf9324b92652ccc@syzkaller.appspotmail.com Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Tested-by: Jan Kara <jack@suse.cz> Cc: <stable@vger.kernel.org>
2020-11-11fuse: support SB_NOSEC flag to improve write performanceVivek Goyal1-1/+13
Virtiofs can be slow with small writes if xattr are enabled and we are doing cached writes (No direct I/O). Ganesh Mahalingam noticed this. Some debugging showed that file_remove_privs() is called in cached write path on every write. And everytime it calls security_inode_need_killpriv() which results in call to __vfs_getxattr(XATTR_NAME_CAPS). And this goes to file server to fetch xattr. This extra round trip for every write slows down writes tremendously. Normally to avoid paying this penalty on every write, vfs has the notion of caching this information in inode (S_NOSEC). So vfs sets S_NOSEC, if filesystem opted for it using super block flag SB_NOSEC. And S_NOSEC is cleared when setuid/setgid bit is set or when security xattr is set on inode so that next time a write happens, we check inode again for clearing setuid/setgid bits as well clear any security.capability xattr. This seems to work well for local file systems but for remote file systems it is possible that VFS does not have full picture and a different client sets setuid/setgid bit or security.capability xattr on file and that means VFS information about S_NOSEC on another client will be stale. So for remote filesystems SB_NOSEC was disabled by default. Commit 9e1f1de02c22 ("more conservative S_NOSEC handling") mentioned that these filesystems can still make use of SB_NOSEC as long as they clear S_NOSEC when they are refreshing inode attriutes from server. So this patch tries to enable SB_NOSEC on fuse (regular fuse as well as virtiofs). And clear SB_NOSEC when we are refreshing inode attributes. This is enabled only if server supports FUSE_HANDLE_KILLPRIV_V2. This says that server will clear setuid/setgid/security.capability on chown/truncate/write as apporpriate. This should provide tighter coherency because now suid/sgid/ security.capability will be cleared even if fuse client cache has not seen these attrs. Basic idea is that fuse client will trigger suid/sgid/security.capability clearing based on its attr cache. But even if cache has gone stale, it is fine because FUSE_HANDLE_KILLPRIV_V2 will make sure WRITE clear suid/sgid/security.capability. We make this change only if server supports FUSE_HANDLE_KILLPRIV_V2. This should make sure that existing filesystems which might be relying on seucurity.capability always being queried from server are not impacted. This tighter coherency relies on WRITE showing up on server (and not being cached in guest). So writeback_cache mode will not provide that tight coherency and it is not recommended to use two together. Having said that it might work reasonably well for lot of use cases. This change improves random write performance very significantly. Running virtiofsd with cache=auto and following fio command: fio --ioengine=libaio --direct=1 --name=test --filename=/mnt/virtiofs/random_read_write.fio --bs=4k --iodepth=64 --size=4G --readwrite=randwrite Bandwidth increases from around 50MB/s to around 250MB/s as a result of applying this patch. So improvement is very significant. Link: https://github.com/kata-containers/runtime/issues/2815 Reported-by: "Mahalingam, Ganesh" <ganesh.mahalingam@intel.com> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-11-11fuse: introduce the notion of FUSE_HANDLE_KILLPRIV_V2Vivek Goyal1-1/+4
We already have FUSE_HANDLE_KILLPRIV flag that says that file server will remove suid/sgid/caps on truncate/chown/write. But that's little different from what Linux VFS implements. To be consistent with Linux VFS behavior what we want is. - caps are always cleared on chown/write/truncate - suid is always cleared on chown, while for truncate/write it is cleared only if caller does not have CAP_FSETID. - sgid is always cleared on chown, while for truncate/write it is cleared only if caller does not have CAP_FSETID as well as file has group execute permission. As previous flag did not provide above semantics. Implement a V2 of the protocol with above said constraints. Server does not know if caller has CAP_FSETID or not. So for the case of write()/truncate(), client will send information in special flag to indicate whether to kill priviliges or not. These changes are in subsequent patches. FUSE_HANDLE_KILLPRIV_V2 relies on WRITE being sent to server to clear suid/sgid/security.capability. But with ->writeback_cache, WRITES are cached in guest. So it is not recommended to use FUSE_HANDLE_KILLPRIV_V2 and writeback_cache together. Though it probably might be good enough for lot of use cases. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-11-11fuse: add fuse_sb_destroy() helperMiklos Szeredi1-9/+7
This is to avoid minor code duplication between fuse_kill_sb_anon() and fuse_kill_sb_blk(). Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-11-11fuse: get rid of fuse_mount refcountMiklos Szeredi1-13/+4
Fuse mount now only ever has a refcount of one (before being freed) so the count field is unnecessary. Remove the refcounting and fold fuse_mount_put() into callers. The only caller of fuse_mount_put() where fm->fc was NULL is fuse_dentry_automount() and here the fuse_conn_put() can simply be omitted. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-11-11virtiofs: simplify sb setupMiklos Szeredi1-7/+0
Currently when acquiring an sb for virtiofs fuse_mount_get() is being called from virtio_fs_set_super() if a new sb is being filled and fuse_mount_put() is called unconditionally after sget_fc() returns. The exact same result can be obtained by checking whether fs_contex->s_fs_info was set to NULL (ref trasferred to sb->s_fs_info) and only calling fuse_mount_put() if the ref wasn't transferred (error or matching sb found). This allows getting rid of virtio_fs_set_super() and fuse_mount_get(). Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-10-20Merge tag 'fuse-update-5.10' of ↵Linus Torvalds1-88/+303
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse updates from Miklos Szeredi: - Support directly accessing host page cache from virtiofs. This can improve I/O performance for various workloads, as well as reducing the memory requirement by eliminating double caching. Thanks to Vivek Goyal for doing most of the work on this. - Allow automatic submounting inside virtiofs. This allows unique st_dev/ st_ino values to be assigned inside the guest to files residing on different filesystems on the host. Thanks to Max Reitz for the patches. - Fix an old use after free bug found by Pradeep P V K. * tag 'fuse-update-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (25 commits) virtiofs: calculate number of scatter-gather elements accurately fuse: connection remove fix fuse: implement crossmounts fuse: Allow fuse_fill_super_common() for submounts fuse: split fuse_mount off of fuse_conn fuse: drop fuse_conn parameter where possible fuse: store fuse_conn in fuse_req fuse: add submount support to <uapi/linux/fuse.h> fuse: fix page dereference after free virtiofs: add logic to free up a memory range virtiofs: maintain a list of busy elements virtiofs: serialize truncate/punch_hole and dax fault path virtiofs: define dax address space operations virtiofs: add DAX mmap support virtiofs: implement dax read/write operations virtiofs: introduce setupmapping/removemapping commands virtiofs: implement FUSE_INIT map_alignment field virtiofs: keep a list of free dax memory ranges virtiofs: add a mount option to enable dax virtiofs: set up virtio_fs dax_device ...
2020-10-12fuse: connection remove fixMiklos Szeredi1-0/+7
Re-add lost removal of fc from fuse_conn_list and the control filesystem. Reported-by: kernel test robot <rong.a.chen@intel.com> Fixes: fcee216beb9c ("fuse: split fuse_mount off of fuse_conn") Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-10-09fuse: implement crossmountsMax Reitz1-2/+24
FUSE servers can indicate crossmount points by setting FUSE_ATTR_SUBMOUNT in fuse_attr.flags. The inode will then be marked as S_AUTOMOUNT, and the .d_automount implementation creates a new submount at that location, so that the submount gets a distinct st_dev value. Note that all submounts get a distinct superblock and a distinct st_dev value, so for virtio-fs, even if the same filesystem is mounted more than once on the host, none of its mount points will have the same st_dev. We need distinct superblocks because the superblock points to the root node, but the different host mounts may show different trees (e.g. due to submounts in some of them, but not in others). Right now, this behavior is only enabled when fuse_conn.auto_submounts is set, which is the case only for virtio-fs. Signed-off-by: Max Reitz <mreitz@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-09-24bdi: invert BDI_CAP_NO_ACCT_WBChristoph Hellwig1-1/+2
Replace BDI_CAP_NO_ACCT_WB with a positive BDI_CAP_WRITEBACK_ACCT to make the checks more obvious. Also remove the pointless bdi_cap_account_writeback wrapper that just obsfucates the check. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-24bdi: initialize ->ra_pages and ->io_pages in bdi_initChristoph Hellwig1-1/+0
Set up a readahead size by default, as very few users have a good reason to change it. This means code, ecryptfs, and orangefs now set up the values while they were previously missing it, while ubifs, mtd and vboxsf manually set it to 0 to avoid readahead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: David Sterba <dsterba@suse.com> [btrfs] Acked-by: Richard Weinberger <richard@nod.at> [ubifs, mtd] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-18fuse: Allow fuse_fill_super_common() for submountsMax Reitz1-21/+88
Submounts have their own superblock, which needs to be initialized. However, they do not have a fuse_fs_context associated with them, and the root node's attributes should be taken from the mountpoint's node. Extend fuse_fill_super_common() to work for submounts by making the @ctx parameter optional, and by adding a @submount_finode parameter. (There is a plain "unsigned" in an existing code block that is being indented by this commit. Extend it to "unsigned int" so checkpatch does not complain.) Signed-off-by: Max Reitz <mreitz@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-09-18fuse: split fuse_mount off of fuse_connMax Reitz1-49/+121
We want to allow submounts for the same fuse_conn, but with different superblocks so that each of the submounts has its own device ID. To do so, we need to split all mount-specific information off of fuse_conn into a new fuse_mount structure, so that multiple mounts can share a single fuse_conn. We need to take care only to perform connection-level actions once (i.e. when the fuse_conn and thus the first fuse_mount are established, or when the last fuse_mount and thus the fuse_conn are destroyed). For example, fuse_sb_destroy() must invoke fuse_send_destroy() until the last superblock is released. To do so, we keep track of which fuse_mount is the root mount and perform all fuse_conn-level actions only when this fuse_mount is involved. Signed-off-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-09-10virtiofs: serialize truncate/punch_hole and dax fault pathVivek Goyal1-0/+1
Currently in fuse we don't seem have any lock which can serialize fault path with truncate/punch_hole path. With dax support I need one for following reasons. 1. Dax requirement DAX fault code relies on inode size being stable for the duration of fault and want to serialize with truncate/punch_hole and they explicitly mention it. static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp, const struct iomap_ops *ops) /* * Check whether offset isn't beyond end of file now. Caller is * supposed to hold locks serializing us with truncate / punch hole so * this is a reliable test. */ max_pgoff = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE); 2. Make sure there are no users of pages being truncated/punch_hole get_user_pages() might take references to page and then do some DMA to said pages. Filesystem might truncate those pages without knowing that a DMA is in progress or some I/O is in progress. So use dax_layout_busy_page() to make sure there are no such references and I/O is not in progress on said pages before moving ahead with truncation. 3. Limitation of kvm page fault error reporting If we are truncating file on host first and then removing mappings in guest lateter (truncate page cache etc), then this could lead to a problem with KVM. Say a mapping is in place in guest and truncation happens on host. Now if guest accesses that mapping, then host will take a fault and kvm will either exit to qemu or spin infinitely. IOW, before we do truncation on host, we need to make sure that guest inode does not have any mapping in that region or whole file. 4. virtiofs memory range reclaim Soon I will introduce the notion of being able to reclaim dax memory ranges from a fuse dax inode. There also I need to make sure that no I/O or fault is going on in the reclaimed range and nobody is using it so that range can be reclaimed without issues. Currently if we take inode lock, that serializes read/write. But it does not do anything for faults. So I add another semaphore fuse_inode->i_mmap_sem for this purpose. It can be used to serialize with faults. As of now, I am adding taking this semaphore only in dax fault path and not regular fault path because existing code does not have one. May be existing code can benefit from it as well to take care of some races, but that we can fix later if need be. For now, I am just focussing only on DAX path which is new path. Also added logic to take fuse_inode->i_mmap_sem in truncate/punch_hole/open(O_TRUNC) path to make sure file truncation and fuse dax fault are mutually exlusive and avoid all the above problems. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-09-10virtiofs: implement dax read/write operationsVivek Goyal1-4/+17
This patch implements basic DAX support. mmap() is not implemented yet and will come in later patches. This patch looks into implemeting read/write. We make use of interval tree to keep track of per inode dax mappings. Do not use dax for file extending writes, instead just send WRITE message to daemon (like we do for direct I/O path). This will keep write and i_size change atomic w.r.t crash. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: Peng Tao <tao.peng@linux.alibaba.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-09-10virtiofs: implement FUSE_INIT map_alignment fieldStefan Hajnoczi1-1/+16
The device communicates FUSE_SETUPMAPPING/FUSE_REMOVMAPPING alignment constraints via the FUST_INIT map_alignment field. Parse this field and ensure our DAX mappings meet the alignment constraints. We don't actually align anything differently since our mappings are already 2MB aligned. Just check the value when the connection is established. If it becomes necessary to honor arbitrary alignments in the future we'll have to adjust how mappings are sized. The upshot of this commit is that we can be confident that mappings will work even when emulating x86 on Power and similar combinations where the host page sizes are different. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-09-10virtiofs: add a mount option to enable daxVivek Goyal1-1/+17
Add a mount option to allow using dax with virtio_fs. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-09-10virtiofs: get rid of no_mount_optionsVivek Goyal1-14/+16
This option was introduced so that for virtio_fs we don't show any mounts options fuse_show_options(). Because we don't offer any of these options to be controlled by mounter. Very soon we are planning to introduce option "dax" which mounter should be able to specify. And no_mount_options does not work anymore. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-07-14fuse: reject options on reconfigure via fsconfig(2)Miklos Szeredi1-6/+10
Previous patch changed handling of remount/reconfigure to ignore all options, including those that are unknown to the fuse kernel fs. This was done for backward compatibility, but this likely only affects the old mount(2) API. The new fsconfig(2) based reconfiguration could possibly be improved. This would make the new API less of a drop in replacement for the old, OTOH this is a good chance to get rid of some weirdnesses in the old API. Several other behaviors might make sense: 1) unknown options are rejected, known options are ignored 2) unknown options are rejected, known options are rejected if the value is changed, allowed otherwise 3) all options are rejected Prior to the backward compatibility fix to ignore all options all known options were accepted (1), even if they change the value of a mount parameter; fuse_reconfigure() does not look at the config values set by fuse_parse_param(). To fix that we'd need to verify that the value provided is the same as set in the initial configuration (2). The major drawback is that this is much more complex than just rejecting all attempts at changing options (3); i.e. all options signify initial configuration values and don't make sense on reconfigure. This patch opts for (3) with the rationale that no mount options are reconfigurable in fuse. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-07-14fuse: ignore 'data' argument of mount(..., MS_REMOUNT)Miklos Szeredi1-0/+7
The command mount -o remount -o unknownoption /mnt/fuse succeeds on kernel versions prior to v5.4 and fails on kernel version at or after. This is because fuse_parse_param() rejects any unrecognised options in case of FS_CONTEXT_FOR_RECONFIGURE, just as for FS_CONTEXT_FOR_MOUNT. This causes a regression in case the fuse filesystem is in fstab, since remount sends all options found there to the kernel; even ones that are meant for the initial mount and are consumed by the userspace fuse server. Fix this by ignoring mount options, just as fuse_remount_fs() did prior to the conversion to the new API. Reported-by: Stefan Priebe <s.priebe@profihost.ag> Fixes: c30da2e981a7 ("fuse: convert to use the new mount API") Cc: <stable@vger.kernel.org> # v5.4 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-07-14fuse: use ->reconfigure() instead of ->remount_fs()Miklos Szeredi1-3/+5
s_op->remount_fs() is only called from legacy_reconfigure(), which is not used after being converted to the new API. Convert to using ->reconfigure(). This restores the previous behavior of syncing the filesystem and rejecting MS_MANDLOCK on remount. Fixes: c30da2e981a7 ("fuse: convert to use the new mount API") Cc: <stable@vger.kernel.org> # v5.4 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-05-19fuse: update attr_version counter on fuse_notify_inval_inode()Miklos Szeredi1-0/+7
A GETATTR request can race with FUSE_NOTIFY_INVAL_INODE, resulting in the attribute cache being updated with stale information after the invalidation. Fix this by bumping the attribute version in fuse_reverse_inval_inode(). Reported-by: Krzysztof Rusek <rusek@9livesdata.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-05-19virtiofs: do not use fuse_fill_super_common() for device installationVivek Goyal1-7/+12
fuse_fill_super_common() allocates and installs one fuse_device. Hence virtiofs allocates and install all fuse devices by itself except one. This makes logic little twisted. There does not seem to be any real need that why virtiofs can't allocate and install all fuse devices itself. So opt out of fuse device allocation and installation while calling fuse_fill_super_common(). Regular fuse still wants fuse_fill_super_common() to install fuse_device. It needs to prevent against races where two mounters are trying to mount fuse using same fd. In that case one will succeed while other will get -EINVAL. virtiofs does not have this issue because sget_fc() resolves the race w.r.t multiple mounters and only one instance of virtio_fs_fill_super() should be in progress for same filesystem. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-02-09Merge branch 'merge.nfs-fs_parse.1' of ↵Linus Torvalds1-15/+10
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs file system parameter updates from Al Viro: "Saner fs_parser.c guts and data structures. The system-wide registry of syntax types (string/enum/int32/oct32/.../etc.) is gone and so is the horror switch() in fs_parse() that would have to grow another case every time something got added to that system-wide registry. New syntax types can be added by filesystems easily now, and their namespace is that of functions - not of system-wide enum members. IOW, they can be shared or kept private and if some turn out to be widely useful, we can make them common library helpers, etc., without having to do anything whatsoever to fs_parse() itself. And we already get that kind of requests - the thing that finally pushed me into doing that was "oh, and let's add one for timeouts - things like 15s or 2h". If some filesystem really wants that, let them do it. Without somebody having to play gatekeeper for the variants blessed by direct support in fs_parse(), TYVM. Quite a bit of boilerplate is gone. And IMO the data structures make a lot more sense now. -200LoC, while we are at it" * 'merge.nfs-fs_parse.1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (25 commits) tmpfs: switch to use of invalfc() cgroup1: switch to use of errorfc() et.al. procfs: switch to use of invalfc() hugetlbfs: switch to use of invalfc() cramfs: switch to use of errofc() et.al. gfs2: switch to use of errorfc() et.al. fuse: switch to use errorfc() et.al. ceph: use errorfc() and friends instead of spelling the prefix out prefix-handling analogues of errorf() and friends turn fs_param_is_... into functions fs_parse: handle optional arguments sanely fs_parse: fold fs_parameter_desc/fs_parameter_spec fs_parser: remove fs_parameter_description name field add prefix to fs_context->log ceph_parse_param(), ceph_parse_mon_ips(): switch to passing fc_log new primitive: __fs_parse() switch rbd and libceph to p_log-based primitives struct p_log, variants of warnf() et.al. taking that one instead teach logfc() to handle prefices, give it saner calling conventions get rid of cg_invalf() ...
2020-02-07fuse: switch to use errorfc() et.al.Al Viro1-6/+6
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-07fs_parse: fold fs_parameter_desc/fs_parameter_specAl Viro1-8/+4
The former contains nothing but a pointer to an array of the latter... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-07fs_parser: remove fs_parameter_description name fieldEric Sandeen1-1/+0
Unused now. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-06fuse: use true,false for bool variablezhengbin1-7/+7
Fixes coccicheck warning: fs/fuse/readdir.c:335:1-19: WARNING: Assignment of 0/1 to bool variable fs/fuse/file.c:1398:2-19: WARNING: Assignment of 0/1 to bool variable fs/fuse/file.c:1400:2-20: WARNING: Assignment of 0/1 to bool variable fs/fuse/cuse.c:454:1-20: WARNING: Assignment of 0/1 to bool variable fs/fuse/cuse.c:455:1-19: WARNING: Assignment of 0/1 to bool variable fs/fuse/inode.c:497:2-17: WARNING: Assignment of 0/1 to bool variable fs/fuse/inode.c:504:2-23: WARNING: Assignment of 0/1 to bool variable fs/fuse/inode.c:511:2-22: WARNING: Assignment of 0/1 to bool variable fs/fuse/inode.c:518:2-23: WARNING: Assignment of 0/1 to bool variable fs/fuse/inode.c:522:2-26: WARNING: Assignment of 0/1 to bool variable fs/fuse/inode.c:526:2-18: WARNING: Assignment of 0/1 to bool variable fs/fuse/inode.c:1000:1-20: WARNING: Assignment of 0/1 to bool variable Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: zhengbin <zhengbin13@huawei.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-10-15virtio-fs: don't show mount optionsMiklos Szeredi1-0/+4
Virtio-fs does not accept any mount options, so it's confusing and wrong to show any in /proc/mounts. Reported-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-28Merge tag 'virtio-fs-5.4' of ↵Linus Torvalds1-0/+4
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse virtio-fs support from Miklos Szeredi: "Virtio-fs allows exporting directory trees on the host and mounting them in guest(s). This isn't actually a new filesystem, but a glue layer between the fuse filesystem and a virtio based back-end. It's similar in functionality to the existing virtio-9p solution, but significantly faster in benchmarks and has better POSIX compliance. Further permformance improvements can be achieved by sharing the page cache between host and guest, allowing for faster I/O and reduced memory use. Kata Containers have been including the out-of-tree virtio-fs (with the shared page cache patches as well) since version 1.7 as an experimental feature. They have been active in development and plan to switch from virtio-9p to virtio-fs as their default solution. There has been interest from other sources as well. The userspace infrastructure is slated to be merged into qemu once the kernel part hits mainline. This was developed by Vivek Goyal, Dave Gilbert and Stefan Hajnoczi" * tag 'virtio-fs-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: virtio-fs: add virtiofs filesystem virtio-fs: add Documentation/filesystems/virtiofs.rst fuse: reserve values for mapping protocol
2019-09-24fuse: kmemcg account fs dataKhazhismel Kumykov1-1/+1
account per-file, dentry, and inode data blockdev/superblock and temporary per-request data was left alone, as this usually isn't accounted Reviewed-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: Khazhismel Kumykov <khazhy@google.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-18virtio-fs: add virtiofs filesystemStefan Hajnoczi1-0/+4
Add a basic file system module for virtio-fs. This does not yet contain shared data support between host and guest or metadata coherency speedups. However it is already significantly faster than virtio-9p. Design Overview =============== With the goal of designing something with better performance and local file system semantics, a bunch of ideas were proposed. - Use fuse protocol (instead of 9p) for communication between guest and host. Guest kernel will be fuse client and a fuse server will run on host to serve the requests. - For data access inside guest, mmap portion of file in QEMU address space and guest accesses this memory using dax. That way guest page cache is bypassed and there is only one copy of data (on host). This will also enable mmap(MAP_SHARED) between guests. - For metadata coherency, there is a shared memory region which contains version number associated with metadata and any guest changing metadata updates version number and other guests refresh metadata on next access. This is yet to be implemented. How virtio-fs differs from existing approaches ============================================== The unique idea behind virtio-fs is to take advantage of the co-location of the virtual machine and hypervisor to avoid communication (vmexits). DAX allows file contents to be accessed without communication with the hypervisor. The shared memory region for metadata avoids communication in the common case where metadata is unchanged. By replacing expensive communication with cheaper shared memory accesses, we expect to achieve better performance than approaches based on network file system protocols. In addition, this also makes it easier to achieve local file system semantics (coherency). These techniques are not applicable to network file system protocols since the communications channel is bypassed by taking advantage of shared memory on a local machine. This is why we decided to build virtio-fs rather than focus on 9P or NFS. Caching Modes ============= Like virtio-9p, different caching modes are supported which determine the coherency level as well. The “cache=FOO” and “writeback” options control the level of coherence between the guest and host filesystems. - cache=none metadata, data and pathname lookup are not cached in guest. They are always fetched from host and any changes are immediately pushed to host. - cache=always metadata, data and pathname lookup are cached in guest and never expire. - cache=auto metadata and pathname lookup cache expires after a configured amount of time (default is 1 second). Data is cached while the file is open (close to open consistency). - writeback/no_writeback These options control the writeback strategy. If writeback is disabled, then normal writes will immediately be synchronized with the host fs. If writeback is enabled, then writes may be cached in the guest until the file is closed or an fsync(2) performed. This option has no effect on mmap-ed writes or writes going through the DAX mechanism. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-12fuse: allow skipping control interface and forced unmountVivek Goyal1-1/+6
virtio-fs does not support aborting requests which are being processed. That is requests which have been sent to fuse daemon on host. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-12fuse: dissociate DESTROY from fuseblkMiklos Szeredi1-4/+8
Allow virtio-fs to also send DESTROY request. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-12fuse: separate fuse device allocation and installation in fuse_connVivek Goyal1-4/+21
As of now fuse_dev_alloc() both allocates a fuse device and installs it in fuse_conn list. fuse_dev_alloc() can fail if fuse_device allocation fails. virtio-fs needs to initialize multiple fuse devices (one per virtio queue). It initializes one fuse device as part of call to fuse_fill_super_common() and rest of the devices are allocated and installed after that. But, we can't afford to fail after calling fuse_fill_super_common() as we don't have a way to undo all the actions done by fuse_fill_super_common(). So to avoid failures after the call to fuse_fill_super_common(), pre-allocate all fuse devices early and install them into fuse connection later. This patch provides two separate helpers for fuse device allocation and fuse device installation in fuse_conn. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-12fuse: add fuse_iqueue_ops callbacksStefan Hajnoczi1-4/+9
The /dev/fuse device uses fiq->waitq and fasync to signal that requests are available. These mechanisms do not apply to virtio-fs. This patch introduces callbacks so alternative behavior can be used. Note that queue_interrupt() changes along these lines: spin_lock(&fiq->waitq.lock); wake_up_locked(&fiq->waitq); + kill_fasync(&fiq->fasync, SIGIO, POLL_IN); spin_unlock(&fiq->waitq.lock); - kill_fasync(&fiq->fasync, SIGIO, POLL_IN); Since queue_request() and queue_forget() also call kill_fasync() inside the spinlock this should be safe. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-12fuse: extract fuse_fill_super_common()Stefan Hajnoczi1-59/+53
fuse_fill_super() includes code to process the fd= option and link the struct fuse_dev to the fd's struct file. In virtio-fs there is no file descriptor because /dev/fuse is not used. This patch extracts fuse_fill_super_common() so that both classic fuse and virtio-fs can share the code to initialize a mount. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-12fuse: export fuse_send_init_request()Vivek Goyal1-1/+2
This will be used by virtio-fs to send init request to fuse server after initialization of virt queues. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-12fuse: fix request limitMiklos Szeredi1-2/+5
The size of struct fuse_req was reduced from 392B to 144B on a non-debug config, thus the sanitize_global_limit() helper was setting a larger default limit. This doesn't really reflect reduction in the memory used by requests, since the fields removed from fuse_req were added to fuse_args derived structs; e.g. sizeof(struct fuse_writepages_args) is 248B, thus resulting in slightly more memory being used for writepage requests overalll (due to using 256B slabs). Make the calculatation ignore the size of fuse_req and use the old 392B value. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-10fuse: convert init to simple apiMiklos Szeredi1-28/+37
Bypass the fc->initialized check by setting the force flag. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-10fuse: convert destroy to simple apiMiklos Szeredi1-18/+10
We can use the "force" flag to make sure the DESTROY request is always sent to userspace. So no need to keep it allocated during the lifetime of the filesystem. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-10fuse: simplify 'nofail' requestMiklos Szeredi1-1/+0
Instead of complex games with a reserved request, just use __GFP_NOFAIL. Both calers (flush, readdir) guarantee that connection was already initialized, so no need to wait for fc->initialized. Also remove unneeded clearing of FR_BACKGROUND flag. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-10fuse: flatten 'struct fuse_args'Miklos Szeredi1-6/+6
...to make future expansion simpler. The hiearachical structure is a historical thing that does not serve any practical purpose. The generated code is excatly the same before and after the patch. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-10fuse: fix deadlock with aio poll and fuse_iqueue::waitq.lockEric Biggers1-0/+1
When IOCB_CMD_POLL is used on the FUSE device, aio_poll() disables IRQs and takes kioctx::ctx_lock, then fuse_iqueue::waitq.lock. This may have to wait for fuse_iqueue::waitq.lock to be released by one of many places that take it with IRQs enabled. Since the IRQ handler may take kioctx::ctx_lock, lockdep reports that a deadlock is possible. Fix it by protecting the state of struct fuse_iqueue with a separate spinlock, and only accessing fuse_iqueue::waitq using the versions of the waitqueue functions which do IRQ-safe locking internally. Reproducer: #include <fcntl.h> #include <stdio.h> #include <sys/mount.h> #include <sys/stat.h> #include <sys/syscall.h> #include <unistd.h> #include <linux/aio_abi.h> int main() { char opts[128]; int fd = open("/dev/fuse", O_RDWR); aio_context_t ctx = 0; struct iocb cb = { .aio_lio_opcode = IOCB_CMD_POLL, .aio_fildes = fd }; struct iocb *cbp = &cb; sprintf(opts, "fd=%d,rootmode=040000,user_id=0,group_id=0", fd); mkdir("mnt", 0700); mount("foo", "mnt", "fuse", 0, opts); syscall(__NR_io_setup, 1, &ctx); syscall(__NR_io_submit, ctx, 1, &cbp); } Beginning of lockdep output: ===================================================== WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected 5.3.0-rc5 #9 Not tainted ----------------------------------------------------- syz_fuse/135 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire: 000000003590ceda (&fiq->waitq){+.+.}, at: spin_lock include/linux/spinlock.h:338 [inline] 000000003590ceda (&fiq->waitq){+.+.}, at: aio_poll fs/aio.c:1751 [inline] 000000003590ceda (&fiq->waitq){+.+.}, at: __io_submit_one.constprop.0+0x203/0x5b0 fs/aio.c:1825 and this task is already holding: 0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: spin_lock_irq include/linux/spinlock.h:363 [inline] 0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: aio_poll fs/aio.c:1749 [inline] 0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: __io_submit_one.constprop.0+0x1f4/0x5b0 fs/aio.c:1825 which would create a new lock dependency: (&(&ctx->ctx_lock)->rlock){..-.} -> (&fiq->waitq){+.+.} but this new dependency connects a SOFTIRQ-irq-safe lock: (&(&ctx->ctx_lock)->rlock){..-.} [...] Reported-by: syzbot+af05535bb79520f95431@syzkaller.appspotmail.com Reported-by: syzbot+d86c4426a01f60feddc7@syzkaller.appspotmail.com Fixes: bfe4037e722e ("aio: implement IOCB_CMD_POLL") Cc: <stable@vger.kernel.org> # v4.19+ Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-06vfs: subtype handling moved to fuseDavid Howells1-2/+1
The unused vfs code can be removed. Don't pass empty subtype (same as if ->parse callback isn't called). The bits that are left involve determining whether it's permitted to split the filesystem type string passed in to mount(2). Consequently, this means that we cannot get rid of the FS_HAS_SUBTYPE flag unless we define that a type string with a dot in it always indicates a subtype specification. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-06fuse: convert to use the new mount APIDavid Howells1-125/+167
Convert the fuse filesystem to the new internal mount API as the old one will be obsoleted and removed. This allows greater flexibility in communication of mount parameters between userspace, the VFS and the filesystem. See Documentation/filesystems/mount_api.txt for more information. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>