summaryrefslogtreecommitdiff
path: root/fs/unicode
AgeCommit message (Collapse)AuthorFilesLines
2022-02-14kbuild: unify cmd_copy and cmd_shippedMasahiro Yamada1-1/+1
cmd_copy and cmd_shipped have similar functionality. The difference is that cmd_copy uses 'cp' while cmd_shipped 'cat'. Unify them into cmd_copy because this macro name is more intuitive. Going forward, cmd_copy will use 'cat' to avoid the permission issue. I also thought of 'cp --no-preserve=mode' but this option is not mentioned in the POSIX spec [1], so I am keeping the 'cat' command. [1]: https://pubs.opengroup.org/onlinepubs/009695299/utilities/cp.html Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Gabriel Krisman Bertazi <krisman@collabora.com>
2022-02-01Merge tag 'unicode-for-next-5.17-rc3' of ↵Linus Torvalds2-15/+9
git://git.kernel.org/pub/scm/linux/kernel/git/krisman/unicode Pull unicode cleanup from Gabriel Krisman Bertazi: "A fix from Christoph Hellwig merging the CONFIG_UNICODE_UTF8_DATA into the previous CONFIG_UNICODE. It is -rc material since we don't want to expose the former symbol on 5.17. This has been living on linux-next for the past week" * tag 'unicode-for-next-5.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/krisman/unicode: unicode: clean up the Kconfig symbol confusion
2022-01-21unicode: clean up the Kconfig symbol confusionChristoph Hellwig2-15/+9
Turn the CONFIG_UNICODE symbol into a tristate that generates some always built in code and remove the confusing CONFIG_UNICODE_UTF8_DATA symbol. Note that a lot of the IS_ENABLED() checks could be turned from cpp statements into normal ifs, but this change is intended to be fairly mechanic, so that should be cleaned up later. Fixes: 2b3d04787012 ("unicode: Add utf8-data module") Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
2022-01-17unicode: fix .gitignore for generated utfdata fileLinus Torvalds1-1/+1
Commit 2b3d04787012 ("unicode: Add utf8-data module") changed the generated utf8data file from 'utf8data.h' to 'utf8data.c', but didn't change the comments or the .gitignore to match. The comments should be updated too, but at least they don't cause any visible breakage. But the gitignore file needs changing to avoid git complaining about untracked files. Fixes: 2b3d04787012 ("unicode: Add utf8-data module") Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-10-12unicode: only export internal symbols for the selftestsChristoph Hellwig1-4/+7
The exported symbols in utf8-norm.c are not needed for normal file system consumers, so move them to conditional _GPL exports just for the selftest. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
2021-10-12unicode: Add utf8-data moduleChristoph Hellwig8-91/+124
utf8data.h contains a large database table which is an auto-generated decodification trie for the unicode normalization functions. Allow building it into a separate module. Based on a patch from Shreeya Patel <shreeya.patel@collabora.com>. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
2021-10-11unicode: cache the normalization tables in struct unicode_mapChristoph Hellwig4-94/+78
Instead of repeatedly looking up the version add pointers to the NFD and NFD+CF tables to struct unicode_map, and pass a unicode_map plus index to the functions using the normalization tables. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
2021-10-11unicode: move utf8cursor to utf8-selftest.cChristoph Hellwig3-18/+6
Only used by the tests, so no need to keep it in the core. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
2021-10-11unicode: simplify utf8lenChristoph Hellwig3-31/+5
Just use the utf8nlen implementation with a (size_t)-1 len argument, similar to utf8_lookup. Also move the function to utf8-selftest.c, as it isn't used anywhere else. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
2021-10-11unicode: remove the unused utf8{,n}age{min,max} functionsChristoph Hellwig2-129/+0
No actually used anywhere. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
2021-10-11unicode: pass a UNICODE_AGE() tripple to utf8_loadChristoph Hellwig4-73/+17
Don't bother with pointless string parsing when the caller can just pass the version in the format that the core expects. Also remove the fallback to the latest version that none of the callers actually uses. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
2021-10-11unicode: remove the charset field from struct unicode_mapChristoph Hellwig1-3/+0
It is hardcoded and only used for a f2fs sysfs file where it can be hardcoded just as easily. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Gabriel Krisman Bertazi <krisman@collabora.com> Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
2021-05-01.gitignore: prefix local generated files with a slashMasahiro Yamada1-2/+2
The pattern prefixed with '/' matches files in the same directory, but not ones in sub-directories. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Acked-by: Miguel Ojeda <ojeda@kernel.org> Acked-by: Rob Herring <robh@kernel.org> Acked-by: Andra Paraschiv <andraprs@amazon.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Gabriel Krisman Bertazi <krisman@collabora.com>
2020-09-11unicode: Add utf8_casefold_hashDaniel Rosenberg1-1/+22
This adds a case insensitive hash function to allow taking the hash without needing to allocate a casefolded copy of the string. The existing d_hash implementations for casefolding allocate memory within rcu-walk, by avoiding it we can be more efficient and avoid worrying about a failed allocation. Signed-off-by: Daniel Rosenberg <drosen@google.com> Reviewed-by: Gabriel Krisman Bertazi <krisman@collabora.com> Reviewed-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-03-25.gitignore: add SPDX License IdentifierMasahiro Yamada1-0/+1
Add SPDX License Identifier to all .gitignore files. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-03kbuild: rename hostprogs-y/always to hostprogs/always-yMasahiro Yamada1-1/+1
In old days, the "host-progs" syntax was used for specifying host programs. It was renamed to the current "hostprogs-y" in 2004. It is typically useful in scripts/Makefile because it allows Kbuild to selectively compile host programs based on the kernel configuration. This commit renames like follows: always -> always-y hostprogs-y -> hostprogs So, scripts/Makefile will look like this: always-$(CONFIG_BUILD_BIN2C) += ... always-$(CONFIG_KALLSYMS) += ... ... hostprogs := $(always-y) $(always-m) I think this makes more sense because a host program is always a host program, irrespective of the kernel configuration. We want to specify which ones to compile by CONFIG options, so always-y will be handier. The "always", "hostprogs-y", "hostprogs-m" will be kept for backward compatibility for a while. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2019-09-17unicode: make array 'token' static const, makes object smallerColin Ian King1-1/+1
Don't populate the array 'token' on the stack but instead make it static const. Makes the object code smaller by 234 bytes. Before: text data bss dec hex filename 5371 272 0 5643 160b fs/unicode/utf8-core.o After: text data bss dec hex filename 5041 368 0 5409 1521 fs/unicode/utf8-core.o (gcc version 9.2.1, amd64) Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
2019-09-17unicode: Move static keyword to the front of declarationsKrzysztof Wilczynski1-2/+2
Move the static keyword to the front of declarations of nfdi_test_data and nfdicf_test_data, and resolve the following compiler warnings that can be seen when building with warnings enabled (W=1): fs/unicode/utf8-selftest.c:38:1: warning: ‘static’ is not at beginning of declaration [-Wold-style-declaration] fs/unicode/utf8-selftest.c:92:1: warning: ‘static’ is not at beginning of declaration [-Wold-style-declaration] Signed-off-by: Krzysztof Wilczynski <kw@linux.com> Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
2019-07-11Merge tag 'ext4_for_linus' of ↵Linus Torvalds1-0/+28
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Many bug fixes and cleanups, and an optimization for case-insensitive lookups" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: fix coverity warning on error path of filename setup ext4: replace ktype default_attrs with default_groups ext4: rename htree_inline_dir_to_tree() to ext4_inlinedir_to_tree() ext4: refactor initialize_dirent_tail() ext4: rename "dirent_csum" functions to use "dirblock" ext4: allow directory holes jbd2: drop declaration of journal_sync_buffer() ext4: use jbd2_inode dirty range scoping jbd2: introduce jbd2_inode dirty range scoping mm: add filemap_fdatawait_range_keep_errors() ext4: remove redundant assignment to node ext4: optimize case-insensitive lookups ext4: make __ext4_get_inode_loc plug ext4: clean up kerneldoc warnigns when building with W=1 ext4: only set project inherit bit for directory ext4: enforce the immutable flag on open files ext4: don't allow any modifications to an immutable file jbd2: fix typo in comment of journal_submit_inode_data_buffers jbd2: fix some print format mistakes ext4: gracefully handle ext4_break_layouts() failure during truncate
2019-06-20ext4: optimize case-insensitive lookupsGabriel Krisman Bertazi1-0/+28
Temporarily cache a casefolded version of the file name under lookup in ext4_filename, to avoid repeatedly casefolding it. I got up to 30% speedup on lookups of large directories (>100k entries), depending on the length of the string under lookup. Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-06-05treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 294Thomas Gleixner2-20/+2
Based on 2 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license 2 as published by the free software foundation this program is distributed in the hope that it will be useful but without any warranty without even the implied warranty of merchantability or fitness for a particular purpose see the gnu general public license for more details this program is free software you can redistribute it and or modify it under the terms of the gnu general public license as published by the free software foundation this program is distributed in the hope that it [would] be useful but without any warranty without even the implied warranty of merchantability or fitness for a particular purpose see the gnu general public license for more details extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 9 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Allison Randal <allison@lohutok.net> Reviewed-by: Alexios Zavras <alexios.zavras@intel.com> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190529141901.804956444@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-05treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 282Thomas Gleixner1-9/+1
Based on 1 normalized pattern(s): this software is licensed under the terms of the gnu general public license version 2 as published by the free software foundation and may be copied distributed and modified under those terms this program is distributed in the hope that it will be useful but without any warranty without even the implied warranty of merchantability or fitness for a particular purpose see the gnu general public license for more details extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 285 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexios Zavras <alexios.zavras@intel.com> Reviewed-by: Allison Randal <allison@lohutok.net> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190529141900.642774971@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-21treewide: Add SPDX license identifier - Makefile/KconfigThomas Gleixner1-0/+1
Add SPDX license identifiers to all Make/Kconfig files which: - Have no license information of any form These files fall under the project license, GPL v2 only. The resulting SPDX license identifier is: GPL-2.0-only Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-12unicode: update to Unicode 12.1.0 finalTheodore Ts'o1-21/+7
Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: Gabriel Krisman Bertazi <krisman@collabora.com>
2019-05-12unicode: add missing check for an error return from utf8lookup()Theodore Ts'o1-0/+2
Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: Gabriel Krisman Bertazi <krisman@collabora.com>
2019-04-28unicode: refactor the rule for regenerating utf8data.hMasahiro Yamada5-16/+3455
scripts/mkutf8data is used only when regenerating utf8data.h, which never happens in the normal kernel build. However, it is irrespectively built if CONFIG_UNICODE is enabled. Moreover, there is no good reason for it to reside in the scripts/ directory since it is only used in fs/unicode/. Hence, move it from scripts/ to fs/unicode/. In some cases, we bypass build artifacts in the normal build. The conventional way to do so is to surround the code with ifdef REGENERATE_*. For example, - 7373f4f83c71 ("kbuild: add implicit rules for parser generation") - 6aaf49b495b4 ("crypto: arm,arm64 - Fix random regeneration of S_shipped") I rewrote the rule in a more kbuild'ish style. In the normal build, utf8data.h is just shipped from the check-in file. $ make [ snip ] SHIPPED fs/unicode/utf8data.h CC fs/unicode/utf8-norm.o CC fs/unicode/utf8-core.o CC fs/unicode/utf8-selftest.o AR fs/unicode/built-in.a If you want to generate utf8data.h based on UCD, put *.txt files into fs/unicode/, then pass REGENERATE_UTF8DATA=1 from the command line. The mkutf8data tool will be automatically compiled to generate the utf8data.h from the *.txt files. $ make REGENERATE_UTF8DATA=1 [ snip ] HOSTCC fs/unicode/mkutf8data GEN fs/unicode/utf8data.h CC fs/unicode/utf8-norm.o CC fs/unicode/utf8-core.o CC fs/unicode/utf8-selftest.o AR fs/unicode/built-in.a I renamed the check-in utf8data.h to utf8data.h_shipped so that this will work for the out-of-tree build. You can update it based on the latest UCD like this: $ make REGENERATE_UTF8DATA=1 fs/unicode/ $ cp fs/unicode/utf8data.h fs/unicode/utf8data.h_shipped Also, I added entries to .gitignore and dontdiff. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-04-25unicode: update unicode database unicode version 12.1.0Gabriel Krisman Bertazi3-2075/+2138
Regenerate utf8data.h based on the latest UCD files and run tests against the latest version. Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-04-25unicode: introduce test module for normalized utf8 implementationGabriel Krisman Bertazi3-0/+326
This implements a in-kernel sanity test module for the utf8 normalization core. At probe time, it will run basic sequences through the utf8n core, to identify problems will equivalent sequences and normalization/casefold code. This is supposed to be useful for regression testing when adding support for a new version of utf8 to linux. Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-04-25unicode: implement higher level API for string handlingGabriel Krisman Bertazi4-1/+197
This patch integrates the utf8n patches with some higher level API to perform UTF-8 string comparison, normalization and casefolding operations. Implemented is a variation of NFD, and casefold is performed by doing full casefold on top of NFD. These algorithms are based on the core implemented by Olaf Weber from SGI. Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-04-25unicode: reduce the size of utf8data[]Olaf Weber4-12646/+3028
Remove the Hangul decompositions from the utf8data trie, and do algorithmic decomposition to calculate them on the fly. To store the decomposition the caller of utf8lookup()/utf8nlookup() must provide a 12-byte buffer, which is used to synthesize a leaf with the decomposition. This significantly reduces the size of the utf8data[] array. Changes made by Gabriel: Rebase to mainline Fix checkpatch errors Extract robustness fixes and merge back to original mkutf8data.c patch Regenerate utf8data.h Signed-off-by: Olaf Weber <olaf@sgi.com> Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-04-25unicode: introduce code for UTF-8 normalizationOlaf Weber3-0/+756
Supporting functions for UTF-8 normalization are in utf8norm.c with the header utf8norm.h. Two normalization forms are supported: nfdi and nfdicf. nfdi: - Apply unicode normalization form NFD. - Remove any Default_Ignorable_Code_Point. nfdicf: - Apply unicode normalization form NFD. - Remove any Default_Ignorable_Code_Point. - Apply a full casefold (C + F). For the purposes of the code, a string is valid UTF-8 if: - The values encoded are 0x1..0x10FFFF. - The surrogate codepoints 0xD800..0xDFFFF are not encoded. - The shortest possible encoding is used for all values. The supporting functions work on null-terminated strings (utf8 prefix) and on length-limited strings (utf8n prefix). From the original SGI patch and for conformity with coding standards, the utf8data_t typedef was dropped, since it was just masking the struct keyword. On other occasions, namely utf8leaf_t and utf8trie_t, I decided to keep it, since they are simple pointers to memory buffers, and using uchars here wouldn't provide any more meaningful information. From the original submission, we also converted from the compatibility form to canonical. Changes made by Gabriel: Rebase to Mainline Fix up checkpatch.pl warnings Drop typedefs move out of libxfs Convert from NFKD to NFD Signed-off-by: Olaf Weber <olaf@sgi.com> Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-04-25unicode: introduce UTF-8 character databaseGabriel Krisman Bertazi4-0/+13913
The decomposition and casefolding of UTF-8 characters are described in a prefix tree in utf8data.h, which is a generate from the Unicode Character Database (UCD), published by the Unicode Consortium, and should not be edited by hand. The structures in utf8data.h are meant to be used for lookup operations by the unicode subsystem, when decoding a utf-8 string. mkutf8data.c is the source for a program that generates utf8data.h. It was written by Olaf Weber from SGI and originally proposed to be merged into Linux in 2014. The original proposal performed the compatibility decomposition, NFKD, but the current version was modified by me to do canonical decomposition, NFD, as suggested by the community. The changes from the original submission are: * Rebase to mainline. * Fix out-of-tree-build. * Update makefile to build 11.0.0 ucd files. * drop references to xfs. * Convert NFKD to NFD. * Merge back robustness fixes from original patch. Requested by Dave Chinner. The original submission is archived at: <https://linux-xfs.oss.sgi.narkive.com/Xx10wjVY/rfc-unicode-utf-8-support-for-xfs> The utf8data.h file can be regenerated using the instructions in fs/unicode/README.utf8data. - Notes on the update from 8.0.0 to 11.0: The structure of the ucd files and special cases have not experienced any changes between versions 8.0.0 and 11.0.0. 8.0.0 saw the addition of Cherokee LC characters, which is an interesting case for case-folding. The update is accompanied by new tests on the test_ucd module to catch specific cases. No changes to mkutf8data script were required for the updates. Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk> Signed-off-by: Theodore Ts'o <tytso@mit.edu>