Comment by koakuma-chan
Comment by koakuma-chan a day ago
> What if the file name is not valid UTF-8
Nothing? Neither Go nor the OS require file names to be UTF-8, I believe
Comment by koakuma-chan a day ago
> What if the file name is not valid UTF-8
Nothing? Neither Go nor the OS require file names to be UTF-8, I believe
That sounds like your kernel refusing to create that file, nothing to do with Go.
$ cat main.go
package main
import (
"log"
"os"
)
func main() {
f, err := os.Create("\xbd\xb2\x3d\xbc\x20\xe2\x8c\x98")
if err != nil {
log.Fatalf("create: %v", err)
}
_ = f
}
$ go run .
$ ls -1
''$'\275\262''='$'\274'' ⌘'
go.mod
main.go
I've posted a longer explanation in https://news.ycombinator.com/item?id=44991638. I'm interested to hear which kernel and which firesystem zimpenfish is using that has this problem.
I believe macOS forces UTF-8 filenames and normalizes them to something near-but-not-quite Unicode NFD.
Windows doing something similar wouldn't surprise me at all. I believe NTFS internally stores filenames as UTF-16, so enforcing UTF-8 at the API boundary sounds likely.
I'm confused, so is Go restricted to UTF-8 only filenames, because it can read/write arbitrary byte sequences (which is what string can hold), which should be sufficient for dealing with other encodings?
Go is not restricted, since strings are only conventionally utf-8 but not restricted to that.
> That sounds like your kernel refusing to create that file
Yes, that was my assumption when bash et al also had problems with it.
Well, Windows is an odd beast when 8-bit file names are used. If done naively, you can’t express all valid filenames with even broken UTF-8 and non-valid-Unicode filenames cannot be encoded to UTF-8 without loss or some weird convention.
You can do something like WTF-8 (not a misspelling, alas) to make it bidirectional. Rust does this under the hood but doesn’t expose the internal representation.
What do you mean by "when 8-bit filenames are used"? Do you mean the -A APIs, like CreateFileA()? Those do not take UTF-8, mind you -- unless you are using a relatively recent version of Windows that allows you to run your process with a UTF-8 codepage.
In general, Windows filenames are Unicode and you can always express those filenames by using the -W APIs (like CreateFileW()).
Windows filenames in the W APIs are 16-bit (which the A APIs essentially wrap with conversions to the active old-school codepage), and are normally well formed UTF-16. But they aren’t required to be - NTFS itself only cares about 0x0000 and 0x005C (backslash) I believe, and all layers of the stack accept invalid UTF-16 surrogates. Don’t get me started on the normal Win32 path processing (Unicode normalization, “COM” is still a special file, etc.), some of which can be bypassed with the “\\?\” prefix when in NTFS.
The upshot is that since the values aren’t always UTF-16, there’s no canonical way to convert them to single byte strings such that valid UTF-16 gets turned into valid UTF-8 but the rest can still be roundtripped. That’s what bastardized encodings like WTF-8 solve. The Rust Path API is the best take on this I’ve seen that doesn’t choke on bad Unicode.
I think it depends on the underlying filesystem. Unicode (UTF-16) is first-class on NTFS. But Windows still supports FAT, I guess, where multiple 8-bit encodings are possible: the so-called "OEM" code pages (437, 850 etc.) or "ANSI" code pages (1250, 1251 etc.). I haven't checked how recent Windows versions cope with FAT file names that cannot be represented as Unicode.
I believe the same is true on linux, which only cares about 0x2f bytes (i.e. /)
Windows paths are not necessarily well-formed UTF-16 (UCS-2 by some people’s definition) down to the filesystem level. If they were always well formed, you could convert to a single byte representation by straightforward Unicode re-encoding. But since they aren’t - there are choices that need to be made about what to do with malformed UTF-16 if you want to round trip them to single byte strings such that they match UTF-8 encoding if they are well formed.
In Linux, they’re 8-bit almost-arbitrary strings like you noted, and usually UTF-8. So they always have a convenient 8-bit encoding (I.e. leave them alone). If you hated yourself and wanted to convert them to UTF-16, however, you’d have the same problem Windows does but in reverse.
> Nothing?
It breaks. Which is weird because you can create a string which isn't valid UTF-8 (eg "\xbd\xb2\x3d\xbc\x20\xe2\x8c\x98") and print it out with no trouble; you just can't pass it to e.g. `os.Create` or `os.Open`.
(Bash and a variety of other utils will also complain about it being valid UTF-8; neovim won't save a file under that name; etc.)