mirror of https://git.sr.ht/~bptato/chagashi synced 2026-01-02 14:04:43 +00:00

No description

Find a file

bptato 9a6847bdb2 Chagashi version 0.7.2		2025-10-06 22:04:30 +02:00
chagashi	Chagashi version 0.7.2	2025-10-06 22:04:30 +02:00
res	chagashi: fix gb18030 ranges map, misc improvements	2025-09-06 15:00:08 +02:00
test	decodercore: fix UTF-16 decoding on buffer boundary	2025-10-05 17:08:27 +02:00
.gitignore	charset_map: reduce size	2024-08-01 20:31:37 +02:00
chagashi.nimble	Chagashi version 0.7.2	2025-10-06 22:04:30 +02:00
Makefile	Unify decoder and validator APIs	2024-06-13 22:43:08 +02:00
NEWS	Chagashi version 0.7.2	2025-10-06 22:04:30 +02:00
nim.cfg	chagashi: fix gb18030 ranges map, misc improvements	2025-09-06 15:00:08 +02:00
README.md	Update docs	2025-03-28 18:05:44 +01:00
todo	charset_map: reduce size	2024-08-01 20:31:37 +02:00
UNLICENSE	Add missing unlicense file	2025-02-04 19:48:49 +01:00

README.md

Chagashi: a Nim implementation of the WHATWG encoding standard

Chagashi is a Nim text encoding/decoding library in compliance with the WHATWG standards for Chawan.

Minimal example

First, include it in your nimble file:

requires "chagashi"

Note: following code uses the (very) high-level interface, which is rather inefficient. Lower level interfaces are normally faster.

# Makeshift iconv.
# Usage: nim r whatever.nim -f fromCharset -t toCharset <infile.txt >outfile.txt
import std/os, chagashi/[encoder, decoder, charset]

var fromCharset = CHARSET_UTF_8
var toCharset = CHARSET_UTF_8
for i in 1..paramCount():
  case paramStr(i)
  of "-f": fromCharset = getCharset(paramStr(i + 1))
  of "-t": toCharset = getCharset(paramStr(i + 1))
  else: assert false, "wrong parameter"
assert fromCharset != CHARSET_UNKNOWN and toCharset != CHARSET_UNKNOWN
let ins = stdin.readAll()
let insDecoded = ins.decodeAll(fromCharset)
if toCharset == CHARSET_UTF_8: # insDecoded is already UTF-8, nothing to do
  stdout.write(insDecoded)
else:
  stdout.write(insDecoded.encodeAll(toCharset))

Q&A

Q: What encodings does Chagashi support?

A: All the ones you can find on https://encoding.spec.whatwg.org/, no more and no less.

Q: What is the intermediate format?

A: UTF-8, because it is the native encoding of Nim. In general, you can just take whatever non-UTF-8 string you want to decode, pass it to the decoder, and use the result immediately.

Q: What API should I use?

For decoding: the TextDecoderContext.decode() iterator provides a fairly high-level API that does no unnecessary copying, and I recommend using that where you can.

You may also use decodeAll when performance is less of a concern and/or you need the output to be in a string, or reach to decodercore directly if you really need the best performance. (In the latter case I recommend you study the decoder module first, because it's very easy to get it wrong.)

For encoding: sorry, at the moment you need to use encodercore or stick with the (non-optimal) encodeAll. I'll see if I can add an in-between API in the future.

Q: Is it correct?

A: To my knowledge, yes. However, testing is still somewhat inadequate: many single-byte encodings are not covered yet, and we do not have fuzzing either.

Q: Is it fast?

A: Not really, I have done very little optimization because it's not necessary for my use case.

If you need better performance, feel free to complain in the tickets with a specific input and I may look into it. Patches are welcome, too.

Q: How do I decode UTF-8?

A: Like any other character set. Obviously, it won't be "decoded", just validated, because the target charset is UTF-8 as well.

Previously, the API did not have a way to return views into the input data, so we had a separate UTF-8 validator API. This turned out to be very annoying to use, so the two APIs have been unified.

Q: How do I encode UTF-8?

A: You have to make sure that the UTF-8 you are passing to the encoder is at least valid WTF-8. The encoder will convert surrogate codepoints to replacement characters, but it does not validate the input byte stream.

To validate your input, you can run validateUtf8() from std/unicode, or validateUTF8Surr from chagashi/decoder.

Q: Why no UTF-16 encoder?

A: It's not specified in the encoding standard, and I don't need one. Maybe try std/encodings.

Q: Why replace your previous character decoding library?

A: Because it didn't work.

Thanks

To the standard authors for writing a detailed, easy to implement specification.

Chagashi's multibyte test files (test/data.tar.xz) were borrowed from Henri Sivonen's excellent encoding_rs library. His writeup on compressing the encoding data was also very helpful, and Chagashi applies similar techniques.

License

Chagashi is dedicated to the public domain. See the UNLICENSE file for details.