No description
Find a file
2025-10-06 22:04:30 +02:00
chagashi Chagashi version 0.7.2 2025-10-06 22:04:30 +02:00
res chagashi: fix gb18030 ranges map, misc improvements 2025-09-06 15:00:08 +02:00
test decodercore: fix UTF-16 decoding on buffer boundary 2025-10-05 17:08:27 +02:00
.gitignore charset_map: reduce size 2024-08-01 20:31:37 +02:00
chagashi.nimble Chagashi version 0.7.2 2025-10-06 22:04:30 +02:00
Makefile Unify decoder and validator APIs 2024-06-13 22:43:08 +02:00
NEWS Chagashi version 0.7.2 2025-10-06 22:04:30 +02:00
nim.cfg chagashi: fix gb18030 ranges map, misc improvements 2025-09-06 15:00:08 +02:00
README.md Update docs 2025-03-28 18:05:44 +01:00
todo charset_map: reduce size 2024-08-01 20:31:37 +02:00
UNLICENSE Add missing unlicense file 2025-02-04 19:48:49 +01:00

Chagashi: a Nim implementation of the WHATWG encoding standard

Chagashi is a Nim text encoding/decoding library in compliance with the WHATWG standards for Chawan.

Minimal example

First, include it in your nimble file:

requires "chagashi"

Note: following code uses the (very) high-level interface, which is rather inefficient. Lower level interfaces are normally faster.

# Makeshift iconv.
# Usage: nim r whatever.nim -f fromCharset -t toCharset <infile.txt >outfile.txt
import std/os, chagashi/[encoder, decoder, charset]

var fromCharset = CHARSET_UTF_8
var toCharset = CHARSET_UTF_8
for i in 1..paramCount():
  case paramStr(i)
  of "-f": fromCharset = getCharset(paramStr(i + 1))
  of "-t": toCharset = getCharset(paramStr(i + 1))
  else: assert false, "wrong parameter"
assert fromCharset != CHARSET_UNKNOWN and toCharset != CHARSET_UNKNOWN
let ins = stdin.readAll()
let insDecoded = ins.decodeAll(fromCharset)
if toCharset == CHARSET_UTF_8: # insDecoded is already UTF-8, nothing to do
  stdout.write(insDecoded)
else:
  stdout.write(insDecoded.encodeAll(toCharset))

Q&A

Q: What encodings does Chagashi support?

A: All the ones you can find on https://encoding.spec.whatwg.org/, no more and no less.

Q: What is the intermediate format?

A: UTF-8, because it is the native encoding of Nim. In general, you can just take whatever non-UTF-8 string you want to decode, pass it to the decoder, and use the result immediately.

Q: What API should I use?

For decoding: the TextDecoderContext.decode() iterator provides a fairly high-level API that does no unnecessary copying, and I recommend using that where you can.

You may also use decodeAll when performance is less of a concern and/or you need the output to be in a string, or reach to decodercore directly if you really need the best performance. (In the latter case I recommend you study the decoder module first, because it's very easy to get it wrong.)

For encoding: sorry, at the moment you need to use encodercore or stick with the (non-optimal) encodeAll. I'll see if I can add an in-between API in the future.

Q: Is it correct?

A: To my knowledge, yes. However, testing is still somewhat inadequate: many single-byte encodings are not covered yet, and we do not have fuzzing either.

Q: Is it fast?

A: Not really, I have done very little optimization because it's not necessary for my use case.

If you need better performance, feel free to complain in the tickets with a specific input and I may look into it. Patches are welcome, too.

Q: How do I decode UTF-8?

A: Like any other character set. Obviously, it won't be "decoded", just validated, because the target charset is UTF-8 as well.

Previously, the API did not have a way to return views into the input data, so we had a separate UTF-8 validator API. This turned out to be very annoying to use, so the two APIs have been unified.

Q: How do I encode UTF-8?

A: You have to make sure that the UTF-8 you are passing to the encoder is at least valid WTF-8. The encoder will convert surrogate codepoints to replacement characters, but it does not validate the input byte stream.

To validate your input, you can run validateUtf8() from std/unicode, or validateUTF8Surr from chagashi/decoder.

Q: Why no UTF-16 encoder?

A: It's not specified in the encoding standard, and I don't need one. Maybe try std/encodings.

Q: Why replace your previous character decoding library?

A: Because it didn't work.

Thanks

To the standard authors for writing a detailed, easy to implement specification.

Chagashi's multibyte test files (test/data.tar.xz) were borrowed from Henri Sivonen's excellent encoding_rs library. His writeup on compressing the encoding data was also very helpful, and Chagashi applies similar techniques.

License

Chagashi is dedicated to the public domain. See the UNLICENSE file for details.