computer/encoding.html 2020-11-02

encoding errors

Nowadays everything that is text should be encoded as utf-8 without bom in files unless you have a very good reason not to.

That said, let's explore the problems if it is not

Here are a few examples:

Encoding	Interpreted as	encoded text	Looks like	Python3
ASCII	utf-8	A (0x41)	A	"A".encode("ascii").decode("utf-8")
iso_8859-1	utf-8	é (0xE9)	error / �	"é".encode("iso_8859-1").decode("utf-8")
utf-8	ascii	é	error / �	"é".encode("utf-8").decode("ascii")
utf-8	iso_8859-1	é	Ã© (0xC2..)	"é".encode("utf-8").decode("iso_8859-1")
binary	iso_8859-1	1111111...1111111	ÿÿÿ....ÿÿÿ	bytes([0b11111111]).decode("iso_8859-1")

If you see Â, Ã, Ä or Å followed by another character, you are viewing a file that is utf-8 encoded and contains C1 Controls and Latin-1 Supplement or Latin Extended-A characters in a viewer that interprets them as iso_8859 (ISO Latin).

If you see □, then your font does not contain the character. It may also be that the character is unprintable, like the vertical space that microsoft word adds incorrectly when pressing ctrl-enter.

Wilbert's website at SocSci

> Computer> Encoding

Login

Search

encoding errors