Wilbert's website at SocSci

> Computer> Encoding

computer/encoding.html 2020-11-02

encoding errors

Nowadays everything that is text should be encoded as utf-8 without bom in files unless you have a very good reason not to.

That said, let's explore the problems if it is not

Here are a few examples:

EncodingInterpreted asencoded textLooks likePython3
ASCII utf-8 A (0x41) A "A".encode("ascii").decode("utf-8")
iso_8859-1 utf-8 é (0xE9) error / � "é".encode("iso_8859-1").decode("utf-8")
utf-8 ascii é error / � "é".encode("utf-8").decode("ascii")
utf-8 iso_8859-1 é é (0xC2..) "é".encode("utf-8").decode("iso_8859-1")
binary iso_8859-1 1111111...1111111 ÿÿÿ....ÿÿÿ bytes([0b11111111]).decode("iso_8859-1")

If you see Â, Ã, Ä or Å followed by another character, you are viewing a file that is utf-8 encoded and contains C1 Controls and Latin-1 Supplement or Latin Extended-A characters in a viewer that interprets them as iso_8859 (ISO Latin).

If you see □, then your font does not contain the character. It may also be that the character is unprintable, like the vertical space that microsoft word adds incorrectly when pressing ctrl-enter.