Unicode character encodings – Python Morsels

Topic Series: Files

By Trey Hunner

Python Morsels

Watch as video

02:58

All text that comes from outside of your Python process starts as binary data.

All input starts as raw bytes

When you open a file in Python, the default mode is r or rtfor read text mode:

>>> with open("my_file.txt") as f:
...     contents = f.read()
...
>>> f.mode
'r'

Meaning when we read our file, we’ll get back strings that represent text:

>>> contents
'This is a file ✨n'

But that’s not what Python actually reads from disk.

If we open a file with the mode rb and read from our file we’ll see what Python sees; that is bytes:

>>> with open("my_file.txt", mode="rb") as f:
...     contents = f.read()
...
>>> contents
b'This is a file xe2x9cxa8n'
>>> type(contents)
<class 'bytes'>

Bytes are what Python decodes to make strings.

Encoding strings into bytes

If you have a string in Python and you’d like to convert it into bytesyou can call its encode method:

>>> text = "Hello there! u2728"
>>> text.encode()
b'Hello there! xe2x9cxa8'

The encode method uses the character encoding utf-8 by default:

>>> text.encode("utf-8")
b'Hello there! xe2x9cxa8'

But you can specify a different character encoding if you’d like:

>>> text.encode("utf-16-le")
b"Hx00ex00lx00lx00ox00 x00tx00hx00ex00rx00ex00!x00 x00('"

Decoding bytes into strings

If you have a bytes object and you’d like to convert it into a stringyou need to decode it by calling its decode method:

>>> data = b"Hello there! xe2x9cxa8"
>>> data.decode()
'Hello there! ✨'

Like the string encode method, the bits decode method uses the character encoding utf-8 by default:

>>> data.decode("utf-8")
'Hello there! ✨'

But if you have bytes that represent data in a different character encoding, you’ll need to specify that character encoding instead:

>>> data = b"Hx00ex00lx00lx00ox00 x00tx00hx00ex00rx00ex00!x00 x00('"
>>> data.decode("utf-16le")
'Hello there! ✨'

Specifying a character encoding when opening files

When you open a file in Python, whether for writing or for reading, it’s considered a best practice to specify the character encoding that you’re working with:

>>> with open("message.txt", mode="wt", encoding="utf-8") as f:
...     f.write("In Jan 2020 I said u201cI'm glad I upgraded to Python 3u201d.")
...
53
>>> with open("message.txt", mode="rt", encoding="utf-8") as f:
...     contents = f.read()
...
>>> contents
'In Jan 2020 I said u201cI'm glad I upgraded to Python 3u201d.'

This is because on different operating systemsPython will use a different character encoding by default when it’s working with text files.

On my machine, the default character encoding is utf-8. But on Windows, the default character encoding is usually cp1252.

Be careful with your character encodings

So if we read this UTF-8 file on a Windows machine without specifying an encoding, we would get a UnicodeDecodeError:

>>> with open("message.txt", mode="rt") as f:
...     contents = f.read()
...
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/usr/lib/python3.10/encodings/cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 55: character maps to <undefined>
>>>

The UnicodeDecodeError means There’s a mismatch between the character encoding of the bytes that we’re reading and the character encoding that Python is trying to use to read them.

But you can’t rely on UnicodeDecodeErrors always being raised when there’s a character encoding mismatch. Sometimes two different encodings may use the same bytes to represent different text.

Here we’ve saved a file with using the UTF-8 character encoding:

>>> text = "Yay unicode! N{SPARKLES}"
>>> print(text)
Yay unicode! ✨
>>> with open("sparkles.txt", mode="wt", encoding="utf-8") as f:
...     f.write(text)
...
14

If read this file using the cp1252 character encoding, we’ll see different text than what we started with:

>>> with open("sparkles.txt", encoding="cp1252") as f:
...     contents = f.read()
...
>>> contents
'Yay unicode! ✨'
>>>

We used cp1252 to decode bytes that were encoded using utf-8 and ended up with mojibake.

This is actually a really common problem between utf-8 (default encoding on Linux/Mac) and cp1252 (default encoding on Windows) in particular because these two character encodings are very similar, but far from the same.

Summary

When you read a file, Python will read bytes from disk and then decode those bytes to make them into strings.

When you write to a file, Python will take your strings and encode those strings into bytes to write them to disk.

It’s considered a best practice to specify the character encoding that you’re working with Whenever you’re reading or writing text from outside of your Python process, especially if you’re working with non-ASCII text.

Topic Trail: Files

Reading from and writing to text files (and sometimes binary files) is an important skill for most Python programmers.

To track your progress on this Python Morsels topic trail, sign in or sign up.

Write more Pythonic code

Need to fill-in gaps in your Python skills? I send regular emails designed to do just that.

Leave a Comment