Python 3 Unicode annoyances

Every once in a while I feel guilty for not using Python 3, so I spin it up for a few rounds. My experience is usually:

  1. Start using Python 3
  2. oops, UnicodeDecodeError
  3. Go back to Python 2

Looks like I’m not the only one who has this frustration. Knowing when to use encode vs decode was always a frustrating exercise in trial and error. There are some good tips in the linked thread and is worth a thorough read. A useful bit is this comment from redditor Fylwind, part of which is:

  • encode: textual data to binary data.
  • decode: binary data to textual data.

The term “encode” means to a transformation from some high-level structure into bytes, hence in the context of strings it means converting text into binary data.

Q. What are the appropriate data types for textual data and binary data?

  • In Python 3:
    • Textual data is str, written as "foo".
    • Binary data is bytes, written as b"foo".
    • The encode function only works on textual data, and the decode function only works on binary data.
  • In Python 2:
    • Textual data is unicode, written as u"foo". If unicode_literals is enabled, then it’s "foo".
    • Binary data is str (alias: bytes), written as "foo". If unicode_literals is enabled, then it’s b"foo"

Leave a Reply