Coarse Programmer's guide to Python's stringy bits

Programming

Coarse Programmer's guide to Python's stringy bits

July 5, 2020

If you program in Python you probably deal with strings a lot. It's one of the areas where the language excels. Writing apps for journalists, I'm always splitting, slicing, iterating on words, checking if in. But you're also painfully aware that string and "bit of text" are not equivalent. Sometimes you try a string method and find out you're dealing with bytes. Sometimes you've got Unicode on your hands, sometimes it's ASCII. It it's Unicode it might be UTF-8, other times UTF-16. You know you need to decode something to get another something, but what is the parameter... and wait, is it encode or decode? Me, I usually just try a bunch of methods or copy something from Stack Overflow until it works and then forget about it until the next time. That's the Coarse Programmer™ way.

But no more! Today I'm going to find out exactly what a string is, how it's different from bytes, and what Unicode really means. When I'm done, I'll report back here.

Ok, got it. It wasn't hard, and it's clear I'm not the first person with this problem. There's a wealth of great resources, not only technical, but full of curiosities about language (as in human language) and some of them are quite funny. If you're really interested in this subject, you should read them instead. If you just want a breezy summary, stick with me.

Unicode & Character Encodings in Python: A Painless Guide (Real Python)
UTF-8 (Wikiepedia)
Strings, Unicode, and Bytes in Python 3: Everything You Always Wanted to Know (Andrea Colangelo)
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) (Joel Spolsky)
What every programmer absolutely, positively needs to know about encodings and character sets to work with text (David C. Zentgraf)
Python built in types: str (Python docs)
Python built-in types: bytes (Python docs)

Bits of String

The heart of my confusion was always the difference between the str and bytes type in Python 3. (It's a different matter in Python 2 but if you're using Python 2, you've been around long enough to have mastered this subject.) They look almost the same. Seems like the only difference is you put a little "b" in front of a bytes literal, and you get one back when you output to the terminal.

s = "This is a sentence."
b = b"This is a sentence."

print(s) 
# >> This is a sentence

print(b)
# >> b'This is a sentence.'

They've got many of the same methods, most of which you'd typically associate with manipulating text, eg. split, uppercase, zfill and so on.

s.split()
# >> ["This", "is", "a", "sentence"]

b.split()
# >>  [b"This", b"is", b"a", b"sentence"]

So what's the difference?

Bytes

Let's start with bytes. Under the hood, the bytes object is quite simple. It is just an array of 8-bit bytes; that is, integers with a value between 0 and 255. When output to the terminal, most values, including all the values greater than 127, are displayed as hexadecimal, a base-16 short hand for writing binary numbers. You can read more about hexadecimal here if you need to. Here is how hexadecimals are written as literals in Python, along with their decimal and binary equivalents.

\x8c	140	10001100
\x8d	141	10001101
\x8e	142	10001110
\x8f	143	10001111
\x90	144	10010000

Not all the values of a bytes object are displayed as hexadecimal, however. The majority of the values less than 128 are displayed as single ASCII characters. ASCII is a character encoding, a big look-up table for converting integers to characters and back again. ASCII includes all the unaccented letters in the Latin alphabet, Arabic numerals, plus a handful of punctuation symbols. It also includes some keyboard commands that are not printable, such as backspace and enter. If a value in a bytes object coincides with a visible character in ASCII, that is how it will appear in the terminal.

0	-	16	-	32	-	48	0	64	@	80	P	96	`	112	p
1	-	17	-	33	!	49	1	65	A	81	Q	97	a	113	q
2	-	18	-	34	"	50	2	66	B	82	R	98	b	114	r
3	-	19	-	35	#	51	3	67	C	83	S	99	c	115	s
4	-	20	-	36	$	52	4	68	D	84	T	100	d	116	t
5	-	21	-	37	%	53	5	69	E	85	U	101	e	117	u
6	-	22	-	38	&	54	6	70	F	86	V	102	f	118	v
7	-	23	-	39	'	55	7	71	G	87	W	103	g	119	w
8	-	24	-	40	(	56	8	72	H	88	X	104	h	120	x
9	-	25	-	41	)	57	9	73	I	89	Y	105	i	121	y
10	-	26	-	42	*	58	:	74	J	90	Z	106	j	122	z
11	-	27	-	43	+	59	;	75	K	91	[	107	k	123	{
12	-	28	-	44	,	60	<	76	L	92	\	108	l	124	\|
13	-	29	-	45	-	61	=	77	M	93	]	109	m	125	}
14	-	30	-	46	.	62	>	78	N	94	^	110	n	126	~
15	-	31	-	47	/	63	?	79	O	95	_	111	o	127

ASCII is only one of many encodings that have existed over the years, but for decades it was the most common. It is still necessary to support ASCII in places so it is useful that the bytes object can be used to represent ASCII natively. However, there's nothing inherently "charactery" or textual about the lower range of integers in a bytes object. In fact, nothing says a bytes object has to represent text at all. It could be any kind of data, such as a jpg or mp3. For example, here's an audio file opened in binary format.

f = open("urbanization.mp3", "rb")
f.readline()
# >> b'...\xbf0*\xb8R\x86\xd9l6\xec\xa1y&D:\xf815B\xfb\x99\x08<\x18\xe2\xc9vZ\x1b\x94\x9d.\x8b\xf8\xff\xc4c\xad\xcaF\xc8\x0bd^\xa5\x87\x82\x1cy3\xfa\xfe)\x93ai\xb1\xa8j-\x19z...'

Notice the output has that little "b" in front of the line. Let's take the first bit of that line and break it down into individual bytes. Anything with the format \x?? is hexadecimal, everything else is ASCII. Of course to the computer they're all just numbers.

what you see	\xbf	0	*	\xb8	R	\x86	\xd9	l	6
what the computer sees	191	48	42	184	82	134	217	108	54

The audio is a documentary about Haiti though you probably can't tell from such a short excerpt.

Strings

Strings are very similar to bytes in that they too are sequences of integers, but in this case, the value of each "slot" in the sequence can range from 0 to 1,114,111. In this way, strings are able to represent many, many more characters than the ASCII set. In fact, Python 3 strings are designed explicitly to represent every character in Unicode. Like ASCII, Unicode is a look-up table to convert integers to characters but it is huge. It covers almost every symbol in almost every language on Earth (including Klingon) and every emoji you may want to use, and some you may not. Even more remarkably, to me anyway, is that every one of those symbols has a unique name that somebody has had to sit down and come up with.

LATIN SMALL LETTER A WITH RING ABOVE AND ACUTE	ǻ
WHITE-FEATHERED RIGHTWARDS ARROW	➳
HEAVY TEARDROP-SPOKED PINWHEEL ASTERISK	❃
SHOCKED FACE WITH EXPLODING HEAD	🤯

Exotica aside, Unicode also covers all the ASCII characters. In fact the first 127 characters of Unicode are the exact same as the ASCII set. That is why a string can look the same as a bytes object in the terminal, as we saw above. Throw in a few diacriticals and a pile of poo, on the other hand, and the bytes object breaks down crying.

b = b"Peña Nieto olía a 💩"
# >>   File "<stdin>", line 1
# >>    SyntaxError: bytes can only contain ASCII literal characters.

Encoding and Decoding

If strings are so powerful, what's the point of bytes objects? It's because strings have an idiosyncratic way of storing their contents in memory designed specifically for handling the big range of values possible in Unicode. As we will see, it takes some finagling to interpret them. Bytes objects, on the other hand, are just sequences of single, 8-bit bytes strung together. They have exactly the same shape as binary data stored on disc, sent over the internet, received from a media device or through any other form of IO. Any data that comes into your program is going to come in as bytes. Python might turn it into a string for you or you might have to do it yourself, but either way, you're going to have to decode it. And when you send it back out into the world through some IO channel, you'll have to encode it again.

Decoding with ASCII is easy because there is a one-to-one correspondence between a character code and the byte of information that it expresses. With Unicode it is not so simple because Unicode codes can be stored in memory in more than one way. To see why, let's take characters from either end of the table: "Ñ", which is the 209th character, and "🎇", which is the 127,879th. Those ordinal numbers are called "code points" in Unicode and they are usually expressed in hexadecimal with a "U+" at the beginning. This table shows the characters' code points and the hexadecimal values converted to binary, which is how the computer sees them.

Unicode code point	Binary	Name	Symbol
U+00D1	00000000 00000000 11010001	Latin Capital Letter N with Tilde	Ñ
U+1F387	00000001 11110011 10000111	Firework Sparkler Emoji	🎇

Looking at the binary, you can see that to express the 🎇 you need three bytes. Same goes for all the other characters at the high end of the Unicode table. To express Ñ you could be consistent and use three bytes as well, but you're going to be dragging around a lot of zeroes on the left. Same goes for all the other characters at the low of the table, including most of the Latin alphabet. If you wrote a book in English and stored it in a Python string, you'd be a taking up acres of memory with unused zeroes.

To get around this problem, there are several different implementations of Unicode. When people talk about a Unicode encoding, they are referring to these implementations rather than the giant look-up table itself. The encodings have different strategies. The UTF-32 encoding just throws up its hands and saves everything with four bits, redundant zeroes be damned. But UTF-8, which is by far and away the most common encoding today has a clever way of storing code points using a variable number of bytes, depending at which end of the table a character resides.

Code point range: hex (decimal)	Byte 1	Byte 2	Byte 3	Byte 4
U+0000 (0) to U+007F (127)	0xxxxxxx
U+0080 (128) to U+07FF (2047)	110xxxxx	10xxxxxx
U+0800 (2048) to U+FFFF (65535)	1110xxxx	10xxxxxx	10xxxxxx
U+10000 (65536) to U+10FFFF (1114111)	11110xxx	10xxxxxx	10xxxxxx	10xxxxxx

source:https://en.wikipedia.org/wiki/UTF-8

This table shows how many bytes are required for any given range of code points. The lowest, those between 0 and 127, only need one byte, and that one byte is distinguished by the fact it begins with a 0. In that way it is compatible with ASCII values and cannot mistaken for part of a multibyte value. The higher values are composed of sequences of bytes. The first byte of a two byte sequence begins with 110, the first byte of a three byte sequence with 1110 and a four byte-sequence, 11110. The continuation bytes that follow on in the sequence all begin with a 10. These leading bit signatures make the role of each byte unambiguous. The remaining bits (the xxx's) are the actual data, which are concatenated together to yield the code point. Let's see how that works from the computer's point of view using this secret message encoded in UTF-8.

b = b'\xe2\x99\xbb\xf0\x9f\x92\x9c\xc3\xb1\xf0\x9f\xa6\xa1'

To a human, there's no obvious way to parse this array. Is the first character a single byte of "\xe2" followed by a two byte character of "\x99\xbb" or perhaps it's one four byte character of "\xe2\x99\xbb\xf0"... or what? You might think you can combine any set of adjacent bytes. But no!

d = b'\xe2'
d.decode()
# >> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 0: unexpected end of data

d = b'\xe2\x99\xbb\xf0'
d.decode()
# >> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf0 in position 3: unexpected end of data

Looks like any collection of hexadecimal integers can make a valid bytes objects, but not necessarily a valid unicode string. To see which groups of bytes form valid sequences, let's convert that same bytes object into binary.

for i in b:
...   print(bin(i), end=", ")
# >> 0b11100010, 0b10011001, 0b10111011, 0b11110000, 0b10011111, 0b10010010, 0b10011100, 0b11000011, 0b10110001, 0b11110000, 0b10011111, 0b10100110, 0b10100001

Now, let's decode it following the UTF-8 rules. First we look at the leading bits of each byte to determine its role, whether that's the start of a multibyte sequence, a continuation of a multibyte sequence or a single-byte code. We drop off the leading bits and keep the remaining bits as data. For each sequence we concatenate all those bits of data into one long binary number, convert that number to hex and prepend our U+ to get a Unicode code point. Then we use the Unicode lookup-table to figure out what character it stands for.

byte	role	data	code point (binary)	code point (unicode)	character
11100010	3 byte sequence	0010	00100110 01111011	U+267B	♻
10011001	continuation	011001
10111011	continuation	111011

11110000	4 byte sequence	000	00001 11110100 10011100	U+1F49C	💜
10011111	continuation	011111
10010010	continuation	010010
10011100	continuation	011100

11000011	2 byte sequence	00011	000 11110001	U+F1	ñ
10110001	continuation	110001

11110000	4 byte sequence	000	000001 11111001 10100001	U+1F9A1	🦡
10011111	continuation	011111
10100110	continuation	100110
10100001	continuation	100001

And there's our secret message: ♻💜ñ🦡!

"Recycling Heart ñ Badger. Pass it on."

After having done all of that, we can see there is a lot going on under the hood of a Python string object. They may look like bytes object, but the interpreter has to do a lot of work to represent a string as characters. It has to do even more if we call a string method like slice. While bytes can just be split right at the numeric indices, you can't do that with a string because there's no guarantee those indices line up with the beginnings of multibyte sequences. Instead you have to identify those sequences without respect to the length of each, and count over them from the beginning to find your slice point.

Loose ends (of string. Get it?)

Usually, when you see example python code, you see these the encode() and decode() methods being given a parameter, more often than not "utf-8".

b =b'ma\xc3\xb1ana'
b.decode("utf-8")
# >> 'mañana'

Python defaults to utf-8, which is why I haven't included it in this article. Utf-8 accounts for something like 95% of encodings on the web, and is the defacto standard elsewhere as well. However, other encodings do exist and you may need to use them sometimes. The Real Python article cited above recommends you always explicitly include the parameter, if for no other reason than as a reminder not to take encodings for granted. The 5% of the time you hit another encoding will cause you much grief! The author mentions at least one common scenario where this can happen... using the popular requests library. And now that I think about it, I believe it was exactly that scenario that led me to write this article.

Conrad Fox

Blog