Python Concepts/Bytes objects and Bytearrays

Objective

 * What is a  object?
 * Why is a  object important?
 * What is the difference between a  object and a  ?
 * How is a  object created and used?
 * How do you convert from a  object to other sequences based on bytes?
 * How do you avoid errors when using  objects or  ?

Lesson
One byte is a memory location with a size of 8 bits. A bytes object is an immutable sequence of bytes, conceptually similar to a string.

Because each byte must fit into 8 bits, each member $$x$$ of a bytes object is an unsigned int that satisfies $$0 \le x \le 0b1111\_1111.$$

The bytes object is important because data written to disk is written as a stream of bytes, and because integers and strings are sequences of bytes. How the sequence of bytes is interpreted or displayed makes it an integer or a string.

object displayed
A  object is displayed as a sequence of bytes between quotes and preceded by 'b' or 'B':

The representation '\x00' is not read literally. This representation means a byte with value 0x00.

If a member of a  object can be displayed as a printable ASCII character, then it is so displayed.

When you look at the contents of a  object, it is easy to overlook embedded ASCII characters:

Parts of a  object:

Some control characters are recognized as such but not displayed as such:

object initialized
The  object can contain recognized control characters:

As for strings prefix 'r' or 'R' may be used:

A suitable sequence can be converted to :

Like a string the  object doesn't support item assignment:

Traceback (most recent call last): File " ", line 1, in TypeError: 'bytes' object does not support item assignment >>>

Like a string the  object can be repeated:

Behavior of  and behavior of   can be significantly different:

The concatenation of 2 or more  objects:

bytes object as iterable
The  object accepts the usual operations over iterables:

from int
The following code illustrates the process for a positive integer:

i1 = 0x0, b1 = b'\x00' i1 = 0x89abcde, b1 = b'\x08\x9a\xbc\xde' i1 = 0xa89abcde, b1 = b'\x00\xa8\x9a\xbc\xde'

Method int.to_bytes(length, ....)
Method int.to_bytes(length, byteorder, *, signed=False) returns a bytes object representing an integer.

from str
If each character of the string fits into one byte, the process is simple:

A listcomp simplifies the process:

The above implements encoding 'Latin-1':

Method  creates a bytes object containing the string   encoded:

from str containing international text
Each Greek character occupies 2 bytes and is encoded as 2 bytes. Note for example:

from str containing hexadecimal digits
classmethod bytes.fromhex(string) returns a bytes object, decoding the given string object:

classmethod bytes.fromhex(string) can be used to convert from positive  to   object:

Some technical information about encoding standard 'utf-8'
Strings encoded according to encoding standard 'utf-8' conform to the following table:

Encoding standard 'utf-8' is a good choice for default encoding because:


 * 2, 3 or 4 bytes are used only if necessary,
 * it doesn't depend on byte ordering, big or little, and
 * arbitrary binary data is not likely to conform to the above specification.

Examples of characters encoded with 'utf-8'
The following code examines chr(0x10006), encoded in 4 bytes:

c1 = 𐀆 ord(c1) = 0x10006 c1_encoded = b'\xf0\x90\x80\x86' ['0xf0', '0x90', '0x80', '0x86'] # each byte of c1_encoded

The marker bits: c1_encoded[0] & 0b11111_000 == 0b11110_000 : True c1_encoded[1] & 0b11_000000 == 0b10_000000 : True c1_encoded[2] & 0b11_000000 == 0b10_000000 : True c1_encoded[3] & 0b11_000000 == 0b10_000000 : True

The payload bits: payload[0] = c1_encoded[0] & 0x07 = 0xf0 & 0x07 = 0x0 payload[1] = c1_encoded[1] & 0x3F = 0x90 & 0x3F = 0x10 payload[2] = c1_encoded[2] & 0x3F = 0x80 & 0x3F = 0x0 payload[3] = c1_encoded[3] & 0x3F = 0x86 & 0x3F = 0x6

Building c1: i1 = payload[3] + (payload[2] << 6) + (payload[1] << 12) + (payload[0] << 18) = 0x10006 i1 == ord(c1) : True

Theoretically 21 payload bits can contain '\U001FFFFF' but the standard stops at '\U0010FFFF':

A disadvantage of 'utf-8'
---

A bytes object produced with encoding 'utf-8' can contain the null byte b'\x00'. This could cause a problem if you are sending a stream of bytes through a filter that interprets b'\x00' as end of data. Standard 'utf-8' never produces b'\xFF'. If your bytes object must not contain b'\x00' after encoding, you could convert the null byte to b'\xFF', then convert b'\xFF' to b'\x00' before decoding:

to int
The following code illustrates the process for a positive integer:

b1 = b'', i1 = 0x0 b1 = b'\x00\x00\x00', i1 = 0x0 b1 = b'\x13\xd8', i1 = 0x13d8 b1 = b'\x00\xf7\x14', i1 = 0xf714

Class method int.from_bytes(bytes, ....)
Class method int.from_bytes(bytes, byteorder, *, signed=False) simplifies the conversion from bytes to int:

The following code ensures that the integer produced after encoding and decoding is the same as the original int:

to str
If the bytes object contains only characters that fit into one byte:

Method  creates a string representing the bytes object   decoded:

It is important to use the correct decoding:

to str containing international text
It is possible to produce different results depending on encoding/decoding:

to str containing hexadecimal digits
method bytes.hex returns a string object containing two hexadecimal digits for each byte in the instance.

method bytes.hex can be used to convert from  object to positive

Operations with methods on bytes objects
Operations on strings usually require str arguments. Similarly, operations on bytes objects usually require bytes arguments. Occasionally, a suitable int may be substituted.

The following methods on bytes are representative of methods described in the reference. All can be used with arbitrary binary data.

bytes.count(sub[, start[, end]]) --

Creating and using a translation table:
static bytes.maketrans(from, to) returns a translation table to map a byte in  into the byte in the same position in  bytes.translate(table, delete=bytes(0)) returns a copy of the bytes object where all bytes occurring in the optional argument delete are removed, and the remaining bytes have been mapped through the given translation table, which must be a bytes object of length 256.

To invert the case of all alphabetic characters:

To delete specified bytes:

Deletion is completed before translation:

bytes objects and disk files
Data is written to disk as a stream of bytes. Therefore the bytes object is ideal for this purpose.

The following code writes a stream of bytes to disk and then reads the data on disk as text.

Python automatically performs the appropriate decoding (default 'utf-8') when reading text.

$ cat test.py

$ python3.6 test.py >test.sout 2>test.serr $ $ od -t x1 test.bin # The contents of disk file test.bin (edited for clarity): 0000000    E   n   g   l   i   s   h ' '   (   c   u   r ' '   | ' '   p # English 0000016     r   e   v   )'\n'

C  h   i   n   e   s   e ' '  ef  bc  88 # Chinese 0000032   e5  bd  93  e5  89  8d ' '   |  20  e5  85  88  e5  89  8d  ef 0000048    bc  89'\n'

J  a   p   a   n   e   s   e ' '   (  e6  9c  80 # Japanese 0000064    e6  96  b0 ' '   | ' '  e5  89  8d   )'\n'

G  r   e   e   k # Greek 0000080  ' '   (  cf  80  ce  b1  cf  81  cf  8c  ce  bd ' '   | ' '  cf 0000096    80  cf  81  ce  bf  ce  b7  ce  b3   .   )'\n'

R  u   s   s # Russian 0000112    i   a   n ' '   (  d1  82  d0  b5  d0  ba  d1  83  d1  89   . 0000128   ' '   | ' '  d0  bf  d1  80  d0  b5  d0  b4   .   )'\n' 0000142 # Values in left hand column are decimal. $ $ ls -la test.bin -rw-r--r-- 1 user  staff  142 Nov 12 08:46 test.bin $ $ cat test.bin English (cur | prev) Chinese （当前 | 先前） Japanese (最新 | 前) Greek (παρόν | προηγ.) Russian (текущ. | пред.) $ $ cat test.sout 21 English (cur | prev) 18 Chinese （当前 | 先前）# 18 characters including '\n' 18 Japanese (最新 | 前) 23 Greek (παρόν | προηγ.) 25 Russian (текущ. | пред.) $

bytearrays
The  is a mutable sequence of bytes, similar to the   object in that each member of the  fits into one byte, and similar to a list in that the   or any slice of it may be changed dynamically.

displayed
The  is displayed as a   object within parentheses prepended by the word Individual member is returned as int: Slices of  ba1:

initialized
Any  object may be converted to a

A suitable sequence can be converted to

Concatenation of  and   object:

Because the  is a mutable sequence, the   accepts assignment:

from int
The following code illustrates the process for a positive integer:

Method int.to_bytes(length, ....)
Method int.to_bytes(length, byteorder, *, signed=False) returns a  object representing an integer. If a  is required, convert the   object to

from str
If each character of the string fits into one byte, the process is simple:

Method  creates a   object containing the string   encoded. If a  is required, convert the   object to

from str containing hexadecimal digits
classmethod bytearray.fromhex(string) returns a, decoding the given string object:

classmethod bytearray.fromhex(string) can be used to convert from positive  to  :

to int
The following code illustrates the process for a positive integer:

Class method int.from_bytes(bytes, ....)
Class method int.from_bytes(bytes, byteorder, *, signed=False) simplifies the conversion from bytearray to int:

to str
If the bytearray contains only characters that fit into one byte:

Method  creates a string representing the   decoded:

It is important to use the correct decoding:

to str containing hexadecimal digits
method bytearray.hex returns a string object containing two hexadecimal digits for each byte in the instance.

method bytearray.hex can be used to convert from  to positive

Operations with methods on bytearrays
The following methods on bytearrays are representative of methods described in the reference. All can be used with  objects.

Further Reading or Review

 * Previous Lesson: Lists
 * This Lesson: Bytes objects and Bytearrays
 * Next Lesson: Tuples
 * Course Home Page