Exploring CPython's Bytecode

Floris Bruynooghe

flub@devork.be

@flubdevork

Comments in here will not show up on the slides.

Me

Unravelling Bytecode

Python: Compiler & VM

Python overview

(Image courtesy Ned Batchelder)

code-stages.svg

Example Module

"""Docstring for example.py"""

def sum(a, b):
    """Return a * 2 + b * 3"""
    a = a * 2
    c = b * 3
    return a + c

if __name__ == '__main__':
    print(sum(15, 4))

Executing the Bytecode

The Compiler

Benefits over 2.x .pyc files:

  • Less clutter in top level directory
  • Multiple versions can co-exist (thanks Barry Warshaw), Linux distros like this.

"Optimisation"

Reading .pyc Files

Reading .pyc Files

Structure of .pyc Files

Lazy developers search the web rather then read the code.

http://nedbatchelder.com/blog/200804/the_structure_of_pyc_files.html

  1. Magic number (4 bytes)
  2. Timestamp (4 bytes)
  3. Code Object (the rest)

1. Magic Number

2. Timestamp

3. Code Object

Reading .pyc Files

with open('__pycache__/example.cpython-32.pyc',
          'rb') as fp:
    magic = fp.read(4)
    assert magic == imp.get_magic()
    timestamp = struct.unpack('<L', fp.read(4))
    timestamp = timestamp[0]
    code = marshal.load(fp)
# Use code here...

Back to the Compiler

Compile your own code object:

compile(source, filename, mode, flags=0,
        dont_inherit=False, optimize=-1)

Code Objects

<module> Code Object

co_filename = 'example.py'

co_firstlineno = 1

co_lnotab = b'x06x03tx07x0cx01' (Objects/lnotab_notes.txt)

co_flags = 0x40

co_name = '<module>'

co_names = ('__doc__', 'sum', '__name__', 'print')

co_consts = ('Docstring for example.py',
<code object sum at 0x1a3bbe0, file "example.py", line 4>, '__main__', 15, 4, None)

co_stacksize = 4

co_code = b'dx00x00Zx00x00dx01x00x84x00x00Zx01x00ex02x00dx02x00kx02x00r1x00ex03x00ex01x00dx03x00dx04x00x83x02x00x83x01x00x01nx00x00dx05x00S'

sum Code Object

co_filename = 'example.py'

co_flags = 0x43

co_firstlineno = 4

co_lnotab = b'x00x02nx01nx01'

co_name = 'sum'

co_names = ()

co_consts = ('Return a * 2 + b * 3', 2, 3)

The first item would be None if there was no docstring.

co_argcount = 2

co_kwonlyargcount = 0

co_cellvars = ()

This is used when nested functions use variables from this function.

co_freevars = ()

Free variables which are variables which need to come from an enclosing context.

co_nlocals = 3

co_varnames = ('a', 'b', 'c')

co_stacksize = 2

co_code = b'|x00x00dx01x00x14}x00x00|x01x00dx02x00x14}x02x00|x00x00|x02x00x17S'

Bytecode

Bytecode

Disassembler

dis.show_code()

First Taste of Bytecode

First Taste of Bytecode (2)

LOAD_FAST(var_num):
 Pushes a reference to the local co_varnames[var_num] onto the stack.
LOAD_CONST(consti):
 Pushes co_consts[consti] onto the stack.
BINARY_MULTIPLY:
 Implements TOS = TOS1 * TOS.
STORE_FAST(var_num):
 Stores TOS into the local co_varnames[var_num].
BINARY_ADD:Implements TOS = TOS1 + TOS.
RETURN_VALUE:Returns with TOS to the caller of the function.

First Model of the VM

Reading Bytecode

Bytecode Arguments

First Eval Loop

Passing Arguments

Handling Bytecodes

LOAD_FAST • LOAD_CONST • STORE_FAST

BINARY_MULTIPLY

BINARY_ADD

RETURN_VALUE

The VM So Far

Disassembling the Module

Disassembling the Module (contd)

STORE_NAME • LOAD_NAME

MAKE_FUNCTION

CALL_FUNCTION

CALL_FUNCTION (2)

POP_JUMP_IF_FALSE

JUMP_FORWARD • POP_TOP

Executing example.pyc

with open(imp.cache_from_source('example.py'), 'rb') as fp:
    assert fp.read(4) == imp.get_magic()
    fp.seek(4, io.SEEK_CUR)
    code = marshal.load(fp)
f = Frame(code_mod)
f.exec()
$ python3 vm.py
42

Incomplete

EXTENDED_ARG

Abusing Bytecode

Abusing Bytecode

Don't Use These

The Tools

Classic: GOTO

Improve Tracing

The Bad

The Ugly

Questions

?

Thanks to my employer

abilisoft.png