Pythonathan:miscellaneous Python development tips

Nathan Schneider

Are you a Python hacker? This is a scattered list of patterns, tools, and conventions that I have found useful and thought were worth sharing. It is not a tutorial, though excellent ones exist.

Data structures

taxonomy of basic data structures

FixedDict: preventing reassignment to keys

Sometimes it is desirable to create a mapping which enforces the requirement that values cannot be reassigned. This is easily done by subclassing the dict class:

class FixedDict(dict):
    '''Dict subclass which prevents reassignment to existing keys.'''
    def __setitem__(self, key, newvalue):
        if key in self:
            raise KeyError('FixedDict cannot reassign to key {0!r} (current: {1!r}, new value: {2!r})'.format(key,self[key],newvalue))
        dict.__setitem__(self, key, newvalue)

Thus, the last line of the following will trigger an error:

d = FixedDict()
d['a'] = 1
d['b'] = 2
d['a'] = 3

We can make our class slightly fancier to allow several options for handling of reassignment attempts: with the version below, FixedDict(reassign='exception') will create an object behaving as above; FixedDict(reassign='ignore') will cause reassignment attempts to fail silently; and FixedDict(reassign='different') will cause an error to be thrown only if the new value and the old value are not equal to each other. A function can even be passed to customize the reassignment behavior.

class FixedDict(dict):
    '''Dict subclass which constrains reassignment to existing keys.'''
    def __init__(self, *args, **kwargs):
        Optional argument 'reassign' to specify what to do if __setitem__() is invoked 
        for a key that is already present: 
         * reassign='exception' indicates that an error should be thrown (default)
         * reassign='ignore' indicates that the new value should be ignored without error
         * reassign='replace' indicates that the new value should replace the old one without error
         * reassign='different' indicates that an error should be thrown if the new value is 
           different from the old one; if they are the same, the new value will be ignored silently
         * reassign=<2-arg function> indicates that the key will be reassigned to the value returned 
           by the function when applied to the old and new values
        The reassignment policy can also be specified as an argument to __setitem__(), overriding 
        any instance-level policy.
        reass = kwargs.get('reassign','exception')
        assert reass in ('exception','ignore','replace','different') or callable(reass),"Invalid 'reassign' parameter: {0}".format(reass)
        self._reassign = reass
        if 'reassign' in kwargs: del kwargs['reassign']
        dict.__init__(self, *args, **kwargs)
    def __setitem__(self, key, value, reassign=None):
        if key in self:
            return self.reassign(key, value, policy=reassign)
        dict.__setitem__(self, key, value)
    def reassign(self, key, newvalue, policy=None):
        p = policy or self._reassign
        if p=='exception' or (p=='different' and newvalue!=self[key]):
            raise KeyError('FixedDict cannot reassign to key {0!r} (current: {1!r}, new value: {2!r})'.format(key,self[key],newvalue))
        elif p=='ignore':
        elif p=='replace':
            dict.__setitem__(self, key, newvalue)
        elif callable(p):
            dict.__setitem__(self, key, p(self[key], newvalue))

BetweenDict: mapping ranges to values

Sometimes information is organized by ranges of numeric values; thus we want to store the information compactly though lookup will be on a single value at a time. The following data structure accomplishes this, assuming ranges do not overlap:

class BetweenDict(dict):
    def __init__(self, d = {}):
        for k,v in d.items():
            self[k] = v

    def __getitem__(self, key):
        for k, v in self.items():
            if k[0] <= key < k[1]:
                return v
        raise KeyError("Key '%s' is not between any values in the BetweenDict" % key)

    def __setitem__(self, key, value):
            if len(key) == 2:
                if key[0] < key[1]:
                    dict.__setitem__(self, (key[0], key[1]), value)
                    raise RuntimeError('First element of a BetweenDict key '
                                       'must be strictly less than the '
                                       'second element')
                raise ValueError('Key of a BetweenDict must be an iterable '
                                 'with length two')
        except TypeError:
            raise TypeError('Key of a BetweenDict must be an iterable '
                             'with length two')

    def __contains__(self, key):
            return bool(self[key]) or True
        except KeyError:
            return False

source: Joshua Kugler's blog

tuple/dict/set literals

list/dict/set comprehensions and generator expressions

List comprehensions are syntactic sugar for creating a list via a loop:

evens = [i*2 for i in range(6)]   # [0, 2, 4, 8, 10]
charcodes = [chr(ord(c)+1) for c in 'ABCabc']   # ['B', 'C', 'D', 'b', 'c', 'd']
fours = [j for j in evens if j%4==0]   # [0, 4, 8]

Recent versions of Python also support dict and set comprehensions:

d = {'one': 1, 'three': 3, 'two': 2}
inverted = {v: k for k,v in d.items()}   # {1: 'one', 3: 'three', 2: 'two'}
halved = {k.upper(): v/2 for k,v in d.items() if v%2==0}   # {'TWO': 1}
uniqchars = {c.lower() for c in 'OneTwoThree'}   # {'e', 'h', 'n', 'o', 'r', 't', 'w'}

Generator expressions are like comprehensions, but create a generator rather than a new data structure in memory. This is useful for calling functions that accept iterable arguments, and gives rise to an elegant expression of the count-if pattern:

sum(1 for x in items if condition(x))

Cf. sections on generators, iteration patterns, and iteration/sequence operations.

» Jonathan Elsas points out that, since generator expressions were introduced in Python 2.4, the following alternatives to dict and set comprehensions work even for Python <2.7:

inverted = dict((v, k) for k,v in d.items())
uniqchars = set(c.lower() for c in 'OneTwoThree')

sequence literal repetition operator

Shorthand for concatenating something with itself a given number of times:

[None]*3 == [None, None, None]	# True
(1, 2, 3)*2 == (1, 2, 3, 1, 2, 3)	# True
'c'+'o'*10+'l!'*4 == 'cooooooooool!l!l!l!'	# True

You probably want to avoid repeating mutable items, because there will be multiple references to the same instance:

x = [[]]*5
x == [[1], [], [], [], []]	# False!
x == [[1], [1], [1], [1], [1]]	# True

(Suggested by Jonathan Elsas.)

built-in iteration/sequence operations

selecting a single element

max(a, b[, c[, ...]][, key=func]) or max(iter[, key=func])
returns the maximum of several items. A 1-argument function to return a comparison key (as in list.sort() and sorted()) can be supplied via key=func.
analogous to max(...).

checking elements

returns whether all elements of the argument are true
returns whether any elements of the argument are true

aggregating elements

reduce(function, iter[, initializer])
sum(iter[, start])

iterator protocol

iter(obj[, sentinel])
returns an iterator via obj.__iter__() unless sentinel is provided, in which case obj is called with no arguments for each step of iteration and iteration terminates once it returns the value of sentinel
next(iter[, default])
equivalent to, but returns default if the end of iteration has been reached

producing a new sequence or modifying the iteration

enumerate(sequence[, start=0])
returns an indexed sequence
zip([iter[, ...]])
constructs a reverse iterator
sorted(iter[, cmp[, key[, reverse]]])
returns a sorted list
filter([function,] iterable)
returns a list containing the elements e of iterable for which function(e) is true (or which are true themselves, if function is omitted)
map(function, iterable, ...)
range([start], stop[, step])
creates a list with the specified integer progression
xrange([start], stop[, step])
behaves like range() with respect to iteration but doesn't store all the values simultaneously

Cf. itertools module.

the inverse of zip(...) is zip(*...)

x = zip([1,2,3], 'abc', 'ABC')
x==[(1,'a','A'), (2,'b','B'), (3,'c','C')]   # True
y = zip(*x)
y==[(1,2,3), ('a','b','c'), ('A','B','C')]   # True

functional operations between sets

In addition to methods like .union(), .intersection(), and .difference() there are operator versions:

hi = set('hello')
bye = set('goodbye')
# all True:
hi & bye=={'e', 'o'}
hi | bye=={'b', 'd', 'e', 'g', 'h', 'l', 'o', 'y'}
hi ^ bye=={'b', 'd', 'g', 'h', 'l', 'y'}
hi - bye=={'h', 'l'}

breaking a sequence into parts of length n

parts = [seq[i:i+n] for i in xrange(0, len(seq), n)]


sequence comparison

Sequences are compared element-wise. More precisely, seq1 < seq2 pseudocode is as follows:

for i in range(len(seq1)):
	if i>=len(seq2):
		return False
	elt1 = seq1[i]
	elt2 = seq2[i]
	if elt1==elt2:
	elif elt1<elt2:
		return True
	else: # ASSUMES elt2<elt1!
		return False
if len(seq2)>len(seq1):
	return True
return False

In other words, when comparing or sorting sequences, it is assumed that exactly one of the following is true for a given pair of elements: a<b, a==b, b<a

more: [1], [2]

namedtuple creation decorator

A shortcut for defining namedtuple types with a more Pythonic syntax:

from collections import namedtuple
import inspect

def make_namedtuple(fn):
	args = ' '.join(inspect.getargspec(fn).args)
	return namedtuple(fn.__name__, args)

Usage example:

def Point(x, y): pass

p = Point(1, 2)	# Point(x=1, y=2)

non-mutating version of dict.update()

Extending dict with a non-mutating update operation:

class Xdict(dict):
     def __lshift__(x, y):
         return Xdict(x.items() + y.items())

d = Xdict({'a': 9, 'b': 12})
d << {'a': 3, 'c': 8}   # result: {'a': 3, 'b': 12, 'c': 8}
d['a']==9	# True

(Original suggestion: [1])

Overloading << with these semantics is a new idiom, but I think the symbol << plausibly denotes a sort of modified concatenation/asymmetric union operation. Reasons to prefer this over other options:

Variables & Types

built-in types in Python 2.7

These come with corresponding conversion functions. See also Built-in Types.

basestring, bool, complex, dict, file, float, frozenset, int, list, long, memoryview, object, set, slice, str, tuple, type, unicode 

checking for built-in types

Preferred methods for checking if a variable is a number, a string, an iterable, etc.:

from numbers import Number
from collections import Iterable

def describeType(x):
	s = str(type(x))
	if x is None:
		s += ' : None!'
	elif x is True or x is False:
		s += ' : a boolean!'
	elif isinstance(x, Number):
		s += ' : a number!'
	elif isinstance(x, basestring): # Python 2.x; simply use str in 3.x (all strings are Unicode)
		s += ' : a (possibly Unicode) string!'
	if callable(x):
		s += ' : a callable, such as a function!'
	if hasattr(x, '__iter__'):
		s += ' : a non-string iterable, such as a collection!'
	if isinstance(x, Iterable):
		s += ' : any iterable, including strings and collections!'

To check whether a variable's type is any of several possibilities, pass a tuple of types as the second argument to isinstance():

if isinstance(x, (Number, basestring)):
	# a number or string

Unicode and Python strings

Tutorial: "Unicode in Python, Completely Demystified" (Kumar McMillan, PyCon 2008)

To summarize: in Python 2.7,

s = '\xe2\x96\xaa'	# UTF-8 bytes
u = u'\u25aa'
# s or u will print as: ▪
u==s.decode('utf-8')	# True
s==u.encode('utf-8')	# True
len(s)==3	# True
len(u)==1	# True
s==t	# False; UnicodeWarning (comparing str with unicode)
v = '\u25aa'
v=='\\u25aa'	# True (i.e. \u escape is meaningless for plain strings)
s.encode('utf-8')	# UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)
isinstance(u,str)	# False
isinstance(u,unicode)	# True
isinstance(u,basestring)	# True
isinstance(s,basestring)	# True
s+u	# UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

Python 3 makes Unicode strings the default for working with string literals/text files, and provides a bytes type for plain bytestrings.

built-in string functions

format(value[, format_spec])
return the interned version of the string

character codes

inverse of chr()/unichr()

string, regular expression references

variable scoping

See "Understanding UnboundLocalError in Python". Essentially: with respect to assignment, variables not declared as global—or (in Python 3.x), nonlocal—are local to the scope in which they are assigned. Variables local to an enclosing scope can be accessed:

x = 0
l = []
def f0():
l==[0]	# True

However, they cannot be assigned to—the following results in a different, local variable which shadows the first one:

x = 0
def f1():
	x = 1	# a different 'x'
x==0	# True

Moreover, a variable name in a given scope cannot refer variously to an enclosing-scope variable and to a local variable shadowing it:

x = 0
def f2():
	print(x)	# refers to first 'x'
	x = 2	# attempt to create a second 'x'
def f3():
	x = x + 1	# attempt to refer to the first 'x' (RHS), then create a second 'x'
def f4():
	x += 1	# equivalent to x = x + 1
def f5():
	x = 5	# global, because 'global x' occurs in this scope
	global x
f2()	# UnboundLocalError
f3()	# UnboundLocalError
f4()	# UnboundLocalError
x==5	# True

naming conventions

I like to use the following conventions for variable names where I think it will help avoid confusion:

Sstring (esp. in contrast to a numeric value or list)dataS
Ttemplate string or regex pattern
FfileinF, outF
FPfile path (as a string)inFP, outFP
Rfile reader (e.g. for CSV files)inR
Wfile writer (e.g. for CSV files)outW
Xfunction (e.g., in an argument list)

(Suffixes I haven't settled on yet, but might: D for dict or directory or distribution, P for probability or prior, G for graph, N for node, I for integer, U for tuple, M for markup or map, V for vector/matrix, KV for key-value pairs (such as are returned by dict.items()), Q for query string, etc.)

(Possible convention: Double the suffix to indicate a variable may be a single instance or a collection, such as in a parameter list. For instance, inFF would be "input file or files".)

I often prefix variable names with n if they are counts: e.g. nItems "number of items".

Infixed 2 is short for "to" in mappings (dicts or conversion functions) like name2id.

I also typically reserve the variable names s for a string, f for a file, d for a dict, m for a regular expression match, and ex for an exception.

Doubling a single-character variable name indicates a collection of them: mm is a list of matches, for example. For variable names that are words, pluralize them to indicate a collection.

Expressions & Statements

iteration patterns for everyone

# seq is a sequence (like a list, tuple, or string)
for x in seq: pass
for i,x in enumerate(seq): pass	  # i is the element index
for i in range(len(seq)): pass
for x in sorted(seq): pass	# iterates through contents in sort order
for x,y in zip(seq1,seq2): pass	# iterates until the end of the shortest sequence is reached

# d is a mapping type (like a dict)
for k in d: pass	# k is the entry key
for v in d.values(): pass
for k,v in d.items(): pass
for i,(k,v) in enumerate(d.items()): pass	# note that the ordering of items may be arbitrary (depending on the mapping type)
for k,v in sorted(d.items(), key=lambda kv: kv[0]): pass # sort order of keys
for k,v in sorted(d.items(), key=lambda kv: kv[1]): pass # sort order of values

See also: comprehensions and generator expressions, generators, iteration/sequence operations

debugging with assert

Python's assert statement is great for debugging. In the basic case, it lets you specify assumptions about the state of your program that are not otherwise checked but could lead to subtle bugs (say, if user code is unaware of a function's expectations, or if you forget your own assumptions down the road). For example:

assert x>=1
assert len(items1)==len(items2),'Length mismatch: {} and {}'.format(len(items1),len(items2))

The optional expression following the comma is a value to be printed along with the AssertionError that is raised if your condition fails.

Callers can intercept AssertionErrors, but otherwise they will terminate the execution of your code prematurely like any other exception. This is another way in which they are handy for quick debugging: for example, if I want to add a bunch of print statements and then have the program terminate, I use

assert False

or if I simply want to print a couple of variables,

assert False,(x,y)

This is less clunky than adding return or exit statements, and is readily identifiable as debugging code because assert False would not normally occur as part of a program.

Note that assert False,x is equivalent to assert False if x is None or ()—no message is added to the AssertionError, which can be misleading. A solution is to enclose the mysterious expression in a tuple: assert False,(x,) will always display the value of x.

To print an empty or complicated string for debugging purposes, convert it to its literal form with repr(s).

boolean expression values

When it comes to truth values in Python, False, 0, None, '', (), and [] are all false. In fact, any instance of a class that defines a __len__() method evaluates as false if its length is 0.

The value of a boolean expression is not always of type bool; rather, it is the value of the last operand that was evaluated. All of the following are true:

(None or 'a')=='a'
('a' or None)=='a'	# short-circuited: None is never evaluated because 'a' is a true value
('a' and None) is None
('a' and False) is False

This property gives us a shorthand for expressions of the form a-if-a-is-true-else-b:

x = y if y else z	# can be rewritten as:
x = y or z

The file will automatically be closed at the end of the block, even if an exception is encountered.

fancy uses of with

Nested with blocks can be exploited to represent structured content, such as XML (discussion). Cf. PEP 359, which proposes new syntax for this sort of functionality.

built-in math functions

divmod(a, b)
returns (essentially) (a//b, a%b)
pow(x, y[, z])
round(x[, n])
max(), min(), sum()
see iteration/sequence operations

base conversion

convert to a binary string
convert to an octal string
convert to a hexadecimal string

Cf. math module.


generators are awesome

Generators are functions that do not return but yield values; they are designed to be iterated over lazily. Calling the function returns a generator instance, which maintains the internal state of a use of the function. Iterating over that instance (in a for loop or by explicitly calling next()) will resume evaluation of the function until the next yield statement is reached, returning the yielded value. After the generator instance has been exhausted it will raise a StopIteration exception, per the iterator protocol.

Notably, lazy evaluation with generators can make code more efficient by avoiding the creation of intermediate data structures. They are especially powerful when chained together. See this tutorial for an in-depth discussion. An example from slide I-39:

wwwlog	   = open("access-log") 
bytecolumn = (line.rsplit(None,1)[1] for line in wwwlog)
bytes	   = (int(x) for x in bytecolumn if x != '-')

print "Total", sum(bytes)

Rather than creating and operating over lists, we are operating directly over generators (here, generator expressions) arranged in a pipeline. For very large files, there is a major performance advantage because items are "pulled" through the pipeline one by one for processing. They do not need to be kept around after being tallied by the loop within the sum() function.

def foo(*args, bar='baz') is illegal, though intuitive

…because named arguments can also be provided positionally (without the keyword), which means a call foo(x) would be ambiguous between foo(args=(x,)) and foo(args=(), bar=x). (The same holds true if there are named arguments before *args.) The workaround is to use **kwargs:

def foo(*args, **kwargs):
    bar = 'baz'   # default
    if 'bar' in kwargs:
        bar = kwargs['bar']
        del kwargs['bar']
    assert len(kwargs)==0   # check that illegal arguments haven't been provided
    # main body

We can write a decorator to reduce the amount of code required for this case (as far as I know there is no equivalent in the standard library):

from functools import wraps
def xkwargs(**defaults):
    def wrap(fxn):
        def _(*args, **kwargs):
            for k in kwargs:
                assert k in defaults,'Invalid keyword argument: {}'.format(k)
            d = dict(defaults)
            return fxn(*args, **d)
        return _
    return wrap

def foo(*args, **kwargs):
    bar = kwargs['bar']
    # main body

If we want to allow extra keyword arguments besides the ones with defaults, functools.partial almost fits the bill. We can instead define the decorator as follows:

from functools import partial, update_wrapper
def xkwargs(**defaults):
    def wrap(fxn):
        return update_wrapper(partial(fxn, **defaults), fxn)
    return wrap

foo will then be a partial object which delegates to the original function, filling in with defaults as necessary. The defaults will live in foo.keywords.

keyword arguments are great for rapid prototyping

Often it is impossible to predict the twists and turns of how functionality will evolve as a piece of code is being developed. Even libraries with general-purpose utility functions will evolve as user code wants finer-grained options. Keyword arguments allow great flexibility in this context, because it's almost always possible to add (optional) keyword arguments with defaults that preserve previous behavior but add new functionality. Adding parameters to a function is as easy as adding attributes or methods to a class.

For example, I have a function for drawing F score contours on a precision-vs.-recall plot with matplotlib. It started off like this:

def fcurves():
    from pylab import ogrid, divide, clabel, contour, plot
    X, Y = ogrid[0:1:.001,0:1:.001]    # range of R and P values, respectively. X is a row vector, Y is a column vector.
    F = divide(2*X*Y, X+Y)   # matrix s.t. F[P,R] = 2PR/(P+R)
    plot(X[...,0], X[...,0], color='#cccccc')   # P=R
     # show F score curves at values .5, .7, and .9
    clabel(contour(X[...,0], Y[0,...], F, levels=[.5,.7,.9], colors='#aaaaaa', linewidths=2), fmt='F=%.1f', inline_spacing=1)

This was sufficient for what I needed at the time. But later I needed to make a similar plot, and found this function, but wanted slightly different functionality. Rather than replace it, I opted to generalize it via keyword arguments:

def fcurves(levels=[.5,.7,.9], lblfmt='F=%.1f'):
    from pylab import ogrid, divide, clabel, contour, plot
    X, Y = ogrid[0:1:.001,0:1:.001]    # range of R and P values, respectively. X is a row vector, Y is a column vector.
    F = divide(2*X*Y, X+Y)   # matrix s.t. F[P,R] = 2PR/(P+R)
    plot(X[...,0], X[...,0], color='#cccccc')   # P=R
     # show F score curves at specified levels
    clabel(contour(X[...,0], Y[0,...], F, levels=levels, colors='#aaaaaa', linewidths=2), fmt=lblfmt, inline_spacing=1)

Calling the function without arguments produces the same result as before! To the extent that this customization power (a) does not add too much additional complexity and (b) will be reused in the future, it is an improvement over the original implementation of the function, without breaking backward compatibility.

Warning: When adding new keyword arguments to a function, be sure to check that they are passed appropriately in any recursive calls!

Classes & Objects

built-in object operations

returns whether the argument is callable
cmp(x,y) (comparison)
strictly positive if y>x, negative if y<x
hash(obj) (hashing)
id(obj) (identifier)
returns an integer id which is unique for the object instance during its lifetime
isinstance(obj, classinfo)
issubclass(cls, classinfo)
len(obj) (length)
super(type[, obj_or_type])
returns the type of an object
type(name, bases, dict)
constructs a type (class) with the specified name, superclasses, and contents

functions pertaining to attributes (members)

hasattr(obj, name)
getattr(obj, name[, default])
equivalent to obj.<name>, with an optional default value in case the attribute does not exist
setattr(obj, name, value)
delattr(obj, name)
equivalent to del obj.<name>
property([fget[, fset[, fdel[, doc]]]])
class property attribute with the given getter, setter, and deleter

decorators for methods/properties

decorates a member function that does not take an instance argument
decorates a member function that is like a static method, but takes the type of the class as its first parameter
decorates a function whose name is the name of a property and which defines the getter for that property
decorates a function whose name is property x and which defines a setter for that property
decorates a function whose name is property x and which defines a deleter for that property

decorator to store constructor arguments as attributes

Implementation and examples here. For instance,

class Person(object):
    def __init__(self, first, last, email, dob, role='student'): = first + ' ' + last = email
        self.dob = dob
        self.role = role

can be simplified to

class Person(object):
    def __init__(self, first, last, email, dob, role='student'): = first + ' ' + last

or even

class Person(object):
    def __init__(self, first, last, email, dob, role='student'): = first + ' ' + last

Simply using @autoassign (with no arguments) stores all constructor arguments as attributes.

adding bound methods to instances


passing methods to higher-order functions

Jonathan Elsas points out that, because object methods take self as their first parameter, an expression like ' abc '.strip() can be rewritten as str.strip(' abc '). Why is this useful? Because often it is desirable to invoke instance methods indirectly via higher-order functions, decoupling the method name from the call. For example:

import operator
ops = [str.strip, str.split, operator.itemgetter(0), str.lower]
for op in ops:
	data = map(op, data)

Note the use of the operator module, which provides convenience methods for passing to higher-order functions operations that would normally be encoded with special syntax (operators).

super(), multiple inheritance, and method resolution order


Execution & I/O

with statement is now the preferred way to open files

with open('file.txt') as f:

built-in I/O, execution environment, and reflection functions

print([object, ...][, sep=' '][, end='\n'][, file=sys.stdout])
expects a valid Python expression as input
open(filename[, mode[, bufsize]])
creates a file object
returns a list of names in obj, or in the current scope if no argument is provided
eval(expression[, globals[, locals]])
execfile(filename[, globals[, locals]])
compile(source, ...)

file and stream handling libraries

special files

In addition to the built-in open() there are ways to open special kinds of files:

In Python 2.x, is for encoded text files:

import codecs
with'file.txt', 'r', 'utf-8') as f:
	for ln in f:
		process(ln)	# 'ln' is automatically decoded into Unicode from UTF-8 is for gzipped files.

The tempfile module can generate temporary files/directories.

filesystem access

os provides basic filesystem support. Of particular note:

shutil offers additional support for copying and (re)moving files.

Unix-style pattern matching is supported for names of files (fnmatch) and paths (glob).

fileinput [suggested by Jonathan Elsas]: iterate through lines of input from files (typically specified as script arguments) or standard input.

import fileinput, sys
# reads from files matched by all but the first 2 arguments, as well as all .txt files
for ln in fileinput.input(sys.argv[3:]+['*.txt']):
# reads from stdin
for ln in fileinput.input([]):
# defaults to sys.argv[1:], or sys.stdin if no arguments
for ln in fileinput.input():
# read input encoded as UTF-8, whether in sys.stdin or a file specified as an argument
import codecs
sys.stdin = codecs.getreader("utf-8")(sys.stdin)
for ln in fileinput.input(openhook=fileinput.hook_encoded("utf-8")):

Cf. io.

INI-style configuration files

The ConfigParser module provides functionality for working with configuration files of the format traditionally associated with the .ini extension. This format groups option assignments under one or more section headers (in brackets). Option values are then accessed by section.

Comments begin with a semicolon. Default options (implicitly part of a DEFAULT section) can be provided to the constructor. Option names are case-insensitive, though section names are case-sensitive. The SafeConfigParser class, which supports interpolation (variable substitution), is recommended. (Interpolation does not cross sections, though defaults are available from all sections.)

Supposing the contents of opts.ini are as follows:

dir = ./in
file = %(dir)s/in.txt ; The interpolation syntax is %(...)s

file = %(dir)s/out.txt ; Uses default value of 'dir'

; These values can be tuned
alpha = 0.5
k = 30
optimize = no
evaluate = FALSE

This file can be read as follows:

import ConfigParser
config = ConfigParser.SafeConfigParser({'dir': '.'})'opts.ini')

config.get('Input', 'dir')=='./in'
config.get('Input', 'file')=='./in/in.txt'

config.get('Output', 'file')=='./out.txt'

config.items('Hyperparams')==[('dir', '.'), ('alpha', '0.5'), ('k', '30')]
config.items('Hyperparams', vars={'k': '100'})==[('dir','.'), ('alpha','0.5'), ('k', '100')]  # With override

Classes in the ConfigParser module can also be used to create configuration files.

PHP provides similar (but more limited) functionality in parse-ini-file(): interpolation and default options are not supported. The PHP implementation additionally supports array-valued options.

executing other processes: subprocess

The subprocess module is used to spawn new processes (e.g. system commands) and pipe to/from them. It is quite complex; see this tutorial for the important aspects.

runtime CPU profiling: cProfile

See the Instant User's Manual. (Recommended by Jonathan Elsas.)

catching Ctrl+C

	# stuff...
except KeyboardInterrupt:
	# Ctrl+C handling

interactive post-execution for debugging

From Ned Batchelder's blog:

The -i switch to the Python interpreter will run your program, then leave you in the command prompt when it ends. This can be good for interactively testing the functions defined in the file, or for debugging if an exception happens.

If you use "python -i" and need to debug, is the thing to use. It places you at the point where the last traceback was raised. It seems like magic, since the exception has already risen to the top level of the program, but puts you "back in time" before the exception started its climb up through the stack.

Source files

file headers

My Python source files start something like this:

Utilities for scoring/evaluating data (human annotations or system output).

Plotting functions require the 'pylab' library (includes numpy and matplotlib).

@author: Nathan Schneider (nschneid)
@since: 2010-08-25

# Strive towards Python 3 compatibility
from __future__ import print_function, division, absolute_import
from future_builtins import map, filter, zip

The first line is to indicate to the interpreter that there may be Unicode characters in the source file (e.g. in comments). Next is the docstring describing the module. Finally, when writing 2.7 code, there are several imports to enable Python 3 behavior: print() as a function rather than a statement; real division (never integer division) with the / operator; etc. This will make it easier to port the code without bugs.

Additionally, for text processing scripts I typically use the following:

import os, sys, re, codecs, fileinput
from collections import defaultdict, Counter

Specialty modules


mathy stuff

times and dates

Python's support for date/time/calendar functionality is horribly confusing, and distributed across several modules with varied APIs—only some of which are in the standard library. See datetime module overview (with links at the bottom to related modules). The inscrutability of these APIs is well known (see "Pythonic Dates, Times, and Deltas" thread on python-ideas; cf. "Dealing with Timezones in Python").

I hacked together as a means of providing a consistent interface for some of this functionality (esp. converting among string, integer, and special API representations of dates and times). Others have similarly written alternative APIs: [1], [2]

Development tools

The Pydev plugin for Eclipse is fantastic. Includes some fairly sophisticated static code analysis in order to support typing completion, etc.

ipython is interactive Python shell with a number of useful features: support for some basic filesystem operations, tab completion of object attributes, and automatic saving of command history across sessions. It also incorporates an interface for parallel computing.

General resources

installation/configuration for Mac OS 10.6 (Snow Leopard)

I am by no means an expert on Python distribution methods or compiling code on OS X, but after several frustrating encounters trying to install new Python modules here is my current understanding of the issues:

Cf. Installing Python 2.7, matplotlib, and ipython on Mac OS X Snow Leopard

Multiple friends have recommended the Homebrew package manager. I have not tried it yet and thus am not sure how it will handle the architecture issues.

general programming resources


Why I use Python (in case you care)

Today, programmers have a wide menu of options when it comes to programming languages. This is a good thing, as people have different needs and different tastes. I do not presume that any particular language is best for all people or all purposes.

That being said, I find Python uniquely fun and satisfying. It essentially boils down to two reasons: usability and community.


I like to think of Python as pseudocode that works. Ideally, code should effortlessly reflect algorithmic ideas; the language and development tools should work to support the developer, not the other way around.

There are several parts to usability:

Language expressivity and transparency
I refer here to the ease of mapping between the conceptual and the formal when it comes to describing an algorithm or system. This includes learnability (effort required to master the basics of the language), writeability (effort required to transform ideas into code once you know the language), and readability (effort to reconstruct ideas from code).
Domain flexibility
Some languages target a particular type of application. PHP is specialized for Web scripting; Perl is specialized for text processing; Matlab and R are specialized to mathematical/statistical computation. Python, in contrast, seeks to support important general-purpose programming styles and idioms in the language, and to provide specialized functionality in libraries (such as Django, NLTK, numpy/scipy/matplotlib).
Performance and compatibility
As a dynamic, interpreted language, cross-platform compatibility is rarely an issue with standard Python implementations. Performance is perhaps the greatest challenge Python must face if it is to become a truly general-purpose language. There is a lot of community excitement around performance-oriented implementations such as PyPy ([1], [2], [3]). Efforts including Jython, Cython, ctypes, and rpy2 provide cross-language interoperability, whether for performance or other reasons.
Development support
Tools/resources like editors, interpreters, debuggers, and documentation make the development process less frustrating and more efficient. Python's interactive interpreter and introspection capabilities, I think, are enormous assets when it comes to understanding how code works.
Low overhead
The above two strengths make for a low barrier to entry for Python novices and a better day-to-day experience for experts. Starting a new project, understanding and tweaking existing projects, and making a piece of software robust (documentation, cross-platform compatibility, etc.) are all important aspects of this experience.


The size, diversity, and enthusiasm of a programming language's user base is important. These characteristics dictate how much pressure there is to keep the language (and its libraries and resources) polished and relevant. Statistics attest to Python's large and growing user base ([1], [2], [3]). It is an open-source effort with support from individual volunteers as well as industry (e.g. Google). It is popular for many purposes, ranging from scientific computing to text processing to client and Web applications to education.

Usability and community feed each other: a more usable technology translates to less of a barrier to entry and greater user satisfaction, which makes for a larger community. A larger community means more diversity of the user base, and more energy towards making the technology more useful to more people.


[2.06] added sequence comparison and BetweenDict, included type() in built-in object operations, improved file headers and fileinput examples, fixed diveintopython link and bug in @xkwargs decorator
[2.05] elaborated the issues with installing/configuring Python for Snow Leopard; linked to Dive Into Python and "Dealing with Timezones in Python"
[2.04] added INI-style configuration files
[2.03] added interactive post-execution for debugging
[2.02] added iteration patterns for everyone, sequence literal repetition operator
[2.01] added super()/multiple inheritance, passing methods to higher-order functions, runtime CPU profiling: cProfile
[2.0] new section: Execution & I/O, including a new subsection on file and stream handling; updated mathematical links, including new link to scikits.learn; added Jonathan Elsas's comment about generator expressions
[1.01] added variable scoping