Running big jobs without user interaction

For many sophisticated calculations, it can take computers hours, days, weeks, and even longer to calculated solutions. For example, here is a script that estimates certain curves using Montecarlo integration. Save this file as perc.py

__author__ = 'Timothy Reluga <treluga@psu.edu>'
__date__ = '2016.05.27'
__copyright__ = 'Copyright (c) 2016'
__license__ = 'For personnel use only, with author approval'

import numpy
from matplotlib.pyplot import *

def find_bottom(A, i0=None):
    n = A.shape[1]
    if i0 == None:
        return max([find_bottom(A, i) for i in xrange(n)])
    B = 0*A
    def next(i, j):
        if j >= n:
            return n
        if B[i, j] == 1:
            return 0
        B[i, j] = 1
        #print hstack([A, B])

        if A[i, j] == 1:
            return j
        k = j
        k = max(k, next(i, j+1))
        if k == n:
            return k
        if j > 0:
            k = max(k, next(i, j+1))
            if k == n:
                return k
        if i > 0:
            k = max(k, next(i-1, j))
            if k == n:
                return k
        if i+1 < n:
            k = max(k, next(i+1, j))
            if k == n:
                return k
        return k
    return next(i0, 0)

def prob_perc(N = 1000, p = 0.5, n = 40):
    d = [ find_bottom(numpy.floor(numpy.random.rand(n, n) + p)) \
            for i in xrange(N) ]
    #e = array([[i, d.count(i)] for i in xrange(n+1)])
    return float(d.count(n))/float(N)

def main():
    numsamples = 5000
    p_vals = numpy.linspace(0.3, 0.6, 30)
    n_set = [ 5, 10, 20, 50, 100]
    for n in n_set:
        result = numpy.array([prob_perc(N=numsamples, p=p,n=n) for p in p_vals])
        plot(p_vals, result,'-')
    legend(["N = %d"%n for n in n_set], loc='upper right')
    xlabel('Probability a site is occupied ($p$)',fontsize=18)
    ylabel('Fraction of lattices percolating through $h(p,N)$',fontsize=18)
    show()

main()

Saving data and figures

First, this script is initially very slow, even though it is designed for interactive use. We can better test it by greatly reducing the value of numsamples used (even though this will make the figure inaccurate).

Now, to make the script work better for a background calculation with a few changes.

Now, you can change your plot easily without having to recalculate all your data just to change an axis label, for example.

The other advantage is that you can run your calculation in the background on your computer or another computer and just retrieve the data file when you are done to explore your result. Try this in the terminal by cd'ing to the right directory and running

$ python perc.py 3

Estimate how large num_samples has to be for your program to take 8 minutes to run using the time shell command and filling in small values for ? below.

$ time python perc.py ?

Then run perc.py for 8 minutes. Use the & character to make the job run in the terminal's background. Once the execution completes, plot the result data, and inspect your resulting figure -- notice that the curves will be much smoother than they were for small numbers of samples.

Add a title to the figure without rerunning perc.py.

Shell scripts for batch runs

Now, one of the advantages of running a script from the terminal is that you can configure it and have a script documenting your configuration, without having to update the code every time.

  1. Modify perc.py so that it now takes two arguments, the first being numsamples and the second being a value for n. use the %d string trick to create a unique file-name based on the parameters you are passing in and save the resulting data to this file.

  2. Write a shell-script that will to the calculations for values of n in the set {2,4,8,16,32}.

Now, one of the very powerful tricks the shell can do is parallelize your code so that it uses all of your CPU cores (python DOES NOT do this automatically).

  1. Find out how many processor cores you workstation has. Store this in a shell variable called processors.

  2. Run the following command for an appropriate value of numsamples.

    echo "2 4 8 16 32" | xargs -P $processors perc.py $numsamples

    If you open up a seperate terminal window, you can monitor you memory and CPU usage using top.

Tracking and sharing your work

One of the other very helpful things we sometimes use when writting software is a version control system, which tracks the changes you make to a program, and can be used to recover old versions and merge in changes that other people might make. Initially, we had cvs then subversion, but today, we have git, which is widely popular.

Git for version control and sharing

Public (and private) repositories