Blog Archive

Tuesday, July 31, 2012

Data Fitting Python

First here are few helpful links:
Data fitting using fmin

Fitting nonlinear models to data
Simple example of curve-fitting


1) Get Data

Lets say you have some data you want to fit -
from pylab import *
from pylab import *
from numpy import *



## Generating data to fit
n = 20
xmin = 1
xmax = 11
a=2
b=4
funct=lambda x:a*x+b
x = linspace(xmin,xmax,n)
y = funct(x) 

a=plot(x,y,'ro')
show(a)

This data was generated from the line y=ax+b where the slope a=2 and intercept b=4.

However, if you just have the data handed to you or you recorded it-you do not know this. Which brings us to step 2.







2) Guess the Relationship between the x and y coordinates

You might guess y=ax+b is the right relationship between the data points. "Hey" you say "That looks like a line!". And yes you would be right. You might even be able to guess with a decent amount of certainty what the slope (a) and intercept (b) should be. However sometimes you get messier relationships that are not so straightforward when you fit data. So for example's sake, we will continue to pretend we are dumbfounded as to what the slope(a) and intercept (b) should be.

We can graph the relationship and throw out a guess for "a" and "b"....



from pylab import *
from numpy import *


## Generating data to fit
n = 20
xmin = 1
xmax = 11
a=2
b=4
funct=lambda x:a*x+b
x = linspace(xmin,xmax,n)
y = funct(x)

#Our function we think might fit the data
A=3
B=4
my_guess_function=lambda x:A*x+B
X=linspace(xmin,xmax,100)
Y=my_guess_function(X)


a=plot(x,y,'ro',X,Y,'-')
title('The Data with Best Guess Function')
show(a)






Well...almost I guessed A=3 and B=4. We can keep guessing and graphing to find something closer. But how do we tell if one guess is closer than another? It can be hard to eyeball at a certain point. This brings us to step 3.





3) Write a Function to Score the Fit

How close is close? Well we want to minimize the distance between the
blue line (our guess) and the red dots (the data). Ideally we want them on top of each other (aka distance from blue line to red dot = zero).

Whatever guess gets us close to the most data points wins...basically.

You can determine which guess gets closest by calculating the "sum of squares".
If you have never heard of it click here. There are other ways to calculate how close your guess is to data but this one is the most straightforward and useful.

Here is an example of how to code it to check and see 'which guess is best'


## Generating data to fit
n = 20
xmin = 1
xmax = 11
a=2
b=4
funct=lambda x:a*x+b
x = linspace(xmin,xmax,n)

y = funct(x)

# The function we think fits the data
my_guess_function=lambda x:A*x+B

#Use sum of squared distances (SS) to score the fit
ss = lambda x, y: ((my_guess_function(x)-y)**2).sum()

#our 1st fit attempt
A=3
B=4
X1=linspace(xmin,xmax,100)
Y1=my_guess_function(X1)
ss1=ss(my_guess_function(x),y)

#our 2nd fit attempt
A=1.5
B=2
X2=linspace(xmin,xmax,100)
Y2=my_guess_function(X2)
ss2=ss(my_guess_function(x),y)

#plotting the data and the fits
plt.plot(x,y,'ro',X1,Y1,'-',X2,Y2,'-')
plt.legend(('data', 'fit 1', 'fit 2'),'upper center')
graphname=' Fit1: '+str(ss1)+' Fit2: '+str(ss2)
plt.title(graphname)
plt.show()


The summed squares of the distances are listed in the title of this figure. we can see that the second fit in green has a lower number (136) then the first fit in blue (67346.) This means that the green line was, overall,closer to the red data points since its overall squared distances were smaller.





 4)  Optimize to find the Best Fit.

Previously, we figured out a way to determine if one fit was better than another. Now we need to find the 'best' fit possible. We could keep guessing over and over again, and try to find the fit that gives us the lowest sum of squared distances.
But this is a pain. Really we need to have an optimizer do this for us.


The optimizer will use an algorithm to find the values of A and B that minimize
the sum of squares. It will give us the 'least sum of squares' .


import numpy 
import matplotlib.pyplot as plt
from scipy.optimize import fmin

#==============================================
## Generating data to fit
n = 20
xmin = 1
xmax = 11
a=2
b=4
funct=lambda x,a,b:a*x+b
x = linspace(xmin,xmax,n)
y = funct(x,a,b)
#===============================================



#scores fit of functions for a set of parameters
def score((parms)):
    #1 pull off parameters from optimizer input
    A,B=parms
    #2 data you are matching to
    funct=lambda x,a,b:a*x+b
    x = linspace(1,11,20)
    y = funct(x,2,4)
    #3 write function to fit data
    my_guess_function=lambda x,A,B:A*x+B
    #4 write cost function to see how well it fits
    ss=lambda x,y,A,B:((my_guess_function(x,A,B)-y)**2).sum()
    #5 return the fit score for this try
    return ss(x,y,A,B)


#find the parameters that give us the lowest score
answ=fmin(score,(3,1),full_output=1,maxiter=1000)

#Create function from lowest scoring parameters
function=lambda x,A,B:A*x+B
A=answ[0][0]
B=answ[0][1]
X1 = linspace(xmin,xmax,n*5)
Y1= funct(X1,A,B)

#===============================================
#plotting the data and lowest scoring (aka best)fit
plt.plot(x,y,'ro',X1,Y1,'-')
plt.legend(('data', 'the lowest scoring fit'),'upper center')
graphname='optimized fit'
plt.title(graphname)
plt.show()
#===============================================




















5) Other Optimization Functions

There are a variety of optimization functions we could have used
see here.

For instance if you have a discrete optimization problem it might make
more sense to use an optimizer that anneals...

However if you have a problem whose minimization function does not
tend to 'jump' around as you shift parameters, then you should stick
with something like fmin.


from numpy import *
import matplotlib.pyplot as plt
from scipy.optimize import fmin
from scipy.optimize import anneal

#==============================================
## Generating data to fit
n = 20
xmin = 1
xmax = 11
a=2
b=4
funct=lambda x,a,b:a*x+b
x = linspace(xmin,xmax,n)
y = funct(x,a,b)
#===============================================



#scores fit of functions for a set of parameters
def score((parms)):
    #1 pull off parameters from optimizer input
    A,B=parms
    #2 data you are matching to
    funct=lambda x,a,b:a*x+b
    x = linspace(1,11,20)
    y = funct(x,2,4)
    #3 write function to fit data
    my_guess_function=lambda x,A,B:A*x+B
    #4 write cost function to see how well it fits
    ss=lambda x,y,A,B:((my_guess_function(x,A,B)-y)**2).sum()
    #5 return the fit score for this try
    return ss(x,y,A,B)


#find the parameters that give us the lowest score
answ=anneal(score,(3,1),schedule='boltzmann',full_output=1,maxiter=10000)

#Create function from lowest scoring parameters
function=lambda x,A,B:A*x+B
A=answ[0][0]
B=answ[0][1]
X1 = linspace(xmin,xmax,n*5)
Y1= funct(X1,A,B)

#===============================================
#plotting the data and lowest scoring (aka best)fit
plt.plot(x,y,'ro',X1,Y1,'-')
plt.legend(('data', 'the lowest scoring fit'),'upper center')
graphname='optimized fit'
plt.title(graphname)
plt.show()
#===============================================

Wednesday, July 25, 2012

How to Use a Nested Dictionary


Sometimes you will want to look up two sets of criteria... in this case what amino acid is represented by 'A' and what its side chain is

adic={
     'R':{'name':'Arginine','color':'red','shape':'s','sidechain':'poscharged'},
     'H':{'name':'Histidine','color':'Maroon','shape':'s','sidechain':'poscharged'},
     'K':{'name':'Lysine','color':'Salmon','shape':'s','sidechain':'poscharged'},
     'D':{'name':'Aspartic Acid','color':'blue','shape':'s','sidechain':'negcharged'},
     'E':{'name':'Glutamic Acid','color':'DarkBlue','shape':'s','sidechain':'negcharged'},
     'S':{'name':'Serine','color':'Purple','shape':'o','sidechain':'polar'},
     'T':{'name':'Threonine','color':'Indigo','shape':'o','sidechain':'polar'},
     'N':{'name':'Asparagine','color':'DarkViolet','shape':'o','sidechain':'polar'},
     'Q':{'name':'Glutamine','color':'DarkViolet','shape':'o','sidechain':'polar'},
     'C':{'name':'Cystein','color':'Tomato','shape':'^','sidechain':'---'},
     'U':{'name':'Selenocysteine','color':'OrangeRed','shape':'^','sidechain':'---'},
     'G':{'name':'Glycine','color':'DarkOrange','shape':'^','sidechain':'---'},
     'P':{'name':'Proline','color':'Gold','shape':'^','sidechain':'---'},
     'A':{'name':'Alanine','color':'GreenYellow','shape':'D','sidechain':'hydrophobic'},
     'V':{'name':'Valine','color':'SpringGreen','shape':'D','sidechain':'hydrophobic'},
     'I':{'name':'Isoleucine','color':'ForestGreen','shape':'D','sidechain':'hydrophobic'},
     'L':{'name':'Leucine','color':'Lime','shape':'D','sidechain':'hydrophobic'},
     'M':{'name':'Methionine','color':'DarkGreen','shape':'D','sidechain':'hydrophobic'},
     'F':{'name':'Phenylalanine','color':'LimeGreen','shape':'D','sidechain':'hydrophobic'},
     'T':{'name':'Tyrosine','color':'SeaGreen','shape':'D','sidechain':'hydrophobic'},
     'W':{'name':'Tryptophan','color':'MediumSpringGreen','shape':'D','sidechain':'hydrophobic'},
     }

>>> adic['A']['name']
'Alanine'
>>> adic['A']['sidechain']
'hydrophobic'

Tuesday, July 24, 2012

Sorting Hat Function

Sometimes you will want to sort a spreadsheet by columns.

Take for example
'green',100
'blue',100
'orange',3
'blue',10
'orange',11

we could sort this spread sheet by either the color column or the number column.

Using the following function-
def sorting_hat(a_list,column):
    col=column
    bycolumn=lambda x:x[col]
    from itertools import groupby
    a_list=sorted(a_list,key=bycolumn)
    cat=[]
    gr=[]
    for c,g in groupby(a_list,key=bycolumn):
        cat.append(c)
        gr.append(list(g))
    a_list=gr 
    return a_list,cat
we can sort by color (column 0)
info=[['green',100],
      ['blue',100],
      ['orange',3],
      ['blue',10],
      ['orange',11]]

sortedlist,categories=sorting_hat(info,0)

>>> categories
['blue', 'green', 'orange']
>>> sortedlist
[[['blue', 100], ['blue', 10]],
 [['green', 100]],
 [['orange', 3], ['orange', 11]]]


and we can sort by number (column 1)

info=[['green',100],
      ['blue',100],
      ['orange',3],
      ['blue',10],
      ['orange',11]]

sortedlist,categories=sorting_hat(info,1)

>>> categories
[3, 10, 11, 100]
>>> sortedlist
[[['orange', 3]], 
[['blue', 10]],
 [['orange', 11]],
 [['green', 100], ['blue', 100]]]