## Summary

There is no general solution for the minimization of chi-squared to find the best estimates of the parameters when the model is nonlinear. A number of methods have been developed that can often find the best estimates and their uncertainties with arbitrary precision. Iterative methods are usually used to minimize a nonlinear chi-squared. Another brute force method, called the grid method, tests every possible set of parameters with a certain precision. This is usually not possible when the parameter space or the number of parameters are large. Commercial curve fitting software usually implements an iterative method called the Levenberg-Marquardt method. If you have a nonlinear curve fitting software you can also apply it to linear models also if you want, but there is no guarantee you get the analytic solution. In this lesson, we discuss how to implement the grid method for simple problems. We give an example of Newton’s method as a basic iterative method. We explore the gradient descent method, Gauss-Newton method, and the Levenberg Marquardt method both theoretically and in MATLAB. Users of curve fitting software should understand the underlying algorithms and be at least able to check if the software is performing the calculation correctly.

## The minimization of nonlinear data

We minimize the error function in least squares theory.

We showed in a previous lesson that if is a linear model, or

Then has a unique global minimum and we can analytically derive expressions for the best estimates of the model parameters and their uncertainties, and .

When is not of the above form then it is nonlinear. An example is

When you take the derivative of for a nonlinear model, you get a nonlinear equation which may have no general analytical solution in terms of linear algebra. One then has to resort to different methods. One typically resorts to iterative methods that require an initial guess for and update rules for the parameters that move towards the minimum of . One repeats this a number of times until the algorithm converges to some minimum of . This may be a local minimum of the actual global minimum. There is not a general test to show that a local minimum is actually the global minimum. It is always a good idea to start with different initial guesses and see if the same minimum is reached. This is no guarantee, but it is generally tolerated.

We want to work on some nonlinear data in this lesson so we can take a simple nonlinear model. For example, a common nonlinear model is a mass oscillating on a spring with damping.

For a certain choice of constants, , the motion is called critically damped and obeys the following solution.

We can add some random noise to that function and take the average of five samples to generate our data set. This way we know how close different approach fit to the actual parameters. Now we generate a dataset in MATLAB and save it as a .mat file.

Example Generate some sample data that represent critical damping.

We take five simulated measurements and average them.

clear;figure(1);clf;

A = 2;

B = -5;

C = 1;

xi = 0:0.15:5;

Ni = 5;

ymodel = (A+B.*xi).*exp(-C.*xi);

yi = [];

sigi = [];

for i = 1:length(xi)

temp = [];

for j = 1:Ni

temp(j) = (A + B.*xi(i)).*exp(-C.*xi(i)) + normrnd(0,0.5);

end

yi(i) = mean(temp);

sigi(i) = std(temp)/sqrt(Ni);

end

subplot(2,1,1);

plot(xi,ymodel,’.-k’,’markersize’,20);

axis([0 5 -2 3]);

hold on;

xlabel( ‘Time [s]’,’fontsize’,20);

ylabel( ‘Position [m]’,’fontsize’,20);

title(‘True values for critical damping’,’fontsize’,20);

subplot(2,1,2)

errorbar(xi,yi, sigi,’.k’,’markersize’,15);

xlabel( ‘Time [s]’,’fontsize’,20);

ylabel( ‘Position [m]’,’fontsize’,20);

title(‘Simulated values for critical damping’,’fontsize’,20);

axis([0 5 -2 3]);

hold on;

plot(xi,ymodel,’-k’,’markersize’,20)

save criticaldampingdata xi yi sigi

## Brute force grid method

If you have the computational power then there is a brute force approach to curve fitting. This method is called the grid method. Suppose from the graph of some linear data you can tell that is between 3 and 4 and is between 0.5 and 1.5. Then you can calculate for every value in between to 0.01 or 0.001 resolution. You then just find the least one by direct search in Matlab. Of course, those might take 10,000 or 100,000 calculations but you will get the same answer as the analytical result. You can do a coarse grid to find local minima and then do a fine grid to find the global minimum. One advantage of the grid method is that you don’t need to know much theory and it is pretty obvious how it works. If you cannot narrow the parameter space to include the global minimum then you might require too many calculations to perform the optimization.

We will now study some iterative methods that can be used to minimize chi-squared. Grid method is not really an iterative method like the rest. You try to cast a net over the parameter surface. You see which loop of the net produces the minimum, then you cast a finer net over that loop and rapidly converge towards a minimum. This method is good because you get out of it what you expect. When you have lots of parameters it might not be possible to cast a fine enough net to progress because there are too many combinations.

Example. Implement the grid method on Matlab to fit a straight line. Compare the results to

We can implement the grid method to fit a straight line model. It works just the same with nonlinear models.

%Grid Method

clear; clf;

A = 10;

B = 3;

Ni = 15;

xi = 1:2:15

for i = 1:length(xi)

tx = [];

for j = 1:Ni

tx(j) = A + B*xi(i) + normrnd(0,3);

end

ei(i) = std(tx)/sqrt(Ni);

yi(i) = sum(tx)/Ni;

end

errorbar(xi,yi,ei,’.k’, ‘markersize’,10);

xlabel(‘trial’)

ylabel(‘measurement’)

hold on;

[A sigA B sigB] = lineABx(xi,yi,ei)

A

sigA

B

sigB

a = 9.5:0.0001:10.5;

b = 2.9:0.0001:3.1;

for j = 1:length(a)

for k = 1:length(b)

chisquared(j,k) = sum( 1./ei.^2.*(yi-b(k).*xi – a(j)).^2 );

end

end

g = min(min(chisquared));

[I J] = find(chisquared == g);

AA = a(I);

BB = b(J);

A

AA

B

BB

## Important matrix math

We can represent in terms of vectors. Let

We define the residual vector.

We define the weight matrix which is diagonal.

So we can write as

We go back and forth between column and row vectors using the transpose property. Some properties of transpose are

When you multiply an matrix with an matrix you get an matrix. You can use size(A) command in Matlab to see how your matrices are arranged. You can use the transpose operator A’ to convert a column vector to a row vector and vice versa. You may have to check your expression if you haven’t kept track of your transposes.

We will be taking partial derivatives of with respect to the parameters which is the Jacobian matrix.

We also have the Hessian matrix which is defined as

Here is an important result from Matrix calculus for a symmetric matrix .

## Newton’s method

We will explore Newton’s method because it is a simple example of an iterative method which we will be using to minimize . Consider the following diagram.

Illustration of Newton’s method

We have

We can rearrange the expression as

So if we know the initial guess, we can find . Then we can plug back in in the next iteration to find and so on.

We get closer and closer to the point where with each iteration.

Example Find the square root of two to high precision.

We can try out Newton’s method to calculate the square root of 2. If we have the equation,

Then we can write . This function goes to zero at so we can apply Newton’s method.

clear; format long;

N = 10;

x(1) = 2.

for i = 1:N

x(i+1) = x(i) – (x(i)^2 -2)/(2*x(i));

end

If we look at the contents of we find,

n x_n

1 2

2 1.500

3 1.416666666666667

4 1.414215686274510

5 1.414213562374690

6 1.414213562373095

7 1.414213562373095

The update function just produces fractions or rational numbers if you choose an initial guess like 2. In old times, before the calculator one could calculate the square root of any number to sufficient accuracy by finding the fraction, then doing long division in the last step. It is quite a good method, 15 digits are perfect after the 6th iteration.

There are a number of things that can go wrong with Newton’s method. Each successive update may result in oscillation or the updates may be directed to another solution you are not looking for. The best way to find a root is to make a plot of near the root and plan your initial guess such that it climbs down the hill to the root avoiding any bumps or being too far away from the minimum. Also, your function may have a restricted domain or singularities where division by zero occurs. These will obviously crash Newton’s method.

## Gauss-Newton method

The Gauss-Newton algorithm is another iterative curve fitting method.

One can arrive at the update formula by considering a Taylor expansion of with respect to an update .

At the minimum of , then

Then we have

We want to take the derivative of with respect to and set it equal to zero. Remember the important derivative.

We can take the transpose of that expression to get

Remember that is symmetric. Solving for gives the Gauss Newton update rule.

To implement the Gauss-Newton method we only need to code for the update rule. Gauss-Newton seems to be more picky about the initial guess than other methods.

Example. Implement the Gauss-Newton method to fit the model data

We change the initial guess and update the delta rule.

clear;clf;figure(1);

load modeldata.mat

errorbar(xi, ]yi, sigi,’.k’);

hold on;

xlabel(‘time [s]’, ‘fontsize’,20);

ylabel(‘distance [m]’, ‘fontsize’,20);

N = length(xi);

W = zeros(N,N);

for i = 1:N

W(i,i) = 1/sigi(i)^2;

end

%some initial random guess

A(1) = 1.4

B(1) = -4.3

C(1) = 1.3

gamma = 0.00001; %Arbitrary step size

S = 10000; %Number of steps

for s = 1:S

for i = 1:N

J(i,1) = exp(-C(s)*xi(i));

J(i,2) = xi(i)*exp(-C(s)*xi(i));

J(i,3) = (A(s) + B(s)*xi(i) )*(-xi(i))*exp(-C(s)*xi(i));

ri(i) = (yi(i) – (A(s)+ B(s)*xi(i))*exp(-C(s)*xi(i)));

end

% we really only have to change one line of code for delta

delta = inv(J’*W*J)*J’*W*ri’;

A(s+1)=A(s)+delta(1);

B(s+1)=B(s)+delta(2);

C(s+1)=C(s)+delta(3);

end;

ymodel = (A(end)+B(end).*xi).*exp(-C(end).*xi);

plot(xi, ymodel,’r’);

title(‘Gauss Newton fits’,’fontsize’,20);

Gauss-Newton works

## Gradient decent method

Imagine as a multidimensional hyper-surface of the possible fit parameters, . A choice of corresponds to a location on the hypersurface. We can use the result from vector calculus that will point towards a local minimum. We can then step in that direction and repeat the process. We take an arbitrary step size of .The gradient of chi squared is calculated as

The gradient points up the slope of a function so our update rule should point in the opposite direction. We choose the step size to be proportional to which is something we set arbitrarily. We can make it and see if the calculation converges.

Example Implement gradient descent with the model data described earlier in this chapter.

We load the data and do the calculation

%Imprementing gradiate descent

clear;clf;figure(1);

load modeldata.mat

errorbar(xi, yi, sigi,’.k’);

hold on;

xlabel(‘time [s]’, ‘fontsize’,20);

ylabel(‘distance [m]’, ‘fontsize’,20);

N = length(xi);

W = zeros(N,N);

for i = 1:N

W(i,i) = 1/sigi(i)^2;

end

%some initial random guess

A(1) = .2

B(1) = -.2

C(1) = .3

gamma = 0.00001; %Arbitrary step size

S = 10000; %Number of steps

for s = 1:S

for i = 1:N

J(i,1) = exp(-C(s)*xi(i));

J(i,2) = xi(i)*exp(-C(s)*xi(i));

J(i,3) = (A(s) + B(s)*xi(i) )*(-xi(i))*exp(-C(s)*xi(i));

ri(i) = (yi(i) – (A(s)+ B(s)*xi(i))*exp(-C(s)*xi(i)));

end

delta = 2*gamma*ri*W*J;

A(s+1)=A(s)+delta(1);

B(s+1)=B(s)+delta(2);

C(s+1)=C(s)+delta(3);

end;

ymodel = (A(end)+B(end).*xi).*exp(-C(end).*xi);

plot(xi, ymodel,’r’);

title(‘Gradient descent fit’,’fontsize’,20);

Gradient descent works

## Levenberg Marquardt Method

The Levenberg-Marquardt algorithm is one of the most powerful curve fitting methods available. Most commercial software packages use it in one form or another. It is robust, meaning that it will converge to a local minimum for a wide range of initial guesses. It does not always find the global minimum. The Levenberg-Marquardt algorithm can be thought of as a combination of the Gauss-Newton method and the gradient descent method. To derive the Levenberg-Marquardt algorithm one starts along the same lines as the Gauss-Newton method and we arrive at

The Levenberg-Marquardt method slightly modifies this equation to

The diag symbol means only the components along the diagonal are non-zero.

The update can be solved for by an inverse matrix.

The Levenberg-Marquardt update rule is given by

Example Implement the Levenberg Marquardt algorithm in Matlab

Here is the code

clear; clf; figure(1);

load criticaldampingdata

errorbar(xi, yi, sigi,’.k’);

hold on;

xlabel(‘time [s]’, ‘fontsize’,20);

ylabel(‘distance [m]’, ‘fontsize’,20);

N = length(xi);

W = zeros(N,N);

for i = 1:N

W(i,i) = 1/sigi(i)^2;

end

%some initial random guess

A(1) = 1;

B(1) = -4;

C(1) = 1.3;

S = 10000;

lambda = 0.01;

for s = 1:S

for i = 1:N

J(i,1) = exp(-C(s)*xi(i));

J(i,2) = xi(i)*exp(-C(s)*xi(i));

J(i,3) = (A(s) + B(s)*xi(i) )*(-xi(i))*exp(-C(s)*xi(i));

ri(i) = (yi(i) – (A(s)+ B(s)*xi(i))*exp(-C(s)*xi(i)));

end

Q = J’*W*J;

QQ = zeros(3,3);

QQ(1,1) = Q(1,1)*(1+lambda);

QQ(2,2) = Q(2,2)*(1+lambda);

QQ(3,3) = Q(3,3)*(1+lambda);

delta = inv(Q+QQ)*J’*W*ri’;

A(s+1)=A(s)+delta(1);

B(s+1)=B(s)+delta(2);

C(s+1)=C(s)+delta(3);

end;

ymodel = (A(end)+B(end).*xi).*exp(-C(end).*xi);

plot(xi, ymodel,’-.’);

title(‘Levenberg Marquardt fits’,’fontsize’,20);\end{verbatim}

## Nonlinear least squares uncertainties

If you can have already found the optimal parameters by one of the various methods we are about to discuss then it is straightforward to calculate the estimated parameter uncertainties. Recall from the generalized linear theory, we had the matrix . There is a second way to calculate this matrix just take

Now we can find and that leads to the parameter uncertainties . One should verify that this method works by testing in on the model to generate .

For nonlinear models we can write out the matrix elements by calculating the derivatives.

Example. Find the parameter estimate for and its uncertainty in the straight line model through the origin by calculating . The formula for when we fit the line through the origin was

We can easily find it is a 1 by 1 matrix.

Since is the inverse of we have

Therefore

This is exactly what we got before with this model using different methods.

Example Calculate for the critical damping problem and then find the uncertainties in the estimates.

For this task, I definitely want to use Mathematica. I notice that is symmetric so we only have to calculate six quantities. To find the derivatives in Mathematica I type the following.

The results are implemented in MATLAB

wi = sigi.^(-2);

A=A(end);

B=B(end);

C=C(end);

E1 = exp(-C.*xi);

E2 = exp(-2.*C.*xi);

P1 = (A + B.*xi);

X2 = xi.*xi;

D(1,1) = sum( wi.*E2);

D(2,2) = sum( wi.*X2.*E2);

D(3,3) = sum( wi.*(E2.*X2.*P1.^2)-E1.*X2.*P1.*(-E1.*P1 + yi));

D(1,2) = sum( wi.*E2.*xi);

D(1,3) = sum( -wi.*E2.*xi.*P1 + wi.*E1.*xi.*(-E1.*P1 + yi));

D(2,3) = sum( -wi.*E2.*X2.*P1 + wi.*E1.*X2.*(-E1.*P1 +yi));

D

M = inv(D)

A

sqrt(M(1,1))

B

sqrt(M(2,2))

C

sqrt(M(3,3))

With five samples the values are usually pretty close to the true parameters. If you increase the number of samples then you get even closer to the true values. I was expecting that the error bars would be larger and enclose the true values, but this is not always the case. The curve fit (dotted-dashed line) is visually pretty close to the true model, yet small differences remain. One can experiment with the code to try different starting models and initial guesses. This exercise was meant to show the LM method and how to implement it in MATLAB. You can easily now adapt the code to your curve fitting functions.

Levenberg Marquardt works. For this plot, the agreement with the true values is reasonable. A = 2.077(93), B = -5.17(12), and C = 1.026(16)

We have to get into further discussions of statistics later to determine what it means to have a good fit. This depends on the value of chi-squared and the number of degrees of freedom. For the meantime, we are stuck with chi by eye.

NonlinearLeastSquaresCurveFitting.pdf

Visit www.sciencemathmastery.com for more lessons on error analysis, probability, and statistics.