User:Egm6936.f09/Weibull distribution

Introduction
Here, we present some real experimental data that fit the Weibull distribution, which is a more general form of the exponential distribution. The data can be found in a spreadsheet format, here. The probability density function (pdf) of the Weibull distribution is given as


 * {| style="width:100%" border="0"

$$  \displaystyle F_X (x) = \Big[  1 - exp \Big( - \frac{x}{p_2} \Big)^{p_1} \Big] $$.     (A1)
 * style="width:95%" |
 * style="width:95%" |
 * 
 * }

To fit to the experimental data, a generalization of the Weibull distribution is applied by introducing a third coefficient as


 * {| style="width:100%" border="0"

$$  \displaystyle F_X (x) = p_3 \Big[ 1 - exp \Big( - \frac{x}{p_2} \Big)^{p_1} \Big] $$.     (A2)
 * style="width:95%" |
 * style="width:95%" |
 * 
 * }

When a nonlinear fitting is applied, the parameters are found as


 * {| style="width:100%" border="0"

$$  \displaystyle p = [1.6407, 740.2473, 933.7323] $$.     (A3)
 * style="width:95%" |
 * style="width:95%" |
 * 
 * }

With the above parameters, the distribution looks like



FIG 1: The distribution and the data points

When we take $$ \displaystyle p(3) = 1 $$ as in the usual Weibull distribution, we obtain the pdf as



FIG 2. The Weibull distribution for $$ \displaystyle p(3) = 1 $$

The story about the experimental data
The data was first given in the paper by Dalal and McIntosh (1994) and reproduced in the book by Cook and Lawless (2007, Appendix D, Table D3, p.368). The example comes from a debugging process of a software system with around seven million noncommentary source lines. Cumulative staff days, $$ \displaystyle t $$, the cumulative number of faults detected, $$ \displaystyle N(t) $$, and the cumulative number of lines of code added, $$ \displaystyle C(t) $$, were recorded. Here is a table of selected data (1 data/10 data).

As seen from the table, over $$ \displaystyle 1,300 $$ total staff days were spent for the debugging process, $$ \displaystyle 870 $$ faults were detected, and more than $$ \displaystyle 342,000 $$ lines were added. The reason to fit a model to the data is to guess when to stop testing the code. The selected data and the full set is given in the following figure.



FIG 3 Selected and the full set data

At the beginning, the total number of faults found increases very slowly, then it rises quickly, and finally reaches a plateau with time. As the data represents an event that repeats itself at some amount of time and also from the shape of the curve, one may think of fitting an exponential distribution or a more general distribution, e.g., Weibull distribution that would also fit the early times.

Curve fitting with actual data
The general form of the Weibull distribution with 3 parameters is given in Eq.(A2). To fit this curve to the existing data, one may use a nonlinear least square analysis. First, an initial guess for the parameters is required. Then, the estimation of the coefficients is done with a nonlinear fit. If the initial guess is very far from the actual values, it is not possible to obtain converged values for the parameters. For example, an initial guess of


 * {| style="width:100%" border="0"

$$  \displaystyle p = [100, 10, 1] $$.     (A4)
 * style="width:95%" |
 * style="width:95%" |
 * 
 * }

does not yield converged estimates of the parameters. The output of Matlab is:

Warning: The Jacobian at the solution is ill-conditioned, and some model parameters may not be estimated well (they are not identifiable). Use caution in making predictions.

Parameter estimation
So, it is important to understand the role of each parameter. Here, we present two graphs where $$ \displaystyle p_3 = 1 $$ for both cases.



FIG 4. Effect of parameters when $$ \displaystyle p_3 = 1 $$, a) Study of $$ \displaystyle p_1 $$, b) Study of $$ \displaystyle p_2 $$.

The first figure aims to study the effect of the first parameter, $$ \displaystyle p_1 $$. Note that the exponential distribution is recovered when $$ \displaystyle p_1 = 1 $$. For $$ \displaystyle p_1>1 $$, the curve increases slowly at the beginning, which is the case in our problem (Figs 1-3) and $$ \displaystyle p_1 $$ turned out to be $$ \displaystyle 1.64 $$. The parameter $$ \displaystyle p_2 $$ shows how fast the curve increases after it passes through this slow increase region. Finally, a good initial guess for the parameter $$ \displaystyle p_3 $$ is the maximum of the data $$ \displaystyle N(t) $$, i.e. $$ \displaystyle 870 $$.

An alternative method for parameter estimation
Another way to estimate the parameters $$ \displaystyle p_1 $$, $$ \displaystyle p_2 $$, and $$ \displaystyle p_3 $$ is to take two times the logarithm of the generalized Weibull distribution as


 * {| style="width:100%" border="0"

$$  \displaystyle ln\bigg(-ln\big(1-\frac{F_X }{p_3} \big) \bigg) = p_1 ln(x) - p_1 ln(p_2) $$.     (A5)
 * style="width:95%" |
 * style="width:95%" |
 * 
 * }

As explained in the previous section, a good guess for $$ \displaystyle p_3 $$ would be around the maximum value of $$ \displaystyle N(t) $$, which is $$ \displaystyle 870 $$. A plot of the left hand side versus $$ \displaystyle ln(x) $$ gives p1 as the slope and $$ \displaystyle - p_1 ln(p_2) $$ as the intercept. Using a spreadsheet, if we insert $$ \displaystyle p_3=880 $$ so that the natural logarithm of the last couple of data exist, the slope becomes $$ \displaystyle 1.7177 $$ and the intercept $$ \displaystyle -11.195 $$. Hence, $$ \displaystyle p_1 $$ is $$ \displaystyle 1.728 $$ and $$ \displaystyle p_2 $$ is $$ \displaystyle 678 $$. Recall that the result of the nonlinear fit was $$ \displaystyle p = [1.6407, 740.2473, 933.7323] $$. So an initial guess of $$ \displaystyle p = [1.728, 678, 880] $$ is a good starting point. If there were only two parameters, this analysis would have given the end result without the need for a nonlinear fit.

Detailed Octave Procedure
Copy the first two columns in Exponential Data into soffice (or excel), then save as .csv file, e.g., junk.csv. Use vim to replace blank space with comma, and save the result into the file exponential.cdf.csv:  cat junk.csv | sed 's/ /,/g' >! exponential.cdf.csv 

Load the 1st column into matrix x and the 2nd column into matrix y with the octave commands:  x = csvread ('exponential.cdf.csv','A1:A158')   y = csvread ('exponential.cdf.csv','b1:b158') 

Define the inline function to do the curve fitting: <ul> F = inline ("p(3) * ( 1 - exp ( - ( x/p(2) ).^p(1) ) ) ","x","p") </ul>

Define the initial parameters in the matrix "pin": <ul> pin = [2 ; 600 ; 870] </ul>

Then give the nonlinear least square curve fitting command: <ul> [f,p,kvg,iter,corp,covp,covr,stdresid,Z,r2]=leasqr(x,y,pin,F); </ul>

Now plot the data (x,y) and the fitted function (x,"f"): <ul> plot (x,y, "+;y;", x, f , "-;f;") </ul>

Save the figure in png format for uploading to google doc: <ul> print -dpng weibull.fit.png </ul>

Detailed Matlab Procedure
First, create the model <ul> modelFun = @(p,x)p(3) * ( 1 - exp ( - ( x/p(2) ).^p(1) ) ) </ul>

Then, assign an initial guess for the parameters p(i) <ul> startingVals = [2 600 870]; </ul>

The command nlinfit gives the coefficients based on the model and the starting values <ul> coefEsts = nlinfit(x, y, modelFun, startingVals) </ul>

We give an initial guess <ul> startingVals = [2 600 870]; </ul>

The parameters are found as <ul> coefEsts = [1.6404 740.2474  933.7324] </ul> compared to the optimal result from Octave: <ul> p (result) = [1.6407, 740.2473, 933.7323] </ul> which is very close to what was found with Octave leasqr command.