|
Linear regression with SAS
The example and dataWhen there is one independent variable and one dependent variable, and the question is to find the strength of a linear relationship between the two variables, a correlation test (Pearson correlation coefficeint) can be calculated. Suppose in a health screening, seven people take measurement on Gender, Height, Weight and Age. The question is to find the linear association between every two variables, for example, Height and Weight, Height and Age and Weight and Age. The data is "measurement.csv". Setting up the dataOpen the data set from SAS. Or import with the following command. data measurement; infile "H:\sas\data\measurement.csv" dlm=',' firstobs=2; input gender $ height weight age; run; Analyzing the data correlation, syntaxproc corr data=measurement; title "Example of correlation matrix"; var height weight age; run; Reading the output of the correlationExample of correlation matrix The CORR Procedure 3 Variables: height weight age Simple Statistics Variable N Mean Std Dev Sum Minimum Maximum height 7 66.85714 3.97612 468.00000 61.00000 72.00000 weight 7 155.57143 45.79613 1089 99.00000 220.00000 age 7 31.71429 11.42679 222.00000 20.00000 48.00000 Pearson Correlation Coefficients, N = 7 Prob > |r| under H0: Rho=0 height weight age height 1.00000 0.97165 0.86467 0.0003 0.0120 weight 0.97165 1.00000 0.92621 0.0003 0.0027 age 0.86467 0.92621 1.00000 0.0120 0.0027 Interpreting the result of the correlationProc Corr gives us some simple descriptive statistics on the variables in the Var list along with a correlation matrix. The correlation is the top number and the p-value is the second number. For example, "height" and "weight" are highly correlatied with a correlation 0.9716; furthermore the p-value is 0.003 (<0.05) which suggests that the correlation is significant. On the otherwords, it is unlikely to have obtained a correlation this large by chance if there is no correlation between "height" and "weight". The matrix is symatric along the diagonal line.
Linear regressionCorrelation shows the linear association between two variables. If the question is to predict one variable from another, lindear regression can be used. For example, if one wants to predict weight according to height, the following regression model can be run. proc Reg data=measurement; title "Example of linear regression"; model weight = height; run; Reading the output of the linear regressionExample of linear regression The REG Procedure Model: MODEL1 Dependent Variable: weight Number of Observations Read 7 Number of Observations Used 7 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 11880 11880 84.45 0.0003 Error 5 703.38705 140.67741 Corrected Total 6 12584 Root MSE 11.86075 R-Square 0.9441 Dependent Mean 155.57143 Adj R-Sq 0.9329 Coeff Var 7.62399 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 -592.64458 81.54217 -7.27 0.0008 height 1 11.19127 1.21780 9.19 0.0003 Interpreting the result of the linear regressionLinear regression assumes that the dependent variable (e.g, Y) is linearly depending on the independent variable (x), i.e., Y= a + bX, where a is the intercept and b is the slope. The output shows the parameters of a and b respectively, i.e, weight= -592.64458 + 11.19127*height. Therefore, one can predict weight given height using this linear function between the two variables. For example, the predicted weight of a 70-inch-tall persion is -592.64458 + 11.19127 X 70=190.66 lbs. Furthermore, under the heading "parameter Estimates" are columns labeled "standard error","t value," and "Pr > |t|". The T values and the associated p-value test the hypothesis that the parameter is zero, or in other words, whether the (linear) effect of height on weight is zero. In this case the p-value is 0.0003, therefore we can conclude that the height has a significant linear effect on weight. Checking assumptions for the linear regressionLinear regression assumes that the relationship between two variables is linear, and the residules (defined as Actural Y- predicted Y) are normally distributed. These can be check with scatter plot and residual plot. You can also ask for these plots under the "proc reg" function. For example proc reg data=measurement; title "Regression and residual plots" model weight=height; plot weight * height; residual. * height; The two plots are shown here: The plot of residuals suggests that a second-order term (X2) might improve the model since the points do not seem random and form a curve that could be fit by a second-order equation. One way to fix the problem quickly is to fit a possible quadratic relationship between height and weight. The sencond-order variable height2, denoted as height2, can be created in the proc data, and use later in the proc reg. The syntax is shown below. data measurement2; infile "H:\sas\data\measurement.csv" dlm=',' firstobs=2; input gender $ height weight age; height2 = height**2; run; proc reg data=measurement2; title "Regression and residual plots with quatratic term"; model weight=height height2; plot residual. * height; run; The resulting residual plot is given below. The distribution of residual is more random than the earlier plot.
|
© COPYRIGHT 2010 ALL RIGHTS RESERVED tqin@purdue.edu |