Linear regression with SAS

  1. The example and data
  2. Setting up the data
  3. Analyzing the data correlation
  4. Analyzing the independent variable effect on the dependent variable with linear regression

The example and data

When there is one independent variable and one dependent variable, and the question is to find the strength of a linear relationship between the two variables, a correlation test (Pearson correlation coefficeint) can be calculated.

Suppose in a health screening, seven people take measurement on Gender, Height, Weight and Age. The question is to find the linear association between every two variables, for example, Height and Weight, Height and Age and Weight and Age.

The data is "measurement.csv".

Setting up the data

Open the data set from SAS. Or import with the following command.

 
  data measurement;
	infile "H:\sas\data\measurement.csv" dlm=',' firstobs=2;
	input gender $ height weight age;
    run;

Analyzing the data correlation, syntax

 
  proc corr data=measurement;
	title "Example of correlation matrix";
	var height weight age;
	run;

Reading the output of the correlation

                                 Example of correlation matrix                                 
                                                               

                                       The CORR Procedure

                            3  Variables:    height   weight   age


                                       Simple Statistics

   Variable           N          Mean       Std Dev           Sum       Minimum       Maximum

   height             7      66.85714       3.97612     468.00000      61.00000      72.00000
   weight             7     155.57143      45.79613          1089      99.00000     220.00000
   age                7      31.71429      11.42679     222.00000      20.00000      48.00000


                            Pearson Correlation Coefficients, N = 7
                                   Prob > |r| under H0: Rho=0

                                      height        weight           age

                        height       1.00000       0.97165       0.86467
                                                    0.0003        0.0120

                        weight       0.97165       1.00000       0.92621
                                      0.0003                      0.0027

                        age          0.86467       0.92621       1.00000
                                      0.0120        0.0027




Interpreting the result of the correlation

Proc Corr gives us some simple descriptive statistics on the variables in the Var list along with a correlation matrix. The correlation is the top number and the p-value is the second number. For example, "height" and "weight" are highly correlatied with a correlation 0.9716; furthermore the p-value is 0.003 (<0.05) which suggests that the correlation is significant. On the otherwords, it is unlikely to have obtained a correlation this large by chance if there is no correlation between "height" and "weight". The matrix is symatric along the diagonal line.

Linear regression

Correlation shows the linear association between two variables. If the question is to predict one variable from another, lindear regression can be used. For example, if one wants to predict weight according to height, the following regression model can be run.

 
  proc Reg data=measurement;
	title "Example of linear regression";
	model weight = height;
	run;

Reading the output of the linear regression

  				Example of linear regression                                 

                                        The REG Procedure
                                          Model: MODEL1
                                   Dependent Variable: weight

                             Number of Observations Read           7
                             Number of Observations Used           7


                                      Analysis of Variance

                                             Sum of           Mean
         Source                   DF        Squares         Square    F Value    Pr > F

         Model                     1          11880          11880      84.45    0.0003
         Error                     5      703.38705      140.67741
         Corrected Total           6          12584


                      Root MSE             11.86075    R-Square     0.9441
                      Dependent Mean      155.57143    Adj R-Sq     0.9329
                      Coeff Var             7.62399


                                      Parameter Estimates

                                   Parameter       Standard
              Variable     DF       Estimate          Error    t Value    Pr > |t|

              Intercept     1     -592.64458       81.54217      -7.27      0.0008
              height        1       11.19127        1.21780       9.19      0.0003



Interpreting the result of the linear regression

Linear regression assumes that the dependent variable (e.g, Y) is linearly depending on the independent variable (x), i.e., Y= a + bX, where a is the intercept and b is the slope. The output shows the parameters of a and b respectively, i.e, weight= -592.64458 + 11.19127*height. Therefore, one can predict weight given height using this linear function between the two variables. For example, the predicted weight of a 70-inch-tall persion is -592.64458 + 11.19127 X 70=190.66 lbs.

Furthermore, under the heading "parameter Estimates" are columns labeled "standard error","t value," and "Pr > |t|". The T values and the associated p-value test the hypothesis that the parameter is zero, or in other words, whether the (linear) effect of height on weight is zero. In this case the p-value is 0.0003, therefore we can conclude that the height has a significant linear effect on weight.

Checking assumptions for the linear regression

Linear regression assumes that the relationship between two variables is linear, and the residules (defined as Actural Y- predicted Y) are normally distributed. These can be check with scatter plot and residual plot.

You can also ask for these plots under the "proc reg" function. For example

	proc reg data=measurement;	
	title "Regression and residual plots"
	model weight=height;
	plot weight * height;
	     residual. * height;

The two plots are shown here:

The plot of residuals suggests that a second-order term (X2) might improve the model since the points do not seem random and form a curve that could be fit by a second-order equation. One way to fix the problem quickly is to fit a possible quadratic relationship between height and weight. The sencond-order variable height2, denoted as height2, can be created in the proc data, and use later in the proc reg. The syntax is shown below.

	data measurement2;
		infile "H:\sas\data\measurement.csv" dlm=',' firstobs=2;
		input gender $ height weight age;
		height2 = height**2;
		run;
    
 	proc reg data=measurement2;	
		title "Regression and residual plots with quatratic term";
		model weight=height height2;
		plot residual. * height;
		run;

The resulting residual plot is given below.

The distribution of residual is more random than the earlier plot.