## Introduction to Probability Models

Lecture 40

Qi Wang, Department of Statistics

Dec 4, 2017

### Calculate Correlation

• Population
• Variance: $\sigma^2 = \frac{\sum_{i=1}^N(x_i -\mu)^2}{N}$
• Covariance: $\sigma_{x, y} = \frac{\sum_{i=1}^N(x_i - \mu_x)(y_i - \mu_y)}{N}$
• Correlation: $\rho_{x, y} = \frac{\sum_{i=1}^N (x_i - \mu_x)(y_i - \mu_y)}{\sqrt{\sum_{i=1}^N(x_i - \mu_x)^2}\sqrt{\sum_{i=1}^N(y_i - \mu_y)^2}}$
• Sample
• Variance: $s^2 = \frac{\sum_{i = 1}^n(x_i - \bar{x})^2}{n - 1}$, SD: $s_x = \sqrt{\frac{\sum_{i = 1}^n(x_i - \bar{x})^2}{n - 1}}$
• Covariance: $s_{x, y} = \frac{\sum_{i=1}^n(x_i - \bar{x})(y_i - \bar{y})}{n - 1}$
• Correlation: $r_{x, y} = \frac{\sum_{i=1}^N (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^N(x_i - \bar{x})^2}\sqrt{\sum_{i=1}^N(y_i - \bar{y})^2}}$
• Sample correlation is often written as $r_{x, y} = \frac{s_{x, y}}{s_x s_y}$

### Example 1

You want to compare the airspeed velocity of unladen swallows by species. Suppose you collect the following data on pairs of African and European swallows.

African European
18 21
22 22
26 25
30 28
Calculate the means, variances, standard deviations of each. Then calculate the covariance and the correlation between the two species.

### Least-Squares Regression

Can we do better than just a scatter plot and the correlation in describing how x and y are related? What if we want to predict y for other values of x? Least-Squares Regression fits a straight line through the data points that will minimize the sum of the vertical distances of the data points from the line

• Minimizes $\sum_i^n e_i^2$
• Equation of the line is: $\hat{y} = b_0 + b_1 x$
• Slope of the line is:$b_1$, where the slope measures the amount of change caused in the response variable when the explanatory variable is increased by one unit.
• Intercept of the line is:$b_0$, where the intercept is the value of the response variable when the explanatory variable = 0. (i.e. value where line intersects the y-axis)
• Used for Prediction: using the line to find y-values corresponding to x-values that are within the range of your data x-values
• Using values outside range of the collected data can lead to extrapolation
• Coefficient of Determination: Denoted by 𝑟2, it gives the proportion of the variance of the response variable that is predicted by the explanatory variable. So when $r^2$ is high, close to 1 or 100%, you have explained most of the variability
• Residuals: the difference between the observed value of the response variable ($y$) and the predicted value ($\hat{y}$): residuals = observed y - predicted y, $$e = y - \hat{y}$$

### Example 2

We want to examine whether the amount of rainfall per year increases or decreases corn bushel output. A sample of 10 observations was taken, and the amount of rainfall (in inches) was measured, as was the subsequent growth of corn.

Amount of Rain Bushels of Corn
3.03 80
3.47 84
4.21 90
4.44 95
4.95 97
5.11 102
5.63 105
6.34 112
6.56 115
6.82 115

The regression line (also called the prediction line or trend line) is $=\hat{y} = 50.832+9.625x$ On the rain/corn data above, predict the corn yield for

1. 5 inches of rain
2. 6.56 inches of rain
3. 0 inches of rain
4. 100 inches of rain
5. For which amounts of rainfall above do you think the line does a good job of predicting actual corn yield? Why?
6. What percentages of the variation in corn yield is explained by the relationship with amount of rain?
7. Calculate the residual when the amount of rain is 6.56 inches

### Histogram

Histogram is used for large amount of data, displays a count (frequency distribution) or percent or relative frequency distribution (probabilities) QUANTITATIVE DATA ONLY!!!! It is constructed by placing the class intervals on the horizontal axis and the frequencies, relative frequencies, or percent frequencies on the vertical axis. When making a histogram, you need to pick an adequate number of classes (or, equivalently, an appropriate width of the interval for each class).

### When to use mean and when to use median

Suppose there are 7 people eating in a diner. We ask “what is their income/ salary? Then Bill Gates walks into the diner and we add in his salary.

which measure is more representative of the group after Bill Gates arrives?

• When data is sysmetric --want to use mean and standard deviation
• When data is skewed or have outlier(s) --want to use 5 number summary (or median and IQR)

Since the median is relatively unchanged by a few very large or very small measurements in the data set, we say that the median is resistant. The mean is non-resistant.