Review of Basic Probability and Statistics

Review of Probability

Random Variables

A random variable such as your weight takes its values by chance. It is described by a probability distribution.

A continuous random variable is one whose values may fill an interval. Its probability distribution is described by its probability density function (pdf).The probability that the random variable falls into an interval is the area under the curve.

Random variables are usually denoted by capital letters $X, Y, Z$.

Independence

Two random variables $X$ and $Y$ are independent if \[ P(Y\le b|X\le a)=P(Y\le b) \] for any values $a$ and $b$.

Mean and Variance

The mean of $Y$, denoted by $E(Y)$ or $\mu$, is the center of the probability distribution. The variance of $Y$ is defined by \[ Var(Y)=E[(Y-\mu)^2]. \] Hence \[ Var(Y)=E(Y^2)-\mu^2. \]

Properties

$E\left(a_{1} Y_{1}+a_{2} Y_{2}\right)=a_{1} E\left(Y_{1}\right)+a_{2} E\left(Y_{2}\right)$ for any constants $a_1$ and $a_2$.
$\operatorname{Var}\left(a_{1} Y_{1}\right)=a_{1}^{2} \operatorname{Var}\left(Y_{1}\right)$.
$\operatorname{Var}\left(a_{1} Y_{1}+a_{2} Y_{2}\right) = a_{1}^{2} \operatorname{Var}\left(Y_{1}\right)+a_{2}^{2} \operatorname{Var}\left(Y_{2}\right)+2a_1a_2\mbox{Cov}(Y_1, Y_2).$
If $Y_{1}$ and $Y_{2}$ are independent or uncorrelated, then $\operatorname{Var}\left(a_{1} Y_{1}+a_{2} Y_{2}\right)=$ $a_{1}^{2} \operatorname{Var}\left(Y_{1}\right)+a_{2}^{2} \operatorname{Var}\left(Y_{2}\right).$

The above properties generalizes to the sum of $n$ variables. For example \[ E(\sum_{i=1}^n a_iY_i)=\sum_{i=1}^n a_i E(Y_i). \]

Special distributions

Normal distributions

The $N(\mu, \sigma^2)$ has a pdf\[ f(x)=\frac{1}{\sqrt{2\pi}\sigma}\exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right). \]where $\mu$ is the mean and $\sigma^2$ the variance. The normal distribution is denoted by $N(\mu, \sigma^2)$. $N(0, 1)$ is called the standard norm.

If $Y$ is $N(\mu, \sigma^2)$, $a+bY$ is $N(a+b\mu, b^2\sigma^2).$
If $Y$ is $N(\mu, \sigma^2)$, $(Y-\mu)/\sigma\sim N(0, 1)$. This is called the standardization.
If $Y_1$ is $N(\mu_1, \sigma_1^2)$ and $Y_2$ is $N(\mu_2, \sigma_2^2)$ and the two variables are independent, then $b_1Y_1+b_2Y_2$ is $N(b_1\mu_1+b_2\mu_2, b_1^2\sigma_1^2+b_2^2\sigma_2^2).$

The sum of independent normal random variables is a normal random variable.

$\chi^2$ distributions

For practical purpose, we use the following fact as the definition and also a property of $\chi^2$ distribution:

If $Z_1, \ldots, Z_n$ are independent standard normal random variables, then $\sum_{i=1}^n Z_i^2$ has the $\chi^2$ distribution with $n$ degrees of freedom.

If $Y_1$ has $\chi_m^2$ distribution and $Y_2$ has the $\chi_n^2$ distribution and the two variables are independent, then $Y_1+Y_2$ has the $\chi_{m+n}^2$ distribution.

Cochran’s theorem

If $Z_{1}, \ldots, Z_{k}$ are independent identically distributed (i.i.d.), standard normal random variables, then $Q=\sum_{i=1}^{k}\left(Z_{i}-\bar{Z}\right)^{2} \sim \chi_{k-1}^{2}$ where \[ \bar{Z}=\frac{1}{k} \sum_{i=1}^{k} Z_{i} \] In addition, $Q$ and $\bar Z$ are independent.

If $X_{1}, \ldots, X_{n}$ are i.i.d. $N\left(\mu, \sigma^{2}\right)$ random variables, then

\[V=\frac{1}{\sigma^2} \sum_{i=1}^{n}\left(X_{i}-\bar{X}\right)^{2} \sim \chi_{n-1}^{2} \]
where $\bar{X}=\frac{1}{n} \sum_{i=1}^{n} X_{i}$, and $V$ and $\bar X$ are independent.

t-distribution

The $t$-distribution with $\nu$ degrees of freedom can be defined as the distribution of the random variable $T$ with \[ T=\frac{Z}{\sqrt{V / \nu}} \] where

$Z$ is a standard normal random variable;
$V$ has a chi-squared distribution ( $\chi^{2}$-distribution) with $\nu$ degrees of freedom;
$Z$ and $V$ are independent.

The t-distribution is essential for the inferences about the mean of a normal distribution.

F-distribution

The F-distribution with $d_{1}$ and $d_{2}$ degrees of freedom is the distribution of \[ X=\frac{S_{1} / d_{1}}{S_{2} / d_{2}} \] where $S_{1}$ and $S_{2}$ are independent random variables with chi-square distributions with respective degrees of freedom $d_{1}$ and $d_{2}$.

The $F$-distribution is essential for comparing the means of multiple normal distributions.

Review of Statistics

Sampling distributions

Let $Y_1, \ldots, Y_n$ be a random sample from $N(\mu, \sigma^2)$, and let \[ \bar Y=\frac 1 n \sum_{i=1}^n Y_i \] and \[ S^2=\frac 1 {n-1} \sum_{i=1}^n (Y_i-\bar Y)^2 \] be the sample mean and sample variance.

The distributions of sample statistics such as the sample mean and sample variance are called sampling distributions.

$\bar Y$ and $(Y_1-\bar Y, \ldots, Y_n-\bar Y)$ are independent.
$\bar Y$ and $S^2$ are independent.
$\bar Y\sim N(\mu, \sigma^2/n)$.
$(n-1)S^2/\sigma^2 \sim \chi_{n-1}^2.$
$\frac{\bar Y-\mu}{S/\sqrt n }\sim t_{n-1}.$

These properties are extremely important for statistical inferences.

Hypothesis Testing

For one sample mean

\[ H_0: \mu=\mu_0, ~ H_1: \mu\ne \mu_0. \]

Data: $y_1, \ldots, y_n$.

Test statistic:

\[ t=\frac{\bar Y-\mu_0}{S/\sqrt n}. \]

where $\bar Y$ and $S$ are the sample mean and sample standard derivation, respectively.

Intuition: Reject $H_0$ is $t$ is too big or too small.

P-value:

\[ p=2P(T>|t|) \]

where $T$ has a t-distribution with $n-1$ degrees of freedom, and $t$ is the value of the test statistic.

Two-sample means

Observed two samples from two normal distributions and test if the means are equal

Sample one from $N(\mu_1, \sigma^2)$:

\[ Y_{11}, Y_{12}, \ldots, Y_{1, n_1}. \]
- Sample statistics: $\bar Y_1$ and $S_1^2$.
Sample two from $N(\mu_2, \sigma^2)$:

\[ Y_{21}, Y_{22}, \ldots, Y_{2, n_2}. \]
- Sample statistics: $\bar Y_2$ and $S_2^2$.
Note the variances are assumed to be equal but unknown.
Hypothesis:

\[ H_0: \mu_1=\mu_2, ~ H_1: \mu_1\ne \mu_2. \]
Test statistic

\[ t=\frac{\bar{Y}_{1}-\bar{Y_2}}{S_{p} \sqrt{\frac{1}{n_{1}}+\frac{1}{n_{2}}}} \]

where

\[ S_{p}=\sqrt{\frac{\left(n_{1}-1\right) S_{1}^{2}+\left(n_{2}-1\right) S_{2}^{2}}{n_{1}+n_{2}-2}} \]

Under $H_0$ it has a t-distribution with $n_1+n_2-2$ degrees of freedom. To see this, observe
- Distribution of the difference:
\[ \bar{Y_{1}}-\bar{Y_2} \sim N\left(\mu_1-\mu_2, \sigma^2(1/n_1+1/n_2)\right). \]
- Distribution of the sample variances:
\[\begin{gathered} (n_1-1)S_1^2/\sigma^2 \sim \chi_{n_1-1}^2 \\ (n_2-1)S_2^2/\sigma^2 \sim \chi_{n_2-1}^2 \\ \frac{(n_1-1)S_1^2+(n_2-1)S_2^2}{\sigma^2} \sim \chi_{n_1+n_2-2}^2 \end{gathered}\]
P-value

\[ p=2P(T>|t|) \] where $T\sim t_{n_1+n_2-2}$.

Revisit of the two-sample test

Given two samples from two normal populations, consider testing \[ H_0: \mu_1=\mu_2, ~ H_1: \mu_1\ne \mu_2. \]

We can rewrite the t-test in a form that can be generalized to testing for more than two normal means.

Write \[\begin{equation} \begin{array}{c} \bar{y}_{. .}=\frac{1}{n_{1}+n_{2}}\left(\sum_{j=1}^{n_{1}} y_{1 j}+\sum_{j=1}^{n_{2}} y_{2 j}\right)=\frac{1}{n_{1}+n_{2}}\left(n_1\bar y_{1}+n_2\bar y_2\right) \\ \bar{y}_{1}-\bar{y}_{. .}=\frac{n_{2}}{n_{1}+n_{2}}\left(\bar{y}_{1}-\bar{y}_{2}\right) \\ \bar{y}_{2}-\bar{y}_{. .}=\frac{n_{1}}{n_{1}+n_{2}}\left(\bar{y}_{2}-\bar{y}_{1}\right) \\ n_{1}\left(\bar{y}_{1}-\bar{y}_{. .}\right)^{2}+n_{2}\left(\bar{y}_{2}-\bar{y}_{. .} \right)^{2}=\frac{n_{1} n_{2}}{n_{1}+n_{2}}\left(\bar{y}_{1}-\bar{y}_{2}\right)^{2} \end{array} \end{equation}\]

Then for the two-sample $t$-test, we have

\[ F=t^2=\frac{n_{1}\left(\bar{y}_{1}-\bar{y}_{. .}\right)^{2}+n_{2}\left(\bar{y}_{2}-\bar{y}_{. .}\right)^2}{\frac{\sum_{j=1}^{n_1}(y_{1j}-\bar y_1)^2+\sum_{j=1}^{n_2}(y_{2j}-\bar y_2)^2}{n_1+n_2-2} } \]

This test statistic has an $F$-distribution with 1 and $(n_1+n_2-2)$ degrees of freedom.

Note

The numerator represents between-group variation.
The denominator represents within-group variation.
The numerator and denominator are independent (why?)
The following graphs further illustrates the terms.