Lecture 37

Qi Wang, Department of Statistics

Nov 27, 2017

- Describes the relationship between two categorical variables.
- Represents a table of counts (can include percentages).

Examples

- Gender versus major
- Political party versus voting status

Sometimes one or both variables are quantitative, but we classify them into categories for data collection and/or analysis. For example, suppose our variables are years of college education and income. We decide to group years of education into four classes: none, some college, Bachelor’s degree, and post-graduate. We also decide to classify annual income in dollars into four classes: $<10,000, 10,000-30,000, 30,001-50,000,$ and $>50,000$.

An instructor taught four sections of a large statistics course and had the following distribution of grades when the semester was finished.

Grade | One | Two | Three | Four | Total |
---|---|---|---|---|---|

A | 12 | 18 | 10 | 12 | |

B | 26 | 26 | 16 | 16 | |

C | 28 | 20 | 24 | 18 | |

D | 6 | 8 | 20 | 18 | |

F | 4 | 4 | 8 | 12 | |

Total |

The **joint distribution** of the 2 categorical variables is the proportion of total cases in a cell
$$Joint Probability = \frac{Total In Cell}{Overall Total}$$
All the joint distributions should add to 1 (or 100%).
For example: 18/306 = 0.0588 or 5.88% is the joint distribution for people with grade of A AND Class time One. Joint distributions use or imply “AND”. (i.e intersection)

Fill in the table of Joint distributions:

Grade | One | Two | Three | Four | Total |
---|---|---|---|---|---|

A | 5.88% | 3.92% | 16.99% | ||

B | 8.50% | 5.23% | 5.23% | 27.45% | |

C | 9.15% | 6.54% | 7.84% | 29.41 | |

D | 2.61% | 6.54% | 5.88% | 16.99% | |

F | 1.31% | 1.31% | 2.61% | 9.15% | |

Total | 24.84% | 24.84% | 25.49% | 24.84% | 100% |

The **marginal distribution** allows us to study 1 variable at a time. The marginal distributions of each categorical variable are obtained from row and column totals. Basically we are examining the distributions of a single variable in the two-way table.
Marginal distributions allow us to compare the relative frequencies among the levels of a single categorical variable

- The marginals for the row variable should add to 1 (or 100%).
- The marginals for the column variable should add to 1 (or 100%)

Find the marginal distribution of Class Time for Example 1

One | Two | Three | Four | |
---|---|---|---|---|

Counts | ||||

Percents |

Find the marginal distribution of Letter Grade for Example 1

A | B | C | D | F | |
---|---|---|---|---|---|

Counts | |||||

Percents |

In **conditional distributions**, we find the distribution of one categorical variable given a common level of another categorical variable. Look for key words to indicate a conditional—“given”, “knowing”, etc.

Find the conditional distribution of Letter Grade for Class Time One

A | B | C | D | F | |
---|---|---|---|---|---|

Counts | |||||

Percents |

Find the conditional distribution of Class Time for Letter Grade C.

One | Two | Three | Four | |
---|---|---|---|---|

Counts | ||||

Percents |

- What percent of students in Class time Four earned a B? Is this joint, conditional or marginal?
- Of all students earning a B, what proportion were in Class time 4? Is this joint, conditional or marginal
- What percent of students were enrolled in Class Time 3? Is this joint, conditional or marginal?
- What proportion of students earned B’s and were in Class time 2? Is this joint, conditional or marginal?

**Stemplot or Stem-and-leaf plot** is a technique that orders quantitative data points and provides insight about the shape of the distribution.
To make a stem-and-leaf plot, the last digit of the number is the leaf and the rest of the number is the stem. Leaves are arrange in ascending order on the stem. Additionally, any stem that is not used, but is within the range of the data, is kept in the plot.

DATA SET is: $1, 3, 5, 7, 12, 15, 17, 19, 21, 21, 21, 30, 33, 39,$ and $56$. Create a stem-and-leaf plot of the data.