Contingency procedure demonstrated with an example

  1. The contingency test
  2. Analyzing simple data with contingency test, if knowing the counts
  3. Analyzing comprehensive questionanir data with contingency test, without knowing the counts
  4. Output, interpretation and assumption checking

The contingency test

Contingency table test is used when both dependent and independent variables are categorical. It is usually used to check relationship between two variables.

  1. No more than 20% of the expcted value for each cell is less than 5, otherwise Fisher's exact test(discussed at the end) should be used.
  2. Samples must be independent, for example, when checking the gender (Female/Male) effect with some opinions (yes/no), the female and male must be independently selected. On the other hand, if the female and male are dependent (as in husbands and wives), the McNemar's test should be used.

Analyzing simple data with counts

The easiest way to carry contingency test is when counts are availiable for each categories, for example, if the frequencies are given below,

Frequency table:
Counts gender total
female male
opinionyes5550105
no6580105
total120130250

Then one can use the counts, rather than the original data files to run the contingency test. The counts can be inputed and analyzed as below.

	data simple;
	input opinion $ gender $ count;
	datalines;
	yes female 55
	yes male 50
	no female 65
	no male 80
	;
	
	proc freq data=simple noprint;
	tables opinion*gender / chisq nocol norwo nopercent expected;
	weight count;
	run;

The "chisq" option requests a chi-squre test, and "nocol", "norow", "noprecent" simplify the output and "expected" requests the expected values.

The "weight" statement tells the precedure how many subjects there are for each combination of gender and opinion.

 The FREQ Procedure

                                    The FREQ Procedure

                                   Table of opinion by gender

                               opinion     gender

                               Frequency
                               Expected female  male      Total
                               -----------------------------------
                               no            65     80     145
                                           69.6    75.4 
                               -----------------------------------
                               yes           55      50     105
                                           50.4    54.6 
                               -----------------------------------
                               Total         120      130      250


                            Statistics for Table of opinion by gender

                     Statistic                     DF       Value      Prob
                     ------------------------------------------------------
                     Chi-Square                     1      1.3920    0.2381
                     Likelihood Ratio Chi-Square    1      1.3926    0.2380
                     Continuity Adj. Chi-Square     1      1.1059    0.2930
                     Mantel-Haenszel Chi-Square     1      1.3865    0.2390
                     Phi Coefficient                      -0.0746
                     Contingency Coefficient               0.0744
                     Cramer's V                           -0.0746


                                      Fisher's Exact Test
                               ----------------------------------
                               Cell (1,1) Frequency (F)        65
                               Left-sided Pr <= F          0.1465
                               Right-sided Pr >= F         0.9046

                               Table Probability (P)       0.0511
                               Two-sided Pr <= P           0.2508

                                        Sample Size = 250


The results include a contingency table with observed and expected values. A list of tests are performed where the first one is the classic Chi-square test. With a big p-value of 0.2381 we do not reject the hypothesis so gender and opinion is not associated.

Note that the expected values for each combination is big(69.6, 75.4, 50.4, 54.6) so the assumption is met and conclusions are sound. Otherwise, we should use Fisher's exact test in the end of the output where p-value is 0.2508.

Analyzing comprehensive questionanir data with contingency test, without knowing the counts

An important application of contingency test is to analyze questionnair and survey data where measurements (variables) are often categorical.

Suppose there is a survey where there are 8 quesetions about profile and political opinions. The survey form is shown here, and the response is recorded in "questionnaire.csv".

Suppose the question is to test whether race and opinions for president are related.

In the following sections, we first organize the data then run the contingency test.

Open the data set from SAS. Or import with the following command.

 
  data questionnaire;
	infile "H:\sas\data\questionnaire.csv" dlm='|' firstobs=2;
	input age gender race marital education president arm city;
    run;
proc format;
	value $gender			'1' ='Male'
					'2'='Female'
					OTHER='Miscoded';
	Value $race			'1'='White'
					'2'='African Am.'
					'3'='Hispanic'
					'4'='Other';
	Value $marital			'1'='Single'
					'2'='Married'
					'3'='Widowed'
					'4'='Divorced';
	Value $educ			'1'='High Scho or Less'
					'2'='Two Yr. College'
					'3'='Four Yr. College'
					'4'='Graduate Degree';
	Value opinion			1='Str Disagree'
					2='Disagree'
					3='No opinion'
					4='Agree'
					5='Str Agree';
	Value agegroup 			1='0-20'
					2='21-40'
					3='41-60'
					4='Greater than 60';
	run;

	data questionnaire;
	infile "H:\sas\data\questionnaire.csv" dlm='|' firstobs=2;
	input age  gender $ race $ marital $ education $ president arm city;

	IF age GE 0 AND	age LE 20 THEN agegroup=1;
	ELSE IF age GT 20 AND age LE 40 THEN agegroup=2;
	ELSE IF age GT 40 AND age LE 60 THEN agegroup=3;
	ELSE IF age GT 60 THEN agegroup = 4;
 
	format  agegroup agegroup.
		gender    $gender.
		race      $race.
		marital   $marital.
		education $educ.
		president arm city opinion.;

    run;

Note that readability can be improved by adding lables for the variables.Futhermore| a new variable agegroup has been defined with age and labled. The agegroup is categorical variable.

the IF and ELSE IF statement has a general form as following and can be used to define new variables:

	IF condition THEN statement;

Summarizing table can be formed with the following statement.

	proc freq data=questionnaire;
		title "Frequcy Counts for Categorical Variables";
		Tables gender race marital education president arm city;
  		run;

Now request a contingency test with the SAS proc freq.

	proc freq data=questionnaire;
		title "Contingency test for race and president";
		tables race*president /CHISQ expected norow nocol nopercent;
		run;

The "CHISQ" reqests a contingency test; the "expected" requests the expected values for checking the assumption; and "norow, nocol, and nopercent" hide the minor results and make the outpu more readable.

Checking assumptions

Paired sample t-test assumes that

  1. The two groups of data are dependent;
  2. The differences between control and treatment follow normal distribution;

Reading the output

            The FREQ Procedure

                                   Table of race by president

                    race         president

                    Frequency   |
                    Expected    |Str Disa|Disagree|No opini|Agree   | Total
                                |gree    |        |on      |        |
                   _________________________________________________________
                    White       |      1 |      1 |      1 |      0 |      3
                                |     0.5|    0.5 |    0.5 |    1.5 |  
                   _________________________________________________________
                    African Am. |      0 |      0 |      0 |      2 |      2
                                |    0.33|    0.33|   0.33 |      1 |  
                     _________________________________________________________
                    Hispanic    |      0 |      0 |      0 |      1 |      1
                                |   0.17 |   0.17 |   0.17 |     0.5|  
                    _________________________________________________________
                    Total              1        1        1        3        6
                                   
                            Statistics for Table of race by president

                     Statistic                     DF       Value      Prob
                     _________________________________________________________
                     Chi-Square                     6      6.0000    0.4232
                     Likelihood Ratio Chi-Square    6      8.3178    0.2157
                     Mantel-Haenszel Chi-Square     1      3.0000    0.0833
                     Phi Coefficient                       1.0000
                     Contingency Coefficient               0.7071
                     Cramer's V                            0.7071

                      WARNING: 100% of the cells have expected counts less
                               than 5. Chi-Square may not be a valid test.

                                         Sample Size = 6



Interpreting the result

In the "Statistics for Table of race by president", the p-value of contingency test (Chi-Square test) is 0.4232, therefore at a &alphs = 0.05, do not reject the null hypothesis and thus conclude that the race and opinion on President are independent. '

Expected values of each cell are shown in the resuling table. For example, the expected number of White and Strong disagree should be 0.5 given the total count is 3 presons. Theoretically speaking, the example should not be analyzed with the contingency test because of the small sample size. If the question can be modified to a 2 by 2 table, i.e., race (white/non-white) by opinion (agree/disagree), one could consider An option would be to use the Fisher's test . For further discussion on the methods espacially for 2 by 2 tables, please refer to Campbell (2007).

Reference

  • Campbell, I. Chi-squared and Fisher-Irwin tests of two-by-two tables with small sample recommendations. Statist. Med. 2007; 26:3661-3675).