Procedure demonstrated with an example

  1. The contingency test
  2. Analyzing data with contingency test
  3. Output, interpretation and assumption checking

The contingency test

Questionnair and survey are common and useful way to collect information. Measurements (variables) in survey are often categorical. For example| gender (F/M) or race (White/Hispanic/African American/Other). If the question is to study the relationship between two variables| a chi-square test is possible if the following assumptions are met:

  1. No more than 20% of the expcted value for each cell is less than 5, otherwise other methods (discussed at the end) should be used.
  2. The data of the study should be one of the following:
    • As two or more independent random samples both categorical; or
    • As one random sample| observed with respect to two categorical variables.
    • For either type of data| the observations within a sample must be independent of each other , otherwise a McNemar's test should be used.

Analyzing a survey data with contingency test

Suppose there is a survey where there are 8 quesetions about profile and political opinions. The survey form is shown here| and the response is recorded in "questionnaire.csv".

Suppose the question is to test whether race and opinions for president are related.

In the following sections| we first organize the data then run the contingency test.

Open the data set from SAS. Or import with the following command.

 
  data questionnaire;
	infile "H:\sas\data\questionnaire.csv" dlm='|' firstobs=2;
	input age gender race marital education president arm city;
    run;
proc format;
	value $gender			'1' ='Male'
					'2'='Female'
					OTHER='Miscoded';
	Value $race			'1'='White'
					'2'='African Am.'
					'3'='Hispanic'
					'4'='Other';
	Value $marital			'1'='Single'
					'2'='Married'
					'3'='Widowed'
					'4'='Divorced';
	Value $educ			'1'='High Scho or Less'
					'2'='Two Yr. College'
					'3'='Four Yr. College'
					'4'='Graduate Degree';
	Value opinion			1='Str Disagree'
					2='Disagree'
					3='No opinion'
					4='Agree'
					5='Str Agree';
	Value agegroup 			1='0-20'
					2='21-40'
					3='41-60'
					4='Greater than 60';
	run;

	data questionnaire;
	infile "H:\sas\data\questionnaire.csv" dlm='|' firstobs=2;
	input age  gender $ race $ marital $ education $ president arm city;

	IF age GE 0 AND	age LE 20 THEN agegroup=1;
	ELSE IF age GT 20 AND age LE 40 THEN agegroup=2;
	ELSE IF age GT 40 AND age LE 60 THEN agegroup=3;
	ELSE IF age GT 60 THEN agegroup = 4;
 
	format  agegroup agegroup.
		gender    $gender.
		race      $race.
		marital   $marital.
		education $educ.
		president arm city opinion.;

    run;

Note that readability can be improved by adding lables for the variables.Futhermore| a new variable agegroup has been defined with age and labled. The agegroup is categorical variable.

the IF and ELSE IF statement has a general form as following and can be used to define new variables:

	IF condition THEN statement;

Summarizing table can be formed with the following statement.

	proc freq data=questionnaire;
		title "Frequcy Counts for Categorical Variables";
		Tables gender race marital education president arm city;
  		run;

Now request a contingency test with the SAS proc freq.

	proc freq data=questionnaire;
		title "Contingency test for race and president";
		tables race*president /CHISQ expected norow nocol nopercent;
		run;

The "CHISQ" reqests a contingency test; the "expected" requests the expected values for checking the assumption; and "norow, nocol, and nopercent" hide the minor results and make the outpu more readable.

Checking assumptions

Paired sample t-test assumes that

  1. The two groups of data are dependent;
  2. The differences between control and treatment follow normal distribution;

Reading the output

            The FREQ Procedure

                                   Table of race by president

                    race         president

                    Frequency   |
                    Expected    |Str Disa|Disagree|No opini|Agree   | Total
                                |gree    |        |on      |        |
                   _________________________________________________________
                    White       |      1 |      1 |      1 |      0 |      3
                                |     0.5|    0.5 |    0.5 |    1.5 |  
                   _________________________________________________________
                    African Am. |      0 |      0 |      0 |      2 |      2
                                |    0.33|    0.33|   0.33 |      1 |  
                     _________________________________________________________
                    Hispanic    |      0 |      0 |      0 |      1 |      1
                                |   0.17 |   0.17 |   0.17 |     0.5|  
                    _________________________________________________________
                    Total              1        1        1        3        6
                                   
                            Statistics for Table of race by president

                     Statistic                     DF       Value      Prob
                     _________________________________________________________
                     Chi-Square                     6      6.0000    0.4232
                     Likelihood Ratio Chi-Square    6      8.3178    0.2157
                     Mantel-Haenszel Chi-Square     1      3.0000    0.0833
                     Phi Coefficient                       1.0000
                     Contingency Coefficient               0.7071
                     Cramer's V                            0.7071

                      WARNING: 100% of the cells have expected counts less
                               than 5. Chi-Square may not be a valid test.

                                         Sample Size = 6



Interpreting the result

In the "Statistics for Table of race by president", the p-value of contingency test (Chi-Square test) is 0.4232, therefore at a &alphs = 0.05, do not reject the null hypothesis and thus conclude that the race and opinion on President are independent. '

Expected values of each cell are shown in the resuling table. For example, the expected number of White and Strong disagree should be 0.5 given the total count is 3 presons. Theoretically speaking, the example should not be analyzed with the contingency test because of the small sample size. If the question can be modified to a 2 by 2 table, i.e., race (white/non-white) by opinion (agree/disagree), one could consider An option would be to use the Fisher's test . For further discussion on the methods espacially for 2 by 2 tables, please refer to Campbell (2007).

Reference

  • Campbell, I. Chi-squared and Fisher-Irwin tests of two-by-two tables with small sample recommendations. Statist. Med. 2007; 26:3661-3675).