|
Contingency procedure demonstrated with an example
The contingency testContingency table test is used when both dependent and independent variables are categorical. It is usually used to check relationship between two variables.
Analyzing simple data with countsThe easiest way to carry contingency test is when counts are availiable for each categories, for example, if the frequencies are given below, Frequency table:
Then one can use the counts, rather than the original data files to run the contingency test. The counts can be inputed and analyzed as below. data simple; input opinion $ gender $ count; datalines; yes female 55 yes male 50 no female 65 no male 80 ; proc freq data=simple noprint; tables opinion*gender / chisq nocol norwo nopercent expected; weight count; run; The "chisq" option requests a chi-squre test, and "nocol", "norow", "noprecent" simplify the output and "expected" requests the expected values. The "weight" statement tells the precedure how many subjects there are for each combination of gender and opinion. The FREQ Procedure The FREQ Procedure Table of opinion by gender opinion gender Frequency Expected female male Total ----------------------------------- no 65 80 145 69.6 75.4 ----------------------------------- yes 55 50 105 50.4 54.6 ----------------------------------- Total 120 130 250 Statistics for Table of opinion by gender Statistic DF Value Prob ------------------------------------------------------ Chi-Square 1 1.3920 0.2381 Likelihood Ratio Chi-Square 1 1.3926 0.2380 Continuity Adj. Chi-Square 1 1.1059 0.2930 Mantel-Haenszel Chi-Square 1 1.3865 0.2390 Phi Coefficient -0.0746 Contingency Coefficient 0.0744 Cramer's V -0.0746 Fisher's Exact Test ---------------------------------- Cell (1,1) Frequency (F) 65 Left-sided Pr <= F 0.1465 Right-sided Pr >= F 0.9046 Table Probability (P) 0.0511 Two-sided Pr <= P 0.2508 Sample Size = 250 The results include a contingency table with observed and expected values. A list of tests are performed where the first one is the classic Chi-square test. With a big p-value of 0.2381 we do not reject the hypothesis so gender and opinion is not associated. Note that the expected values for each combination is big(69.6, 75.4, 50.4, 54.6) so the assumption is met and conclusions are sound. Otherwise, we should use Fisher's exact test in the end of the output where p-value is 0.2508. Analyzing comprehensive questionanir data with contingency test, without knowing the countsAn important application of contingency test is to analyze questionnair and survey data where measurements (variables) are often categorical. Suppose there is a survey where there are 8 quesetions about profile and political opinions. The survey form is shown here, and the response is recorded in "questionnaire.csv". Suppose the question is to test whether race and opinions for president are related. In the following sections, we first organize the data then run the contingency test. Open the data set from SAS. Or import with the following command. data questionnaire; infile "H:\sas\data\questionnaire.csv" dlm='|' firstobs=2; input age gender race marital education president arm city; run; proc format; value $gender '1' ='Male' '2'='Female' OTHER='Miscoded'; Value $race '1'='White' '2'='African Am.' '3'='Hispanic' '4'='Other'; Value $marital '1'='Single' '2'='Married' '3'='Widowed' '4'='Divorced'; Value $educ '1'='High Scho or Less' '2'='Two Yr. College' '3'='Four Yr. College' '4'='Graduate Degree'; Value opinion 1='Str Disagree' 2='Disagree' 3='No opinion' 4='Agree' 5='Str Agree'; Value agegroup 1='0-20' 2='21-40' 3='41-60' 4='Greater than 60'; run; data questionnaire; infile "H:\sas\data\questionnaire.csv" dlm='|' firstobs=2; input age gender $ race $ marital $ education $ president arm city; IF age GE 0 AND age LE 20 THEN agegroup=1; ELSE IF age GT 20 AND age LE 40 THEN agegroup=2; ELSE IF age GT 40 AND age LE 60 THEN agegroup=3; ELSE IF age GT 60 THEN agegroup = 4; format agegroup agegroup. gender $gender. race $race. marital $marital. education $educ. president arm city opinion.; run; Note that readability can be improved by adding lables for the variables.Futhermore| a new variable agegroup has been defined with age and labled. The agegroup is categorical variable. the IF and ELSE IF statement has a general form as following and can be used to define new variables: IF condition THEN statement; Summarizing table can be formed with the following statement. proc freq data=questionnaire; title "Frequcy Counts for Categorical Variables"; Tables gender race marital education president arm city; run; Now request a contingency test with the SAS proc freq. proc freq data=questionnaire; title "Contingency test for race and president"; tables race*president /CHISQ expected norow nocol nopercent; run; The "CHISQ" reqests a contingency test; the "expected" requests the expected values for checking the assumption; and "norow, nocol, and nopercent" hide the minor results and make the outpu more readable. Checking assumptionsPaired sample t-test assumes that
Reading the outputThe FREQ Procedure Table of race by president race president Frequency | Expected |Str Disa|Disagree|No opini|Agree | Total |gree | |on | | _________________________________________________________ White | 1 | 1 | 1 | 0 | 3 | 0.5| 0.5 | 0.5 | 1.5 | _________________________________________________________ African Am. | 0 | 0 | 0 | 2 | 2 | 0.33| 0.33| 0.33 | 1 | _________________________________________________________ Hispanic | 0 | 0 | 0 | 1 | 1 | 0.17 | 0.17 | 0.17 | 0.5| _________________________________________________________ Total 1 1 1 3 6 Statistics for Table of race by president Statistic DF Value Prob _________________________________________________________ Chi-Square 6 6.0000 0.4232 Likelihood Ratio Chi-Square 6 8.3178 0.2157 Mantel-Haenszel Chi-Square 1 3.0000 0.0833 Phi Coefficient 1.0000 Contingency Coefficient 0.7071 Cramer's V 0.7071 WARNING: 100% of the cells have expected counts less than 5. Chi-Square may not be a valid test. Sample Size = 6 Interpreting the resultIn the "Statistics for Table of race by president", the p-value of contingency test (Chi-Square test) is 0.4232, therefore at a &alphs = 0.05, do not reject the null hypothesis and thus conclude that the race and opinion on President are independent. ' Expected values of each cell are shown in the resuling table. For example, the expected number of White and Strong disagree should be 0.5 given the total count is 3 presons. Theoretically speaking, the example should not be analyzed with the contingency test because of the small sample size. If the question can be modified to a 2 by 2 table, i.e., race (white/non-white) by opinion (agree/disagree), one could consider
An option would be to use the Fisher's test . For further discussion on the methods espacially for 2 by 2 tables, please refer to Campbell (2007).
|
© COPYRIGHT 2010 ALL RIGHTS RESERVED tqin@purdue.edu |