Attribute Agreement Analysis

Between Appraisers - Fleiss' Kappa Statistics

  

You can assess the consistency of ratings between appraisers.

If kappa = 1, then there is perfect agreement. If kappa = 0, then agreement is the same as would be expected by chance. The higher the value of kappa, the stronger the agreement between appraisers. Negative values occur when agreement is weaker than expected by chance, but this rarely happens. Depending on the application, kappa less than 0.7 indicates that your measurement system needs improvement. Kappa values greater than 0.9 are considered excellent.

Compare the kappa statistics for each Response and Overall. Are appraisers having difficulty with a particular response?

For the fabric data, all kappa statistics for all responses are greater than 0.7. The consistency between the appraisers' ratings is within acceptable limits.

Use the p-values to choose between two opposing hypotheses, based on your sample data:

·    H0: The agreement between appraisers is due to chance

·    H1:The agreement between appraisers is not due to chance

The p-value provides the likelihood of obtaining your sample, with its particular kappa statistic, if the null hypothesis (H0) is true. If the p-value is less than or equal to a predetermined level of significance (a-level), then you reject the null hypothesis and claim support for the alternative hypothesis.

Note

The between-appraiser statistics do not compare the appraisers' ratings to the standard. Although the ratings between appraisers' may be consistent, they are not necessarily correct.

Example Output

Fleiss’ Kappa Statistics

 

Response   Kappa   SE Kappa        Z  P(vs > 0)

1         0.974356  0.0345033  28.2395     0.0000

2         0.934060  0.0345033  27.0716     0.0000

3         0.884560  0.0345033  25.6370     0.0000

4         0.911754  0.0345033  26.4251     0.0000

5         0.973542  0.0345033  28.2159     0.0000

Overall   0.937151  0.0173844  53.9076     0.0000

Interpretation

For the fabric data, with a = 0.05, for all responses, p = 0.0000, so you can reject the null hypothesis. The between-appraiser agreement is significantly different than what would be due to chance.