General Notions

Inter-rater agreement (also known as inter-rater reliability) is a measure of consensus among \(n\) raters in the classification of \(N\) objects in a \(k\) different categories.

In the general case, the rater evalutations can be represented by the reliability data matrix: a \(n \times N\)-matrix \(R\) such that \(R[i,j]\) stores the category selected by the \(i\)-th rater for the \(j\)-th object.

A more succint representation is provided by a \(N \times k\)-matrix \(C\) whose elements \(C[i,j]\) account how many raters evaluated the \(i\)-th object as belonging to the \(j\)-th category. This matrix is the classification matrix.

Whenever the numer of raters is \(2\), i.e., \(n=2\), the rater evaluations can be represented by the agreement matrix: a \(k \times k\)-matrix \(A\) such that \(A[i,j]\) stores the number of objects that are classified at the same time as belonging to the \(i\)-th category by the first rater and to the \(j\)-th category by the second rater.

Bennett, Alpert and Goldstein’s S

Bennett, Alpert and Goldstein’s \(S\) is an inter-rater agreement measure on nominal scale (see [BAG54] and [War12]). It is defined as:

\[S \stackrel{\tiny\text{def}}{=} \frac { k * P_0 - 1 } { k - 1 } \]

where \(P_0\) is the probability of agreement among the raters and \(k\) is the number of different categories in the classification.

Bangdiwala’s B

Bangdiwala’s \(B\) is an inter-rater agreement measure on nominal scale (see [MB97]). It is defined as:

\[B \stackrel{\tiny\text{def}}{=} \frac{\sum_{i} A[i,i]}{\sum_{i} A_{i\cdot}*A_{\cdot{}i}}\]

where \(A_{i\cdot}\) and \(A_{\cdot{}i}\) are the sums of the elements in the \(i\)-th row and \(i\)-th column of the matrix \(A\), respectively.

Cohen’s Kappa

Cohen’s \(\kappa\) is an inter-rater agreement measure on nominal scale (see [Coh60]). It is defined as:

\[\kappa \stackrel{\tiny\text{def}}{=} \frac{P_0-P_e}{1-P_e}\]

where \(P_0\) is the probability of agreement among the raters and \(P_e\) is the agreement probability by chance.

Scott’s Pi

Scott’s \(\pi\) is an inter-rater agreement measure on nominal scale (see [Sco55]). Similarly to Cohen’s \(\kappa\), it is defined as:

\[\pi \stackrel{\tiny\text{def}}{=} \frac{P_0-P_e}{1-P_e}\]

where \(P_0\) is the probability of agreement among the raters (as in Cohen’s \(\kappa\)) and \(P_e\) is the sum of the squared joint proportions (whereas it is the sum of the squared geometric means of marginal proportions in Cohen’s \(\kappa\)). In particular, the joint proportions are the arithmetic means of the marginal proportions.

Yule’s Y

Yule’s \(Y\) (see [Yul12]), sometime called coefficient of colligation, measures the relation between two binary random variables (i.e., it can be computed exclusively on \(2 \times 2\) agreement matrices). It is defined as:

\[Y \stackrel{\tiny\text{def}}{=} \frac{\sqrt{\text{OR}}-1}{\sqrt{\text{OR}}+1}\]

where \(\text{OR}\) is the odds ratio (e.g., see here):

\[\text{OR} \stackrel{\tiny\text{def}}{=} \frac{A[0,0]*A[1,1]}{A[1,0]*A[0,1]}.\]

Fleiss’s Kappa

Fleiss’s \(\kappa\) (see [Fle71]) is a multi-rater generalization of Scott’s Pi.

If the classifications are represented in a classification matrix (see General Notions), the ratio of classifications assigned to class \(j\) is:

\[p_j \stackrel{\tiny\text{def}}{=} \frac{1}{N*n}\sum_{i=1}^{N} C[i,j]\]

and their square sum is:

\[\bar{P_e} \stackrel{\tiny\text{def}}{=} \sum_{j=1}^k p_j.\]

Instead, the ratio between the pairs of raters which agree on the \(i\)-th subject and the overall pairs of raters is:

\[P_i \stackrel{\tiny\text{def}}{=} \frac{1}{n*(n-1)}\left(\left(\sum_{j=1}^k C[i,j]^2\right) - n\right)\]

and its mean is:

\[\bar{P} \stackrel{\tiny\text{def}}{=} \frac{1}{N}\sum_{i=1}^{N}P_i.\]

Fleiss’s \(\kappa\) is defined as:

\[\kappa \stackrel{\tiny\text{def}}{=} \frac{\bar{P}-\bar{P_e}}{1-\bar{P_e}}.\]

Information Agreement

The Information Agreement, (\(\text{IA}\)), is an inter-rater agreement measure on nominal scale (see [CFG20a]) which gauges the dependence between the classifications of two raters.

The probability distributions for the evaluations of the rater \(\mathfrak{X}\), those of the rater \(\mathfrak{Y}\), and the joint evalutions \(\mathfrak{X}\mathfrak{Y}\) on the agreement matrix \(A\) are:

\[p_{X_{A}}(j_0) \stackrel{\tiny\text{def}}{=} \frac{\sum_{i} A[i,j_0]}{\sum_{i}\sum_{j} A[i,j]}, \quad\quad\quad p_{Y_{A}}(i_0) \stackrel{\tiny\text{def}}{=} \frac{\sum_{j} A[i_0,j]}{\sum_{i}\sum_{j} A[i,j]}, \]

and

\[p_{X_{A}Y_{A}}(i_0,j_0) = \frac{A[i_0,j_0]}{\sum_{i}\sum_{i} A[i,j]}, \]

respectively. The entropy functions for the random variables \(X_{A}\), \(Y_{A}\), and \(X_{A}Y_{A}\) are:

\[H(X_{A}) \stackrel{\tiny\text{def}}{=} - \sum_{i} p_{X_{A}}(i) \log_2 p_{X_{A}}(i), \quad\quad\quad H(Y_{A}) \stackrel{\tiny\text{def}}{=} - \sum_{j} p_{Y_{A}}(j) \log_2 p_{Y_{A}}(j), \]

and

\[H(X_{A}Y_{A}) \stackrel{\tiny\text{def}}{=} - \sum_{i}\sum_{j} p_{X_{A}Y_{A}}(i,j) \log_2 p_{X_{A}Y_{A}}(i,j). \]

The mutual information between the classification of \(\mathfrak{X}\) and \(\mathfrak{Y}\) is:

\[I(X_{A},Y_{A}) \stackrel{\tiny\text{def}}{=} H(X_{A})+H(Y_{A})-H(X_{A}Y_{A}). \]

The Information Agreement of \(A\) is the ratio between \(I(X_{A},Y_{A})\) and the minimum among \(H(X_{A})\) and \(H(Y_{A})\) as \(\epsilon\) tends to \(0\) from the right, i.e.,

\[\text{IA} \stackrel{\tiny\text{def}}{=} \frac{I(X_{A},Y_{A})} { \min(H(X_{A}), H(Y_{A})) }. \]

Extension-by-Continuity of IA

\(\text{IA}\) was proven to be effetive in gauging agreement and solves some of the pitfalls of Cohen’s \(\kappa\). However, it is not defined over all the agreement matrices and, in particular, it cannot be directly computed on agreement matrices containing some zeros (see [CFG20b]).

The extension-by-continuity of Information Agreement, (\(\text{IA}_{C}\)), extends \(\text{IA}\)’s domain so that it can deal with matrices containing some zeros (see [CFG20b]). In order to achieve this goal, the considered agreement matrix \(A\) is replaced by the symbolic matrix \(A_{\epsilon}\) is defined as:

\[A_{\epsilon}[i,j] \stackrel{\tiny\text{def}}{=} \begin{cases} A[i,j] & \textrm{if $A[i,j]\neq 0$}\\ \epsilon & \textrm{if $A[i,j]=0$} \end{cases} \]

where \(\epsilon\) is a real variable with values in the open interval \((0, +\infty)\). On this matrix, mutual information of the variables \(X_{A_{\epsilon}}\) and \(Y_{A_{\epsilon}}\) and their entropy functions are defined. The extension-by-continuity of Information Agreement of \(A\) is the limit of the ratio between \(I(X_{A_{\epsilon}},Y_{A_{\epsilon}})\) and the minimum among \(H(X_{A_{\epsilon}})\) and \(H(Y_{A_{\epsilon}})\) as \(\epsilon\) tends to \(0\) from the right, i.e.,

\[\text{IA}_{C}(A) \stackrel{\tiny\text{def}}{=} \lim_{\epsilon \rightarrow 0^+} \frac{I(X_{A_{\epsilon}},Y_{A_{\epsilon}})} { \min(H(X_{A_{\epsilon}}), H(Y_{A_{\epsilon}})) }. \]

\(\text{IA}_{C}(A)\) was proven to be defined over any non-null agreement matrix having more than one row/column and, if \(l\) and \(m\) are numbers of non-null columns and non-null rows in \(A\), respectively, then:

\[\text{IA}_{C}(A) = \begin{cases} 1-\frac{m}{k} & \text{if $H(\overline{X_{A}})=0$}\\ 1-\frac{l}{k} & \text{if $H(\overline{Y_{A}})=0$}\\ \frac{I(\overline{X_{A}},\overline{Y_{A}})} { \min\left(H\left(\overline{X_{A}}\right), H\left(\overline{Y_{A}}\right)\right) }&\text{otherwise} \end{cases} \]

where \(\overline{X_{A}}\), \(\overline{Y_{A}}\), and \(\overline{X_{A}Y_{A}}\) are three random variables having the same probability distributions of \({X_{A}}\), \({Y_{A}}\), and \({X_{A}Y_{A}}\) except for \(0\)-probability events which are removed from their domains (see [CFG20b]).

References

BAG54

E.M. Bennett, R. Alpert, and A.C. Goldsein. Communications Through Limited-Response Questioning. Public Opinion Quarterly, 18(3):303–308, 01 1954. doi:10.1086/266520.

CFG20a

Alberto Casagrande, Francesco Fabris, and Rossano Girometti. Beyond Kappa: An Informational Index for Diagnostic Agreement in Dichotomous and Multivalue Ordered-Categorical Ratings. Submitted for the publication, 2020.

CFG20b(1,2,3)

Alberto Casagrande, Francesco Fabris, and Rossano Girometti. Extending Information Agreement by Continuity. 2020.

Coh60

Jacob Cohen. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1):37–46, 1960. doi:10.1177/001316446002000104.

Fle71

Joseph L. Fleiss. Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5):378–382, 1971. doi:10.1037/h0031619.

MB97

Sergio R. Munoz and Shrikant I. Bangdiwala. Interpretation of Kappa and B statistics measures of agreement. Journal of Applied Statistics, 24(1):105–112, 1997. doi:10.1080/02664769723918.

Sco55

William A. Scott. Reliability of content analysis: the case of nominal scale coding. The Public Opinion Quarterly, 19(3):321–325, 1955. doi:10.1086/266577.

War12

Matthijs J. Warrens. The effect of combining categories on bennett, alpert and goldstein’s s. Statistical Methodology, 9(3):341–352, 2012. doi:10.1016/j.stamet.2011.09.001.

Yul12

George Udny Yule. On the methods of measuring association between two attributes. Journal of the Royal Statistical Society, 75(6):579–652, 1912. doi:10.2307/2340126.