General Notions¶

Inter-rater agreement (also known as inter-rater reliability) is a measure of consensus among $n$ raters in the classification of $N$ objects in a $k$ different categories.

In the general case, the rater evalutations can be represented by the reliability data matrix: a $n \times N$-matrix $R$ such that $R[i,j]$ stores the category selected by the $i$-th rater for the $j$-th object.

A more succint representation is provided by a $N \times k$-matrix $C$ whose elements $C[i,j]$ account how many raters evaluated the $i$-th object as belonging to the $j$-th category. This matrix is the classification matrix.

Whenever the numer of raters is $2$, i.e., $n=2$, the rater evaluations can be represented by the agreement matrix: a $k \times k$-matrix $A$ such that $A[i,j]$ stores the number of objects that are classified at the same time as belonging to the $i$-th category by the first rater and to the $j$-th category by the second rater.

Bennett, Alpert and Goldstein’s S¶

Bennett, Alpert and Goldstein’s $S$ is an inter-rater agreement measure on nominal scale (see [BAG54] and [War12]). It is defined as:

\[S \stackrel{\tiny\text{def}}{=} \frac { k * P_0 - 1 } { k - 1 } \]

where $P_0$ is the probability of agreement among the raters and $k$ is the number of different categories in the classification.

Bangdiwala’s B¶

Bangdiwala’s $B$ is an inter-rater agreement measure on nominal scale (see [MB97]). It is defined as:

\[B \stackrel{\tiny\text{def}}{=} \frac{\sum_{i} A[i,i]}{\sum_{i} A_{i\cdot}*A_{\cdot{}i}}\]

where $A_{i\cdot}$ and $A_{\cdot{}i}$ are the sums of the elements in the $i$-th row and $i$-th column of the matrix $A$, respectively.

Cohen’s Kappa¶

Cohen’s $\kappa$ is an inter-rater agreement measure on nominal scale (see [Coh60]). It is defined as:

\[\kappa \stackrel{\tiny\text{def}}{=} \frac{P_0-P_e}{1-P_e}\]

where $P_0$ is the probability of agreement among the raters and $P_e$ is the agreement probability by chance.

Scott’s Pi¶

Scott’s $\pi$ is an inter-rater agreement measure on nominal scale (see [Sco55]). Similarly to Cohen’s $\kappa$, it is defined as:

\[\pi \stackrel{\tiny\text{def}}{=} \frac{P_0-P_e}{1-P_e}\]

where $P_0$ is the probability of agreement among the raters (as in Cohen’s $\kappa$) and $P_e$ is the sum of the squared joint proportions (whereas it is the sum of the squared geometric means of marginal proportions in Cohen’s $\kappa$). In particular, the joint proportions are the arithmetic means of the marginal proportions.

Yule’s Y¶

Yule’s $Y$ (see [Yul12]), sometime called coefficient of colligation, measures the relation between two binary random variables (i.e., it can be computed exclusively on $2 \times 2$ agreement matrices). It is defined as:

\[Y \stackrel{\tiny\text{def}}{=} \frac{\sqrt{\text{OR}}-1}{\sqrt{\text{OR}}+1}\]

where $\text{OR}$ is the odds ratio (e.g., see here):

\[\text{OR} \stackrel{\tiny\text{def}}{=} \frac{A[0,0]*A[1,1]}{A[1,0]*A[0,1]}.\]

Fleiss’s Kappa¶

Fleiss’s $\kappa$ (see [Fle71]) is a multi-rater generalization of Scott’s Pi.

If the classifications are represented in a classification matrix (see General Notions), the ratio of classifications assigned to class $j$ is:

\[p_j \stackrel{\tiny\text{def}}{=} \frac{1}{N*n}\sum_{i=1}^{N} C[i,j]\]

and their square sum is:

\[\bar{P_e} \stackrel{\tiny\text{def}}{=} \sum_{j=1}^k p_j.\]

Instead, the ratio between the pairs of raters which agree on the $i$-th subject and the overall pairs of raters is:

\[P_i \stackrel{\tiny\text{def}}{=} \frac{1}{n*(n-1)}\left(\left(\sum_{j=1}^k C[i,j]^2\right) - n\right)\]

and its mean is:

\[\bar{P} \stackrel{\tiny\text{def}}{=} \frac{1}{N}\sum_{i=1}^{N}P_i.\]

Fleiss’s $\kappa$ is defined as:

\[\kappa \stackrel{\tiny\text{def}}{=} \frac{\bar{P}-\bar{P_e}}{1-\bar{P_e}}.\]

Information Agreement¶

The Information Agreement, ($\text{IA}$), is an inter-rater agreement measure on nominal scale (see [CFG20a]) which gauges the dependence between the classifications of two raters.

The probability distributions for the evaluations of the rater $\mathfrak{X}$, those of the rater $\mathfrak{Y}$, and the joint evalutions $\mathfrak{X}\mathfrak{Y}$ on the agreement matrix $A$ are:

\[p_{X_{A}}(j_0) \stackrel{\tiny\text{def}}{=} \frac{\sum_{i} A[i,j_0]}{\sum_{i}\sum_{j} A[i,j]}, \quad\quad\quad p_{Y_{A}}(i_0) \stackrel{\tiny\text{def}}{=} \frac{\sum_{j} A[i_0,j]}{\sum_{i}\sum_{j} A[i,j]}, \]

and

\[p_{X_{A}Y_{A}}(i_0,j_0) = \frac{A[i_0,j_0]}{\sum_{i}\sum_{i} A[i,j]}, \]

respectively. The entropy functions for the random variables $X_{A}$, $Y_{A}$, and $X_{A}Y_{A}$ are:

\[H(X_{A}) \stackrel{\tiny\text{def}}{=} - \sum_{i} p_{X_{A}}(i) \log_2 p_{X_{A}}(i), \quad\quad\quad H(Y_{A}) \stackrel{\tiny\text{def}}{=} - \sum_{j} p_{Y_{A}}(j) \log_2 p_{Y_{A}}(j), \]

and

\[H(X_{A}Y_{A}) \stackrel{\tiny\text{def}}{=} - \sum_{i}\sum_{j} p_{X_{A}Y_{A}}(i,j) \log_2 p_{X_{A}Y_{A}}(i,j). \]

The mutual information between the classification of $\mathfrak{X}$ and $\mathfrak{Y}$ is:

\[I(X_{A},Y_{A}) \stackrel{\tiny\text{def}}{=} H(X_{A})+H(Y_{A})-H(X_{A}Y_{A}). \]

The Information Agreement of $A$ is the ratio between $I(X_{A},Y_{A})$ and the minimum among $H(X_{A})$ and $H(Y_{A})$ as $\epsilon$ tends to $0$ from the right, i.e.,

\[\text{IA} \stackrel{\tiny\text{def}}{=} \frac{I(X_{A},Y_{A})} { \min(H(X_{A}), H(Y_{A})) }. \]

Extension-by-Continuity of IA¶

$\text{IA}$ was proven to be effetive in gauging agreement and solves some of the pitfalls of Cohen’s $\kappa$. However, it is not defined over all the agreement matrices and, in particular, it cannot be directly computed on agreement matrices containing some zeros (see [CFG20b]).

The extension-by-continuity of Information Agreement, ($\text{IA}_{C}$), extends $\text{IA}$’s domain so that it can deal with matrices containing some zeros (see [CFG20b]). In order to achieve this goal, the considered agreement matrix $A$ is replaced by the symbolic matrix $A_{\epsilon}$ is defined as:

\[A_{\epsilon}[i,j] \stackrel{\tiny\text{def}}{=} \begin{cases} A[i,j] & \textrm{if $A[i,j]\neq 0$}\\ \epsilon & \textrm{if $A[i,j]=0$} \end{cases} \]

where $\epsilon$ is a real variable with values in the open interval $(0, +\infty)$. On this matrix, mutual information of the variables $X_{A_{\epsilon}}$ and $Y_{A_{\epsilon}}$ and their entropy functions are defined. The extension-by-continuity of Information Agreement of $A$ is the limit of the ratio between $I(X_{A_{\epsilon}},Y_{A_{\epsilon}})$ and the minimum among $H(X_{A_{\epsilon}})$ and $H(Y_{A_{\epsilon}})$ as $\epsilon$ tends to $0$ from the right, i.e.,

\[\text{IA}_{C}(A) \stackrel{\tiny\text{def}}{=} \lim_{\epsilon \rightarrow 0^+} \frac{I(X_{A_{\epsilon}},Y_{A_{\epsilon}})} { \min(H(X_{A_{\epsilon}}), H(Y_{A_{\epsilon}})) }. \]

$\text{IA}_{C}(A)$ was proven to be defined over any non-null agreement matrix having more than one row/column and, if $l$ and $m$ are numbers of non-null columns and non-null rows in $A$, respectively, then:

\[\text{IA}_{C}(A) = \begin{cases} 1-\frac{m}{k} & \text{if $H(\overline{X_{A}})=0$}\\ 1-\frac{l}{k} & \text{if $H(\overline{Y_{A}})=0$}\\ \frac{I(\overline{X_{A}},\overline{Y_{A}})} { \min\left(H\left(\overline{X_{A}}\right), H\left(\overline{Y_{A}}\right)\right) }&\text{otherwise} \end{cases} \]

where $\overline{X_{A}}$, $\overline{Y_{A}}$, and $\overline{X_{A}Y_{A}}$ are three random variables having the same probability distributions of ${X_{A}}$, ${Y_{A}}$, and ${X_{A}Y_{A}}$ except for $0$-probability events which are removed from their domains (see [CFG20b]).

References¶

BAG54: E.M. Bennett, R. Alpert, and A.C. Goldsein. Communications Through Limited-Response Questioning. Public Opinion Quarterly, 18(3):303–308, 01 1954. doi:10.1086/266520.
CFG20a: Alberto Casagrande, Francesco Fabris, and Rossano Girometti. Beyond Kappa: An Informational Index for Diagnostic Agreement in Dichotomous and Multivalue Ordered-Categorical Ratings. Submitted for the publication, 2020.
CFG20b(1,2,3): Alberto Casagrande, Francesco Fabris, and Rossano Girometti. Extending Information Agreement by Continuity. 2020.
Coh60: Jacob Cohen. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1):37–46, 1960. doi:10.1177/001316446002000104.
Fle71: Joseph L. Fleiss. Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5):378–382, 1971. doi:10.1037/h0031619.
MB97: Sergio R. Munoz and Shrikant I. Bangdiwala. Interpretation of Kappa and B statistics measures of agreement. Journal of Applied Statistics, 24(1):105–112, 1997. doi:10.1080/02664769723918.
Sco55: William A. Scott. Reliability of content analysis: the case of nominal scale coding. The Public Opinion Quarterly, 19(3):321–325, 1955. doi:10.1086/266577.
War12: Matthijs J. Warrens. The effect of combining categories on bennett, alpert and goldstein’s s. Statistical Methodology, 9(3):341–352, 2012. doi:10.1016/j.stamet.2011.09.001.
Yul12: George Udny Yule. On the methods of measuring association between two attributes. Journal of the Royal Statistical Society, 75(6):579–652, 1912. doi:10.2307/2340126.