Measurement

Cohen’s Kappa: Measuring Agreement Beyond Chance

Given how each inspector tends to classify items, how much agreement would we expect purely by chance?

This image shows a person in personal protective equipment (PPE) using a tablet, likely in an industrial or construction setting — Credit: Starvetiger, E+ Collection, Creative #1487505430 (Royalty-free)

Attribute inspection is one of the most widespread, yet difficult-to-control, measurement methods in manufacturing. Whether inspecting machined surfaces for cosmetic defects, checking weld quality, reviewing molded parts, evaluating assembly completeness, or verifying diameters with go/no-go gages, many operations depend on human inspectors making these pass-fail, subjective judgments.

To evaluate these inspection systems, many organizations still rely on observed agreement, the simplest method of measurement consistency. Unfortunately, attempts to understand the variation introduced by an attribute gage are frequently led astray by excellent observed agreement results masking an unreliable measurement system.

Let’s consider a case study in which two appraisers evaluate an attribute gage pin by each measuring the same 35 samples and comparing their results. Each appraiser declares each of the 35 samples as either “G” or “NG” on the basis of the output of the attribute gage. A useful presentation of the study data follows:

Observed agreement (P_o) represents the proportion of parts on which two inspectors -- or an inspector and a master standard -- give the same classification. To calculate it, you count the number of times both evaluations match (e.g. both calling a part G or both calling it NG) and divide that by the total number of parts inspected. For a G/NG inspection,

Pₒ = (nGG + nNG) / N

where N is the total number of parts.

In our case study, the appraisers agreed 21 times that the sample was Good (n_GG) and five times that the sample was Not Good (n_NGNG) for a total of 26 points of agreement in 35 opportunities (N) yielding an observed agreement (P_o) of,

P_o = (21 + 5) / 35 = 74.3%

While observed agreement is easy to compute and intuitive, it often overestimates the true reliability of a gage system, especially when one category (usually G) dominates the population.

Consider this: If 95% of all parts are “Good,” then an inspector who simply calls everything “Good” will achieve 95% agreement, even if they cannot properly detect defects. This false sense of capability leads to poor decisions and recurring quality escapes.

To avoid this trap, quality engineers can use Cohen’s Kappa, K, a statistical measure of agreement beyond chance. Kappa tells you how much appraisers agree in a meaningful way, not merely due to guesswork.

In attribute gage studies, Kappa can be used to quantify:

Repeatability - Does the same inspector classify the same part consistently across trials?
Reproducibility - Do multiple inspectors agree with each other?
Accuracy - Does the inspector’s classification match a reference or master standard?

Observed agreement treats all agreement as equally meaningful. Kappa corrects this bias by asking:

Given how each inspector tends to classify items, how much agreement would we expect purely by chance?

In the Kappa calculations, chance agreement is the baseline, and the K statistic measures agreement beyond that base. This makes Kappa far more reliable in evaluating human classification systems, especially those involving G/NG decisions.

The Core Formulas: , , and Kappa

At this point in our discussion, we understand that P_o is the observed agreement of our study, the ratio of the total number of times the two appraisers agreed about a sample’s inspection status to the total number of samples in the study. But another critical question is, what is the probability that our appraisers declared the same inspection status by pure chance? This is where Expected Agreement ( ) enters the discussion.

is how often the inspectors would agree just by chance, based on how frequently each uses each category (G or NG). And this total expected agreement is naturally the sum of the expected agreement of G and the expected agreement of NG.

For a 2×2 G/NG system:

P_e = P_G + P_NG

P_G = (number of G_{App 1} / Total) * (number of G_{App 2} / Total)

P_NG = (number of NG_{App 1} / Total) * (number of NG_{App 2}) / Total)

Drawing from the data in Table A in our case study,

P_G = ((21 + 6) / 35) * ((21 + 3) / 35) = .529

P_NG = ((3 + 5) / 35) * ((6 + 5) / 35) = .072

P_e = .529 + .072 = .601

Once you have and, calculating Cohen’s Kappa, K is straightforward:

κ = (P_o − P_e) / (1 − P_e)

Again, referring to our case study,

κ = (0.743 − 0.601) / (1 − 0.601) = 0.356

A quality engineer’s interpretation of Kappa will vary with industry and application, but a good starting point for interpreting values of Kappa can be found in Table B.

A Kappa of 1 implies perfect agreement. A Kappa of 0 implies no agreement beyond what chance produces. Kappa calculation can also produce negative values (i.e. worse than chance) implying systematic disagreement. This unusual case is typically caused by a structural problem with the inspection process e.g. a misunderstanding about what is G and NG, recording data incorrectly, etc.

In our case, even though the inspectors agree 74.3% of the time, once chance agreement is removed, the true agreement is only 35.6%, an unacceptable result in most environments.

This means the attribute inspection system needs improvement, possibly via:

Better defect acceptance standards
Boundary samples
Lighting, magnification, or fixture improvements
Inspector training

This example demonstrates why Kappa is essential for meaningful attribute gage analysis. Observed agreement alone masks underlying variability and can allow an unreliable inspection system into production.

Human classification systems are among the most variable measurement systems in manufacturing. But Cohen’s Kappa provides a rigorous, chance-corrected measure of inspection consistency and is one of the most important tools for evaluating:

Visual inspection
Go/no-go attribute checks
Defect classification
Assembly verification
NDT/NDE categorical decisions

For manufacturing quality engineers and process engineers, understanding and applying Kappa is essential for ensuring reliable attribute gage performance and preventing costly inspection errors.

Looking for a reprint of this article?
From high-res PDFs to custom plaques, order your copy today!

Ray Harkins is the General Manager of Lexington Technologies in Lexington, North Carolina. He earned his Master of Science from Rochester Institute of Technology and his Master of Business Administration from Youngstown State University. He also taught over 100,000 student quality-related skills such as Gage R&R Simplified: Essential Tools for Quality Engineers, Quality Engineering Statistics, and Root Cause Analysis and the 8D Corrective Action Process through the online learning platform, Udemy. He can be reached via LinkedIn at linkedin.com/in/ray-harkins or by email at [email protected]. www.TheManufacturingAcademy.com.