The Importance of Hypothesis Testing in Quality Management
In modern manufacturing plants, people still seldom attach importance to hypothesis testing, which they believe is merely a matter of theory. However, the application of hypothesis testing in quality management should be promoted. Both parametric test (ttest and ztest) and nonparametric test (sign test and Wilcoxon ranksum test) are appropriate for use in a manufacturing environment.
Data collection establishes the foundation for appraising quality of a product or service. But without correct data processing, it becomes challenging to make an objective conclusion. Sometimes, the observation is wrongly interpreted.
For instance, suppose that the fallout rate of samples drawn from two different groups is 15% and 10%, respectively. It would be a partial judgment saying that one is better than the other. On this occasion, hypothesis testing is instrumental in explanation of phenomena. Unfortunately, in many manufacturing facilities people tend to merely focus on descriptive statistics such as arithmetic mean and range. Simply put, application of hypothesis testing is indispensable to better understand quality data and provide guidance to production control. Cases of parametric test and nonparametric test are presented below.
Definition of Terms
Hypothesis testing. This is the process of using statistics to determine the probability that a specific hypothesis is true. Hypothesis testing is categorized as parametric test and nonparametric test. The parametric test includes ztest, ttest, fTest and x^{2} test. The nonparametric test includes sign test, Wilcoxon Ranksum test, KruskalWallis test and permutation test.
Parametric test. In this test, samples are taken from a population with known distribution (normal distribution), and a test of population parameters is executed.
Nonparametric test. Also called distributionfree test, this test does not require the population to conform to a normal distribution, nor do the popular parameters need to be statistically estimated.
Application of Parametric Test
Estimation of population mean with confidence interval
To estimate the population mean, confidence interval is introduced because the mean value of samples is not equal to that of population. For spot checks in the manufacturing process, two risks of making a wrong conclusion appear: Type I and Type II risks. Type I risk (α) is the probability of rejecting qualified products (for producer); Type II risk (β) is the probability of accepting nonconforming products (for customer).
To exemplify the application of confidence interval, the following example is given: Twentyfive plastic rims are randomly chosen and hub ID is checked by caliper and coordinate measuring machine (CMM). See table 1.
Both caliper and CMM are operated by welltrained inspectors. In the far left column, x stands for caliper measurement; y stands for CMM measurement; d stands for the difference of caliper measurement and CMM measurement. According to Table 1, the sample mean of caliper measurement and CMM measurement is 12.92 and 12.90 millimeters, respectively; the range (difference between the smallest and largest values) of caliper measurement and CMM measurement is 0.14 and 0.05 millimeter, respectively.
A significance level of α=0.05 is chosen and analysis software is used to compute population mean.
Apparently, sample mean is substantially different than the estimated population mean because the latter provides the region of mean value. Consequently, it cannot statistically conclude that the average hub ID of overall production parts checked by caliper and CMM is 12.92 and 12.90 millimeters, respectively. Combined with the following confidence interval, estimated population mean has the more practical significance because it scientifically assists in adjusting process parameters:
Moreover, confidence interval distinguishes from the range as well. In an actual production process, the range is generally not used for adjusting process parameters because the smallest and largest values are mostly abnormal data and require root cause analysis. Note that tstatistic is usually used when the sample size is less than 30. Zstatistic is preferable when the sample size is greater than 30.
Paired Sample Ttest used to determine disparity of measurement systems (caliper vs. CMM)
Table 1 shows pairedsample data, which are obtained by two different measurement devices. The paired sample ttest is based on the paired differences between measurement values, and it is normally assumed that the difference in mean values is zero. Nonetheless, the measurement values themselves should not be regarded as the objects of paired sample ttest. To curtail the length of this article, the process of setting null hypothesis and alternative hypothesis is eliminated, and only the formula of tstatistic is present below:
Here, d stands for the average difference between caliper measurement and CMM measurement, and “s” means the standard deviation of 25 sample values.
A significance level α=0.05 is chosen and analysis software is used, finding that measurement values by caliper and CMM are significantly different. As a matter of fact, the instrument error of caliper and CMM is 0.02 and 0.001 millimeter, respectively. Additionally, when using a caliper to measure hub ID of the plastic rim, two factors are worth considerationthe elastic nature of plastic product and the different force applied through caliper legs by different operators.
As a result, the reading accuracy of the caliper is questionable. Although the CMM is a highly precise measurement system, its application in mass production is not costeffective. In the view of that, a go/nogo gage commonly replaces the CMM for checking hub ID in mass production.
Application of twosample proportion ztest in evaluating different groups
Certain conditions for twosample proportion ztest applications should be met: (1) sample size is greater than 30; and (2) population must follow a normal distribution; and (3) two groups must be independent. To curtail the length of the article, the process of setting null hypothesis and alternative hypothesis is eliminated, and only formula of zstatistic is presented below:
Here, n_{1} and n_{2} are the sample size of two groups; x_{1} and x_{2} stand for the number of units with specific characteristics such as number of acceptable samples.
Here is an example in a factory. Two machines (A and B) are used to machine the hub ID of a type of product. Table 2 below is for 35 samples processed separately by these two machines. The measurement error can be neglected, and the specification of hub ID is 28.20 to 28.50 millimeters.
From table 2, in which quantity of nonconforming samples machined by machine A and B is 10 and 6, respectively, the fallout rate of samples for machine A is 28.6%, and 17.1% for B. Does it manifest that machine B performs better than machine A? The answer is uncertain because sampling error exists even if they are chosen randomly.
A significance level α=0.05 is chosen and analysis software is used, finding that the fallout rate of products processed by machine A and machine B does not significantly differ from each other. In actual production, however, root cause for variance in measurement values should be analyzed. Perhaps an unskillful operator resulted in such a high percentage of poor parts; or perhaps the machine is not precise enough because of a lack of maintenance. In a nutshell, the quality performance of the machines should be monitored carefully.
Application of onesample proportion ztest in verifying AQL of sampling plan
As previously stated, a sampling plan needs to take into account Type I and Type II risks. In order to reduce the cost of nonvalueadded spot checks and to assure product quality, companies have been striving to find the most appropriate acceptable quality level (AQL). Currently, a legion of producers has adopted MILSTD105E, the military standard sampling procedure for inspection by attributes. Of course, determination of AQL varies with different kinds of products and customer requirement; nevertheless, AQL is sometimes selected inappropriately. An example of onesample proportion ztest is presented as below.
Supplier X produces crankshafts that will be assembled in a customer’s factory. For almost each lot of product, however, some crankshafts do not fit well with the gears assembled within them at the customer’s worksite, causing the fallout rate to keep at 2.3% to 2.8%. Investigation shows that in the final inspection supplier X spotchecks, the OD of the crankshaft as per MILSTD105E Sampling Plan for Normal Inspection (SPNI) and AQL selected is 1.5%. Assume that the population size is 7,200 and look up the MILSTD105E SPNI. The sample size is 200; at AQL=1.5%, the acceptable quantity of nonconforming parts out of 200 samples should be not greater than seven. Is AQL =1.5% correct for the purpose of controlling defect rates within 1.5%? Proportion ztest is constructed at first.
Set the null hypothesis H0: p>=98.5%; alternative hypothesis H1:p<98.5%
(Above p stands for qualification rate of population)
Using analysis software, it can find that the null hypothesis is rejected. In other words, the qualification rate of population is less than 98.5% when AQL=1.5% and the reliability of the conclusion is 95%. If seven units out of 200 samples are detected, the fallout rate of 7,200 units is approximately greater than 1.5%. Likewise, if six bad parts are found, the onesample proportion ztest proves that estimated fallout rate of population also is greater than 1.5%. However, if five units of bad parts are found, the estimated fallout rate of population is not greater than 1.5%. Obviously, there is a chance that bad parts are still shipped out of the factory when AQL=1.5 is adopted, depending on how many bad parts are found in spot check and how inspection results are disposed. Further, if AQL=0.65 and sample size is still 200, onesample proportion ztest demonstrates that the estimated fallout rate of the population is less than 1.5%. In this sense, the fallout rate of 2.3% to 2.8% for supplier X is not surprising statistically. In fact, MILSTD105E Sampling Plan for Tightened Inspection regulates that at AQL (1.5%) the acceptable quantity of nonconforming parts out of 200 samples is not greater than five.
Therefore, if supplier X wants to warrant that fallout rate of crankshaft falls within 1.5%, AQL=1.5 should not be adopted; instead AQL=0.65 is recommended when adopting MILSTD105E SPNI. Note that MILSTD105E states the condition for switching sampling procedure from normal to tightened:
“When normal inspection is in effect, tightened inspection shall be instituted; when 2 out of 2, 3, 4 or 5 consecutive lots or batches have been rejected on original inspection (i.e., ignoring resubmitted lots or batches for this procedure.”
In common practice, for unimportant dimensions and undemanding appearance features, AQL can be loosened relatively, depending on the requirements of the customer and the process capability of manufacturers.
All in all, when it comes to critical dimensions that would potentially bring about a high disqualification rate and considerable cost loss, onesample proportion ztest is a useful tool to verify the appropriateness of selected AQL.
Application of Nonparametric Test
Application of sign test in assessing measurement under different conditions
Sign test, the simplest nonparametric test, involves the use of matched pairs. Sign test is usually used in place of onesample ttest when the normality assumption is questionable. Yet, sign test is a less powerful alternative to Wilcoxon ranksum test. An example of sign test is demonstrated as follows.
In a workshop, different inspectors have different skills. To better assess the inspectors’ operational level, an experiment is performed by requesting two inspectors (Tom and John) to measure hub ID of 15 samples with the same caliper. Results are shown in table 3.
Note: (1) If the difference of two measurement values is positive, use the sign +; if negative, use the sign ; if equal, use the numeral 0; (2) n^{+} and n^{} represent the total number of sign + and sign , respectively.
A significance level α=0.05 is chosen and analysis software is used. It can judge that the operational skill of Tom significantly differs from that of John, and the reliability of the conclusion is 95%. In this case, further investigation into Tom and John is necessary, because perhaps one of them has a problem with using the caliper properly, or possibly they measured different locations of the hub in which roundness differs.
Application of Wilcoxon ranksum test in assessing quality difference of products
Compared with sign test, Wilcoxon ranksum test (also commonly called MannWhitney test) is more widely applied in mass production to test the equivalence of two independent populations because the test does not require paired data. Below is the procedure of Wilcoxon ranksum test for two groups: 1. Sort the data of group A and group B into ascending order, each member of the groups is assigned with a rank such as 1, 2 and 3.
2. Compute the sum of rank (T) for the group with smaller group size.
3. Set the symbols n_{1} and n_{2}. if group size n_{A} < n_{B}, then n_{1}= n_{A} and n_{2}= n_{B} and vice versa.
4. Choose a significance level α and look up the T_{1} and T_{2} in ranksum table.
5. If T_{1}< T
Below is the running example of Wilcoxon ranksum test. In a factory, the best supplier from two candidates for producing plastic switches is selected. The criterion of selection is that the pull strength of the switch must not be less than 700 newtons. Table 4 is the comparison of samples from supplier A and supplier B.
The average strength of samples from supplier A is 740.0 newtons, from supplier B 729.3 newtons. The fallout rate of samples from supplier A is 42.9% and that from supplier B is 40%.
Assign a rank to each measure value. See table 5.
Because group size= n_{A}=7<n_{B}=12; therefore n_{1}=n_{A}=7, n_{2}=n_{B}=12
Compute the sum of rank for n_{1}, T=2+4+6+11+13+14+ (15+16)/2= 65.5
Note: Both suppliers have the same value of 868, and rank is, therefore, equal to (15+16)/2.
A significance level α=0.05 is chosen and Wilcoxon ranksum table is used: T1=45, T2=81.
T_{1}
Undeniably, the deficiency of nonparametric test lies in its incomplete use of observation, such as sample mean, median and range. The nonparametric test cannot be used in examining the interaction of multiple factors.
The above examples of hypothesis testing substantiates that the use of only a descriptive statistic, such as arithmetic mean, sum and range, fails to provide a panoramic view of product or service quality. Also, the fallout rate of samples from different populations cannot guarantee that one is superior to the other. Further, hypothesis testing also helps verify whether or not the selected AQL is appropriate in the spotchecking of critical dimensions.
Tech Tips

Parametric test (ttest and ztest) and nonparametric test (sign test and Wilcoxon ranksum test) are appropriate for use in a manufacturing environment.

Application of hypothesis testing will allow manufacturers to better understand quality data and provide guidance to production control.
 Hypothesis testing substantiates that the use of only a descriptive statistic, such as arithmetic mean, sum and range, fails to provide a panoramic view of product or service quality.
Did you enjoy this article? Click here to subscribe to Quality Magazine.