In modern manufacturing plants, people still seldom attach importance to hypothesis testing, which they believe is merely a matter of theory. However, the application of hypothesis testing in quality management should be promoted. Both parametric test (t-test and z-test) and nonparametric test (sign test and Wilcoxon rank-sum test) are appropriate for use in a manufacturing environment.

Data collection establishes the foundation for appraising quality of a product or service. But without correct data processing, it becomes challenging to make an objective conclusion. Sometimes, the observation is wrongly interpreted.

For instance, suppose that the fallout rate of samples drawn from two different groups is 15% and 10%, respectively. It would be a partial judgment saying that one is better than the other. On this occasion, hypothesis testing is instrumental in explanation of phenomena. Unfortunately, in many manufacturing facilities people tend to merely focus on descriptive statistics such as arithmetic mean and range. Simply put, application of hypothesis testing is indispensable to better understand quality data and provide guidance to production control. Cases of parametric test and nonparametric test are presented below.

## Definition of Terms

Hypothesis testing. This is the process of using statistics to determine the probability that a specific hypothesis is true. Hypothesis testing is categorized as parametric test and nonparametric test. The parametric test includes z-test, t-test, f-Test and x2 test. The nonparametric test includes sign test, Wilcoxon Rank-sum test, Kruskal-Wallis test and permutation test.

Parametric test. In this test, samples are taken from a population with known distribution (normal distribution), and a test of population parameters is executed.

Nonparametric test. Also called distribution-free test, this test does not require the population to conform to a normal distribution, nor do the popular parameters need to be statistically estimated. Table 1. Hub ID of 25 samples: caliper vs. CMM (unit: millimeter)

## Application of Parametric Test

Estimation of population mean with confidence interval
To estimate the population mean, confidence interval is introduced because the mean value of samples is not equal to that of population. For spot checks in the manufacturing process, two risks of making a wrong conclusion appear: Type I and Type II risks. Type I risk (α) is the probability of rejecting qualified products (for producer); Type II risk (β) is the probability of accepting nonconforming products (for customer).

To exemplify the application of confidence interval, the following example is given: Twenty-five plastic rims are randomly chosen and hub ID is checked by caliper and coordinate measuring machine (CMM). See table 1. Both caliper and CMM are operated by well-trained inspectors. In the far left column, x stands for caliper measurement; y stands for CMM measurement; d stands for the difference of caliper measurement and CMM measurement. According to Table 1, the sample mean of caliper measurement and CMM measurement is 12.92 and 12.90 millimeters, respectively; the range (difference between the smallest and largest values) of caliper measurement and CMM measurement is 0.14 and 0.05 millimeter, respectively.

A significance level of α=0.05 is chosen and analysis software is used to compute population mean. Apparently, sample mean is substantially different than the estimated population mean because the latter provides the region of mean value. Consequently, it cannot statistically conclude that the average hub ID of overall production parts checked by caliper and CMM is 12.92 and 12.90 millimeters, respectively. Combined with the following confidence interval, estimated population mean has the more practical significance because it scientifically assists in adjusting process parameters: Moreover, confidence interval distinguishes from the range as well. In an actual production process, the range is generally not used for adjusting process parameters because the smallest and largest values are mostly abnormal data and require root cause analysis. Note that t-statistic is usually used when the sample size is less than 30. Z-statistic is preferable when the sample size is greater than 30.

Paired Sample T-test used to determine disparity of measurement systems (caliper vs. CMM)
Table 1 shows paired-sample data, which are obtained by two different measurement devices. The paired sample t-test is based on the paired differences between measurement values, and it is normally assumed that the difference in mean values is zero. Nonetheless, the measurement values themselves should not be regarded as the objects of paired sample t-test. To curtail the length of this article, the process of setting null hypothesis and alternative hypothesis is eliminated, and only the formula of t-statistic is present below: Here, d stands for the average difference between caliper measurement and CMM measurement, and “s” means the standard deviation of 25 sample values.

A significance level α=0.05 is chosen and analysis software is used, finding that measurement values by caliper and CMM are significantly different. As a matter of fact, the instrument error of caliper and CMM is 0.02 and 0.001 millimeter, respectively. Additionally, when using a caliper to measure hub ID of the plastic rim, two factors are worth consideration-the elastic nature of plastic product and the different force applied through caliper legs by different operators.

As a result, the reading accuracy of the caliper is questionable. Although the CMM is a highly precise measurement system, its application in mass production is not cost-effective. In the view of that, a go/no-go gage commonly replaces the CMM for checking hub ID in mass production.

Application of two-sample proportion z-test in evaluating different groups
Certain conditions for two-sample proportion z-test applications should be met: (1) sample size is greater than 30; and (2) population must follow a normal distribution; and (3) two groups must be independent. To curtail the length of the article, the process of setting null hypothesis and alternative hypothesis is eliminated, and only formula of z-statistic is presented below: Table 2. Machine A&B comparisons (unit: millimeter)

Here, n1 and n2 are the sample size of two groups; x1 and x2 stand for the number of units with specific characteristics such as number of acceptable samples.

Here is an example in a factory. Two machines (A and B) are used to machine the hub ID of a type of product. Table 2 below is for 35 samples processed separately by these two machines. The measurement error can be neglected, and the specification of hub ID is 28.20 to 28.50 millimeters.

From table 2, in which quantity of nonconforming samples machined by machine A and B is 10 and 6, respectively, the fallout rate of samples for machine A is 28.6%, and 17.1% for B. Does it manifest that machine B performs better than machine A? The answer is uncertain because sampling error exists even if they are chosen randomly.

A significance level α=0.05 is chosen and analysis software is used, finding that the fallout rate of products processed by machine A and machine B does not significantly differ from each other. In actual production, however, root cause for variance in measurement values should be analyzed. Perhaps an unskillful operator resulted in such a high percentage of poor parts; or perhaps the machine is not precise enough because of a lack of maintenance. In a nutshell, the quality performance of the machines should be monitored carefully.

Application of one-sample proportion z-test in verifying AQL of sampling plan
As previously stated, a sampling plan needs to take into account Type I and Type II risks. In order to reduce the cost of nonvalue-added spot checks and to assure product quality, companies have been striving to find the most appropriate acceptable quality level (AQL). Currently, a legion of producers has adopted MIL-STD-105E, the military standard sampling procedure for inspection by attributes. Of course, determination of AQL varies with different kinds of products and customer requirement; nevertheless, AQL is sometimes selected inappropriately. An example of one-sample proportion z-test is presented as below.

Supplier X produces crankshafts that will be assembled in a customer’s factory. For almost each lot of product, however, some crankshafts do not fit well with the gears assembled within them at the customer’s worksite, causing the fallout rate to keep at 2.3% to 2.8%. Investigation shows that in the final inspection supplier X spot-checks, the OD of the crankshaft as per MIL-STD-105E Sampling Plan for Normal Inspection (SPNI) and AQL selected is 1.5%. Assume that the population size is 7,200 and look up the MIL-STD-105E SPNI. The sample size is 200; at AQL=1.5%, the acceptable quantity of nonconforming parts out of 200 samples should be not greater than seven. Is AQL =1.5% correct for the purpose of controlling defect rates within 1.5%? Proportion z-test is constructed at first.

Set the null hypothesis H0: p>=98.5%; alternative hypothesis H1:p<98.5%
(Above p stands for qualification rate of population)

Using analysis software, it can find that the null hypothesis is rejected. In other words, the qualification rate of population is less than 98.5% when AQL=1.5% and the reliability of the conclusion is 95%. If seven units out of 200 samples are detected, the fallout rate of 7,200 units is approximately greater than 1.5%. Likewise, if six bad parts are found, the one-sample proportion z-test proves that estimated fallout rate of population also is greater than 1.5%. However, if five units of bad parts are found, the estimated fallout rate of population is not greater than 1.5%. Obviously, there is a chance that bad parts are still shipped out of the factory when AQL=1.5 is adopted, depending on how many bad parts are found in spot check and how inspection results are disposed. Further, if AQL=0.65 and sample size is still 200, one-sample proportion z-test demonstrates that the estimated fallout rate of the population is less than 1.5%. In this sense, the fallout rate of 2.3% to 2.8% for supplier X is not surprising statistically. In fact, MIL-STD-105E Sampling Plan for Tightened Inspection regulates that at AQL (1.5%) the acceptable quantity of nonconforming parts out of 200 samples is not greater than five.

Therefore, if supplier X wants to warrant that fallout rate of crankshaft falls within 1.5%, AQL=1.5 should not be adopted; instead AQL=0.65 is recommended when adopting MIL-STD-105E SPNI. Note that MIL-STD-105E states the condition for switching sampling procedure from normal to tightened:

“When normal inspection is in effect, tightened inspection shall be instituted; when 2 out of 2, 3, 4 or 5 consecutive lots or batches have been rejected on original inspection (i.e., ignoring resubmitted lots or batches for this procedure.”

In common practice, for unimportant dimensions and undemanding appearance features, AQL can be loosened relatively, depending on the requirements of the customer and the process capability of manufacturers.

All in all, when it comes to critical dimensions that would potentially bring about a high disqualification rate and considerable cost loss, one-sample proportion z-test is a useful tool to verify the appropriateness of selected AQL. Table 3. Hub ID data from Tom and John (unit: millimeter)

## Application of Nonparametric Test

Application of sign test in assessing measurement under different conditions
Sign test, the simplest nonparametric test, involves the use of matched pairs. Sign test is usually used in place of one-sample t-test when the normality assumption is questionable. Yet, sign test is a less powerful alternative to Wilcoxon rank-sum test. An example of sign test is demonstrated as follows.

In a workshop, different inspectors have different skills. To better assess the inspectors’ operational level, an experiment is performed by requesting two inspectors (Tom and John) to measure hub ID of 15 samples with the same caliper. Results are shown in table 3. Table 4. Pull strength of switches (supplier A vs. supplier B) (unit: newton)

Note: (1) If the difference of two measurement values is positive, use the sign +; if negative, use the sign -; if equal, use the numeral 0; (2) n+ and n- represent the total number of sign + and sign -, respectively.

A significance level α=0.05 is chosen and analysis software is used. It can judge that the operational skill of Tom significantly differs from that of John, and the reliability of the conclusion is 95%. In this case, further investigation into Tom and John is necessary, because perhaps one of them has a problem with using the caliper properly, or possibly they measured different locations of the hub in which roundness differs.

Application of Wilcoxon rank-sum test in assessing quality difference of products
Compared with sign test, Wilcoxon rank-sum test (also commonly called Mann-Whitney test) is more widely applied in mass production to test the equivalence of two independent populations because the test does not require paired data. Below is the procedure of Wilcoxon rank-sum test for two groups: 1. Sort the data of group A and group B into ascending order, each member of the groups is assigned with a rank such as 1, 2 and 3.
2. Compute the sum of rank (T) for the group with smaller group size.
3. Set the symbols n1 and n2. if group size nA < nB, then n1= nA and n2= nB and vice versa.
4. Choose a significance level α and look up the T1 and T2 in rank-sum table.
5. If T1< T2, then group A and group B does not significantly differ from each other.
Below is the running example of Wilcoxon rank-sum test. In a factory, the best supplier from two candidates for producing plastic switches is selected. The criterion of selection is that the pull strength of the switch must not be less than 700 newtons. Table 4 is the comparison of samples from supplier A and supplier B. Table 5. Rank Table

The average strength of samples from supplier A is 740.0 newtons, from supplier B 729.3 newtons. The fallout rate of samples from supplier A is 42.9% and that from supplier B is 40%.

Assign a rank to each measure value. See table 5.

Because group size= nA=7<nB=12; therefore n1=nA=7, n2=nB=12

Compute the sum of rank for n1, T=2+4+6+11+13+14+ (15+16)/2= 65.5

Note: Both suppliers have the same value of 868, and rank is, therefore, equal to (15+16)/2.

A significance level α=0.05 is chosen and Wilcoxon rank-sum table is used: T1=45, T2=81.

T12 signifies that the pull strength of switches from supplier A and supplier B is not significantly different, and the reliability of this conclusion is 95%. Nevertheless, if judgment is made according to average pull-strength or fallout rate, supplier A is better than supplier B. In this sense, Wilcoxon rank-sum test theoretically lends support to the practice of selecting different suppliers to provide the same products.

Undeniably, the deficiency of nonparametric test lies in its incomplete use of observation, such as sample mean, median and range. The nonparametric test cannot be used in examining the interaction of multiple factors.

The above examples of hypothesis testing substantiates that the use of only a descriptive statistic, such as arithmetic mean, sum and range, fails to provide a panoramic view of product or service quality. Also, the fallout rate of samples from different populations cannot guarantee that one is superior to the other. Further, hypothesis testing also helps verify whether or not the selected AQL is appropriate in the spot-checking of critical dimensions.

## Tech Tips

• Parametric test (t-test and z-test) and nonparametric test (sign test and Wilcoxon rank-sum test) are appropriate for use in a manufacturing environment.

• Application of hypothesis testing will allow manufacturers to better understand quality data and provide guidance to production control.

• Hypothesis testing substantiates that the use of only a descriptive statistic, such as arithmetic mean, sum and range, fails to provide a panoramic view of product or service quality.