Tests with Matched Samples

Mann Whitney U Test (Wilcoxon Rank Sum Test)

The modules on hypothesis testing presented techniques for testing the equality of means in two independent samples. An underlying assumption for appropriate use of the tests described was that the continuous outcome was approximately normally distributed or that the samples were sufficiently large (usually n₁> 30 and n₂> 30) to justify their use based on the Central Limit Theorem. When comparing two independent samples when the outcome is not normally distributed and the samples are small, a nonparametric test is appropriate.

A popular nonparametric test to compare outcomes between two independent groups is the Mann Whitney U test. The Mann Whitney U test, sometimes called the Mann Whitney Wilcoxon Test or the Wilcoxon Rank Sum Test, is used to test whether two samples are likely to derive from the same population (i.e., that the two populations have the same shape). Some investigators interpret this test as comparing the medians between the two populations. Recall that the parametric test compares the means (H₀: μ₁=μ₂) between independent groups.

In contrast, the null and two-sided research hypotheses for the nonparametric test are stated as follows:

H₀: The two populations are equal versus

H₁: The two populations are not equal.

This test is often performed as a two-sided test and, thus, the research hypothesis indicates that the populations are not equal as opposed to specifying directionality. A one-sided research hypothesis is used if interest lies in detecting a positive or negative shift in one population as compared to the other. The procedure for the test involves pooling the observations from the two samples into one combined sample, keeping track of which sample each observation comes from, and then ranking lowest to highest from 1 to n₁+n₂, respectively.

Example:

Consider a Phase II clinical trial designed to investigate the effectiveness of a new drug to reduce symptoms of asthma in children. A total of n=10 participants are randomized to receive either the new drug or a placebo. Participants are asked to record the number of episodes of shortness of breath over a 1 week period following receipt of the assigned treatment. The data are shown below.

Placebo	7	5	6	4	12
New Drug	3	6	4	2	1

Is there a difference in the number of episodes of shortness of breath over a 1 week period in participants receiving the new drug as compared to those receiving the placebo? By inspection, it appears that participants receiving the placebo have more episodes of shortness of breath, but is this statistically significant?

In this example, the outcome is a count and in this sample the data do not follow a normal distribution.

Frequency Histogram of Number of Episodes of Shortness of Breath

In addition, the sample size is small (n₁=n₂=5), so a nonparametric test is appropriate. The hypothesis is given below, and we run the test at the 5% level of significance (i.e., α=0.05).

H₀: The two populations are equal versus

H₁: The two populations are not equal.

Note that if the null hypothesis is true (i.e., the two populations are equal), we expect to see similar numbers of episodes of shortness of breath in each of the two treatment groups, and we would expect to see some participants reporting few episodes and some reporting more episodes in each group. This does not appear to be the case with the observed data. A test of hypothesis is needed to determine whether the observed data is evidence of a statistically significant difference in populations.

The first step is to assign ranks and to do so we order the data from smallest to largest. This is done on the combined or total sample (i.e., pooling the data from the two treatment groups (n=10)), and assigning ranks from 1 to 10, as follows. We also need to keep track of the group assignments in the total sample.

		Total Sample (Ordered Smallest to Largest)		Ranks
Placebo	New Drug	Placebo	New Drug	Placebo	New Drug
7	3		1		1
5	6		2		2
6	4		3		3
4	2	4	4	4.5	4.5
12	1	5		6
		6	6	7.5	7.5
		7		9
		12		10

Note that the lower ranks (e.g., 1, 2 and 3) are assigned to responses in the new drug group while the higher ranks (e.g., 9, 10) are assigned to responses in the placebo group. Again, the goal of the test is to determine whether the observed data support a difference in the populations of responses. Recall that in parametric tests (discussed in the modules on hypothesis testing), when comparing means between two groups, we analyzed the difference in the sample means relative to their variability and summarized the sample information in a test statistic. A similar approach is employed here. Specifically, we produce a test statistic based on the ranks.

First, we sum the ranks in each group. In the placebo group, the sum of the ranks is 37; in the new drug group, the sum of the ranks is 18. Recall that the sum of the ranks will always equal n(n+1)/2. As a check on our assignment of ranks, we have n(n+1)/2 = 10(11)/2=55 which is equal to 37+18 = 55.

For the test, we call the placebo group 1 and the new drug group 2 (assignment of groups 1 and 2 is arbitrary). We let R₁ denote the sum of the ranks in group 1 (i.e., R₁=37), and R₂denote the sum of the ranks in group 2 (i.e., R₂=18). If the null hypothesis is true (i.e., if the two populations are equal), we expect R₁ and R₂ to be similar. In this example, the lower values (lower ranks) are clustered in the new drug group (group 2), while the higher values (higher ranks) are clustered in the placebo group (group 1). This is suggestive, but is the observed difference in the sums of the ranks simply due to chance? To answer this we will compute a test statistic to summarize the sample information and look up the corresponding value in a probability distribution.

Test Statistic for the Mann Whitney U Test

The test statistic for the Mann Whitney U Test is denoted U and is the smaller of U₁ and U₂, defined below.

where R₁ = sum of the ranks for group 1 and R₂ = sum of the ranks for group 2.

For this example,

In our example, U=3. Is this evidence in support of the null or research hypothesis? Before we address this question, we consider the range of the test statistic U in two different situations.

Situation #1

Consider the situation where there is complete separation of the groups, supporting the research hypothesis that the two populations are not equal. If all of the higher numbers of episodes of shortness of breath (and thus all of the higher ranks) are in the placebo group, and all of the lower numbers of episodes (and ranks) are in the new drug group and that there are no ties, then:

and

Therefore, when there is clearly a difference in the populations, U=0.

Situation #2

Consider a second situation where low and high scores are approximately evenly distributed in the two groups, supporting the null hypothesis that the groups are equal. If ranks of 2, 4, 6, 8 and 10 are assigned to the numbers of episodes of shortness of breath reported in the placebo group and ranks of 1, 3, 5, 7 and 9 are assigned to the numbers of episodes of shortness of breath reported in the new drug group, then:

R₁= 2+4+6+8+10 = 30 and R₂= 1+3+5+7+9 = 25,

and

When there is clearly no difference between populations, then U=10.

Thus, smaller values of U support the research hypothesis, and larger values of U support the null hypothesis.

Key Concept:

For any Mann-Whitney U test, the theoretical range of U is from 0 (complete separation between groups, H₀ most likely false and H₁ most likely true) to n₁*n₂ (little evidence in support of H₁).

In every test, U₁+U₂ is always equal to n₁*n₂. In the example above, U can range from 0 to 25 and smaller values of U support the research hypothesis (i.e., we reject H₀if U is small). The procedure for determining exactly when to reject H₀ is described below.

In every test, we must determine whether the observed U supports the null or research hypothesis. This is done following the same approach used in parametric testing. Specifically, we determine a critical value of U such that if the observed value of U is less than or equal to the critical value, we reject H₀ in favor of H₁ and if the observed value of U exceeds the critical value we do not reject H₀.

The critical value of U can be found in the table below. To determine the appropriate critical value we need sample sizes (for Example: n₁=n₂=5) and our two-sided level of significance (α=0.05). For Example 1 the critical value is 2, and the decision rule is to reject H₀ if U < 2. We do not reject H₀ because 3 > 2. We do not have statistically significant evidence at α =0.05, to show that the two populations of numbers of episodes of shortness of breath are not equal. However, in this example, the failure to reach statistical significance may be due to low power. The sample data suggest a difference, but the sample sizes are too small to conclude that there is a statistically significant difference.

Table of Critical Values for U

Example:

A new approach to prenatal care is proposed for pregnant women living in a rural community. The new program involves in-home visits during the course of pregnancy in addition to the usual or regularly scheduled visits. A pilot randomized trial with 15 pregnant women is designed to evaluate whether women who participate in the program deliver healthier babies than women receiving usual care. The outcome is the APGAR score measured 5 minutes after birth. Recall that APGAR scores range from 0 to 10 with scores of 7 or higher considered normal (healthy), 4-6 low and 0-3 critically low. The data are shown below.

Usual Care	8	7	6	2	5	8	7	3
New Program	9	9	7	8	10	9	6

Is there statistical evidence of a difference in APGAR scores in women receiving the new and enhanced versus usual prenatal care? We run the test using the five-step approach.

Step 1. Set up hypotheses and determine level of significance.

H₀: The two populations are equal versus

H₁: The two populations are not equal. α =0.05

Step 2. Select the appropriate test statistic.

Because APGAR scores are not normally distributed and the samples are small (n₁=8 and n₂=7), we use the Mann Whitney U test. The test statistic is U, the smaller of

where R₁ and R₂ are the sums of the ranks in groups 1 and 2, respectively.

Step 3. Set up decision rule.

The appropriate critical value can be found in the table above. To determine the appropriate critical value we need sample sizes (n₁=8 and n₂=7) and our two-sided level of significance (α=0.05). The critical value for this test with n₁=8, n₂=7 and α =0.05 is 10 and the decision rule is as follows: Reject H₀ if U < 10.

Step 4. Compute the test statistic.

The first step is to assign ranks of 1 through 15 to the smallest through largest values in the total sample, as follows:

		Total Sample (Ordered Smallest to Largest)		Ranks
Usual Care	New Program	Usual Care	New Program	Usual Care	New Program
8	9	2		1
7	8	3		2
6	7	5		3
2	8	6	6	4.5	4.5
5	10	7	7	7	7
8	9	7		7
7	6	8	8	10.5	10.5
3		8	8	10.5	10.5
			9		13.5
			9		13.5
			10		15
				R₁=45.5	R₂=74.5

Next, we sum the ranks in each group. In the usual care group, the sum of the ranks is R₁=45.5 and in the new program group, the sum of the ranks is R₂=74.5. Recall that the sum of the ranks will always equal n(n+1)/2. As a check on our assignment of ranks, we have n(n+1)/2 = 15(16)/2=120 which is equal to 45.5+74.5 = 120.

We now compute U₁ and U₂, as follows:

Thus, the test statistic is U=9.5.

Step 5. Conclusion:

We reject H₀ because 9.5 < 10. We have statistically significant evidence at α =0.05 to show that the populations of APGAR scores are not equal in women receiving usual prenatal care as compared to the new program of prenatal care.

Example:

A clinical trial is run to assess the effectiveness of a new anti-retroviral therapy for patients with HIV. Patients are randomized to receive a standard anti-retroviral therapy (usual care) or the new anti-retroviral therapy and are monitored for 3 months. The primary outcome is viral load which represents the number of HIV copies per milliliter of blood. A total of 30 participants are randomized and the data are shown below.

Standard Therapy	7500	8000	2000	550	1250	1000	2250	6800	3400	6300	9100	970	1040	670	400
New Therapy	400	250	800	1400	8000	7400	1020	6000	920	1420	2700	4200	5200	4100	undetectable

Is there statistical evidence of a difference in viral load in patients receiving the standard versus the new anti-retroviral therapy?

Step 1. Set up hypotheses and determine level of significance.

H₀: The two populations are equal versus

H₁: The two populations are not equal. α=0.05

Step 2. Select the appropriate test statistic.

Because viral load measures are not normally distributed (with outliers as well as limits of detection (e.g., « undetectable »)), we use the Mann-Whitney U test. The test statistic is U, the smaller of

where R₁ and R₂ are the sums of the ranks in groups 1 and 2, respectively.

Step 3. Set up the decision rule.

The critical value can be found in the table of critical values based on sample sizes (n₁=n₂=15) and a two-sided level of significance (α=0.05). The critical value 64 and the decision rule is as follows: Reject H₀ if U < 64.

Step 4. Compute the test statistic.

The first step is to assign ranks of 1 through 30 to the smallest through largest values in the total sample. Note in the table below, that the « undetectable » measurement is listed first in the ordered values (smallest) and assigned a rank of 1.

		Total Sample (Ordered Smallest to Largest)		Ranks
Standard Anti-retroviral	New Anti-retroviral	Standard Anti-retroviral	New Anti-retroviral	Standard Anti-retroviral	New Anti-retroviral
7500	400		undetectable		1
8000	250		250		2
2000	800	400	400	3.5	3.5
550	1400	550		5
1250	8000	670		6
1000	7400		800		7
2250	1020		920		8
6800	6000	970		9
3400	920	1000		10
6300	1420		1020		11
9100	2700	1040		12
970	4200	1250		13
1040	5200		1400		14
670	4100		1420		15
400	undetectable	2000		16
		2250		17
			2700		18
		3400		19
			4100		20
			4200		21
			5200		22
			6000		23
		6300		24
		6800		25
			7400		26
		7500		27
		8000	8000	28.5	28.5
		9100		30
				R₁ = 245	R₂ = 220

Next, we sum the ranks in each group. In the standard anti-retroviral therapy group, the sum of the ranks is R₁=245; in the new anti-retroviral therapy group, the sum of the ranks is R₂=220. Recall that the sum of the ranks will always equal n(n+1)/2. As a check on our assignment of ranks, we have n(n+1)/2 = 30(31)/2=465 which is equal to 245+220 = 465. We now compute U₁ and U₂, as follows,

Thus, the test statistic is U=100.

Step 5. Conclusion.

We do not reject H₀ because 100 > 64. We do not have sufficient evidence to conclude that the treatment groups differ in viral load.

APPARIED

This section describes nonparametric tests to compare two groups with respect to a continuous outcome when the data are collected on matched or paired samples. The parametric procedure for doing this was presented in the modules on hypothesis testing for the situation in which the continuous outcome was normally distributed. This section describes procedures that should be used when the outcome cannot be assumed to follow a normal distribution. There are two popular nonparametric tests to compare outcomes between two matched or paired groups. The first is called the Sign Test and the second the Wilcoxon Signed Rank Test.

Recall that when data are matched or paired, we compute difference scores for each individual and analyze difference scores. The same approach is followed in nonparametric tests. In parametric tests, the null hypothesis is that the mean difference (μ_d) is zero. In nonparametric tests, the null hypothesis is that the median difference is zero.

Example:

Consider a clinical investigation to assess the effectiveness of a new drug designed to reduce repetitive behaviors in children affected with autism. If the drug is effective, children will exhibit fewer repetitive behaviors on treatment as compared to when they are untreated. A total of 8 children with autism enroll in the study. Each child is observed by the study psychologist for a period of 3 hours both before treatment and then again after taking the new drug for 1 week. The time that each child is engaged in repetitive behavior during each 3 hour observation period is measured. Repetitive behavior is scored on a scale of 0 to 100 and scores represent the percent of the observation time in which the child is engaged in repetitive behavior. For example, a score of 0 indicates that during the entire observation period the child did not engage in repetitive behavior while a score of 100 indicates that the child was constantly engaged in repetitive behavior. The data are shown below.

Child	Before Treatment	After 1 Week of Treatment
1	85	75
2	70	50
3	40	50
4	65	40
5	80	20
6	75	65
7	55	40
8	20	25

Looking at the data, it appears that some children improve (e.g., Child 5 scored 80 before treatment and 20 after treatment), but some got worse (e.g., Child 3 scored 40 before treatment and 50 after treatment). Is there statistically significant improvement in repetitive behavior after 1 week of treatment?.

Because the before and after treatment measures are paired, we compute difference scores for each child. In this example, we subtract the assessment of repetitive behaviors after treatment from that measured before treatment so that difference scores represent improvement in repetitive behavior. The question of interest is whether there is significant improvement after treatment.

Child	Before Treatment	After 1 Week of Treatment	Difference (Before-After)
1	85	75	10
2	70	50	20
3	40	50	-10
4	65	40	25
5	80	20	60
6	75	65	10
7	55	40	15
8	20	25	-5

In this small sample, the observed difference (or improvement) scores vary widely and are subject to extremes (e.g., the observed difference of 60 is an outlier). Thus, a nonparametric test is appropriate to test whether there is significant improvement in repetitive behavior before versus after treatment. The hypotheses are given below.

H₀: The median difference is zero versus

H₁: The median difference is positive α=0.05

In this example, the null hypothesis is that there is no difference in scores before versus after treatment. If the null hypothesis is true, we expect to see some positive differences (improvement) and some negative differences (worsening). If the research hypothesis is true, we expect to see more positive differences after treatment as compared to before.

The Sign Test

The Sign Test is the simplest nonparametric test for matched or paired data. The approach is to analyze only the signs of the difference scores, as shown below:

Child	Before Treatment	After 1 Week of Treatment	Difference (Before-After)	Sign
1	85	75	10	+
2	70	50	20	+
3	40	50	-10	–
4	65	40	25	+
5	80	20	60	+
6	75	65	10	+
7	55	40	15	+
8	20	25	-5	–

If the null hypothesis is true (i.e., if the median difference is zero) then we expect to see approximately half of the differences as positive and half of the differences as negative. If the research hypothesis is true, we expect to see more positive differences.

Test Statistic for the Sign Test

The test statistic for the Sign Test is the number of positive signs or number of negative signs, whichever is smaller. In this example, we observe 2 negative and 6 positive signs. Is this evidence of significant improvement or simply due to chance?

Determining whether the observed test statistic supports the null or research hypothesis is done following the same approach used in parametric testing. Specifically, we determine a critical value such that if the smaller of the number of positive or negative signs is less than or equal to that critical value, then we reject H₀ in favor of H₁ and if the smaller of the number of positive or negative signs is greater than the critical value, then we do not reject H₀. Notice that this is a one-sided decision rule corresponding to our one-sided research hypothesis (the two-sided situation is discussed in the next example).

Table of Critical Values for the Sign Test

The critical values for the Sign Test are in the table below.

To determine the appropriate critical value we need the sample size, which is equal to the number of matched pairs (n=8) and our one-sided level of significance α=0.05. For this example, the critical value is 1, and the decision rule is to reject H₀ if the smaller of the number of positive or negative signs < 1. We do not reject H₀ because 2 > 1. We do not have sufficient evidence at α=0.05 to show that there is improvement in repetitive behavior after taking the drug as compared to before. In essence, we could use the critical value to decide whether to reject the null hypothesis. Another alternative would be to calculate the p-value, as described below.

Computing P-values for the Sign Test

With the Sign test we can readily compute a p-value based on our observed test statistic. The test statistic for the Sign Test is the smaller of the number of positive or negative signs and it follows a binomial distribution with n = the number of subjects in the study and p=0.5 (See the module on Probability for details on the binomial distribution). In the example above, n=8 and p=0.5 (the probability of success under H₀).

By using the binomial distribution formula:

we can compute the probability of observing different numbers of successes during 8 trials. These are shown in the table below.

x=Number of Successes	P(x successes)
0	0.0039
1	0.0313
2	0.1094
3	0.2188
4	0.2734
5	0.2188
6	0.1094
7	0.0313
8	0.0039

Recall that a p-value is the probability of observing a test statistic as or more extreme than that observed. We observed 2 negative signs. Thus, the p-value for the test is: p-value = P(x <2). Using the table above,

Because the p-value = 0.1446 exceeds the level of significance α=0.05, we do not have statistically significant evidence that there is improvement in repetitive behaviors after taking the drug as compared to before. Notice in the table of binomial probabilities above, that we would have had to observe at most 1 negative sign to declare statistical significance using a 5% level of significance. Recall the critical value for our test was 1 based on the table of critical values for the Sign Test (above).

One-Sided versus Two-Sided Test

In the example looking for differences in repetitive behaviors in autistic children, we used a one-sided test (i.e., we hypothesize improvement after taking the drug). A two sided test can be used if we hypothesize a difference in repetitive behavior after taking the drug as compared to before. From the table of critical values for the Sign Test, we can determine a two-sided critical value and again reject H₀ if the smaller of the number of positive or negative signs is less than or equal to that two-sided critical value. Alternatively, we can compute a two-sided p-value. With a two-sided test, the p-value is the probability of observing many or few positive or negative signs. If the research hypothesis is a two sided alternative (i.e., H₁: The median difference is not zero), then the p-value is computed as: p-value = 2*P(x < 2). Notice that this is equivalent to p-value = P(x < 2) + P(x > 6), representing the situation of few or many successes. Recall in two-sided tests, we reject the null hypothesis if the test statistic is extreme in either direction. Thus, in the Sign Test, a two-sided p-value is the probability of observing few or many positive or negative signs. Here we observe 2 negative signs (and thus 6 positive signs). The opposite situation would be 6 negative signs (and thus 2 positive signs as n=8). The two-sided p-value is the probability of observing a test statistic as or more extreme in either direction (i.e.,

When Difference Scores are Zero

There is a special circumstance that needs attention when implementing the Sign Test which arises when one or more participants have difference scores of zero (i.e., their paired measurements are identical). If there is just one difference score of zero, some investigators drop that observation and reduce the sample size by 1 (i.e., the sample size for the binomial distribution would be n-1). This is a reasonable approach if there is just one zero. However, if there are two or more zeros, an alternative approach is preferred.

If there is an even number of zeros, we randomly assign them positive or negative signs.
If there is an odd number of zeros, we randomly drop one and reduce the sample size by 1, and then randomly assign the remaining observations positive or negative signs. The following example illustrates the approach.

Example:

A new chemotherapy treatment is proposed for patients with breast cancer. Investigators are concerned with patient’s ability to tolerate the treatment and assess their quality of life both before and after receiving the new chemotherapy treatment. Quality of life (QOL) is measured on an ordinal scale and for analysis purposes, numbers are assigned to each response category as follows: 1=Poor, 2= Fair, 3=Good, 4= Very Good, 5 = Excellent. The data are shown below.

Patient	QOL Before Chemotherapy Treatment	QOL After Chemotherapy Treatment
1	3	2
2	2	3
3	3	4
4	2	4
5	1	1
6	3	4
7	2	4
8	3	3
9	2	1
10	1	3
11	3	4
12	2	3

The question of interest is whether there is a difference in QOL after chemotherapy treatment as compared to before.

Step 1. Set up hypotheses and determine level of significance.

H₀: The median difference is zero versus

H₁: The median difference is not zero α=0.05

Step 2. Select the appropriate test statistic.

The test statistic for the Sign Test is the smaller of the number of positive or negative signs.

Step 3. Set up the decision rule.

The appropriate critical value for the Sign Test can be found in the table of critical values for the Sign Test. To determine the appropriate critical value we need the sample size (or number of matched pairs, n=12), and our two-sided level of significance α=0.05.

The critical value for this two-sided test with n=12 and a =0.05 is 2, and the decision rule is as follows: Reject H₀ if the smaller of the number of positive or negative signs < 2.

Step 4. Compute the test statistic.

Because the before and after treatment measures are paired, we compute difference scores for each patient. In this example, we subtract the QOL measured before treatment from that measured after.

Patient	QOL Before Chemotherapy Treatment	QOL After Chemotherapy Treatment	Difference (After-Before)
1	3	2	-1
2	2	3	1
3	3	4	1
4	2	4	2
5	1	1	0
6	3	4	1
7	2	4	2
8	3	3	0
9	2	1	-1
10	1	3	2
11	3	4	1
12	2	3	1

We now capture the signs of the difference scores and because there are two zeros, we randomly assign one negative sign (i.e., « – » to patient 5) and one positive sign (i.e., « + » to patient 8), as follows:

Patient	QOL Before Chemotherapy Treatment	QOL After Chemotherapy Treatment	Difference (After-Before)	Sign
1	3	2	-1	–
2	2	3	1	+
3	3	4	1	+
4	2	4	2	+
5	1	1	0	–
6	3	4	1	+
7	2	4	2	+
8	3	3	0	+
9	2	1	-1	–
10	1	3	2	+
11	3	4	1	+
12	2	3	1	+

The test statistic is the number of negative signs which is equal to 3.

Step 5. Conclusion.

We do not reject H₀ because 3 > 2. We do not have statistically significant evidence at α=0.05 to show that there is a difference in QOL after chemotherapy treatment as compared to before.

We can also compute the p-value directly using the binomial distribution with n = 12 and p=0.5. The two-sided p-value for the test is p-value = 2*P(x < 3) (which is equivalent to p-value = P(x < 3) + P(x > 9)). Again, the two-sided p-value is the probability of observing few or many positive or negative signs. Here we observe 3 negative signs (and thus 9 positive signs). The opposite situation would be 9 negative signs (and thus 3 positive signs as n=12). The two-sided p-value is the probability of observing a test statistic as or more extreme in either direction (i.e., P(x < 3) + P(x > 9)). We can compute the p-value using the binomial formula or a statistical computing package, as follows:

Because the p-value = 0.1460 exceeds the level of significance (α=0.05) we do not have statistically significant evidence at α =0.05 to show that there is a difference in QOL after chemotherapy treatment as compared to before.

Key Concept:

In each of the two previous examples, we failed to show statistical significance because the p-value was not less than the stated level of significance. While the test statistic for the Sign Test is easy to compute, it actually does not take much of the information in the sample data into account. All we measure is the difference in participant’s scores, and do not account for the magnitude of those differences.

Mann -Whitney – Wilcoxon TEST – EXAMPLES

Tests with Matched Samples

Mann Whitney U Test (Wilcoxon Rank Sum Test)

Test Statistic for the Mann Whitney U Test

Table of Critical Values for U

The Sign Test

Test Statistic for the Sign Test

Table of Critical Values for the Sign Test

Computing P-values for the Sign Test

One-Sided versus Two-Sided Test

When Difference Scores are Zero

Articles similaires

Tests with Matched Samples

Mann Whitney U Test (Wilcoxon Rank Sum Test)

Test Statistic for the Mann Whitney U Test

Table of Critical Values for U

The Sign Test

Test Statistic for the Sign Test

Table of Critical Values for the Sign Test

Computing P-values for the Sign Test

One-Sided versus Two-Sided Test

When Difference Scores are Zero

Partager :

Articles similaires