The aim of factorMerger
is to provide set of tools to support results from post hoc comparisons. Post hoc testing is an analysis performed after running ANOVA to examine differences between group means (of some response numeric variable) for each pair of groups (groups are defined by a factor variable).
This project arose from the need to create a method of post hoc testing which gives the hierarchical interpretation of relations between groups means. Thereby, for a given significance level we may divide groups into nonoverlapping clusters.
In the current version the factorMerger package supports parametric models:
family = "gaussian"
),family = "gaussian"
),family = "binomial"
),family = "survival"
).Set of hypotheses that are tested during merging may be either comprehensive or limited. This gives two possibilities:
successive = FALSE
),successive = TRUE
).The version all-to-all considers all possible pairs of factor levels. In the successive approach factor levels are preliminarily sorted and then only consecutive groups are tested for means equality.
The factorMerger package also implements two strategies of a single iteration of the algorithm. They use one of the following:
method = "LRT"
),method = "hclust"
).To visualize functionalities of factorMerger
we use samples for which response variable is generated from one of the distributions listed above and corresponding factor variable is sampled uniformly from a finite set of a size .
To do so, we may use function generateSample
or generateMultivariateSample
.
library(factorMerger)
library(knitr)
library(dplyr)
randSample <- generateMultivariateSample(N = 100, k = 10, d = 3)
mergeFactors
is a function that performs hierarchical post hoc testing. As arguments it takes:
By default (with argument abbreviate = TRUE
) factor levels are abbreviated and surrounded with brackets.
fmAll <- mergeFactors(randSample$response, randSample$factor)
mergeFactors
outputs with information about the ‘merging history’.
mergingHistory(fmAll, showStats = TRUE) %>%
kable()
groupA | groupB | model | GIC | pvalVsFull | pvalVsPrevious |
---|---|---|---|---|---|
-583.0985 | 1186.197 | 1.0000 | 1.0000 | ||
(F) | (E) | -583.1770 | 1184.354 | 0.9868 | 0.9868 |
(A) | (B) | -583.5992 | 1183.198 | 0.9891 | 0.8600 |
(D) | (C) | -584.7755 | 1183.551 | 0.9627 | 0.5461 |
(G) | (I) | -585.9748 | 1183.949 | 0.9489 | 0.5330 |
(H) | (A)(B) | -588.4350 | 1186.870 | 0.8344 | 0.2078 |
(F)(E) | (J) | -591.6986 | 1191.397 | 0.6111 | 0.1067 |
(H)(A)(B) | (D)(C) | -597.1654 | 1200.331 | 0.2354 | 0.0159 |
(H)(A)(B)(D)(C) | (F)(E)(J) | -605.6794 | 1215.359 | 0.0160 | 0.0010 |
(H)(A)(B)(D)(C)(F)(E)(J) | (G)(I) | -619.1469 | 1240.294 | 0.0001 | 0.0000 |
Each row of the above frame describes one step of the merging algorithm. First two columns specify which groups were merged in the iteration, columns model and GIC gather loglikelihood and Generalized Information Criterion for the model after merging. Last two columns are p-values for the Likelihood Ratio Test – against the full model (pvalVsFull) and against the previous one (pvalVsPrevious).
If we set successive = TRUE
then at the beginning one dimensional response is fitted using isoMDS{MASS}
. Next, in each step only groups whose means are closed are compared.
fm <- mergeFactors(randSample$response, randSample$factor,
successive = TRUE,
method = "hclust")
mergingHistory(fm, showStats = TRUE) %>%
kable()
groupA | groupB | model | GIC | pvalVsFull | pvalVsPrevious |
---|---|---|---|---|---|
-583.0985 | 1186.197 | 1.0000 | 1.0000 | ||
(F) | (E) | -583.1770 | 1184.354 | 0.9868 | 0.9868 |
(A) | (B) | -583.5992 | 1183.198 | 0.9891 | 0.8600 |
(D) | (C) | -584.7755 | 1183.551 | 0.9627 | 0.5461 |
(G) | (I) | -585.9748 | 1183.949 | 0.9489 | 0.5330 |
(A)(B) | (F)(E) | -588.9627 | 1187.926 | 0.7767 | 0.1370 |
(H) | (D)(C) | -591.5613 | 1191.123 | 0.6354 | 0.1824 |
(A)(B)(F)(E) | (J) | -596.7372 | 1199.475 | 0.2691 | 0.0205 |
(H)(D)(C) | (A)(B)(F)(E)(J) | -605.6794 | 1215.359 | 0.0160 | 0.0007 |
(H)(D)(C)(A)(B)(F)(E)(J) | (G)(I) | -619.1469 | 1240.294 | 0.0001 | 0.0000 |
Algorithms implemented in the factorMerger package enable to create unequivocal partition of a factor. Below we present how to extract the partition from the mergeFactor
output.
cutTree(fm)
#> [1] (C) (J) (F)(E) (H) (H) (H) (F)(E) (J) (F)(E) (J)
#> [11] (C) (G) (A)(B) (F)(E) (G) (A)(B) (F)(E) (F)(E) (I) (D)
#> [21] (D) (C) (G) (F)(E) (I) (J) (J) (C) (A)(B) (C)
#> [31] (F)(E) (H) (I) (F)(E) (F)(E) (F)(E) (C) (F)(E) (A)(B) (J)
#> [41] (F)(E) (G) (A)(B) (A)(B) (F)(E) (F)(E) (F)(E) (G) (F)(E) (A)(B)
#> [51] (D) (J) (I) (J) (J) (F)(E) (H) (D) (C) (C)
#> [61] (H) (F)(E) (A)(B) (J) (F)(E) (A)(B) (G) (J) (A)(B) (C)
#> [71] (I) (J) (A)(B) (I) (A)(B) (I) (F)(E) (I) (F)(E) (H)
#> [81] (G) (I) (H) (J) (I) (A)(B) (G) (H) (H) (F)(E)
#> [91] (A)(B) (J) (F)(E) (A)(B) (G) (F)(E) (J) (A)(B) (J) (H)
#> Levels: (H) (D) (C) (A)(B) (F)(E) (J) (G) (I)
By default, cutTree
returns a factor split for the optimal GIC (with penalty = 2) model. However, we can specify different metrics (stat = c("loglikelihood", "p-value", "GIC"
) we would like to use in cutting. If loglikelihood
or p-value
is chosen an exact threshold must be given as a value
parameter. Then cutTree
returns factor for the smallest model whose statistic is higher than the threshold. If we choose GIC
then value
is interpreted as GIC penalty.
mH <- mergingHistory(fm, T)
thres <- mH$model[nrow(mH) / 2]
cutTree(fm, stat = "loglikelihood", value = thres)
#> [1] (D)(C) (J) (F)(E) (H) (H) (H) (F)(E) (J) (F)(E) (J)
#> [11] (D)(C) (G)(I) (A)(B) (F)(E) (G)(I) (A)(B) (F)(E) (F)(E) (G)(I) (D)(C)
#> [21] (D)(C) (D)(C) (G)(I) (F)(E) (G)(I) (J) (J) (D)(C) (A)(B) (D)(C)
#> [31] (F)(E) (H) (G)(I) (F)(E) (F)(E) (F)(E) (D)(C) (F)(E) (A)(B) (J)
#> [41] (F)(E) (G)(I) (A)(B) (A)(B) (F)(E) (F)(E) (F)(E) (G)(I) (F)(E) (A)(B)
#> [51] (D)(C) (J) (G)(I) (J) (J) (F)(E) (H) (D)(C) (D)(C) (D)(C)
#> [61] (H) (F)(E) (A)(B) (J) (F)(E) (A)(B) (G)(I) (J) (A)(B) (D)(C)
#> [71] (G)(I) (J) (A)(B) (G)(I) (A)(B) (G)(I) (F)(E) (G)(I) (F)(E) (H)
#> [81] (G)(I) (G)(I) (H) (J) (G)(I) (A)(B) (G)(I) (H) (H) (F)(E)
#> [91] (A)(B) (J) (F)(E) (A)(B) (G)(I) (F)(E) (J) (A)(B) (J) (H)
#> Levels: (H) (D)(C) (A)(B) (F)(E) (J) (G)(I)
In this example data partition is created for the last model from the merging path whose loglikelihood is greater than -585.9748.
getOptimalPartition(fm)
#> [1] "(H)" "(D)" "(C)" "(A)(B)" "(F)(E)" "(J)" "(G)" "(I)"
Function getOptimalPartition
returns a vector with the final cluster names from the factorMerger object.
getOptimalPartitionDf(fm)
#> orig pred
#> 1 (C) (C)
#> 2 (J) (J)
#> 3 (F) (F)(E)
#> 4 (H) (H)
#> 7 (E) (F)(E)
#> 12 (G) (G)
#> 13 (B) (A)(B)
#> 16 (A) (A)(B)
#> 19 (I) (I)
#> 20 (D) (D)
Function getOptimalPartitionDf
returns a dictionary in a data frame format. Each row gives an original label of a factor level and its new (cluster) label.
Similarly to cutTree
, functions getOptimalPartition
and getOptimalPartitionDf
take arguments stat
and threshold
.
We may plot results using function plot
.
plot(fm, panel = "all", nodesSpacing = "equidistant", color = "cluster")
plot(fmAll, panel = "merging", show = "p-value",
nodesSpacing = "effects", color = "cluster",
clusterSplit = list(stat = "p-value", alpha = 0.5),
alpha = 0.5)
plot(fm, color = "cluster", panel = "response")
The heatmap on the right shows means of all variables taken into analysis by groups.
plot(fm, color = "cluster", panel = "response", summary = "profile")
In the above plots colours are connected with the group. The plot on the right shows means rankings for all variables included in the algorithm.
It is also possible to plot GIC together with the merging path plot.
plot(fm, panel = "GIC", penalty = 5,
clusterSplit = list(stat = "GIC", value = 5))
Model with the lowest GIC is marked.
oneDimRandSample <- generateSample(1000, 10)
oneDimFm <- mergeFactors(oneDimRandSample$response, oneDimRandSample$factor,
method = "hclust")
mergingHistory(oneDimFm, showStats = TRUE) %>%
kable()
groupA | groupB | model | GIC | pvalVsFull | pvalVsPrevious |
---|---|---|---|---|---|
-3219.290 | 6458.581 | 1.0000 | 1.0000 | ||
(B) | (E) | -3219.292 | 6456.583 | 0.9592 | 0.9592 |
(B)(E) | (G) | -3219.301 | 6454.602 | 0.9897 | 0.8928 |
(C) | (F) | -3219.312 | 6452.624 | 0.9977 | 0.8826 |
(J) | (I) | -3219.470 | 6450.941 | 0.9859 | 0.5748 |
(A) | (B)(E)(G) | -3219.865 | 6449.729 | 0.9507 | 0.3763 |
(D) | (H) | -3221.041 | 6450.083 | 0.7474 | 0.1260 |
(A)(B)(E)(G) | (C)(F) | -3222.905 | 6451.811 | 0.4108 | 0.0540 |
(D)(H) | (J)(I) | -3231.854 | 6467.708 | 0.0016 | 0.0000 |
(A)(B)(E)(G)(C)(F) | (D)(H)(J)(I) | -3296.211 | 6594.422 | 0.0000 | 0.0000 |
plot(oneDimFm, responsePalette = "Reds")
plot(oneDimFm, summary = "boxplot", color = "cluster")
If family = "binomial"
response must have to values: 0
and 1
(1
is interpreted as success).
binomRandSample <- generateSample(1000, 10, distr = "binomial")
table(binomRandSample$response, binomRandSample$factor) %>%
kable()
I | F | A | G | H | J | E | B | C | D | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 84 | 66 | 61 | 55 | 52 | 52 | 51 | 40 | 17 | 8 |
1 | 21 | 29 | 37 | 47 | 46 | 54 | 55 | 69 | 68 | 88 |
binomFm <- mergeFactors(binomRandSample$response,
binomRandSample$factor,
family = "binomial",
successive = TRUE)
mergingHistory(binomFm, showStats = TRUE) %>%
kable()
groupA | groupB | model | GIC | pvalVsFull | pvalVsPrevious |
---|---|---|---|---|---|
-602.6546 | 1225.309 | 1.0000 | 1.0000 | ||
(G) | (H) | -602.6620 | 1223.324 | 0.9029 | 0.9029 |
(J) | (E) | -602.6715 | 1221.343 | 0.9833 | 0.8907 |
(G)(H) | (J)(E) | -603.1691 | 1220.338 | 0.7942 | 0.3185 |
(F) | (A) | -603.7303 | 1219.461 | 0.7079 | 0.2894 |
(C) | (D) | -606.3384 | 1222.677 | 0.1947 | 0.0224 |
(I) | (F)(A) | -609.7837 | 1227.567 | 0.0269 | 0.0087 |
(G)(H)(J)(E) | (B) | -613.3427 | 1232.685 | 0.0033 | 0.0076 |
(I)(F)(A) | (G)(H)(J)(E)(B) | -633.8721 | 1271.744 | 0.0000 | 0.0000 |
(I)(F)(A)(G)(H)(J)(E)(B) | (C)(D) | -692.7551 | 1387.510 | 0.0000 | 0.0000 |
plot(binomFm, color = "cluster", GICcolor = "red")
plot(binomFm, color = "cluster",
clusterSplit = list(stat = "GIC", penalty = 7), penalty = 7)
plot(binomFm, GICcolor = "red")
If family = "survival"
response must be of a class Surv
.
library(survival)
data(veteran)
survResponse <- Surv(time = veteran$time,
event = veteran$status)
survivalFm <- mergeFactors(response = survResponse,
factor = veteran$celltype,
family = "survival")
mergingHistory(survivalFm, showStats = TRUE) %>%
kable()
groupA | groupB | model | GIC | pvalVsFull | pvalVsPrevious |
---|---|---|---|---|---|
-493.0247 | 994.0495 | 1.0000 | 1.0000 | ||
(smll) | (aden) | -493.1951 | 992.3902 | 0.5594 | 0.5594 |
(sqms) | (larg) | -493.5304 | 991.0609 | 0.6031 | 0.4128 |
(sqms)(larg) | (smll)(aden) | -505.4491 | 1012.8981 | 0.0000 | 0.0000 |
plot(survivalFm)
plot(survivalFm, nodesSpacing = "effects", color = "cluster")
Prochenka, Agnieszka. 2016. “Delete or Merge Regressors algorithm.” In Polska Akadenia Nauk, 37, 44–91.