Introduction

Algorithm inputs

Generating samples

Merging factors

Multi-dimensional Gaussian model

Computations
Final clusters
Visualizations

One-dimensional Gaussian model
Binomial model
Survival model

Bibliography

Introduction

Algorithm inputs

Generating samples

Merging factors

Multi-dimensional Gaussian model

Computations
Final clusters
Visualizations

One-dimensional Gaussian model
Binomial model
Survival model

Bibliography

Introduction

The aim of factorMerger is to provide set of tools to support results from post hoc comparisons. Post hoc testing is an analysis performed after running ANOVA to examine differences between group means (of some response numeric variable) for each pair of groups (groups are defined by a factor variable).

This project arose from the need to create a method of post hoc testing which gives the hierarchical interpretation of relations between groups means. Thereby, for a given significance level we may divide groups into nonoverlapping clusters.

Algorithm inputs

In the current version the factorMerger package supports parametric models:

one-dimensional Gaussian (with the argument family = "gaussian"),
multi dimensional Gaussian (with the argument family = "gaussian"),
binomial (with the argument family = "binomial"),
survival (with the argument family = "survival").

Set of hypotheses that are tested during merging may be either comprehensive or limited. This gives two possibilities:

all-to-all (with the argument successive = FALSE),
successive (with the argument successive = TRUE).

The version all-to-all considers all possible pairs of factor levels. In the successive approach factor levels are preliminarily sorted and then only consecutive groups are tested for means equality.

The factorMerger package also implements two strategies of a single iteration of the algorithm. They use one of the following:

Likelihood Ratio Test (with the argument method = "LRT"),
agglomerative clustering with constant distance matrix (based on the DMR4glm algorithm Prochenka 2016, with the argument method = "hclust").

Generating samples

To visualize functionalities of factorMerger we use samples for which response variable is generated from one of the distributions listed above and corresponding factor variable is sampled uniformly from a finite set of a size $k$ .

To do so, we may use function generateSample or generateMultivariateSample.

library(factorMerger) 
library(knitr)
library(dplyr)
randSample <- generateMultivariateSample(N = 100, k = 10, d = 3)

Merging factors

mergeFactors is a function that performs hierarchical post hoc testing. As arguments it takes:

matrix/data.frame/vector with numeric response,
factor vector defining groups.

By default (with argument abbreviate = TRUE) factor levels are abbreviated and surrounded with brackets.

Multi-dimensional Gaussian model

Computations

fmAll <- mergeFactors(randSample$response, randSample$factor)

mergeFactors outputs with information about the ‘merging history’.

mergingHistory(fmAll, showStats = TRUE) %>% 
    kable()

groupA	groupB	model	GIC	pvalVsFull	pvalVsPrevious
		-583.0985	1186.197	1.0000	1.0000
(F)	(E)	-583.1770	1184.354	0.9868	0.9868
(A)	(B)	-583.5992	1183.198	0.9891	0.8600
(D)	(C)	-584.7755	1183.551	0.9627	0.5461
(G)	(I)	-585.9748	1183.949	0.9489	0.5330
(H)	(A)(B)	-588.4350	1186.870	0.8344	0.2078
(F)(E)	(J)	-591.6986	1191.397	0.6111	0.1067
(H)(A)(B)	(D)(C)	-597.1654	1200.331	0.2354	0.0159
(H)(A)(B)(D)(C)	(F)(E)(J)	-605.6794	1215.359	0.0160	0.0010
(H)(A)(B)(D)(C)(F)(E)(J)	(G)(I)	-619.1469	1240.294	0.0001	0.0000

Each row of the above frame describes one step of the merging algorithm. First two columns specify which groups were merged in the iteration, columns model and GIC gather loglikelihood and Generalized Information Criterion for the model after merging. Last two columns are p-values for the Likelihood Ratio Test – against the full model (pvalVsFull) and against the previous one (pvalVsPrevious).

If we set successive = TRUE then at the beginning one dimensional response is fitted using isoMDS{MASS}. Next, in each step only groups whose means are closed are compared.

fm <- mergeFactors(randSample$response, randSample$factor, 
                   successive = TRUE, 
                   method = "hclust")

mergingHistory(fm, showStats = TRUE) %>% 
    kable()

groupA	groupB	model	GIC	pvalVsFull	pvalVsPrevious
		-583.0985	1186.197	1.0000	1.0000
(F)	(E)	-583.1770	1184.354	0.9868	0.9868
(A)	(B)	-583.5992	1183.198	0.9891	0.8600
(D)	(C)	-584.7755	1183.551	0.9627	0.5461
(G)	(I)	-585.9748	1183.949	0.9489	0.5330
(A)(B)	(F)(E)	-588.9627	1187.926	0.7767	0.1370
(H)	(D)(C)	-591.5613	1191.123	0.6354	0.1824
(A)(B)(F)(E)	(J)	-596.7372	1199.475	0.2691	0.0205
(H)(D)(C)	(A)(B)(F)(E)(J)	-605.6794	1215.359	0.0160	0.0007
(H)(D)(C)(A)(B)(F)(E)(J)	(G)(I)	-619.1469	1240.294	0.0001	0.0000

Final clusters

Algorithms implemented in the factorMerger package enable to create unequivocal partition of a factor. Below we present how to extract the partition from the mergeFactor output.

predict new labels for observations

cutTree(fm)
#>   [1] (C)    (J)    (F)(E) (H)    (H)    (H)    (F)(E) (J)    (F)(E) (J)   
#>  [11] (C)    (G)    (A)(B) (F)(E) (G)    (A)(B) (F)(E) (F)(E) (I)    (D)   
#>  [21] (D)    (C)    (G)    (F)(E) (I)    (J)    (J)    (C)    (A)(B) (C)   
#>  [31] (F)(E) (H)    (I)    (F)(E) (F)(E) (F)(E) (C)    (F)(E) (A)(B) (J)   
#>  [41] (F)(E) (G)    (A)(B) (A)(B) (F)(E) (F)(E) (F)(E) (G)    (F)(E) (A)(B)
#>  [51] (D)    (J)    (I)    (J)    (J)    (F)(E) (H)    (D)    (C)    (C)   
#>  [61] (H)    (F)(E) (A)(B) (J)    (F)(E) (A)(B) (G)    (J)    (A)(B) (C)   
#>  [71] (I)    (J)    (A)(B) (I)    (A)(B) (I)    (F)(E) (I)    (F)(E) (H)   
#>  [81] (G)    (I)    (H)    (J)    (I)    (A)(B) (G)    (H)    (H)    (F)(E)
#>  [91] (A)(B) (J)    (F)(E) (A)(B) (G)    (F)(E) (J)    (A)(B) (J)    (H)   
#> Levels: (H) (D) (C) (A)(B) (F)(E) (J) (G) (I)

By default, cutTree returns a factor split for the optimal GIC (with penalty = 2) model. However, we can specify different metrics (stat = c("loglikelihood", "p-value", "GIC") we would like to use in cutting. If loglikelihood or p-value is chosen an exact threshold must be given as a value parameter. Then cutTree returns factor for the smallest model whose statistic is higher than the threshold. If we choose GIC then value is interpreted as GIC penalty.

mH <- mergingHistory(fm, T)
thres <- mH$model[nrow(mH) / 2]
cutTree(fm, stat = "loglikelihood", value = thres)
#>   [1] (D)(C) (J)    (F)(E) (H)    (H)    (H)    (F)(E) (J)    (F)(E) (J)   
#>  [11] (D)(C) (G)(I) (A)(B) (F)(E) (G)(I) (A)(B) (F)(E) (F)(E) (G)(I) (D)(C)
#>  [21] (D)(C) (D)(C) (G)(I) (F)(E) (G)(I) (J)    (J)    (D)(C) (A)(B) (D)(C)
#>  [31] (F)(E) (H)    (G)(I) (F)(E) (F)(E) (F)(E) (D)(C) (F)(E) (A)(B) (J)   
#>  [41] (F)(E) (G)(I) (A)(B) (A)(B) (F)(E) (F)(E) (F)(E) (G)(I) (F)(E) (A)(B)
#>  [51] (D)(C) (J)    (G)(I) (J)    (J)    (F)(E) (H)    (D)(C) (D)(C) (D)(C)
#>  [61] (H)    (F)(E) (A)(B) (J)    (F)(E) (A)(B) (G)(I) (J)    (A)(B) (D)(C)
#>  [71] (G)(I) (J)    (A)(B) (G)(I) (A)(B) (G)(I) (F)(E) (G)(I) (F)(E) (H)   
#>  [81] (G)(I) (G)(I) (H)    (J)    (G)(I) (A)(B) (G)(I) (H)    (H)    (F)(E)
#>  [91] (A)(B) (J)    (F)(E) (A)(B) (G)(I) (F)(E) (J)    (A)(B) (J)    (H)   
#> Levels: (H) (D)(C) (A)(B) (F)(E) (J) (G)(I)

In this example data partition is created for the last model from the merging path whose loglikelihood is greater than -585.9748.

get final clusters and clusters dictionary

getOptimalPartition(fm)
#> [1] "(H)"    "(D)"    "(C)"    "(A)(B)" "(F)(E)" "(J)"    "(G)"    "(I)"

Function getOptimalPartition returns a vector with the final cluster names from the factorMerger object.

getOptimalPartitionDf(fm)
#>    orig   pred
#> 1   (C)    (C)
#> 2   (J)    (J)
#> 3   (F) (F)(E)
#> 4   (H)    (H)
#> 7   (E) (F)(E)
#> 12  (G)    (G)
#> 13  (B) (A)(B)
#> 16  (A) (A)(B)
#> 19  (I)    (I)
#> 20  (D)    (D)

Function getOptimalPartitionDf returns a dictionary in a data frame format. Each row gives an original label of a factor level and its new (cluster) label.

Similarly to cutTree, functions getOptimalPartition and getOptimalPartitionDf take arguments stat and threshold.

Visualizations

We may plot results using function plot.

plot(fm, panel = "all", nodesSpacing = "equidistant", color = "cluster")

plot(fmAll, panel = "merging", show = "p-value", 
     nodesSpacing = "effects", color = "cluster", 
     clusterSplit = list(stat = "p-value", alpha = 0.5),
     alpha = 0.5)

plot(fm, color = "cluster", panel = "response")

The heatmap on the right shows means of all variables taken into analysis by groups.

plot(fm, color = "cluster", panel = "response", summary = "profile")

In the above plots colours are connected with the group. The plot on the right shows means rankings for all variables included in the algorithm.

It is also possible to plot GIC together with the merging path plot.

plot(fm, panel = "GIC", penalty = 5, 
     clusterSplit = list(stat = "GIC", value = 5))

Model with the lowest GIC is marked.

One-dimensional Gaussian model

oneDimRandSample <- generateSample(1000, 10)

oneDimFm <- mergeFactors(oneDimRandSample$response, oneDimRandSample$factor, 
                         method = "hclust")
mergingHistory(oneDimFm, showStats = TRUE) %>% 
    kable()

groupA	groupB	model	GIC	pvalVsFull	pvalVsPrevious
		-3219.290	6458.581	1.0000	1.0000
(B)	(E)	-3219.292	6456.583	0.9592	0.9592
(B)(E)	(G)	-3219.301	6454.602	0.9897	0.8928
(C)	(F)	-3219.312	6452.624	0.9977	0.8826
(J)	(I)	-3219.470	6450.941	0.9859	0.5748
(A)	(B)(E)(G)	-3219.865	6449.729	0.9507	0.3763
(D)	(H)	-3221.041	6450.083	0.7474	0.1260
(A)(B)(E)(G)	(C)(F)	-3222.905	6451.811	0.4108	0.0540
(D)(H)	(J)(I)	-3231.854	6467.708	0.0016	0.0000
(A)(B)(E)(G)(C)(F)	(D)(H)(J)(I)	-3296.211	6594.422	0.0000	0.0000

plot(oneDimFm, responsePalette = "Reds")

plot(oneDimFm, summary = "boxplot", color = "cluster")

Binomial model

If family = "binomial" response must have to values: 0 and 1 (1 is interpreted as success).

binomRandSample <- generateSample(1000, 10, distr = "binomial")
table(binomRandSample$response, binomRandSample$factor) %>% 
    kable()

	I	F	A	G	H	J	E	B	C	D
0	84	66	61	55	52	52	51	40	17	8
1	21	29	37	47	46	54	55	69	68	88

binomFm <- mergeFactors(binomRandSample$response, 
                        binomRandSample$factor, 
                        family = "binomial", 
                        successive = TRUE)
mergingHistory(binomFm, showStats = TRUE) %>% 
    kable()

groupA	groupB	model	GIC	pvalVsFull	pvalVsPrevious
		-602.6546	1225.309	1.0000	1.0000
(G)	(H)	-602.6620	1223.324	0.9029	0.9029
(J)	(E)	-602.6715	1221.343	0.9833	0.8907
(G)(H)	(J)(E)	-603.1691	1220.338	0.7942	0.3185
(F)	(A)	-603.7303	1219.461	0.7079	0.2894
(C)	(D)	-606.3384	1222.677	0.1947	0.0224
(I)	(F)(A)	-609.7837	1227.567	0.0269	0.0087
(G)(H)(J)(E)	(B)	-613.3427	1232.685	0.0033	0.0076
(I)(F)(A)	(G)(H)(J)(E)(B)	-633.8721	1271.744	0.0000	0.0000
(I)(F)(A)(G)(H)(J)(E)(B)	(C)(D)	-692.7551	1387.510	0.0000	0.0000

plot(binomFm, color = "cluster", GICcolor = "red")

plot(binomFm, color = "cluster", 
     clusterSplit = list(stat = "GIC", penalty = 7), penalty = 7)

plot(binomFm, GICcolor = "red")

Survival model

If family = "survival" response must be of a class Surv.

library(survival)
data(veteran)
survResponse <- Surv(time = veteran$time, 
                 event = veteran$status)
survivalFm <- mergeFactors(response = survResponse, 
                   factor = veteran$celltype, 
                   family = "survival")

mergingHistory(survivalFm, showStats = TRUE) %>% 
    kable()

groupA	groupB	model	GIC	pvalVsFull	pvalVsPrevious
		-493.0247	994.0495	1.0000	1.0000
(smll)	(aden)	-493.1951	992.3902	0.5594	0.5594
(sqms)	(larg)	-493.5304	991.0609	0.6031	0.4128
(sqms)(larg)	(smll)(aden)	-505.4491	1012.8981	0.0000	0.0000

plot(survivalFm)

plot(survivalFm, nodesSpacing = "effects", color = "cluster")

Bibliography

Prochenka, Agnieszka. 2016. “Delete or Merge Regressors algorithm.” In Polska Akadenia Nauk, 37, 44–91.

factorMerger: set of tools to support results from post hoc testing

Agnieszka Sitko

2017-05-03