Clean a dataset with single column containing multiple variables as delimited key-value
pairs.
This can be done using tidyr
functions spread
, gather
, separate
, etc.
Creating a test dataset with multiple variables of interest -
var1_text = c("Sachin", "Sourav", "Rahul", "Laxman")
var2_text = c("Virat", "Jinx", "Pujara", "Rohit")
sep1 = ":"
sep2 = "|"
no_rows = 100
set.seed(9653)
d1 = data.frame(id = 1:no_rows,
retired = sample(x = var1_text, size = 10, replace = TRUE),
current = sample(x = var2_text, size = 10, replace = TRUE),
garbage = paste0("my_var", 1:no_rows),
stringsAsFactors = FALSE)
knitr::kable(head(d1))
id | retired | current | garbage |
---|---|---|---|
1 | Sachin | Pujara | my_var1 |
2 | Rahul | Virat | my_var2 |
3 | Laxman | Virat | my_var3 |
4 | Sourav | Virat | my_var4 |
5 | Sachin | Virat | my_var5 |
6 | Sachin | Virat | my_var6 |
Combining variable names (keys
) with variable values
to make it slightly unclean.
d2 = d1
var_names = names(d1)[-1]
d2$var1_pair = paste(var_names[1], d2$retired, sep = sep1)
d2$var2_pair = paste(var_names[2], d2$current, sep = sep1)
d2$var3_pair = paste(var_names[3], d2$garbage, sep = sep1)
d2 = d2[, c("id", "var1_pair", "var2_pair", "var3_pair")]
knitr::kable(head(d2))
id | var1_pair | var2_pair | var3_pair |
---|---|---|---|
1 | retired:Sachin | current:Pujara | garbage:my_var1 |
2 | retired:Rahul | current:Virat | garbage:my_var2 |
3 | retired:Laxman | current:Virat | garbage:my_var3 |
4 | retired:Sourav | current:Virat | garbage:my_var4 |
5 | retired:Sachin | current:Virat | garbage:my_var5 |
6 | retired:Sachin | current:Virat | garbage:my_var6 |
Combining the key-value
pairs for these variables into a single variable separated by a delimiter. Also, removing certain pairs and altering the order of the pairs.
d3 = d2
d3$text = NA
d3$text[4 * (1:25) - 3] = paste(d3$var1_pair[4 * (1:25) - 3],
d3$var2_pair[4 * (1:25) - 3],
d3$var3_pair[4 * (1:25) - 3],
sep = sep2)
d3$text[4 * (1:25) - 2] = paste(d3$var2_pair[4 * (1:25) - 2],
d3$var3_pair[4 * (1:25) - 2],
sep = sep2)
d3$text[4 * (1:25) - 1] = paste(d3$var3_pair[4 * (1:25) - 1],
d3$var2_pair[4 * (1:25) - 1],
d3$var1_pair[4 * (1:25) - 1],
sep = sep2)
d3$text[4 * (1:25)] = d3$var2_pair[4 * (1:25)]
d3 = d3[ , c("id", "text")]
The test dataset now looks like -
id | text |
---|---|
1 | retired:Sachin|current:Pujara|garbage:my_var1 |
2 | current:Virat|garbage:my_var2 |
3 | garbage:my_var3|current:Virat|retired:Laxman |
4 | current:Virat |
5 | retired:Sachin|current:Virat|garbage:my_var5 |
6 | current:Virat|garbage:my_var6 |
Couple of assumptions about this dataset make this problem a non-trivial one -
key-value
pairs).retired
is not always followed by current
and current
is not always followed by garbage
.library(plyr)
library(dplyr)
library(tidyr)
library(stringr)
A 2 step solution:
key: value pairs
into different columns i.e. key1: value1 | key2: value2
—-> key1: value1
and key2: value2
First we need to know at the most how many key value pairs could be there in a single text
instance. Once we know the max number of vars, we can use tidyr::separate
function to separate these pairs -
# Finding max number of key: val pairs in a row
len = max(str_count(string = d3$text, pattern = paste0("[",sep2,"]")))
vec_names = paste0("X", 1:(len + 1))
d2_rev = d3 %>%
separate(col = "text", into = vec_names, sep = paste0("[",sep2,"]"), extra = "drop")
Result after this step -
id | X1 | X2 | X3 |
---|---|---|---|
1 | retired:Sachin | current:Pujara | garbage:my_var1 |
2 | current:Virat | garbage:my_var2 | NA |
3 | garbage:my_var3 | current:Virat | retired:Laxman |
4 | current:Virat | NA | NA |
5 | retired:Sachin | current:Virat | garbage:my_var5 |
6 | current:Virat | garbage:my_var6 | NA |
value
s from key
s i.e. key1: value1
—–> key1
and value1
The next step is to separate key: val
pairs such that the val
values are stored in corresponding key
columns. For this, we’ll first need to go from the wide version that we have now to long version of the dataset using tidyr::gather
. After that, it’s again applying tidyr::separate
to separate key
s from value
s. The last step is to convert this long dataset back to a wide one using tidyr::spread
. Too long an explanation, better see the code -
d3_rev = d2_rev %>%
gather(key = "temp_var", value = "kv_pair", -id, na.rm = TRUE) %>%
select(-temp_var) %>%
separate(col = "kv_pair", into = c("key", "val"), sep = paste0("[",sep1,"]"), extra = "drop") %>%
spread(key = "key", value = "val")
Final result -
id | current | garbage | retired |
---|---|---|---|
1 | Pujara | my_var1 | Sachin |
2 | Virat | my_var2 | NA |
3 | Virat | my_var3 | Laxman |
4 | Virat | NA | NA |
5 | Virat | my_var5 | Sachin |
6 | Virat | my_var6 | NA |
This is very similar to the original clean dataset d1
except for the column order and some NA
values (generated cause of missing values we introduced while creating the test dataset, in order to make this example more generic).
id | retired | current | garbage |
---|---|---|---|
1 | Sachin | Pujara | my_var1 |
2 | Rahul | Virat | my_var2 |
3 | Laxman | Virat | my_var3 |
4 | Sourav | Virat | my_var4 |
5 | Sachin | Virat | my_var5 |
6 | Sachin | Virat | my_var6 |