Clean a dataset with single column containing multiple variables as delimited key-value pairs.
This can be done using tidyr functions spread, gather, separate, etc.
Creating a test dataset with multiple variables of interest -
var1_text = c("Sachin", "Sourav", "Rahul", "Laxman")
var2_text = c("Virat", "Jinx", "Pujara", "Rohit")
sep1 = ":"
sep2 = "|"
no_rows = 100
set.seed(9653)
d1 = data.frame(id = 1:no_rows,
retired = sample(x = var1_text, size = 10, replace = TRUE),
current = sample(x = var2_text, size = 10, replace = TRUE),
garbage = paste0("my_var", 1:no_rows),
stringsAsFactors = FALSE)
knitr::kable(head(d1))
| id | retired | current | garbage |
|---|---|---|---|
| 1 | Sachin | Pujara | my_var1 |
| 2 | Rahul | Virat | my_var2 |
| 3 | Laxman | Virat | my_var3 |
| 4 | Sourav | Virat | my_var4 |
| 5 | Sachin | Virat | my_var5 |
| 6 | Sachin | Virat | my_var6 |
Combining variable names (keys) with variable values to make it slightly unclean.
d2 = d1
var_names = names(d1)[-1]
d2$var1_pair = paste(var_names[1], d2$retired, sep = sep1)
d2$var2_pair = paste(var_names[2], d2$current, sep = sep1)
d2$var3_pair = paste(var_names[3], d2$garbage, sep = sep1)
d2 = d2[, c("id", "var1_pair", "var2_pair", "var3_pair")]
knitr::kable(head(d2))
| id | var1_pair | var2_pair | var3_pair |
|---|---|---|---|
| 1 | retired:Sachin | current:Pujara | garbage:my_var1 |
| 2 | retired:Rahul | current:Virat | garbage:my_var2 |
| 3 | retired:Laxman | current:Virat | garbage:my_var3 |
| 4 | retired:Sourav | current:Virat | garbage:my_var4 |
| 5 | retired:Sachin | current:Virat | garbage:my_var5 |
| 6 | retired:Sachin | current:Virat | garbage:my_var6 |
Combining the key-value pairs for these variables into a single variable separated by a delimiter. Also, removing certain pairs and altering the order of the pairs.
d3 = d2
d3$text = NA
d3$text[4 * (1:25) - 3] = paste(d3$var1_pair[4 * (1:25) - 3],
d3$var2_pair[4 * (1:25) - 3],
d3$var3_pair[4 * (1:25) - 3],
sep = sep2)
d3$text[4 * (1:25) - 2] = paste(d3$var2_pair[4 * (1:25) - 2],
d3$var3_pair[4 * (1:25) - 2],
sep = sep2)
d3$text[4 * (1:25) - 1] = paste(d3$var3_pair[4 * (1:25) - 1],
d3$var2_pair[4 * (1:25) - 1],
d3$var1_pair[4 * (1:25) - 1],
sep = sep2)
d3$text[4 * (1:25)] = d3$var2_pair[4 * (1:25)]
d3 = d3[ , c("id", "text")]
The test dataset now looks like -
| id | text |
|---|---|
| 1 | retired:Sachin|current:Pujara|garbage:my_var1 |
| 2 | current:Virat|garbage:my_var2 |
| 3 | garbage:my_var3|current:Virat|retired:Laxman |
| 4 | current:Virat |
| 5 | retired:Sachin|current:Virat|garbage:my_var5 |
| 6 | current:Virat|garbage:my_var6 |
Couple of assumptions about this dataset make this problem a non-trivial one -
key-value pairs).retired is not always followed by current and current is not always followed by garbage.library(plyr)
library(dplyr)
library(tidyr)
library(stringr)
A 2 step solution:
key: value pairs into different columns i.e. key1: value1 | key2: value2 —-> key1: value1 and key2: value2First we need to know at the most how many key value pairs could be there in a single text instance. Once we know the max number of vars, we can use tidyr::separate function to separate these pairs -
# Finding max number of key: val pairs in a row
len = max(str_count(string = d3$text, pattern = paste0("[",sep2,"]")))
vec_names = paste0("X", 1:(len + 1))
d2_rev = d3 %>%
separate(col = "text", into = vec_names, sep = paste0("[",sep2,"]"), extra = "drop")
Result after this step -
| id | X1 | X2 | X3 |
|---|---|---|---|
| 1 | retired:Sachin | current:Pujara | garbage:my_var1 |
| 2 | current:Virat | garbage:my_var2 | NA |
| 3 | garbage:my_var3 | current:Virat | retired:Laxman |
| 4 | current:Virat | NA | NA |
| 5 | retired:Sachin | current:Virat | garbage:my_var5 |
| 6 | current:Virat | garbage:my_var6 | NA |
values from keys i.e. key1: value1 —–> key1 and value1The next step is to separate key: val pairs such that the val values are stored in corresponding key columns. For this, we’ll first need to go from the wide version that we have now to long version of the dataset using tidyr::gather. After that, it’s again applying tidyr::separate to separate keys from values. The last step is to convert this long dataset back to a wide one using tidyr::spread. Too long an explanation, better see the code -
d3_rev = d2_rev %>%
gather(key = "temp_var", value = "kv_pair", -id, na.rm = TRUE) %>%
select(-temp_var) %>%
separate(col = "kv_pair", into = c("key", "val"), sep = paste0("[",sep1,"]"), extra = "drop") %>%
spread(key = "key", value = "val")
Final result -
| id | current | garbage | retired |
|---|---|---|---|
| 1 | Pujara | my_var1 | Sachin |
| 2 | Virat | my_var2 | NA |
| 3 | Virat | my_var3 | Laxman |
| 4 | Virat | NA | NA |
| 5 | Virat | my_var5 | Sachin |
| 6 | Virat | my_var6 | NA |
This is very similar to the original clean dataset d1 except for the column order and some NA values (generated cause of missing values we introduced while creating the test dataset, in order to make this example more generic).
| id | retired | current | garbage |
|---|---|---|---|
| 1 | Sachin | Pujara | my_var1 |
| 2 | Rahul | Virat | my_var2 |
| 3 | Laxman | Virat | my_var3 |
| 4 | Sourav | Virat | my_var4 |
| 5 | Sachin | Virat | my_var5 |
| 6 | Sachin | Virat | my_var6 |