This report lists the candidate variable for DataScheme variables of the construct sex.
This report is a record of interaction with a data transfer object (dto) produced by
./manipulation/0-ellis-island.R
.
The next section recaps this script, exposes the architecture of the DTO, and demonstrates the language of interacting with it.
All data land on Ellis Island.
The script 0-ellis-island.R
is the first script in the analytic workflow. It accomplished the following:
./data/shared/derived/meta-data-live.csv
, which is updated every time Ellis Island script is executed../data/shared/meta-data-map.csv
. They are used by automatic scripts in later harmonization and analysis.# load the product of 0-ellis-island.R, a list object containing data and metadata
dto <- readRDS("./data/unshared/derived/dto.rds")
# the list is composed of the following elements
names(dto)
[1] "studyName" "filePath" "unitData" "metaData"
# 1st element - names of the studies as character vector
dto[["studyName"]]
[1] "alsa" "lbsl" "satsa" "share" "tilda"
# 2nd element - file paths of the data files for each study as character vector
dto[["filePath"]]
[1] "./data/unshared/raw/ALSA-Wave1.Final.sav" "./data/unshared/raw/LBSL-Panel2-Wave1.Final.sav"
[3] "./data/unshared/raw/SATSA-Q3.Final.sav" "./data/unshared/raw/SHARE-Israel-Wave1.Final.sav"
[5] "./data/unshared/raw/TILDA-Wave1.Final.sav"
# 3rd element - is a list object containing the following elements
names(dto[["unitData"]])
[1] "alsa" "lbsl" "satsa" "share" "tilda"
# each of these elements is a raw data set of a corresponding study, for example
dplyr::tbl_df(dto[["unitData"]][["lbsl"]])
Source: local data frame [656 x 30]
id AGE94 SEX94 MSTAT94 EDUC94 NOWRK94 SMK94 SMOKE
(int) (int) (int) (fctr) (int) (fctr) (fctr) (fctr)
1 4001026 68 1 divorced 16 no, retired no never smoked
2 4012015 94 2 widowed 12 no, retired no never smoked
3 4012032 94 2 widowed 20 no, retired no don't smoke at present but smoked in the past
4 4022004 93 2 NA NA NA NA never smoked
5 4022026 93 2 widowed 12 no, retired no never smoked
6 4031031 92 1 married 8 no, retired no don't smoke at present but smoked in the past
7 4031035 92 1 widowed 13 no, retired no don't smoke at present but smoked in the past
8 4032201 92 2 NA NA NA NA don't smoke at present but smoked in the past
9 4041062 91 1 widowed 7 NA no don't smoke at present but smoked in the past
10 4042057 91 2 NA NA NA NA NA
.. ... ... ... ... ... ... ... ...
Variables not shown: ALCOHOL (fctr), WINE (int), BEER (int), HARDLIQ (int), SPORT94 (int), FIT94 (int), WALK94 (int),
SPEC94 (int), DANCE94 (int), CHORE94 (int), EXCERTOT (int), EXCERWK (int), HEIGHT94 (int), WEIGHT94 (int), HWEIGHT
(int), HHEIGHT (int), SRHEALTH (fctr), smoke_now (lgl), smoked_ever (lgl), year_of_wave (dbl), age_in_years (dbl),
year_born (dbl)
# 4th element - a dataset names and labels of raw variables + added metadata for all studies
dto[["metaData"]] %>% dplyr::select(study_name, name, item, construct, type, categories, label_short, label) %>%
DT::datatable(
class = 'cell-border stripe',
caption = "This is the primary metadata file. Edit at `./data/shared/meta-data-map.csv",
filter = "top",
options = list(pageLength = 6, autoWidth = TRUE)
)
Everybody wants to be somebody.
We query metadata set to retrieve all variables potentially tapping the construct sex
. These are the candidates to enter the DataSchema and contribute to computing harmonized variables.
NOTE: what is being retrieved depends on the manually entered values in the column construct
of the metadata file ./data/shared/meta-data-map.csv
. To specify a different group of variables, edit the metadata, not the script.
meta_data <- dto[["metaData"]] %>%
dplyr::filter(construct %in% c('sex')) %>%
dplyr::select(study_name, name, construct, label_short, categories, url) %>%
dplyr::arrange(construct, study_name)
knitr::kable(meta_data)
study_name | name | construct | label_short | categories | url |
---|---|---|---|---|---|
alsa | SEX | sex | Sex | 2 | |
lbsl | SEX94 | sex | Sex | 2 | |
satsa | SEX | sex | Sex | 2 | |
share | GENDER | sex | Sex | 2 | |
tilda | SEX | sex | Gender | 2 | |
tilda | GD002 | sex | Male or Female? | 2 |
View descriptives : sex for closer examination of each candidate. After reviewing these descriptives and relevant codebooks, the following operationalization of the harmonized variables for female
have been adopted:
female
0
- FALSE
- male - reference group1
- TRUE
femaleThese variables will be generated next, in the Development section.
The particulare goal of this section is to ensure that the schema to encode the values for the sex
variable is consisten across studies.
In this section we will define the schema sets for harmonizing sex
construct (i.e. specify which variables from which studies will be contributing to computing harmonized variables ). Each of these schema sets (e.g. "tilda" = c("SEX", "GD002")
) will have a particular pattern of possible response values to these variables(e.g. “id_1” = c(“SEX”=“Male”, “GD002”=“MALE”)), which we will export for inspection as .csv
tables. We then will manually edit these .csv
tables, populating new columns that will map values of harmonized variables to the specific response pattern of the schema set variables. We then will import harmonization algorithms encoded in .csv
tables and apply them to compute harmonized variables in the dataset combining raw and harmonized variables for sex
construct across studies.
Having all potential variables in categorical format we have defined the sets of data schema variables thus:
schema_sets <- list(
"alsa" = c("SEX"),
"lbsl" = c("SEX94"),
"satsa" = c("SEX"),
"share" = c("GENDER"),
"tilda" = c("SEX","GD002")
)
Each of these schema sets have a particular pattern of possible response values, for example:
# view the profile of responses
dto[["unitData"]][["alsa"]] %>%
dplyr::group_by(SEX) %>%
dplyr::summarize(count = n())
Source: local data frame [2 x 2]
SEX count
(fctr) (int)
1 Male 1056
2 Female 1031
We output these tables into self-standing .csv
files, so we can manually provide the logic of computing harmonized variables.
# define function to extract profiles
response_profile <- function(dto, h_target, study, varnames_values){
ds <- dto[["unitData"]][[study]]
varnames_values <- lapply(varnames_values, as.symbol) # Convert character vector to list of symbols
d <- ds %>%
dplyr::group_by_(.dots=varnames_values) %>%
dplyr::summarize(count = n())
write.csv(d,paste0("./data/meta/response-profiles-live/",h_target,"-",study,".csv"))
}
# extract response profile for data schema set from each study
for(s in names(schema_sets)){
response_profile(dto,
study = s,
h_target = 'sex',
varnames_values = schema_sets[[s]]
)
}
You can examine them in `./data/meta/response-profiles-live/
sex
sex
1
- male2
- femaleItems that can contribute to generating values for the harmonized variable sex
are:
dto[["metaData"]] %>%
dplyr::filter(study_name=="alsa", name %in% c("SEX")) %>%
dplyr::select(study_name, name, label,categories)
study_name name label categories
1 alsa SEX Sex 2
We encode the harmonization rule by manually editing the values in a corresponding .csv
file located in ./data/meta/h-rules/
. Then, we apply the recoding logic it contains and append the newly created, harmonized variable to the initial data set.
study_name <- "alsa"
path_to_hrule <- "./data/meta/h-rules/h-rules-sex-alsa.csv"
dto[["unitData"]][[study_name]] <- recode_with_hrule(
dto,
study_name = study_name,
variable_names = c("SEX"),
harmony_name = "female"
)
Source: local data frame [2 x 3]
Groups: SEX [?]
SEX female n
(chr) (lgl) (int)
1 Female TRUE 1031
2 Male FALSE 1056
# verify
dto[["unitData"]][["alsa"]] %>%
dplyr::filter(id %in% sample(unique(id),10)) %>%
dplyr::select_("id", "SEX","female")
id SEX female
1 1702 Female TRUE
2 5692 Female TRUE
3 11111 Female TRUE
4 21491 Female TRUE
5 24292 Female TRUE
6 24971 Male FALSE
7 25041 Male FALSE
8 25781 Female TRUE
9 28681 Male FALSE
10 39831 Female TRUE
Items that can contribute to generating values for the harmonized variable sex
are:
dto[["metaData"]] %>%
dplyr::filter(study_name == "lbsl", construct == "sex") %>%
# dplyr::filter(name %in% c("SEX94")) %>%
dplyr::select(study_name, name, label_short,categories)
study_name name label_short categories
1 lbsl SEX94 Sex 2
We encode the harmonization rule by manually editing the values in a corresponding .csv
file located in ./data/meta/h-rules/
. Then, we apply the recoding logic it contains and append the newly created, harmonized variable to the initial data set.
study_name <- "lbsl"
path_to_hrule <- "./data/meta/h-rules/h-rules-sex-lbsl.csv"
dto[["unitData"]][[study_name]] <- recode_with_hrule(
dto,
study_name = study_name,
variable_names = c("SEX94"),
harmony_name = "female"
)
Source: local data frame [2 x 3]
Groups: SEX94 [?]
SEX94 female n
(chr) (lgl) (int)
1 1 FALSE 314
2 2 TRUE 342
# verify
dto[["unitData"]][["lbsl"]] %>%
dplyr::filter(id %in% sample(unique(id),10)) %>%
dplyr::select_("id", "SEX94", "female")
id SEX94 female
1 4112089 2 TRUE
2 4121170 1 FALSE
3 4141039 1 FALSE
4 4141201 1 FALSE
5 4142049 2 TRUE
6 4251084 1 FALSE
7 4291081 1 FALSE
8 4372007 2 TRUE
9 4411036 1 FALSE
10 4502034 2 TRUE
Items that can contribute to generating values for the harmonized variable sex
are:
dto[["metaData"]] %>%
dplyr::filter(study_name == "satsa", construct == "SEX") %>%
# dplyr::filter(name %in% c("SEX")) %>%
dplyr::select(study_name, name, label_short,categories)
[1] study_name name label_short categories
<0 rows> (or 0-length row.names)
We encode the harmonization rule by manually editing the values in a corresponding .csv
file located in ./data/meta/h-rules/
. Then, we apply the recoding logic it contains and append the newly created, harmonized variable to the initial data set.
study_name <- "satsa"
path_to_hrule <- "./data/meta/h-rules/h-rules-sex-satsa.csv"
dto[["unitData"]][[study_name]] <- recode_with_hrule(
dto,
study_name = study_name,
variable_names = c("SEX"),
harmony_name = "female"
)
Source: local data frame [2 x 3]
Groups: SEX [?]
SEX female n
(chr) (lgl) (int)
1 female TRUE 887
2 male FALSE 610
# verify
dto[["unitData"]][["satsa"]] %>%
dplyr::filter(id %in% sample(unique(id),10)) %>%
dplyr::select_("id", "SEX", "female")
id SEX female
1 17812 female TRUE
2 140821 male FALSE
3 153712 female TRUE
4 158502 female TRUE
5 162832 female TRUE
6 173682 female TRUE
7 180261 male FALSE
8 260402 female TRUE
9 2105422 male FALSE
10 2208562 male FALSE
Items that can contribute to generating values for the harmonized variable sex
are:
dto[["metaData"]] %>%
dplyr::filter(study_name == "tilda", construct == "sex") %>%
# dplyr::filter(name %in% c("SMK94", "SMOKE")) %>%
dplyr::select(study_name, name, label_short,categories)
study_name name label_short categories
1 tilda SEX Gender 2
2 tilda GD002 Male or Female? 2
We encode the harmonization rule by manually editing the values in a corresponding .csv
file located in ./data/meta/h-rules/
. Then, we apply the recoding logic it contains and append the newly created, harmonized variable to the initial data set.
study_name <- "tilda"
path_to_hrule <- "./data/meta/h-rules/h-rules-sex-tilda.csv"
dto[["unitData"]][[study_name]] <- recode_with_hrule(
dto,
study_name = study_name,
variable_names = c("SEX", "GD002"),
harmony_name = "female"
)
Source: local data frame [2 x 4]
Groups: SEX, GD002 [?]
SEX GD002 female n
(chr) (chr) (lgl) (int)
1 Female Female TRUE 4724
2 Male Male FALSE 3780
# verify
dto[["unitData"]][["tilda"]] %>%
dplyr::filter(id %in% sample(unique(id),10)) %>%
dplyr::select_("id", "SEX", "GD002", "female")
id SEX GD002 female
1 104201 Female Female TRUE
2 228431 Male Male FALSE
3 240481 Male Male FALSE
4 322682 Male Male FALSE
5 336182 Female Female TRUE
6 342411 Male Male FALSE
7 359731 Male Male FALSE
8 459431 Female Female TRUE
9 467011 Male Male FALSE
10 534131 Male Male FALSE
At this point the dto[["unitData"]]
elements (raw data files for each study) have been augmented with the harmonized variable sex
. We retrieve harmonized variables to view frequency counts across studies:
dumlist <- list()
for(s in dto[["studyName"]]){
ds <- dto[["unitData"]][[s]]
dumlist[[s]] <- ds[,c("id","female")]
}
ds <- plyr::ldply(dumlist,data.frame,.id = "study_name")
head(ds)
study_name id female
1 alsa 41 FALSE
2 alsa 42 TRUE
3 alsa 61 TRUE
4 alsa 71 FALSE
5 alsa 91 FALSE
6 alsa 121 TRUE
ds$id <- 1:nrow(ds) # some ids values might be identical, replace
table( ds$female, ds$study_name, useNA = "always")
alsa lbsl satsa share tilda <NA>
FALSE 1056 314 610 1139 3780 0
TRUE 1031 342 887 1459 4724 0
<NA> 0 0 0 0 0 0
Finally, we have added the newly created, harmonized variables to the raw source objects and save the data transfer object.
# Save as a compress, binary R dataset. It's no longer readable with a text editor, but it saves metadata (eg, factor information).
saveRDS(dto, file="./data/unshared/derived/dto.rds", compress="xz")