This report lists the candidate variable for DataScheme variables of the construct physique.
This report is a record of interaction with a data transfer object (dto) produced by
./manipulation/0-ellis-island.R
.
The next section recaps this script, exposes the architecture of the DTO, and demonstrates the language of interacting with it.
All data land on Ellis Island.
The script 0-ellis-island.R
is the first script in the analytic workflow. It accomplished the following:
./data/shared/derived/meta-data-live.csv
, which is updated every time Ellis Island script is executed../data/shared/meta-data-map.csv
. They are used by automatic scripts in later harmonization and analysis.# load the product of 0-ellis-island.R, a list object containing data and metadata
dto <- readRDS("./data/unshared/derived/dto.rds")
# the list is composed of the following elements
names(dto)
[1] "studyName" "filePath" "unitData" "metaData"
# 1st element - names of the studies as character vector
dto[["studyName"]]
[1] "alsa" "lbsl" "satsa" "share" "tilda"
# 2nd element - file paths of the data files for each study as character vector
dto[["filePath"]]
[1] "./data/unshared/raw/ALSA-Wave1.Final.sav" "./data/unshared/raw/LBSL-Panel2-Wave1.Final.sav"
[3] "./data/unshared/raw/SATSA-Q3.Final.sav" "./data/unshared/raw/SHARE-Israel-Wave1.Final.sav"
[5] "./data/unshared/raw/TILDA-Wave1.Final.sav"
# 3rd element - is a list object containing the following elements
names(dto[["unitData"]])
[1] "alsa" "lbsl" "satsa" "share" "tilda"
# each of these elements is a raw data set of a corresponding study, for example
dplyr::tbl_df(dto[["unitData"]][["lbsl"]])
Source: local data frame [656 x 38]
id AGE94 SEX94 MSTAT94 EDUC94 NOWRK94 SMK94 SMOKE
(int) (int) (int) (fctr) (int) (fctr) (fctr) (fctr)
1 4001026 68 1 divorced 16 no, retired no never smoked
2 4012015 94 2 widowed 12 no, retired no never smoked
3 4012032 94 2 widowed 20 no, retired no don't smoke at present but smoked in the past
4 4022004 93 2 NA NA NA NA never smoked
5 4022026 93 2 widowed 12 no, retired no never smoked
6 4031031 92 1 married 8 no, retired no don't smoke at present but smoked in the past
7 4031035 92 1 widowed 13 no, retired no don't smoke at present but smoked in the past
8 4032201 92 2 NA NA NA NA don't smoke at present but smoked in the past
9 4041062 91 1 widowed 7 NA no don't smoke at present but smoked in the past
10 4042057 91 2 NA NA NA NA NA
.. ... ... ... ... ... ... ... ...
Variables not shown: ALCOHOL (fctr), WINE (int), BEER (int), HARDLIQ (int), SPORT94 (int), FIT94 (int), WALK94 (int),
SPEC94 (int), DANCE94 (int), CHORE94 (int), EXCERTOT (int), EXCERWK (int), HEIGHT94 (int), WEIGHT94 (int), HWEIGHT
(int), HHEIGHT (int), SRHEALTH (fctr), smoke_now (lgl), smoked_ever (lgl), year_of_wave (dbl), age_in_years (dbl),
year_born (dbl), female (lgl), marital (chr), single (lgl), educ3 (chr), current_work_2 (lgl), current_drink (lgl),
sedentary (lgl), poor_health (lgl)
# 4th element - a dataset names and labels of raw variables + added metadata for all studies
dto[["metaData"]] %>% dplyr::select(study_name, name, item, construct, type, categories, label_short, label) %>%
DT::datatable(
class = 'cell-border stripe',
caption = "This is the primary metadata file. Edit at `./data/shared/meta-data-map.csv",
filter = "top",
options = list(pageLength = 6, autoWidth = TRUE)
)
Everybody wants to be somebody.
We query metadata set to retrieve all variables potentially tapping the construct physique
. These are the candidates to enter the DataSchema and contribute to computing harmonized variables.
NOTE: what is being retrieved depends on the manually entered values in the column construct
of the metadata file ./data/shared/meta-data-map.csv
. To specify a different group of variables, edit the metadata, not the script.
meta_data <- dto[["metaData"]] %>%
dplyr::filter(construct %in% c('physique')) %>%
dplyr::select(study_name, name, construct, label_short, categories, url) %>%
dplyr::arrange(construct, study_name)
knitr::kable(meta_data)
study_name | name | construct | label_short | categories | url |
---|---|---|---|---|---|
alsa | WEIGHT | physique | Weight in kilograms | NA | |
lbsl | HEIGHT94 | physique | Height in Inches | NA | |
lbsl | WEIGHT94 | physique | Weight in Pounds | NA | |
lbsl | HWEIGHT | physique | Self-reported weight in pounds | NA | |
lbsl | HHEIGHT | physique | Self-reported height in inches | NA | |
satsa | GHTCM | physique | NA | ||
satsa | GWTKG | physique | NA | ||
satsa | GPI | physique | NA | ||
share | PH0130 | physique | how tall are you? | NA | |
share | PH0120 | physique | weight of respondent | NA | |
tilda | SR.HEIGHT.CENTIMETRES | physique | Height Centimetres | NA | |
tilda | HEIGHT | physique | Respondent height | NA | |
tilda | SR.WEIGHT.KILOGRAMMES | physique | Weight Kilogrammes | NA | |
tilda | WEIGHT | physique | Respondent weight | NA |
View descriptives : physique for closer examination of each candidate.
After reviewing descriptives and relevant codebooks, the following operationalization of the harmonized variables for physique
have been adopted:
bmi
from metric:
bmi = weight_kg / height_m ^ 2
from imperial
bmi = weight_lb * 703 / height_in ^ 2
These variables will be generated next, in the Development section.
The particulare goal of this section is to ensure that the schema to encode the values for the physique
variable is consisten across studies.
In this section we will define the schema sets for harmonizing physique
construct (i.e. specify which variables from which studies will be contributing to computing harmonized variables ). Each of these schema sets will have a particular pattern of possible response values to these variables, which we will export for inspection as .csv
tables. We then will manually edit these .csv
tables, populating new columns that will map values of harmonized variables to the specific response pattern of the schema set variables. We then will import harmonization algorithms encoded in .csv
tables and apply them to compute harmonized variables in the dataset combining raw and harmonized variables for physique
construct across studies.
Having all potential variables in categorical format we have defined the sets of data schema variables thus:
bmi
from metric:
bmi = weight_kg / height_m ^ 2
from imperial
bmi = weight_lb * 703 / height_in ^ 2
Items that can contribute to generating values for the harmonized variable bmi
are:
dto[["metaData"]] %>%
dplyr::filter(study_name=="alsa", construct %in% c("physique")) %>%
dplyr::select(study_name, name, label,categories)
study_name name label categories
1 alsa WEIGHT Weight in kilograms NA
ALSA is lacking the measure of height. It is not possible to calculate bmi
for this study.
dto[["unitData"]][["alsa"]] <- dto[["unitData"]][["alsa"]] %>%
dplyr::mutate(
HIEGHT = NA,
bmi = (WEIGHT)/ (HIEGHT^2))
# verify
dto[["unitData"]][["alsa"]] %>%
dplyr::filter(id %in% sample(unique(id),10)) %>%
dplyr::select_("id", "WEIGHT","HIEGHT", "bmi")
id WEIGHT HIEGHT bmi
1 10761 73.2 NA NA
2 18821 76.4 NA NA
3 19401 89.1 NA NA
4 19812 65.9 NA NA
5 21501 66.4 NA NA
6 25131 88.2 NA NA
7 26302 65.5 NA NA
8 29251 63.6 NA NA
9 35831 NA NA NA
10 42861 66.8 NA NA
Items that can contribute to generating values for the harmonized variable bmi
are:
dto[["metaData"]] %>%
dplyr::filter(study_name == "lbsl", construct == "physique") %>%
dplyr::select(study_name, name, label_short,categories)
study_name name label_short categories
1 lbsl HEIGHT94 Height in Inches NA
2 lbsl WEIGHT94 Weight in Pounds NA
3 lbsl HWEIGHT Self-reported weight in pounds NA
4 lbsl HHEIGHT Self-reported height in inches NA
We compute bmi
according to the declared formula:
dto[["unitData"]][["lbsl"]] <- dto[["unitData"]][["lbsl"]] %>%
dplyr::mutate(bmi = (WEIGHT94 * 703)/ (HEIGHT94^2))
# verify
dto[["unitData"]][["lbsl"]] %>%
dplyr::filter(id %in% sample(unique(id),10)) %>%
dplyr::select_("id", "WEIGHT94","HEIGHT94", "bmi")
id WEIGHT94 HEIGHT94 bmi
1 4051040 NA NA NA
2 4101168 160 67 25.05681
3 4111082 170 71 23.70760
4 4121036 150 67 23.49076
5 4141059 195 69 28.79332
6 4152078 160 61 30.22843
7 4162001 150 65 24.95858
8 4312027 140 64 24.02832
9 4462037 145 68 22.04477
10 4472001 175 62 32.00442
# graph
histogram_continuous(dto[["unitData"]][["lbsl"]],"bmi")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Items that can contribute to generating values for the harmonized variable bmi
are:
dto[["metaData"]] %>%
dplyr::filter(study_name == "satsa", construct == "physique") %>%
dplyr::select(study_name, name, label_short,categories)
study_name name label_short categories
1 satsa GHTCM NA
2 satsa GWTKG NA
3 satsa GPI NA
We compute bmi
according to the declared formula:
dto[["unitData"]][["satsa"]] <- dto[["unitData"]][["satsa"]] %>%
dplyr::mutate(bmi = (GWTKG)/ ((GHTCM/100)^2))
# verify
dto[["unitData"]][["satsa"]] %>%
dplyr::filter(id %in% sample(unique(id),10)) %>%
dplyr::select_("id", "GWTKG","GHTCM","GPI", "bmi")
id GWTKG GHTCM GPI bmi
1 16261 85 170 29.41016 29.41176
2 122131 61 151 26.75000 26.75321
3 134301 72 174 23.77734 23.78121
4 147872 75 NA NA NA
5 153711 58 160 22.65234 22.65625
6 175251 70 166 25.40234 25.40282
7 191271 73 168 25.86328 25.86451
8 225241 50 154 21.08203 21.08281
9 2176041 65 172 21.96875 21.97134
10 2301901 67 171 22.91016 22.91303
# graph
histogram_continuous(dto[["unitData"]][["satsa"]],"bmi")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Items that can contribute to generating values for the harmonized variable bmi
are:
dto[["metaData"]] %>%
dplyr::filter(study_name == "tilda", construct == "physique") %>%
dplyr::select(study_name, name, label_short,categories)
study_name name label_short categories
1 tilda SR.HEIGHT.CENTIMETRES Height Centimetres NA
2 tilda HEIGHT Respondent height NA
3 tilda SR.WEIGHT.KILOGRAMMES Weight Kilogrammes NA
4 tilda WEIGHT Respondent weight NA
We compute bmi
according to the declared formula:
ds <- dto[["unitData"]][["tilda"]]
ds <- ds %>%
dplyr::mutate(
weight = ifelse(
!is.na(WEIGHT), WEIGHT, ifelse(
!is.na(SR.WEIGHT.KILOGRAMMES),SR.WEIGHT.KILOGRAMMES, NA)),
height = ifelse(
!is.na(HEIGHT), HEIGHT, ifelse(
!is.na(SR.HEIGHT.CENTIMETRES),SR.HEIGHT.CENTIMETRES, NA))
)
ds <- ds %>%
dplyr::mutate(bmi = (weight)/ ((height/100)^2))
dto[["unitData"]][["tilda"]] <- ds
# verify
dto[["unitData"]][["tilda"]] %>%
dplyr::filter(id %in% sample(unique(id),10)) %>%
dplyr::select_("id","weight", "height", "bmi")
id weight height bmi
1 48281 NA NA NA
2 61281 70.10 164.8 25.81093
3 102151 78.40 166.4 28.31454
4 283521 NA NA NA
5 291281 80.25 165.3 29.36969
6 305281 NA NA NA
7 405012 53.75 156.0 22.08662
8 433381 74.40 165.4 27.19581
9 448912 85.20 168.6 29.97260
10 457072 91.10 174.2 30.02079
# graph
histogram_continuous(dto[["unitData"]][["tilda"]],"bmi")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
At this point the dto[["unitData"]]
elements (raw data files for each study) have been augmented with the harmonized variable bmi
.
dumlist <- list()
for(s in dto[["studyName"]]){
ds <- dto[["unitData"]][[s]]
dumlist[[s]] <- ds[,c("id","bmi")]
}
ds <- plyr::ldply(dumlist,data.frame,.id = "study_name")
head(ds)
study_name id bmi
1 alsa 41 NA
2 alsa 42 NA
3 alsa 61 NA
4 alsa 71 NA
5 alsa 91 NA
6 alsa 121 NA
ds$id <- 1:nrow(ds) # some ids values might be identical, replace
for(s in dto[["studyName"]]){
print(s)
print(summary(dto[["unitData"]][[s]]$bmi))
}
[1] "alsa"
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
NA NA NA NaN NA NA 2087
[1] "lbsl"
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
9.683 23.050 25.740 26.540 29.210 48.820 105
[1] "satsa"
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
14.70 22.45 24.39 24.75 26.78 48.90 46
[1] "share"
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
14.69 23.88 26.09 26.67 29.00 53.78 82
[1] "tilda"
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
16.46 25.23 28.09 28.64 31.36 56.77 2372
Finally, we have added the newly created, harmonized variables to the raw source objects and save the data transfer object.
# Save as a compress, binary R dataset. It's no longer readable with a text editor, but it saves metadata (eg, factor information).
saveRDS(dto, file="./data/unshared/derived/dto.rds", compress="xz")