This report lists the candidate variable for DataScheme variables of the construct self-reported health.
This report is a record of interaction with a data transfer object (dto) produced by
./manipulation/0-ellis-island.R
.
The next section recaps this script, exposes the architecture of the DTO, and demonstrates the language of interacting with it.
All data land on Ellis Island.
The script 0-ellis-island.R
is the first script in the analytic workflow. It accomplished the following:
./data/shared/derived/meta-data-live.csv
, which is updated every time Ellis Island script is executed../data/shared/meta-data-map.csv
. They are used by automatic scripts in later harmonization and analysis.# load the product of 0-ellis-island.R, a list object containing data and metadata
dto <- readRDS("./data/unshared/derived/dto.rds")
# the list is composed of the following elements
names(dto)
[1] "studyName" "filePath" "unitData" "metaData"
# 1st element - names of the studies as character vector
dto[["studyName"]]
[1] "alsa" "lbsl" "satsa" "share" "tilda"
# 2nd element - file paths of the data files for each study as character vector
dto[["filePath"]]
[1] "./data/unshared/raw/ALSA-Wave1.Final.sav" "./data/unshared/raw/LBSL-Panel2-Wave1.Final.sav"
[3] "./data/unshared/raw/SATSA-Q3.Final.sav" "./data/unshared/raw/SHARE-Israel-Wave1.Final.sav"
[5] "./data/unshared/raw/TILDA-Wave1.Final.sav"
# 3rd element - is a list object containing the following elements
names(dto[["unitData"]])
[1] "alsa" "lbsl" "satsa" "share" "tilda"
# each of these elements is a raw data set of a corresponding study, for example
dplyr::tbl_df(dto[["unitData"]][["lbsl"]])
Source: local data frame [656 x 37]
id AGE94 SEX94 MSTAT94 EDUC94 NOWRK94 SMK94 SMOKE
(int) (int) (int) (fctr) (int) (fctr) (fctr) (fctr)
1 4001026 68 1 divorced 16 no, retired no never smoked
2 4012015 94 2 widowed 12 no, retired no never smoked
3 4012032 94 2 widowed 20 no, retired no don't smoke at present but smoked in the past
4 4022004 93 2 NA NA NA NA never smoked
5 4022026 93 2 widowed 12 no, retired no never smoked
6 4031031 92 1 married 8 no, retired no don't smoke at present but smoked in the past
7 4031035 92 1 widowed 13 no, retired no don't smoke at present but smoked in the past
8 4032201 92 2 NA NA NA NA don't smoke at present but smoked in the past
9 4041062 91 1 widowed 7 NA no don't smoke at present but smoked in the past
10 4042057 91 2 NA NA NA NA NA
.. ... ... ... ... ... ... ... ...
Variables not shown: ALCOHOL (fctr), WINE (int), BEER (int), HARDLIQ (int), SPORT94 (int), FIT94 (int), WALK94 (int),
SPEC94 (int), DANCE94 (int), CHORE94 (int), EXCERTOT (int), EXCERWK (int), HEIGHT94 (int), WEIGHT94 (int), HWEIGHT
(int), HHEIGHT (int), SRHEALTH (fctr), smoke_now (lgl), smoked_ever (lgl), year_of_wave (dbl), age_in_years (dbl),
year_born (dbl), female (lgl), marital (chr), single (lgl), educ3 (chr), current_work_2 (lgl), current_drink (lgl),
sedentary (lgl)
# 4th element - a dataset names and labels of raw variables + added metadata for all studies
dto[["metaData"]] %>% dplyr::select(study_name, name, item, construct, type, categories, label_short, label) %>%
DT::datatable(
class = 'cell-border stripe',
caption = "This is the primary metadata file. Edit at `./data/shared/meta-data-map.csv",
filter = "top",
options = list(pageLength = 6, autoWidth = TRUE)
)
Everybody wants to be somebody.
We query metadata set to retrieve all variables potentially tapping the construct self-reported health
. These are the candidates to enter the DataSchema and contribute to computing harmonized variables.
NOTE: what is being retrieved depends on the manually entered values in the column construct
of the metadata file ./data/shared/meta-data-map.csv
. To specify a different group of variables, edit the metadata, not the script.
meta_data <- dto[["metaData"]] %>%
dplyr::filter(construct %in% c('health')) %>%
dplyr::select(study_name, name, construct, label_short, categories, url) %>%
dplyr::arrange(construct, study_name)
knitr::kable(meta_data)
study_name | name | construct | label_short | categories | url |
---|---|---|---|---|---|
alsa | BTSM12MN | health | Health comp with 12mths ago | NA | |
alsa | HLTHBTSM | health | Health compared to others | NA | |
alsa | HLTHLIFE | health | Self-rated health | NA | |
lbsl | SRHEALTH | health | Self-reported health compared to age peers | NA | |
satsa | GHLTHOTH | health | Judge your health compared to others your age? | NA | |
satsa | GGENHLTH | health | How do you judge your general state of health? | NA | |
share | PH0020 | health | health in general question v 1 | NA | |
share | PH0030 | health | health in general question v 2 | NA | |
share | PH0520 | health | health in general question v 2 | NA | |
share | PH0530 | health | health in general question v 1 | NA | |
tilda | PH001 | health | What about your health. Would you say ? | NA | |
tilda | PH009 | health | Compared to others your age, your health is | NA |
View descriptives : health for closer examination of each candidate.
After reviewing descriptives and relevant codebooks, the following operationalization of the harmonized variables for construct self-reported health
have been adopted:
poor_health
0
- FALSE
- REFERENCE group1
- TRUE
- risk factorThese variables will be generated next, in the Development section.
The particulare goal of this section is to ensure that the schema to encode the values for the poor_health
variable is consisten across studies.
In this section we will define the schema sets for harmonizing self-reported health
construct (i.e. specify which variables from which studies will be contributing to computing harmonized variables ). Each of these schema sets will have a particular pattern of possible response values to these variables, which we will export for inspection as .csv
tables. We then will manually edit these .csv
tables, populating new columns that will map values of harmonized variables to the specific response pattern of the schema set variables. We then will import harmonization algorithms encoded in .csv
tables and apply them to compute harmonized variables in the dataset combining raw and harmonized variables for self-reported health
construct across studies.
Having all potential variables in categorical format we have defined the sets of data schema variables thus:
Each of these schema sets have a particular pattern of possible response values, for example:
We output these tables into self-standing .csv
files, so we can manually provide the logic of computing harmonized variables.
You can examine them in `./data/meta/response-profiles-live/
poor_health
poor_health
0
- FALSE
- reference group1
- TRUE
- risk factorItems that can contribute to generating values for the harmonized variable poor_health
are:
dto[["metaData"]] %>%
dplyr::filter(study_name=="alsa", construct %in% c("health")) %>%
dplyr::select(study_name, name, label,categories)
study_name name label categories
1 alsa BTSM12MN Health comp with 12mths ago NA
2 alsa HLTHBTSM Health compared to others NA
3 alsa HLTHLIFE Self-rated health NA
We encode the harmonization rule by manually editing the values in a corresponding .csv
file located in ./data/meta/h-rules/
. Then, we apply the recoding logic it contains and append the newly created, harmonized variable to the initial data set.
study_name <- "alsa"
path_to_hrule <- "./data/meta/h-rules/h-rules-health-alsa.csv"
dto[["unitData"]][[study_name]] <- recode_with_hrule(
dto,
study_name = study_name,
variable_names = c("HLTHLIFE", "BTSM12MN","HLTHBTSM"),
harmony_name = "poor_health"
)
Source: local data frame [66 x 5]
Groups: HLTHLIFE, BTSM12MN, HLTHBTSM [?]
HLTHLIFE BTSM12MN HLTHBTSM poor_health n
(chr) (chr) (chr) (lgl) (int)
1 Don t Know Better now Better NA 1
2 Excellent About the same Better FALSE 122
3 Excellent About the same Don t Know FALSE 1
4 Excellent About the same Same FALSE 19
5 Excellent About the same Worse FALSE 1
6 Excellent Better now Better FALSE 29
7 Excellent Better now Same FALSE 2
8 Excellent Not as good now Better FALSE 16
9 Excellent Not as good now Same FALSE 1
10 Fair About the same Better TRUE 91
11 Fair About the same Don t Know TRUE 9
12 Fair About the same Same TRUE 76
13 Fair About the same Worse TRUE 13
14 Fair About the same NA TRUE 4
15 Fair Better now Better TRUE 23
16 Fair Better now Don t Know TRUE 7
17 Fair Better now Same TRUE 29
18 Fair Better now Worse TRUE 5
19 Fair Not as good now Better TRUE 67
20 Fair Not as good now Don t Know TRUE 7
21 Fair Not as good now Same TRUE 107
22 Fair Not as good now Worse TRUE 35
23 Fair Not as good now NA TRUE 3
24 Fair NA NA TRUE 1
25 Good About the same Better FALSE 245
26 Good About the same Don t Know FALSE 5
27 Good About the same Same FALSE 138
28 Good About the same Worse FALSE 4
29 Good About the same NA FALSE 1
30 Good Better now Better FALSE 49
31 Good Better now Same FALSE 29
32 Good Better now Worse FALSE 2
33 Good Not as good now Better FALSE 86
34 Good Not as good now Don t Know FALSE 5
35 Good Not as good now Same FALSE 60
36 Good Not as good now Worse FALSE 8
37 Good Not as good now NA FALSE 1
38 Poor About the same Better TRUE 13
39 Poor About the same Don t Know TRUE 2
40 Poor About the same Same TRUE 12
41 Poor About the same Worse TRUE 14
42 Poor About the same NA TRUE 2
43 Poor Better now Better TRUE 4
44 Poor Better now Don t Know TRUE 1
45 Poor Better now Same TRUE 5
46 Poor Better now Worse TRUE 7
47 Poor Not as good now Better TRUE 19
48 Poor Not as good now Don t Know TRUE 5
49 Poor Not as good now Same TRUE 39
50 Poor Not as good now Worse TRUE 56
51 Poor Not as good now NA TRUE 1
52 Poor NA NA TRUE 1
53 Very Good About the same Better FALSE 308
54 Very Good About the same Don t Know FALSE 5
55 Very Good About the same Same FALSE 87
56 Very Good About the same NA FALSE 1
57 Very Good Better now Better FALSE 77
58 Very Good Better now Don t Know FALSE 2
59 Very Good Better now Same FALSE 13
60 Very Good Don t Know Same FALSE 1
61 Very Good Not as good now Better FALSE 74
62 Very Good Not as good now Don t Know FALSE 1
63 Very Good Not as good now Same FALSE 28
64 Very Good Not as good now Worse FALSE 2
65 NA Don t Know Don t Know NA 1
66 NA NA NA NA 4
# verify
dto[["unitData"]][["alsa"]] %>%
dplyr::filter(id %in% sample(unique(id),10)) %>%
dplyr::select_("id", "HLTHLIFE", "BTSM12MN","HLTHBTSM","poor_health")
id HLTHLIFE BTSM12MN HLTHBTSM poor_health
1 832 Very Good About the same Better FALSE
2 4081 Good About the same Better FALSE
3 4211 Fair Not as good now Better TRUE
4 5302 Good About the same Same FALSE
5 11951 Good About the same Same FALSE
6 14051 Good Better now Same FALSE
7 17951 Good Better now Better FALSE
8 18562 Fair Not as good now Same TRUE
9 18722 Poor About the same Worse TRUE
10 25861 Very Good About the same Better FALSE
Items that can contribute to generating values for the harmonized variable poor_health
are:
dto[["metaData"]] %>%
dplyr::filter(study_name == "lbsl", construct == "health") %>%
dplyr::select(study_name, name, label_short,categories)
study_name name label_short categories
1 lbsl SRHEALTH Self-reported health compared to age peers NA
We encode the harmonization rule by manually editing the values in a corresponding .csv
file located in ./data/meta/h-rules/
. Then, we apply the recoding logic it contains and append the newly created, harmonized variable to the initial data set.
study_name <- "lbsl"
path_to_hrule <- "./data/meta/h-rules/h-rules-health-lbsl.csv"
dto[["unitData"]][[study_name]] <- recode_with_hrule(
dto,
study_name = study_name,
variable_names = c("SRHEALTH"),
harmony_name = "poor_health"
)
Source: local data frame [7 x 3]
Groups: SRHEALTH [?]
SRHEALTH poor_health n
(chr) (lgl) (int)
1 good FALSE 173
2 moderately good TRUE 177
3 moderately poor TRUE 41
4 poor TRUE 7
5 very good FALSE 163
6 very poor TRUE 3
7 NA NA 92
# verify
dto[["unitData"]][["lbsl"]] %>%
dplyr::filter(id %in% sample(unique(id),10)) %>%
dplyr::select_("id", "SRHEALTH", "poor_health")
id SRHEALTH poor_health
1 4142014 good FALSE
2 4212076 good FALSE
3 4261078 very good FALSE
4 4272075 moderately good TRUE
5 4311077 very good FALSE
6 4322009 good FALSE
7 4411033 very good FALSE
8 4411038 very good FALSE
9 4582004 moderately poor TRUE
10 4612003 good FALSE
Items that can contribute to generating values for the harmonized variable poor_health
are:
dto[["metaData"]] %>%
dplyr::filter(study_name == "satsa", construct == "health") %>%
dplyr::select(study_name, name, label_short,categories)
study_name name label_short categories
1 satsa GHLTHOTH Judge your health compared to others your age? NA
2 satsa GGENHLTH How do you judge your general state of health? NA
We encode the harmonization rule by manually editing the values in a corresponding .csv
file located in ./data/meta/h-rules/
. Then, we apply the recoding logic it contains and append the newly created, harmonized variable to the initial data set.
study_name <- "satsa"
path_to_hrule <- "./data/meta/h-rules/h-rules-health-satsa.csv"
dto[["unitData"]][[study_name]] <- recode_with_hrule(
dto,
study_name = study_name,
variable_names = c("GGENHLTH","GHLTHOTH"),
harmony_name = "poor_health"
)
Source: local data frame [15 x 4]
Groups: GGENHLTH, GHLTHOTH [?]
GGENHLTH GHLTHOTH poor_health n
(chr) (chr) (lgl) (int)
1 bad about the same TRUE 10
2 bad worse TRUE 31
3 bad NA TRUE 1
4 good about the same FALSE 539
5 good better FALSE 305
6 good UNDOCUMENTED CODE FALSE 1
7 good worse FALSE 3
8 good NA FALSE 5
9 reasonable about the same TRUE 468
10 reasonable better TRUE 56
11 reasonable worse TRUE 57
12 reasonable NA TRUE 6
13 NA about the same NA 2
14 NA better NA 1
15 NA NA NA 12
# verify
dto[["unitData"]][["satsa"]] %>%
dplyr::filter(id %in% sample(unique(id),10)) %>%
dplyr::select_("id", "GGENHLTH","GHLTHOTH", "poor_health")
id GGENHLTH GHLTHOTH poor_health
1 111991 reasonable about the same TRUE
2 154411 good about the same FALSE
3 160702 reasonable about the same TRUE
4 191491 good about the same FALSE
5 2119622 reasonable about the same TRUE
6 2125211 good better FALSE
7 2163521 reasonable about the same TRUE
8 2299301 good better FALSE
9 2317701 good about the same FALSE
10 2397802 reasonable about the same TRUE
Items that can contribute to generating values for the harmonized variable poor_health
are:
dto[["metaData"]] %>%
dplyr::filter(study_name == "tilda", construct == "health") %>%
# dplyr::filter(name %in% c("SMK94", "SMOKE")) %>%
dplyr::select(study_name, name, label_short,categories)
study_name name label_short categories
1 tilda PH001 What about your health. Would you say ? NA
2 tilda PH009 Compared to others your age, your health is NA
We encode the harmonization rule by manually editing the values in a corresponding .csv
file located in ./data/meta/h-rules/
. Then, we apply the recoding logic it contains and append the newly created, harmonized variable to the initial data set.
study_name <- "tilda"
path_to_hrule <- "./data/meta/h-rules/h-rules-health-tilda.csv"
dto[["unitData"]][[study_name]] <- recode_with_hrule(
dto,
study_name = study_name,
variable_names = c("PH001", "PH009"),
harmony_name = "poor_health"
)
Source: local data frame [28 x 4]
Groups: PH001, PH009 [?]
PH001 PH009 poor_health n
(chr) (chr) (lgl) (int)
1 Excellent Excellent FALSE 1041
2 Excellent Good FALSE 49
3 Excellent Very good FALSE 270
4 Fair Excellent TRUE 38
5 Fair Fair TRUE 717
6 Fair Good TRUE 561
7 Fair Poor TRUE 43
8 Fair Very good TRUE 152
9 Fair NA TRUE 6
10 Good Excellent FALSE 194
11 Good Fair FALSE 165
12 Good Good FALSE 1541
13 Good Poor FALSE 7
14 Good Very good FALSE 847
15 Good NA FALSE 4
16 Poor Excellent TRUE 6
17 Poor Fair TRUE 126
18 Poor Good TRUE 56
19 Poor Poor TRUE 222
20 Poor Very good TRUE 7
21 Poor NA TRUE 3
22 Very good Excellent FALSE 520
23 Very good Fair FALSE 13
24 Very good Good FALSE 318
25 Very good Poor FALSE 1
26 Very good Very good FALSE 1594
27 Very good NA FALSE 2
28 NA Very good NA 1
# verify
dto[["unitData"]][["tilda"]] %>%
dplyr::filter(id %in% sample(unique(id),10)) %>%
dplyr::select_("id", "PH001", "PH009", "poor_health")
id PH001 PH009 poor_health
1 80301 Excellent Excellent FALSE
2 231361 Good Good FALSE
3 262011 Very good Very good FALSE
4 302431 Poor Good TRUE
5 417321 Very good Very good FALSE
6 442612 Very good Very good FALSE
7 472362 Very good Very good FALSE
8 525821 Excellent Excellent FALSE
9 571821 Good Very good FALSE
10 590911 Poor Fair TRUE
At this point the dto[["unitData"]]
elements (raw data files for each study) have been augmented with the harmonized variable poor_health
. We retrieve harmonized variables to view frequency counts across studies:
dumlist <- list()
for(s in dto[["studyName"]]){
ds <- dto[["unitData"]][[s]]
dumlist[[s]] <- ds[,c("id","poor_health")]
}
ds <- plyr::ldply(dumlist,data.frame,.id = "study_name")
head(ds)
study_name id poor_health
1 alsa 41 FALSE
2 alsa 42 FALSE
3 alsa 61 FALSE
4 alsa 71 TRUE
5 alsa 91 FALSE
6 alsa 121 FALSE
ds$id <- 1:nrow(ds) # some ids values might be identical, replace
table( ds$poor_health, ds$study_name, useNA = "always")
alsa lbsl satsa share tilda <NA>
FALSE 1423 336 853 1411 6566 0
TRUE 658 228 629 1184 1937 0
<NA> 6 92 15 3 1 0
Finally, we have added the newly created, harmonized variables to the raw source objects and save the data transfer object.
# Save as a compress, binary R dataset. It's no longer readable with a text editor, but it saves metadata (eg, factor information).
saveRDS(dto, file="./data/unshared/derived/dto.rds", compress="xz")
Finally, we have added the newly created, harmonized variables to the raw source objects and save the data transfer object.
# Save as a compress, binary R dataset. It's no longer readable with a text editor, but it saves metadata (eg, factor information).
saveRDS(dto, file="./data/unshared/derived/dto.rds", compress="xz")