This report lists the candidate variable for DataScheme variables of the construct education.
This report is a record of interaction with a data transfer object (dto) produced by
./manipulation/0-ellis-island.R
.
The next section recaps this script, exposes the architecture of the DTO, and demonstrates the language of interacting with it.
All data land on Ellis Island.
The script 0-ellis-island.R
is the first script in the analytic workflow. It accomplished the following:
./data/shared/derived/meta-data-live.csv
, which is updated every time Ellis Island script is executed../data/shared/meta-data-map.csv
. They are used by automatic scripts in later harmonization and analysis.# load the product of 0-ellis-island.R, a list object containing data and metadata
dto <- readRDS("./data/unshared/derived/dto.rds")
# the list is composed of the following elements
names(dto)
[1] "studyName" "filePath" "unitData" "metaData"
# 1st element - names of the studies as character vector
dto[["studyName"]]
[1] "alsa" "lbsl" "satsa" "share" "tilda"
# 2nd element - file paths of the data files for each study as character vector
dto[["filePath"]]
[1] "./data/unshared/raw/ALSA-Wave1.Final.sav" "./data/unshared/raw/LBSL-Panel2-Wave1.Final.sav"
[3] "./data/unshared/raw/SATSA-Q3.Final.sav" "./data/unshared/raw/SHARE-Israel-Wave1.Final.sav"
[5] "./data/unshared/raw/TILDA-Wave1.Final.sav"
# 3rd element - is a list object containing the following elements
names(dto[["unitData"]])
[1] "alsa" "lbsl" "satsa" "share" "tilda"
# each of these elements is a raw data set of a corresponding study, for example
dplyr::tbl_df(dto[["unitData"]][["lbsl"]])
Source: local data frame [656 x 33]
id AGE94 SEX94 MSTAT94 EDUC94 NOWRK94 SMK94 SMOKE
(int) (int) (int) (fctr) (int) (fctr) (fctr) (fctr)
1 4001026 68 1 divorced 16 no, retired no never smoked
2 4012015 94 2 widowed 12 no, retired no never smoked
3 4012032 94 2 widowed 20 no, retired no don't smoke at present but smoked in the past
4 4022004 93 2 NA NA NA NA never smoked
5 4022026 93 2 widowed 12 no, retired no never smoked
6 4031031 92 1 married 8 no, retired no don't smoke at present but smoked in the past
7 4031035 92 1 widowed 13 no, retired no don't smoke at present but smoked in the past
8 4032201 92 2 NA NA NA NA don't smoke at present but smoked in the past
9 4041062 91 1 widowed 7 NA no don't smoke at present but smoked in the past
10 4042057 91 2 NA NA NA NA NA
.. ... ... ... ... ... ... ... ...
Variables not shown: ALCOHOL (fctr), WINE (int), BEER (int), HARDLIQ (int), SPORT94 (int), FIT94 (int), WALK94 (int),
SPEC94 (int), DANCE94 (int), CHORE94 (int), EXCERTOT (int), EXCERWK (int), HEIGHT94 (int), WEIGHT94 (int), HWEIGHT
(int), HHEIGHT (int), SRHEALTH (fctr), smoke_now (lgl), smoked_ever (lgl), year_of_wave (dbl), age_in_years (dbl),
year_born (dbl), female (lgl), marital (chr), single (lgl)
# 4th element - a dataset names and labels of raw variables + added metadata for all studies
dto[["metaData"]] %>% dplyr::select(study_name, name, item, construct, type, categories, label_short, label) %>%
DT::datatable(
class = 'cell-border stripe',
caption = "This is the primary metadata file. Edit at `./data/shared/meta-data-map.csv",
filter = "top",
options = list(pageLength = 6, autoWidth = TRUE)
)
Everybody wants to be somebody.
We query metadata set to retrieve all variables potentially tapping the construct education
. These are the candidates to enter the DataSchema and contribute to computing harmonized variables.
NOTE: what is being retrieved depends on the manually entered values in the column construct
of the metadata file ./data/shared/meta-data-map.csv
. To specify a different group of variables, edit the metadata, not the script.
meta_data <- dto[["metaData"]] %>%
dplyr::filter(construct %in% c('education')) %>%
dplyr::select(study_name, name, construct, label_short, categories, url) %>%
dplyr::arrange(construct, study_name)
knitr::kable(meta_data)
study_name | name | construct | label_short | categories | url |
---|---|---|---|---|---|
alsa | SCHOOL | education | Age left school | 8 | |
alsa | TYPQUAL | education | Highest qualification | 10 | |
lbsl | EDUC94 | education | Years of school completed | 18 | |
satsa | EDUC | education | Education | 4 | |
share | DN0100 | education | Edcuation | 13 | |
share | DN012D01 | education | yeshiva, religious high institution | NA | |
share | DN012D02 | education | nursing school | NA | |
share | DN012D03 | education | polytechnic | NA | |
share | DN012D04 | education | university, Bachelors degree | NA | |
share | DN012D05 | education | university, graduate degree | NA | |
share | DN012D09 | education | still in further education or training | NA | |
share | DN012DNO | education | no further education | NA | |
share | DN012DOT | education | other further education | NA | |
share | DN012DRF | education | refused | NA | |
share | DN012DDK | education | dont know | NA | |
tilda | DM001 | education | 4 |
View descriptives : education for closer examination of each candidate.
After reviewing descriptives and relevant codebooks, the following operationalization of the harmonized variables for education
have been adopted:
educ3
-1
- less then high school
0
- high school
- REFERENCE group1
- more than high school
These variables will be generated next, in the Development section.
The particulare goal of this section is to ensure that the schema to encode the values for the education
variable is consisten across studies.
In this section we will define the schema sets for harmonizing education
construct (i.e. specify which variables from which studies will be contributing to computing harmonized variables ). Each of these schema sets will have a particular pattern of possible response values to these variables, which we will export for inspection as .csv
tables. We then will manually edit these .csv
tables, populating new columns that will map values of harmonized variables to the specific response pattern of the schema set variables. We then will import harmonization algorithms encoded in .csv
tables and apply them to compute harmonized variables in the dataset combining raw and harmonized variables for education
construct across studies.
Having all potential variables in categorical format we have defined the sets of data schema variables thus:
Each of these schema sets have a particular pattern of possible response values, for example:
We output these tables into self-standing .csv
files, so we can manually provide the logic of computing harmonized variables.
You can examine them in `./data/meta/response-profiles-live/
educ4
educ4
1
- less then high-school
2
- high-school most
3
- college
4
- college plus
Items that can contribute to generating values for the harmonized variable education
are:
dto[["metaData"]] %>%
dplyr::filter(study_name=="alsa", construct %in% c("education")) %>%
dplyr::select(study_name, name, label,categories)
study_name name label categories
1 alsa SCHOOL Age left school 8
2 alsa TYPQUAL Highest qualification 10
We encode the harmonization rule by manually editing the values in a corresponding .csv
file located in ./data/meta/h-rules/
. Then, we apply the recoding logic it contains and append the newly created, harmonized variable to the initial data set.
study_name <- "alsa"
path_to_hrule <- "./data/meta/h-rules/h-rules-education-alsa.csv"
dto[["unitData"]][[study_name]] <- recode_with_hrule(
dto,
study_name = study_name,
variable_names = c("SCHOOL","TYPQUAL"),
harmony_name = "educ3"
)
Source: local data frame [50 x 4]
Groups: SCHOOL, TYPQUAL [?]
SCHOOL TYPQUAL educ3 n
(chr) (chr) (chr) (int)
1 Eighteen or more years Bachelor Degree or Post Graduate Diploma more than high school 31
2 Eighteen or more years Certificate or Diploma more than high school 34
3 Eighteen or more years Higher Qualification more than high school 6
4 Eighteen or more years No Formal Tuition more than high school 1
5 Eighteen or more years Primary School Course more than high school 1
6 Eighteen or more years Secondary School Course more than high school 1
7 Eighteen or more years Trade or Apprenticeship more than high school 6
8 Eighteen or more years NA more than high school 33
9 Fifteen years Adult Education or Hobby Course more than high school 1
10 Fifteen years Bachelor Degree or Post Graduate Diploma more than high school 6
11 Fifteen years Certificate or Diploma more than high school 71
12 Fifteen years Other more than high school 1
13 Fifteen years Secondary School Course more than high school 8
14 Fifteen years Trade or Apprenticeship more than high school 52
15 Fifteen years NA more than high school 243
16 Fourteen years Adult Education or Hobby Course high school 8
17 Fourteen years Bachelor Degree or Post Graduate Diploma high school 6
18 Fourteen years Certificate or Diploma high school 72
19 Fourteen years Higher Qualification high school 5
20 Fourteen years No Formal Tuition high school 2
21 Fourteen years Other high school 2
22 Fourteen years Secondary School Course high school 1
23 Fourteen years Trade or Apprenticeship high school 109
24 Fourteen years NA high school 614
25 Never went to school Certificate or Diploma less than high school 4
26 Never went to school Trade or Apprenticeship less than high school 1
27 Never went to school NA less than high school 25
28 Seventeen years Adult Education or Hobby Course less than high school 1
29 Seventeen years Bachelor Degree or Post Graduate Diploma more than high school 22
30 Seventeen years Certificate or Diploma more than high school 41
31 Seventeen years Higher Qualification more than high school 1
32 Seventeen years Other more than high school 2
33 Seventeen years Secondary School Course more than high school 1
34 Seventeen years Trade or Apprenticeship more than high school 6
35 Seventeen years NA more than high school 57
36 Sixteen years Bachelor Degree or Post Graduate Diploma more than high school 14
37 Sixteen years Certificate or Diploma more than high school 82
38 Sixteen years Higher Qualification more than high school 2
39 Sixteen years Secondary School Course more than high school 4
40 Sixteen years Trade or Apprenticeship more than high school 26
41 Sixteen years NA more than high school 152
42 Under fourteen years Adult Education or Hobby Course less than high school 1
43 Under fourteen years Bachelor Degree or Post Graduate Diploma less than high school 1
44 Under fourteen years Certificate or Diploma less than high school 27
45 Under fourteen years Other less than high school 1
46 Under fourteen years Secondary School Course less than high school 2
47 Under fourteen years Trade or Apprenticeship less than high school 36
48 Under fourteen years NA less than high school 238
49 NA Certificate or Diploma NA 1
50 NA NA NA 25
# verify
dto[["unitData"]][["alsa"]] %>%
dplyr::filter(id %in% sample(unique(id),10)) %>%
dplyr::select_("id", "SCHOOL","TYPQUAL","educ3")
id SCHOOL TYPQUAL educ3
1 2001 Fifteen years <NA> more than high school
2 3491 Under fourteen years <NA> less than high school
3 4892 Fourteen years <NA> high school
4 5622 Under fourteen years <NA> less than high school
5 10091 Eighteen or more years <NA> more than high school
6 12411 Sixteen years <NA> more than high school
7 22411 Sixteen years <NA> more than high school
8 23341 Under fourteen years <NA> less than high school
9 30352 Sixteen years <NA> more than high school
10 32572 Sixteen years <NA> more than high school
Items that can contribute to generating values for the harmonized variable education
are:
dto[["metaData"]] %>%
dplyr::filter(study_name == "lbsl", construct == "education") %>%
# dplyr::filter(name %in% c("EDUC94")) %>%
dplyr::select(study_name, name, label_short,categories)
study_name name label_short categories
1 lbsl EDUC94 Years of school completed 18
We encode the harmonization rule by manually editing the values in a corresponding .csv
file located in ./data/meta/h-rules/
. Then, we apply the recoding logic it contains and append the newly created, harmonized variable to the initial data set.
study_name <- "lbsl"
path_to_hrule <- "./data/meta/h-rules/h-rules-education-lbsl.csv"
dto[["unitData"]][[study_name]] <- recode_with_hrule(
dto,
study_name = study_name,
variable_names = c("EDUC94"),
harmony_name = "educ3"
)
Source: local data frame [18 x 3]
Groups: EDUC94 [?]
EDUC94 educ3 n
(chr) (chr) (int)
1 10 less than high school 29
2 11 less than high school 18
3 12 high school 170
4 13 more than high school 40
5 14 more than high school 85
6 15 more than high school 37
7 16 more than high school 62
8 17 more than high school 15
9 18 more than high school 28
10 19 more than high school 10
11 20 more than high school 31
12 21 more than high school 1
13 23 more than high school 1
14 4 less than high school 1
15 7 less than high school 6
16 8 less than high school 16
17 9 less than high school 4
18 NA NA 102
# verify
dto[["unitData"]][["lbsl"]] %>%
dplyr::filter(id %in% sample(unique(id),10)) %>%
dplyr::select_("id", "EDUC94", "educ3")
id EDUC94 educ3
1 4092016 14 more than high school
2 4101168 16 more than high school
3 4131061 NA <NA>
4 4141203 12 high school
5 4242074 13 more than high school
6 4291048 NA <NA>
7 4301087 18 more than high school
8 4312017 8 less than high school
9 4612001 12 high school
10 4612005 13 more than high school
Items that can contribute to generating values for the harmonized variable education
are:
dto[["metaData"]] %>%
dplyr::filter(study_name == "satsa", construct == "education") %>%
# dplyr::filter(name %in% c("EDUC")) %>%
dplyr::select(study_name, name, label_short,categories)
study_name name label_short categories
1 satsa EDUC Education 4
We encode the harmonization rule by manually editing the values in a corresponding .csv
file located in ./data/meta/h-rules/
. Then, we apply the recoding logic it contains and append the newly created, harmonized variable to the initial data set.
study_name <- "satsa"
path_to_hrule <- "./data/meta/h-rules/h-rules-education-satsa.csv"
dto[["unitData"]][[study_name]] <- recode_with_hrule(
dto,
study_name = study_name,
variable_names = c("EDUC"),
harmony_name = "educ3"
)
Source: local data frame [5 x 3]
Groups: EDUC [?]
EDUC educ3 n
(chr) (chr) (int)
1 Elementary school less than high school 858
2 gymnasium (A-level) high school 121
3 O-level or vocational school or folk school less than high school 381
4 university or higher more than high school 109
5 NA NA 28
# verify
dto[["unitData"]][["satsa"]] %>%
dplyr::filter(id %in% sample(unique(id),10)) %>%
dplyr::select_("id", "EDUC", "educ3")
id EDUC educ3
1 24262 O-level or vocational school or folk school less than high school
2 132021 Elementary school less than high school
3 138291 Elementary school less than high school
4 150541 university or higher more than high school
5 165662 O-level or vocational school or folk school less than high school
6 178602 Elementary school less than high school
7 274701 Elementary school less than high school
8 294801 O-level or vocational school or folk school less than high school
9 295001 Elementary school less than high school
10 2151972 O-level or vocational school or folk school less than high school
Items that can contribute to generating values for the harmonized variable education
are:
dto[["metaData"]] %>%
dplyr::filter(study_name == "tilda", construct == "education") %>%
# dplyr::filter(name %in% c("SMK94", "SMOKE")) %>%
dplyr::select(study_name, name, label_short,categories)
study_name name label_short categories
1 tilda DM001 4
We encode the harmonization rule by manually editing the values in a corresponding .csv
file located in ./data/meta/h-rules/
. Then, we apply the recoding logic it contains and append the newly created, harmonized variable to the initial data set.
study_name <- "tilda"
path_to_hrule <- "./data/meta/h-rules/h-rules-education-tilda.csv"
dto[["unitData"]][[study_name]] <- recode_with_hrule(
dto,
study_name = study_name,
variable_names = c("DM001"),
harmony_name = "educ3"
)
Source: local data frame [9 x 3]
Groups: DM001 [?]
DM001 educ3 n
(chr) (chr) (int)
1 Diploma/certificate high school 1335
2 Intermediate/junior/group certificate or equivalent less than high school 1971
3 Leaving certificate or equivalent high school 1460
4 None less than high school 9
5 Postgraduate/higher degree more than high school 483
6 Primary degree less than high school 730
7 Primary or equivalent less than high school 2232
8 Some primary (not complete) less than high school 280
9 NA NA 4
# verify
dto[["unitData"]][["tilda"]] %>%
dplyr::filter(id %in% sample(unique(id),10)) %>%
dplyr::select_("id", "DM001", "educ3")
id DM001 educ3
1 75911 Intermediate/junior/group certificate or equivalent less than high school
2 97071 Primary or equivalent less than high school
3 195521 Diploma/certificate high school
4 252201 Postgraduate/higher degree more than high school
5 286431 Postgraduate/higher degree more than high school
6 463892 Primary or equivalent less than high school
7 477791 Primary degree less than high school
8 492302 Intermediate/junior/group certificate or equivalent less than high school
9 493222 Primary degree less than high school
10 564502 Diploma/certificate high school
At this point the dto[["unitData"]]
elements (raw data files for each study) have been augmented with the harmonized variable educ4
. We retrieve harmonized variables to view frequency counts across studies:
dumlist <- list()
for(s in dto[["studyName"]]){
ds <- dto[["unitData"]][[s]]
dumlist[[s]] <- ds[,c("id","educ3")]
}
ds <- plyr::ldply(dumlist,data.frame,.id = "study_name")
head(ds)
study_name id educ3
1 alsa 41 more than high school
2 alsa 42 high school
3 alsa 61 high school
4 alsa 71 high school
5 alsa 91 more than high school
6 alsa 121 high school
ds$id <- 1:nrow(ds) # some ids values might be identical, replace
ds$educ3 <- car::recode(ds$educ3,"
'less than high school'=0;
'high school' =1;
'more than high school'=2
",as.factor.result=TRUE )
ds$educ3 <- factor(
ds$educ3,
levels = c("less than high school",
"high school",
"more than high school"),
labels = c(0,1,2)
)
table( ds$educ3, ds$study_name, useNA = "always")
alsa lbsl satsa share tilda <NA>
0 0 0 0 0 0 0
1 0 0 0 0 0 0
2 0 0 0 0 0 0
<NA> 2087 656 1497 2598 8504 0
Finally, we have added the newly created, harmonized variables to the raw source objects and save the data transfer object.
# Save as a compress, binary R dataset. It's no longer readable with a text editor, but it saves metadata (eg, factor information).
saveRDS(dto, file="./data/unshared/derived/dto.rds", compress="xz")