This report lists the candidate variable for DataScheme variables of the construct work status.
This report is a record of interaction with a data transfer object (dto) produced by
./manipulation/0-ellis-island.R
.
The next section recaps this script, exposes the architecture of the DTO, and demonstrates the language of interacting with it.
All data land on Ellis Island.
The script 0-ellis-island.R
is the first script in the analytic workflow. It accomplished the following:
./data/shared/derived/meta-data-live.csv
, which is updated every time Ellis Island script is executed../data/shared/meta-data-map.csv
. They are used by automatic scripts in later harmonization and analysis.# load the product of 0-ellis-island.R, a list object containing data and metadata
dto <- readRDS("./data/unshared/derived/dto.rds")
# the list is composed of the following elements
names(dto)
[1] "studyName" "filePath" "unitData" "metaData"
# 1st element - names of the studies as character vector
dto[["studyName"]]
[1] "alsa" "lbsl" "satsa" "share" "tilda"
# 2nd element - file paths of the data files for each study as character vector
dto[["filePath"]]
[1] "./data/unshared/raw/ALSA-Wave1.Final.sav" "./data/unshared/raw/LBSL-Panel2-Wave1.Final.sav"
[3] "./data/unshared/raw/SATSA-Q3.Final.sav" "./data/unshared/raw/SHARE-Israel-Wave1.Final.sav"
[5] "./data/unshared/raw/TILDA-Wave1.Final.sav"
# 3rd element - is a list object containing the following elements
names(dto[["unitData"]])
[1] "alsa" "lbsl" "satsa" "share" "tilda"
# each of these elements is a raw data set of a corresponding study, for example
dplyr::tbl_df(dto[["unitData"]][["lbsl"]])
Source: local data frame [656 x 34]
id AGE94 SEX94 MSTAT94 EDUC94 NOWRK94 SMK94 SMOKE
(int) (int) (int) (fctr) (int) (fctr) (fctr) (fctr)
1 4001026 68 1 divorced 16 no, retired no never smoked
2 4012015 94 2 widowed 12 no, retired no never smoked
3 4012032 94 2 widowed 20 no, retired no don't smoke at present but smoked in the past
4 4022004 93 2 NA NA NA NA never smoked
5 4022026 93 2 widowed 12 no, retired no never smoked
6 4031031 92 1 married 8 no, retired no don't smoke at present but smoked in the past
7 4031035 92 1 widowed 13 no, retired no don't smoke at present but smoked in the past
8 4032201 92 2 NA NA NA NA don't smoke at present but smoked in the past
9 4041062 91 1 widowed 7 NA no don't smoke at present but smoked in the past
10 4042057 91 2 NA NA NA NA NA
.. ... ... ... ... ... ... ... ...
Variables not shown: ALCOHOL (fctr), WINE (int), BEER (int), HARDLIQ (int), SPORT94 (int), FIT94 (int), WALK94 (int),
SPEC94 (int), DANCE94 (int), CHORE94 (int), EXCERTOT (int), EXCERWK (int), HEIGHT94 (int), WEIGHT94 (int), HWEIGHT
(int), HHEIGHT (int), SRHEALTH (fctr), smoke_now (lgl), smoked_ever (lgl), year_of_wave (dbl), age_in_years (dbl),
year_born (dbl), female (lgl), marital (chr), single (lgl), educ3 (chr)
# 4th element - a dataset names and labels of raw variables + added metadata for all studies
dto[["metaData"]] %>% dplyr::select(study_name, name, item, construct, type, categories, label_short, label) %>%
DT::datatable(
class = 'cell-border stripe',
caption = "This is the primary metadata file. Edit at `./data/shared/meta-data-map.csv",
filter = "top",
options = list(pageLength = 6, autoWidth = TRUE)
)
Everybody wants to be somebody.
We query metadata set to retrieve all variables potentially tapping the construct work status
. These are the candidates to enter the DataSchema and contribute to computing harmonized variables.
NOTE: what is being retrieved depends on the manually entered values in the column construct
of the metadata file ./data/shared/meta-data-map.csv
. To specify a different group of variables, edit the metadata, not the script.
meta_data <- dto[["metaData"]] %>%
dplyr::filter(construct %in% c('work_status')) %>%
dplyr::select(study_name, name, construct, label_short, categories, url) %>%
dplyr::arrange(construct, study_name)
knitr::kable(meta_data)
study_name | name | construct | label_short | categories | url |
---|---|---|---|---|---|
alsa | RETIRED | work_status | Are you retired from your last job? | 2 | |
alsa | CURRWORK | work_status | Currently working | NA | |
lbsl | NOWRK94 | work_status | Working at present time? | 9 | |
satsa | GAMTWORK | work_status | Describe current work/retirement situation | 11 | |
share | EP0050 | work_status | Current job situation | 10 | |
tilda | WE001 | work_status | Describe current job situation | 9 | |
tilda | WE003 | work_status | Any paid work last week? | 9 |
View descriptives : work for closer examination of each candidate.
After reviewing descriptives and relevant codebooks, the following operationalization of the harmonized variables for work status
have been adopted:
current_work_2
0
- FALSE
- REFERENCE group1
- TRUE
The operationalization of this variable does not destinguish between retired and unemployed statuses.
These variables will be generated next, in the Development section.
The particulare goal of this section is to ensure that the schema to encode the values for the work status
variable is consisten across studies.
In this section we will define the schema sets for harmonizing work status
construct (i.e. specify which variables from which studies will be contributing to computing harmonized variables ). Each of these schema sets will have a particular pattern of possible response values to these variables, which we will export for inspection as .csv
tables. We then will manually edit these .csv
tables, populating new columns that will map values of harmonized variables to the specific response pattern of the schema set variables. We then will import harmonization algorithms encoded in .csv
tables and apply them to compute harmonized variables in the dataset combining raw and harmonized variables for work status
construct across studies.
Having all potential variables in categorical format we have defined the sets of data schema variables thus:
Each of these schema sets have a particular pattern of possible response values, for example:
We output these tables into self-standing .csv
files, so we can manually provide the logic of computing harmonized variables.
You can examine them in `./data/meta/response-profiles-live/
current_work_2
current_work_2
0
- FALSE
1
- TRUE
Items that can contribute to generating values for the harmonized variable current_work_2
are:
dto[["metaData"]] %>%
dplyr::filter(study_name=="alsa", construct %in% c("work_status")) %>%
dplyr::select(study_name, name, label,categories)
study_name name label categories
1 alsa RETIRED Are you retired from your last job? 2
2 alsa CURRWORK Currently working NA
We encode the harmonization rule by manually editing the values in a corresponding .csv
file located in ./data/meta/h-rules/
. Then, we apply the recoding logic it contains and append the newly created, harmonized variable to the initial data set.
study_name <- "alsa"
path_to_hrule <- "./data/meta/h-rules/h-rules-work-alsa.csv"
dto[["unitData"]][[study_name]] <- recode_with_hrule(
dto,
study_name = study_name,
variable_names = c("RETIRED","CURRWORK"),
harmony_name = "current_work_2"
)
Source: local data frame [5 x 4]
Groups: RETIRED, CURRWORK [?]
RETIRED CURRWORK current_work_2 n
(chr) (chr) (lgl) (int)
1 No No FALSE 134
2 Yes No FALSE 1767
3 NA No FALSE 137
4 NA Yes TRUE 31
5 NA NA NA 18
# verify
dto[["unitData"]][["alsa"]] %>%
dplyr::filter(id %in% sample(unique(id),10)) %>%
dplyr::select_("id", "RETIRED","CURRWORK", "current_work_2")
id RETIRED CURRWORK current_work_2
1 2722 Yes No FALSE
2 5741 No No FALSE
3 10241 Yes No FALSE
4 11302 Yes No FALSE
5 14891 Yes No FALSE
6 15731 Yes No FALSE
7 24031 Yes No FALSE
8 24781 Yes No FALSE
9 25932 No No FALSE
10 29581 Yes No FALSE
Items that can contribute to generating values for the harmonized variable current_work_2
are:
dto[["metaData"]] %>%
dplyr::filter(study_name == "lbsl", construct == "work_status") %>%
dplyr::select(study_name, name, label_short,categories)
study_name name label_short categories
1 lbsl NOWRK94 Working at present time? 9
We encode the harmonization rule by manually editing the values in a corresponding .csv
file located in ./data/meta/h-rules/
. Then, we apply the recoding logic it contains and append the newly created, harmonized variable to the initial data set.
study_name <- "lbsl"
path_to_hrule <- "./data/meta/h-rules/h-rules-work-lbsl.csv"
dto[["unitData"]][[study_name]] <- recode_with_hrule(
dto,
study_name = study_name,
variable_names = c("NOWRK94"),
harmony_name = "current_work_2"
)
Source: local data frame [9 x 3]
Groups: NOWRK94 [?]
NOWRK94 current_work_2 n
(chr) (lgl) (int)
1 no, disabled FALSE 14
2 no, homemaker FALSE 34
3 no, not seeking work FALSE 7
4 no, retired FALSE 318
5 no, unemployed FALSE 7
6 yes, full time TRUE 105
7 yes, more than one job TRUE 2
8 yes, part time TRUE 64
9 NA NA 105
# verify
dto[["unitData"]][["lbsl"]] %>%
dplyr::filter(id %in% sample(unique(id),10)) %>%
dplyr::select_("id", "NOWRK94", "current_work_2")
id NOWRK94 current_work_2
1 4131197 no, retired FALSE
2 4152082 yes, part time TRUE
3 4201073 no, retired FALSE
4 4251084 <NA> NA
5 4272076 no, retired FALSE
6 4361043 no, retired FALSE
7 4422007 <NA> NA
8 4491032 yes, full time TRUE
9 4512034 yes, full time TRUE
10 4631001 <NA> NA
Items that can contribute to generating values for the harmonized variable current_work_2
are:
dto[["metaData"]] %>%
dplyr::filter(study_name == "satsa", construct == "work_status") %>%
dplyr::select(study_name, name, label_short,categories)
study_name name label_short categories
1 satsa GAMTWORK Describe current work/retirement situation 11
We encode the harmonization rule by manually editing the values in a corresponding .csv
file located in ./data/meta/h-rules/
. Then, we apply the recoding logic it contains and append the newly created, harmonized variable to the initial data set.
study_name <- "satsa"
path_to_hrule <- "./data/meta/h-rules/h-rules-work-satsa.csv"
dto[["unitData"]][[study_name]] <- recode_with_hrule(
dto,
study_name = study_name,
variable_names = c("GAMTWORK"),
harmony_name = "current_work_2"
)
Source: local data frame [11 x 3]
Groups: GAMTWORK [?]
GAMTWORK current_work_2 n
(chr) (lgl) (int)
1 full time student FALSE 3
2 housewife/houseman FALSE 22
3 old-age pensioner FALSE 778
4 On leave of absence from work FALSE 2
5 other' FALSE 58
6 pension due to sickness FALSE 77
7 Unemployed (looking for a job) FALSE 13
8 Unemployed (not looking for job) FALSE 3
9 work full-time TRUE 407
10 work half-time TRUE 112
11 NA NA 22
# verify
dto[["unitData"]][["satsa"]] %>%
dplyr::filter(id %in% sample(unique(id),10)) %>%
dplyr::select_("id", "GAMTWORK", "current_work_2")
id GAMTWORK current_work_2
1 127251 old-age pensioner FALSE
2 138021 old-age pensioner FALSE
3 151222 old-age pensioner FALSE
4 166911 old-age pensioner FALSE
5 178622 old-age pensioner FALSE
6 178802 old-age pensioner FALSE
7 210622 work full-time TRUE
8 2157402 work full-time TRUE
9 2185881 <NA> NA
10 2236201 old-age pensioner FALSE
Items that can contribute to generating values for the harmonized variable current_work_2
are:
dto[["metaData"]] %>%
dplyr::filter(study_name == "tilda", construct == "work_status") %>%
dplyr::select(study_name, name, label_short,categories)
study_name name label_short categories
1 tilda WE001 Describe current job situation 9
2 tilda WE003 Any paid work last week? 9
We encode the harmonization rule by manually editing the values in a corresponding .csv
file located in ./data/meta/h-rules/
. Then, we apply the recoding logic it contains and append the newly created, harmonized variable to the initial data set.
study_name <- "tilda"
path_to_hrule <- "./data/meta/h-rules/h-rules-work-tilda.csv"
dto[["unitData"]][[study_name]] <- recode_with_hrule(
dto,
study_name = study_name,
variable_names = c("WE001","WE003"),
harmony_name = "current_work_2"
)
Source: local data frame [15 x 4]
Groups: WE001, WE003 [?]
WE001 WE003 current_work_2 n
(chr) (chr) (lgl) (int)
1 Employed UNDOCUMENTED CODE TRUE 2218
2 In education or training No FALSE 47
3 In education or training Yes FALSE 8
4 Looking after home or family No FALSE 1289
5 Looking after home or family Yes FALSE 57
6 Other (Specify) No FALSE 79
7 Other (Specify) Yes FALSE 25
8 Permanently sick or disabled No FALSE 385
9 Permanently sick or disabled Yes FALSE 10
10 Retired No FALSE 2910
11 Retired Yes TRUE 138
12 Self-employed (including farming) UNDOCUMENTED CODE TRUE 923
13 Unemployed No FALSE 395
14 Unemployed Yes FALSE 18
15 NA No NA 2
# verify
dto[["unitData"]][["tilda"]] %>%
dplyr::filter(id %in% sample(unique(id),10)) %>%
dplyr::select_("id","WE001","WE003","current_work_2")
id WE001 WE003 current_work_2
1 162792 Self-employed (including farming) UNDOCUMENTED CODE TRUE
2 200091 Retired No FALSE
3 237281 Retired No FALSE
4 269501 Employed UNDOCUMENTED CODE TRUE
5 277011 Looking after home or family No FALSE
6 441322 Looking after home or family No FALSE
7 504251 Employed UNDOCUMENTED CODE TRUE
8 597251 Looking after home or family No FALSE
9 611571 Looking after home or family No FALSE
10 622844 Permanently sick or disabled No FALSE
At this point the dto[["unitData"]]
elements (raw data files for each study) have been augmented with the harmonized variable current_work_2
. We retrieve harmonized variables to view frequency counts across studies:
dumlist <- list()
for(s in dto[["studyName"]]){
ds <- dto[["unitData"]][[s]]
dumlist[[s]] <- ds[,c("id","current_work_2")]
}
ds <- plyr::ldply(dumlist,data.frame,.id = "study_name")
head(ds)
study_name id current_work_2
1 alsa 41 FALSE
2 alsa 42 FALSE
3 alsa 61 FALSE
4 alsa 71 TRUE
5 alsa 91 FALSE
6 alsa 121 FALSE
ds$id <- 1:nrow(ds) # some ids values might be identical, replace
table( ds$current_work_2, ds$study_name, useNA = "always")
alsa lbsl satsa share tilda <NA>
FALSE 2038 380 956 1658 5223 0
TRUE 31 171 519 932 3279 0
<NA> 18 105 22 8 2 0
Finally, we have added the newly created, harmonized variables to the raw source objects and save the data transfer object.
# Save as a compress, binary R dataset. It's no longer readable with a text editor, but it saves metadata (eg, factor information).
saveRDS(dto, file="./data/unshared/derived/dto.rds", compress="xz")