(I) Exposition
- (I.A) Ellis Island
  - Meta
- (I.B) Target-H
(II) Development
- (II.A)
  - (1) Schema sets
- (II.B) current_work_2
  - ALSA
  - LBSL
  - SATSA
  - SHARE
  - TILDA
(III) Recapitulation

This report lists the candidate variable for DataScheme variables of the construct work status.

(I) Exposition

This report is a record of interaction with a data transfer object (dto) produced by ./manipulation/0-ellis-island.R.

The next section recaps this script, exposes the architecture of the DTO, and demonstrates the language of interacting with it.

(I.A) Ellis Island

All data land on Ellis Island.

The script 0-ellis-island.R is the first script in the analytic workflow. It accomplished the following:

1. Reads in raw data files from the candidate studies
1. Extract, combines, and exports their metadata (specifically, variable names and labels, if provided) into ./data/shared/derived/meta-data-live.csv, which is updated every time Ellis Island script is executed.
1. Augments raw metadata with instructions for renaming and classifying variables. The instructions are provided as manually entered values in ./data/shared/meta-data-map.csv. They are used by automatic scripts in later harmonization and analysis.
1. Combines unit and metadata into a single DTO to serve as a starting point to all subsequent analyses.

# load the product of 0-ellis-island.R,  a list object containing data and metadata
dto <- readRDS("./data/unshared/derived/dto.rds")

# the list is composed of the following elements
names(dto)

[1] "studyName" "filePath"  "unitData"  "metaData"

# 1st element - names of the studies as character vector
dto[["studyName"]]

[1] "alsa"  "lbsl"  "satsa" "share" "tilda"

# 2nd element - file paths of the data files for each study as character vector
dto[["filePath"]]

[1] "./data/unshared/raw/ALSA-Wave1.Final.sav"         "./data/unshared/raw/LBSL-Panel2-Wave1.Final.sav" 
[3] "./data/unshared/raw/SATSA-Q3.Final.sav"           "./data/unshared/raw/SHARE-Israel-Wave1.Final.sav"
[5] "./data/unshared/raw/TILDA-Wave1.Final.sav"

# 3rd element - is a list object containing the following elements
names(dto[["unitData"]])

[1] "alsa"  "lbsl"  "satsa" "share" "tilda"

# each of these elements is a raw data set of a corresponding study, for example
dplyr::tbl_df(dto[["unitData"]][["lbsl"]])

Source: local data frame [656 x 34]

        id AGE94 SEX94  MSTAT94 EDUC94     NOWRK94  SMK94                                         SMOKE
     (int) (int) (int)   (fctr)  (int)      (fctr) (fctr)                                        (fctr)
1  4001026    68     1 divorced     16 no, retired     no                                  never smoked
2  4012015    94     2  widowed     12 no, retired     no                                  never smoked
3  4012032    94     2  widowed     20 no, retired     no don't smoke at present but smoked in the past
4  4022004    93     2       NA     NA          NA     NA                                  never smoked
5  4022026    93     2  widowed     12 no, retired     no                                  never smoked
6  4031031    92     1  married      8 no, retired     no don't smoke at present but smoked in the past
7  4031035    92     1  widowed     13 no, retired     no don't smoke at present but smoked in the past
8  4032201    92     2       NA     NA          NA     NA don't smoke at present but smoked in the past
9  4041062    91     1  widowed      7          NA     no don't smoke at present but smoked in the past
10 4042057    91     2       NA     NA          NA     NA                                            NA
..     ...   ...   ...      ...    ...         ...    ...                                           ...
Variables not shown: ALCOHOL (fctr), WINE (int), BEER (int), HARDLIQ (int), SPORT94 (int), FIT94 (int), WALK94 (int),
  SPEC94 (int), DANCE94 (int), CHORE94 (int), EXCERTOT (int), EXCERWK (int), HEIGHT94 (int), WEIGHT94 (int), HWEIGHT
  (int), HHEIGHT (int), SRHEALTH (fctr), smoke_now (lgl), smoked_ever (lgl), year_of_wave (dbl), age_in_years (dbl),
  year_born (dbl), female (lgl), marital (chr), single (lgl), educ3 (chr)

(I.B) Target-H

Everybody wants to be somebody.

We query metadata set to retrieve all variables potentially tapping the construct work status. These are the candidates to enter the DataSchema and contribute to computing harmonized variables.

NOTE: what is being retrieved depends on the manually entered values in the column construct of the metadata file ./data/shared/meta-data-map.csv. To specify a different group of variables, edit the metadata, not the script.

meta_data <- dto[["metaData"]] %>%
  dplyr::filter(construct %in% c('work_status')) %>% 
  dplyr::select(study_name, name, construct, label_short, categories, url) %>%
  dplyr::arrange(construct, study_name)
knitr::kable(meta_data)

study_name	name	construct	label_short	categories
alsa	RETIRED	work_status	Are you retired from your last job?	2
alsa	CURRWORK	work_status	Currently working	NA
lbsl	NOWRK94	work_status	Working at present time?	9
satsa	GAMTWORK	work_status	Describe current work/retirement situation	11
share	EP0050	work_status	Current job situation	10
tilda	WE001	work_status	Describe current job situation	9
tilda	WE003	work_status	Any paid work last week?	9

View descriptives : work for closer examination of each candidate.

After reviewing descriptives and relevant codebooks, the following operationalization of the harmonized variables for work status have been adopted:

Target : `current_work_2`

0 - FALSE - REFERENCE group
1 - TRUE

The operationalization of this variable does not destinguish between retired and unemployed statuses.

These variables will be generated next, in the Development section.

(II) Development

The particulare goal of this section is to ensure that the schema to encode the values for the work status variable is consisten across studies.

In this section we will define the schema sets for harmonizing work status construct (i.e. specify which variables from which studies will be contributing to computing harmonized variables ). Each of these schema sets will have a particular pattern of possible response values to these variables, which we will export for inspection as .csv tables. We then will manually edit these .csv tables, populating new columns that will map values of harmonized variables to the specific response pattern of the schema set variables. We then will import harmonization algorithms encoded in .csv tables and apply them to compute harmonized variables in the dataset combining raw and harmonized variables for work status construct across studies.

(II.A)

(1) Schema sets

Having all potential variables in categorical format we have defined the sets of data schema variables thus:

Each of these schema sets have a particular pattern of possible response values, for example:

We output these tables into self-standing .csv files, so we can manually provide the logic of computing harmonized variables.

You can examine them in `./data/meta/response-profiles-live/

(II.B) `current_work_2`

Target (1) : `current_work_2`

0 - FALSE
1 - TRUE

ALSA

Items that can contribute to generating values for the harmonized variable current_work_2 are:

dto[["metaData"]] %>%
  dplyr::filter(study_name=="alsa", construct %in% c("work_status")) %>%
  dplyr::select(study_name, name, label,categories)

  study_name     name                               label categories
1       alsa  RETIRED Are you retired from your last job?          2
2       alsa CURRWORK                   Currently working         NA

We encode the harmonization rule by manually editing the values in a corresponding .csv file located in ./data/meta/h-rules/. Then, we apply the recoding logic it contains and append the newly created, harmonized variable to the initial data set.

study_name <- "alsa"
path_to_hrule <- "./data/meta/h-rules/h-rules-work-alsa.csv"
dto[["unitData"]][[study_name]] <- recode_with_hrule(
  dto,
  study_name = study_name, 
  variable_names = c("RETIRED","CURRWORK"), 
  harmony_name = "current_work_2"
)

Source: local data frame [5 x 4]
Groups: RETIRED, CURRWORK [?]

  RETIRED CURRWORK current_work_2     n
    (chr)    (chr)          (lgl) (int)
1      No       No          FALSE   134
2     Yes       No          FALSE  1767
3      NA       No          FALSE   137
4      NA      Yes           TRUE    31
5      NA       NA             NA    18

# verify
dto[["unitData"]][["alsa"]] %>%
  dplyr::filter(id %in% sample(unique(id),10)) %>%
  dplyr::select_("id", "RETIRED","CURRWORK", "current_work_2")

      id RETIRED CURRWORK current_work_2
1   2722     Yes       No          FALSE
2   5741      No       No          FALSE
3  10241     Yes       No          FALSE
4  11302     Yes       No          FALSE
5  14891     Yes       No          FALSE
6  15731     Yes       No          FALSE
7  24031     Yes       No          FALSE
8  24781     Yes       No          FALSE
9  25932      No       No          FALSE
10 29581     Yes       No          FALSE

LBSL

Items that can contribute to generating values for the harmonized variable current_work_2 are:

dto[["metaData"]] %>%
  dplyr::filter(study_name == "lbsl", construct == "work_status") %>%
  dplyr::select(study_name, name, label_short,categories)

  study_name    name              label_short categories
1       lbsl NOWRK94 Working at present time?          9

study_name <- "lbsl"
path_to_hrule <- "./data/meta/h-rules/h-rules-work-lbsl.csv"
dto[["unitData"]][[study_name]] <- recode_with_hrule(
  dto,
  study_name = study_name, 
  variable_names = c("NOWRK94"), 
  harmony_name = "current_work_2"
)

Source: local data frame [9 x 3]
Groups: NOWRK94 [?]

                 NOWRK94 current_work_2     n
                   (chr)          (lgl) (int)
1           no, disabled          FALSE    14
2          no, homemaker          FALSE    34
3   no, not seeking work          FALSE     7
4            no, retired          FALSE   318
5         no, unemployed          FALSE     7
6         yes, full time           TRUE   105
7 yes, more than one job           TRUE     2
8         yes, part time           TRUE    64
9                     NA             NA   105

# verify
dto[["unitData"]][["lbsl"]] %>%
  dplyr::filter(id %in% sample(unique(id),10)) %>%
  dplyr::select_("id", "NOWRK94", "current_work_2")

        id        NOWRK94 current_work_2
1  4131197    no, retired          FALSE
2  4152082 yes, part time           TRUE
3  4201073    no, retired          FALSE
4  4251084           <NA>             NA
5  4272076    no, retired          FALSE
6  4361043    no, retired          FALSE
7  4422007           <NA>             NA
8  4491032 yes, full time           TRUE
9  4512034 yes, full time           TRUE
10 4631001           <NA>             NA

SATSA

Items that can contribute to generating values for the harmonized variable current_work_2 are:

dto[["metaData"]] %>%
  dplyr::filter(study_name == "satsa", construct == "work_status") %>%
  dplyr::select(study_name, name, label_short,categories)

  study_name     name                                label_short categories
1      satsa GAMTWORK Describe current work/retirement situation         11

study_name <- "satsa"
path_to_hrule <- "./data/meta/h-rules/h-rules-work-satsa.csv"
dto[["unitData"]][[study_name]] <- recode_with_hrule(
  dto,
  study_name = study_name, 
  variable_names = c("GAMTWORK"), 
  harmony_name = "current_work_2"
)

Source: local data frame [11 x 3]
Groups: GAMTWORK [?]

                           GAMTWORK current_work_2     n
                              (chr)          (lgl) (int)
1                 full time student          FALSE     3
2                housewife/houseman          FALSE    22
3                 old-age pensioner          FALSE   778
4     On leave of absence from work          FALSE     2
5                            other'          FALSE    58
6           pension due to sickness          FALSE    77
7    Unemployed (looking for a job)          FALSE    13
8  Unemployed (not looking for job)          FALSE     3
9                    work full-time           TRUE   407
10                   work half-time           TRUE   112
11                               NA             NA    22

# verify
dto[["unitData"]][["satsa"]] %>%
  dplyr::filter(id %in% sample(unique(id),10)) %>%
  dplyr::select_("id", "GAMTWORK", "current_work_2")

        id          GAMTWORK current_work_2
1   127251 old-age pensioner          FALSE
2   138021 old-age pensioner          FALSE
3   151222 old-age pensioner          FALSE
4   166911 old-age pensioner          FALSE
5   178622 old-age pensioner          FALSE
6   178802 old-age pensioner          FALSE
7   210622    work full-time           TRUE
8  2157402    work full-time           TRUE
9  2185881              <NA>             NA
10 2236201 old-age pensioner          FALSE

Items that can contribute to generating values for the harmonized variable current_work_2 are:

dto[["metaData"]] %>%
  dplyr::filter(study_name == "share", construct == "work_status") %>%
  dplyr::select(study_name, name, label_short,categories)

  study_name   name           label_short categories
1      share EP0050 Current job situation         10

study_name <- "share"
path_to_hrule <- "./data/meta/h-rules/h-rules-work-share.csv"
dto[["unitData"]][[study_name]] <- recode_with_hrule(
  dto,
  study_name = study_name, 
  variable_names = c("EP0050"), 
  harmony_name = "current_work_2"
)

Source: local data frame [10 x 3]
Groups: EP0050 [?]

                                                         EP0050 current_work_2     n
                                                          (chr)          (lgl) (int)
1                                                    Don't know          FALSE     1
2  Employed or self-employed (including working for family busi           TRUE   932
3                                                     Homemaker          FALSE   289
4                                               Other (specify)          FALSE    34
5                                  Permanently sick or disabled          FALSE    89
6                                                       Retired          FALSE  1071
7                                  Temporarily sick or disabled          FALSE    46
8                                   Unemployed, not seeking job          FALSE    64
9                                       Unemployed, seeking job          FALSE    64
10                                                           NA             NA     8

# verify
knitr::kable(dto[["unitData"]][["share"]] %>%
  dplyr::filter(id %in% sample(unique(id),10)) %>%
  dplyr::select_("id", "EP0050", "current_work_2"))

id	EP0050	current_work_2
2.505203e+12	Permanently sick or disabled	FALSE
2.505222e+12	Retired	FALSE
2.505228e+12	Employed or self-employed (including working for family busi	TRUE
2.505244e+12	Retired	FALSE
2.505278e+12	Employed or self-employed (including working for family busi	TRUE
2.505285e+12	Employed or self-employed (including working for family busi	TRUE
2.605237e+12	Retired	FALSE
2.605278e+12	Employed or self-employed (including working for family busi	TRUE
2.705263e+12	Retired	FALSE
2.705277e+12	Retired	FALSE

TILDA

Items that can contribute to generating values for the harmonized variable current_work_2 are:

dto[["metaData"]] %>%
  dplyr::filter(study_name == "tilda", construct == "work_status") %>%
  dplyr::select(study_name, name, label_short,categories)

  study_name  name                    label_short categories
1      tilda WE001 Describe current job situation          9
2      tilda WE003       Any paid work last week?          9

study_name <- "tilda"
path_to_hrule <- "./data/meta/h-rules/h-rules-work-tilda.csv"
dto[["unitData"]][[study_name]] <- recode_with_hrule(
  dto,
  study_name = study_name, 
  variable_names = c("WE001","WE003"), 
  harmony_name = "current_work_2"
)

Source: local data frame [15 x 4]
Groups: WE001, WE003 [?]

                               WE001             WE003 current_work_2     n
                               (chr)             (chr)          (lgl) (int)
1                           Employed UNDOCUMENTED CODE           TRUE  2218
2           In education or training                No          FALSE    47
3           In education or training               Yes          FALSE     8
4       Looking after home or family                No          FALSE  1289
5       Looking after home or family               Yes          FALSE    57
6                    Other (Specify)                No          FALSE    79
7                    Other (Specify)               Yes          FALSE    25
8       Permanently sick or disabled                No          FALSE   385
9       Permanently sick or disabled               Yes          FALSE    10
10                           Retired                No          FALSE  2910
11                           Retired               Yes           TRUE   138
12 Self-employed (including farming) UNDOCUMENTED CODE           TRUE   923
13                        Unemployed                No          FALSE   395
14                        Unemployed               Yes          FALSE    18
15                                NA                No             NA     2

# verify
dto[["unitData"]][["tilda"]] %>%
  dplyr::filter(id %in% sample(unique(id),10)) %>%
  dplyr::select_("id","WE001","WE003","current_work_2")

                   id                             WE001             WE003 current_work_2
1  162792             Self-employed (including farming) UNDOCUMENTED CODE           TRUE
2  200091                                       Retired                No          FALSE
3  237281                                       Retired                No          FALSE
4  269501                                      Employed UNDOCUMENTED CODE           TRUE
5  277011                  Looking after home or family                No          FALSE
6  441322                  Looking after home or family                No          FALSE
7  504251                                      Employed UNDOCUMENTED CODE           TRUE
8  597251                  Looking after home or family                No          FALSE
9  611571                  Looking after home or family                No          FALSE
10 622844                  Permanently sick or disabled                No          FALSE

(III) Recapitulation

At this point the dto[["unitData"]] elements (raw data files for each study) have been augmented with the harmonized variable current_work_2. We retrieve harmonized variables to view frequency counts across studies:

dumlist <- list()
for(s in dto[["studyName"]]){
  ds <- dto[["unitData"]][[s]]
  dumlist[[s]] <- ds[,c("id","current_work_2")]
}
ds <- plyr::ldply(dumlist,data.frame,.id = "study_name")
head(ds)

  study_name  id current_work_2
1       alsa  41          FALSE
2       alsa  42          FALSE
3       alsa  61          FALSE
4       alsa  71           TRUE
5       alsa  91          FALSE
6       alsa 121          FALSE

ds$id <- 1:nrow(ds) # some ids values might be identical, replace
table( ds$current_work_2, ds$study_name, useNA = "always")

       
        alsa lbsl satsa share tilda <NA>
  FALSE 2038  380   956  1658  5223    0
  TRUE    31  171   519   932  3279    0
  <NA>    18  105    22     8     2    0

Finally, we have added the newly created, harmonized variables to the raw source objects and save the data transfer object.

# Save as a compress, binary R dataset.  It's no longer readable with a text editor, but it saves metadata (eg, factor information).
saveRDS(dto, file="./data/unshared/derived/dto.rds", compress="xz")

Harmonize: work status

(I) Exposition

(I.A) Ellis Island

Meta

(I.B) Target-H

Target : current_work_2

(II) Development

(II.A)

(1) Schema sets

(II.B) current_work_2

Target (1) : current_work_2

ALSA

LBSL

SATSA

SHARE

TILDA

(III) Recapitulation

Target : `current_work_2`

(II.B) `current_work_2`

Target (1) : `current_work_2`