(I) Exposition
- (I.A) Ellis Island
  - Meta
- (I.B) Target-H
(II) Development
- (II.A)
  - (1) Schema sets
- (II.B) sex
  - ALSA
  - LBSL
  - SATSA
  - SHARE
  - TILDA
(III) Recapitulation

This report lists the candidate variable for DataScheme variables of the construct sex.

(I) Exposition

This report is a record of interaction with a data transfer object (dto) produced by ./manipulation/0-ellis-island.R.

The next section recaps this script, exposes the architecture of the DTO, and demonstrates the language of interacting with it.

(I.A) Ellis Island

All data land on Ellis Island.

The script 0-ellis-island.R is the first script in the analytic workflow. It accomplished the following:

1. Reads in raw data files from the candidate studies
1. Extract, combines, and exports their metadata (specifically, variable names and labels, if provided) into ./data/shared/derived/meta-data-live.csv, which is updated every time Ellis Island script is executed.
1. Augments raw metadata with instructions for renaming and classifying variables. The instructions are provided as manually entered values in ./data/shared/meta-data-map.csv. They are used by automatic scripts in later harmonization and analysis.
1. Combines unit and metadata into a single DTO to serve as a starting point to all subsequent analyses.

# load the product of 0-ellis-island.R,  a list object containing data and metadata
dto <- readRDS("./data/unshared/derived/dto.rds")

# the list is composed of the following elements
names(dto)

[1] "studyName" "filePath"  "unitData"  "metaData"

# 1st element - names of the studies as character vector
dto[["studyName"]]

[1] "alsa"  "lbsl"  "satsa" "share" "tilda"

# 2nd element - file paths of the data files for each study as character vector
dto[["filePath"]]

[1] "./data/unshared/raw/ALSA-Wave1.Final.sav"         "./data/unshared/raw/LBSL-Panel2-Wave1.Final.sav" 
[3] "./data/unshared/raw/SATSA-Q3.Final.sav"           "./data/unshared/raw/SHARE-Israel-Wave1.Final.sav"
[5] "./data/unshared/raw/TILDA-Wave1.Final.sav"

# 3rd element - is a list object containing the following elements
names(dto[["unitData"]])

[1] "alsa"  "lbsl"  "satsa" "share" "tilda"

# each of these elements is a raw data set of a corresponding study, for example
dplyr::tbl_df(dto[["unitData"]][["lbsl"]])

Source: local data frame [656 x 30]

        id AGE94 SEX94  MSTAT94 EDUC94     NOWRK94  SMK94                                         SMOKE
     (int) (int) (int)   (fctr)  (int)      (fctr) (fctr)                                        (fctr)
1  4001026    68     1 divorced     16 no, retired     no                                  never smoked
2  4012015    94     2  widowed     12 no, retired     no                                  never smoked
3  4012032    94     2  widowed     20 no, retired     no don't smoke at present but smoked in the past
4  4022004    93     2       NA     NA          NA     NA                                  never smoked
5  4022026    93     2  widowed     12 no, retired     no                                  never smoked
6  4031031    92     1  married      8 no, retired     no don't smoke at present but smoked in the past
7  4031035    92     1  widowed     13 no, retired     no don't smoke at present but smoked in the past
8  4032201    92     2       NA     NA          NA     NA don't smoke at present but smoked in the past
9  4041062    91     1  widowed      7          NA     no don't smoke at present but smoked in the past
10 4042057    91     2       NA     NA          NA     NA                                            NA
..     ...   ...   ...      ...    ...         ...    ...                                           ...
Variables not shown: ALCOHOL (fctr), WINE (int), BEER (int), HARDLIQ (int), SPORT94 (int), FIT94 (int), WALK94 (int),
  SPEC94 (int), DANCE94 (int), CHORE94 (int), EXCERTOT (int), EXCERWK (int), HEIGHT94 (int), WEIGHT94 (int), HWEIGHT
  (int), HHEIGHT (int), SRHEALTH (fctr), smoke_now (lgl), smoked_ever (lgl), year_of_wave (dbl), age_in_years (dbl),
  year_born (dbl)

(I.B) Target-H

Everybody wants to be somebody.

We query metadata set to retrieve all variables potentially tapping the construct sex. These are the candidates to enter the DataSchema and contribute to computing harmonized variables.

NOTE: what is being retrieved depends on the manually entered values in the column construct of the metadata file ./data/shared/meta-data-map.csv. To specify a different group of variables, edit the metadata, not the script.

meta_data <- dto[["metaData"]] %>%
  dplyr::filter(construct %in% c('sex')) %>% 
  dplyr::select(study_name, name, construct, label_short, categories, url) %>%
  dplyr::arrange(construct, study_name)
knitr::kable(meta_data)

study_name	name	construct	label_short	categories
alsa	SEX	sex	Sex	2
lbsl	SEX94	sex	Sex	2
satsa	SEX	sex	Sex	2
share	GENDER	sex	Sex	2
tilda	SEX	sex	Gender	2
tilda	GD002	sex	Male or Female?	2

View descriptives : sex for closer examination of each candidate. After reviewing these descriptives and relevant codebooks, the following operationalization of the harmonized variables for female have been adopted:

Target : `female`

0 - FALSE - male - reference group
1 - TRUEfemale

These variables will be generated next, in the Development section.

(II) Development

The particulare goal of this section is to ensure that the schema to encode the values for the sex variable is consisten across studies.

In this section we will define the schema sets for harmonizing sex construct (i.e. specify which variables from which studies will be contributing to computing harmonized variables ). Each of these schema sets (e.g. "tilda" = c("SEX", "GD002")) will have a particular pattern of possible response values to these variables(e.g. “id_1” = c(“SEX”=“Male”, “GD002”=“MALE”)), which we will export for inspection as .csv tables. We then will manually edit these .csv tables, populating new columns that will map values of harmonized variables to the specific response pattern of the schema set variables. We then will import harmonization algorithms encoded in .csv tables and apply them to compute harmonized variables in the dataset combining raw and harmonized variables for sex construct across studies.

(II.A)

(1) Schema sets

Having all potential variables in categorical format we have defined the sets of data schema variables thus:

schema_sets <- list(
  "alsa" = c("SEX"),
  "lbsl" = c("SEX94"),
  "satsa" =  c("SEX"),
  "share" = c("GENDER"), 
  "tilda" = c("SEX","GD002") 
)

Each of these schema sets have a particular pattern of possible response values, for example:

# view the profile of responses
dto[["unitData"]][["alsa"]] %>% 
  dplyr::group_by(SEX) %>% 
  dplyr::summarize(count = n())

Source: local data frame [2 x 2]

     SEX count
  (fctr) (int)
1   Male  1056
2 Female  1031

We output these tables into self-standing .csv files, so we can manually provide the logic of computing harmonized variables.

# define function to extract profiles
response_profile <- function(dto, h_target, study, varnames_values){
  ds <- dto[["unitData"]][[study]]
  varnames_values <- lapply(varnames_values, as.symbol)   # Convert character vector to list of symbols
  d <- ds %>% 
    dplyr::group_by_(.dots=varnames_values) %>% 
    dplyr::summarize(count = n()) 
  write.csv(d,paste0("./data/meta/response-profiles-live/",h_target,"-",study,".csv"))
}
# extract response profile for data schema set from each study
for(s in names(schema_sets)){
  response_profile(dto,
                   study = s,
                   h_target = 'sex',
                   varnames_values = schema_sets[[s]]
                   )
}

You can examine them in `./data/meta/response-profiles-live/

(II.B) `sex`

Target : `sex`

1 - male
2 - female

ALSA

Items that can contribute to generating values for the harmonized variable sex are:

dto[["metaData"]] %>%
  dplyr::filter(study_name=="alsa", name %in% c("SEX")) %>%
  dplyr::select(study_name, name, label,categories)

  study_name name label categories
1       alsa  SEX   Sex          2

We encode the harmonization rule by manually editing the values in a corresponding .csv file located in ./data/meta/h-rules/. Then, we apply the recoding logic it contains and append the newly created, harmonized variable to the initial data set.

study_name <- "alsa"
path_to_hrule <- "./data/meta/h-rules/h-rules-sex-alsa.csv"
dto[["unitData"]][[study_name]] <- recode_with_hrule(
  dto,
  study_name = study_name, 
  variable_names = c("SEX"), 
  harmony_name = "female"
)

Source: local data frame [2 x 3]
Groups: SEX [?]

     SEX female     n
   (chr)  (lgl) (int)
1 Female   TRUE  1031
2   Male  FALSE  1056

# verify
dto[["unitData"]][["alsa"]] %>%
  dplyr::filter(id %in% sample(unique(id),10)) %>%
  dplyr::select_("id", "SEX","female")

      id    SEX female
1   1702 Female   TRUE
2   5692 Female   TRUE
3  11111 Female   TRUE
4  21491 Female   TRUE
5  24292 Female   TRUE
6  24971   Male  FALSE
7  25041   Male  FALSE
8  25781 Female   TRUE
9  28681   Male  FALSE
10 39831 Female   TRUE

LBSL

Items that can contribute to generating values for the harmonized variable sex are:

dto[["metaData"]] %>%
  dplyr::filter(study_name == "lbsl", construct == "sex") %>%
  # dplyr::filter(name %in% c("SEX94")) %>%
  dplyr::select(study_name, name, label_short,categories)

  study_name  name label_short categories
1       lbsl SEX94         Sex          2

study_name <- "lbsl"
path_to_hrule <- "./data/meta/h-rules/h-rules-sex-lbsl.csv"
dto[["unitData"]][[study_name]] <- recode_with_hrule(
  dto,
  study_name = study_name, 
  variable_names = c("SEX94"), 
  harmony_name = "female"
)

Source: local data frame [2 x 3]
Groups: SEX94 [?]

  SEX94 female     n
  (chr)  (lgl) (int)
1     1  FALSE   314
2     2   TRUE   342

# verify
dto[["unitData"]][["lbsl"]] %>%
  dplyr::filter(id %in% sample(unique(id),10)) %>%
  dplyr::select_("id", "SEX94", "female")

        id SEX94 female
1  4112089     2   TRUE
2  4121170     1  FALSE
3  4141039     1  FALSE
4  4141201     1  FALSE
5  4142049     2   TRUE
6  4251084     1  FALSE
7  4291081     1  FALSE
8  4372007     2   TRUE
9  4411036     1  FALSE
10 4502034     2   TRUE

SATSA

Items that can contribute to generating values for the harmonized variable sex are:

dto[["metaData"]] %>%
  dplyr::filter(study_name == "satsa", construct == "SEX") %>%
  # dplyr::filter(name %in% c("SEX")) %>%
  dplyr::select(study_name, name, label_short,categories)

[1] study_name  name        label_short categories 
<0 rows> (or 0-length row.names)

study_name <- "satsa"
path_to_hrule <- "./data/meta/h-rules/h-rules-sex-satsa.csv"
dto[["unitData"]][[study_name]] <- recode_with_hrule(
  dto,
  study_name = study_name, 
  variable_names = c("SEX"), 
  harmony_name = "female"
)

Source: local data frame [2 x 3]
Groups: SEX [?]

     SEX female     n
   (chr)  (lgl) (int)
1 female   TRUE   887
2   male  FALSE   610

# verify
dto[["unitData"]][["satsa"]] %>%
  dplyr::filter(id %in% sample(unique(id),10)) %>%
  dplyr::select_("id", "SEX", "female")

        id    SEX female
1    17812 female   TRUE
2   140821   male  FALSE
3   153712 female   TRUE
4   158502 female   TRUE
5   162832 female   TRUE
6   173682 female   TRUE
7   180261   male  FALSE
8   260402 female   TRUE
9  2105422   male  FALSE
10 2208562   male  FALSE

Items that can contribute to generating values for the harmonized variable sex are:

dto[["metaData"]] %>%
  dplyr::filter(study_name == "share", construct == "sex") %>%
  # dplyr::filter(name %in% c("GENDER")) %>%
  dplyr::select(study_name, name, label_short,categories)

  study_name   name label_short categories
1      share GENDER         Sex          2

study_name <- "share"
path_to_hrule <- "./data/meta/h-rules/h-rules-sex-share.csv"
dto[["unitData"]][[study_name]] <- recode_with_hrule(
  dto,
  study_name = study_name, 
  variable_names = c("GENDER"), 
  harmony_name = "female"
)

Source: local data frame [2 x 3]
Groups: GENDER [?]

  GENDER female     n
   (chr)  (lgl) (int)
1 female   TRUE  1459
2   male  FALSE  1139

# verify
dto[["unitData"]][["share"]] %>%
  dplyr::filter(id %in% sample(unique(id),10)) %>%
  dplyr::select_("id", "GENDER", "female")

             id GENDER female
1  2.505206e+12   male  FALSE
2  2.505218e+12   male  FALSE
3  2.505223e+12 female   TRUE
4  2.505227e+12   male  FALSE
5  2.505236e+12   male  FALSE
6  2.505246e+12   male  FALSE
7  2.505248e+12   male  FALSE
8  2.505287e+12   male  FALSE
9  2.605208e+12 female   TRUE
10 2.705202e+12   male  FALSE

TILDA

Items that can contribute to generating values for the harmonized variable sex are:

dto[["metaData"]] %>%
  dplyr::filter(study_name == "tilda", construct == "sex") %>%
  # dplyr::filter(name %in% c("SMK94", "SMOKE")) %>%
  dplyr::select(study_name, name, label_short,categories)

  study_name  name     label_short categories
1      tilda   SEX          Gender          2
2      tilda GD002 Male or Female?          2

study_name <- "tilda"
path_to_hrule <- "./data/meta/h-rules/h-rules-sex-tilda.csv"
dto[["unitData"]][[study_name]] <- recode_with_hrule(
  dto,
  study_name = study_name, 
  variable_names = c("SEX", "GD002"), 
  harmony_name = "female"
)

Source: local data frame [2 x 4]
Groups: SEX, GD002 [?]

     SEX  GD002 female     n
   (chr)  (chr)  (lgl) (int)
1 Female Female   TRUE  4724
2   Male   Male  FALSE  3780

# verify
dto[["unitData"]][["tilda"]] %>%
  dplyr::filter(id %in% sample(unique(id),10)) %>%
  dplyr::select_("id", "SEX", "GD002", "female")

                   id    SEX  GD002 female
1  104201             Female Female   TRUE
2  228431               Male   Male  FALSE
3  240481               Male   Male  FALSE
4  322682               Male   Male  FALSE
5  336182             Female Female   TRUE
6  342411               Male   Male  FALSE
7  359731               Male   Male  FALSE
8  459431             Female Female   TRUE
9  467011               Male   Male  FALSE
10 534131               Male   Male  FALSE

(III) Recapitulation

At this point the dto[["unitData"]] elements (raw data files for each study) have been augmented with the harmonized variable sex. We retrieve harmonized variables to view frequency counts across studies:

dumlist <- list()
for(s in dto[["studyName"]]){
  ds <- dto[["unitData"]][[s]]
  dumlist[[s]] <- ds[,c("id","female")]
}
ds <- plyr::ldply(dumlist,data.frame,.id = "study_name")
head(ds)

  study_name  id female
1       alsa  41  FALSE
2       alsa  42   TRUE
3       alsa  61   TRUE
4       alsa  71  FALSE
5       alsa  91  FALSE
6       alsa 121   TRUE

ds$id <- 1:nrow(ds) # some ids values might be identical, replace
table( ds$female, ds$study_name, useNA = "always")

       
        alsa lbsl satsa share tilda <NA>
  FALSE 1056  314   610  1139  3780    0
  TRUE  1031  342   887  1459  4724    0
  <NA>     0    0     0     0     0    0

Finally, we have added the newly created, harmonized variables to the raw source objects and save the data transfer object.

# Save as a compress, binary R dataset.  It's no longer readable with a text editor, but it saves metadata (eg, factor information).
saveRDS(dto, file="./data/unshared/derived/dto.rds", compress="xz")

Harmonize: sex

(I) Exposition

(I.A) Ellis Island

Meta

(I.B) Target-H

Target : female

(II) Development

(II.A)

(1) Schema sets

(II.B) sex

Target : sex

ALSA

LBSL

SATSA

SHARE

TILDA

(III) Recapitulation

Target : `female`

(II.B) `sex`

Target : `sex`