This report lists the candidate variable for DataScheme variables of the construct smoking.

(I) Exposition

This report is a record of interaction with a data transfer object (dto) produced by ./manipulation/0-ellis-island.R.

The next section recaps this script, exposes the architecture of the DTO, and demonstrates the language of interacting with it.

(I.A) Ellis Island

All data land on Ellis Island.

The script 0-ellis-island.R is the first script in the analytic workflow. It accomplished the following:

1. Reads in raw data files from the candidate studies
1. Extract, combines, and exports their metadata (specifically, variable names and labels, if provided) into ./data/meta/names-labels-live/names-labels-live.csv, which is updated every time Ellis Island script is executed.
1. Augments raw metadata with instructions for renaming and classifying variables. The instructions are provided as manually entered values in ./data/meta/meta-data-map.csv. They are used by automatic scripts in later harmonization and analysis.
1. Combines unit and metadata into a single DTO to serve as a starting point to all subsequent analyses.

# load the product of 0-ellis-island.R,  a list object containing data and metadata
dto <- readRDS("./data/unshared/derived/dto.rds")

# the list is composed of the following elements
names(dto)

[1] "studyName" "filePath"  "unitData"  "metaData"

# 1st element - names of the studies as character vector
dto[["studyName"]]

[1] "alsa"  "lbsl"  "satsa" "share" "tilda"

# 2nd element - file paths of the data files for each study as character vector
dto[["filePath"]]

[1] "./data/unshared/raw/ALSA-Wave1.Final.sav"         "./data/unshared/raw/LBSL-Panel2-Wave1.Final.sav" 
[3] "./data/unshared/raw/SATSA-Q3.Final.sav"           "./data/unshared/raw/SHARE-Israel-Wave1.Final.sav"
[5] "./data/unshared/raw/TILDA-Wave1.Final.sav"

# 3rd element - is a list object containing the following elements
names(dto[["unitData"]])

[1] "alsa"  "lbsl"  "satsa" "share" "tilda"

# each of these elements is a raw data set of a corresponding study, for example
dplyr::tbl_df(dto[["unitData"]][["lbsl"]])

Source: local data frame [656 x 25]

        id AGE94 SEX94  MSTAT94 EDUC94     NOWRK94  SMK94                                         SMOKE
     (int) (int) (int)   (fctr)  (int)      (fctr) (fctr)                                        (fctr)
1  4001026    68     1 divorced     16 no, retired     no                                  never smoked
2  4012015    94     2  widowed     12 no, retired     no                                  never smoked
3  4012032    94     2  widowed     20 no, retired     no don't smoke at present but smoked in the past
4  4022004    93     2       NA     NA          NA     NA                                  never smoked
5  4022026    93     2  widowed     12 no, retired     no                                  never smoked
6  4031031    92     1  married      8 no, retired     no don't smoke at present but smoked in the past
7  4031035    92     1  widowed     13 no, retired     no don't smoke at present but smoked in the past
8  4032201    92     2       NA     NA          NA     NA don't smoke at present but smoked in the past
9  4041062    91     1  widowed      7          NA     no don't smoke at present but smoked in the past
10 4042057    91     2       NA     NA          NA     NA                                            NA
..     ...   ...   ...      ...    ...         ...    ...                                           ...
Variables not shown: ALCOHOL (fctr), WINE (int), BEER (int), HARDLIQ (int), SPORT94 (int), FIT94 (int), WALK94 (int),
  SPEC94 (int), DANCE94 (int), CHORE94 (int), EXCERTOT (int), EXCERWK (int), HEIGHT94 (int), WEIGHT94 (int), HWEIGHT
  (int), HHEIGHT (int), SRHEALTH (fctr)

(I.B) Target-H

Everybody wants to be somebody.

We query metadata set to retrieve all variables potentially tapping the construct smoking. These are the candidates to enter the DataSchema and contribute to computing harmonized variables.

NOTE: what is being retrieved depends on the manually entered values in the column construct of the metadata file ./data/shared/meta-data-map.csv. To specify a different group of variables, edit the metadata, not the script.

meta_data <- dto[["metaData"]] %>%
  dplyr::filter(construct %in% c('smoking')) %>% 
  dplyr::select(study_name, name, construct, label_short, categories, url) %>%
  dplyr::arrange(construct, study_name)
knitr::kable(meta_data)

study_name	name	construct	label_short	categories	url
alsa	SMOKER	smoking	Do you currently smoke cigarettes?	2	link
alsa	PIPCIGAR	smoking	Do you regularly smoke pipe or cigar?	2	link
lbsl	SMK94	smoking	Currently smoke?	2	link
lbsl	SMOKE	smoking	Smoke, tobacco use	3	link
satsa	GEVRSMK	smoking	Do you smoke tobacco?	3	link
satsa	GEVRSNS	smoking	Do you take snuff?	3	link
satsa	GSMOKNOW	smoking	Smoked some last month?	2	link
share	BR0010	smoking	Ever smoked tobacco daily for a year?	2	link
share	BR0020	smoking	Smoke at present?	2	link
share	BR0030	smoking	How many years smoked?	NA
tilda	BH001	smoking	Ever smoked tobacco daily for a year?	2	link
tilda	BH002	smoking	Smoke at present?	2	link
tilda	BH003	smoking	Age when stopped smoking	NA
tilda	BEHSMOKER	smoking	Respondent is a smoker	3	link

View descriptives : smoking for closer examination of each candidate. After reviewing these descriptives and relevant codebooks, the following operationalization of the harmonized variables for smoking have been adopted:

(1) `smoke_now` : Are you a smoker presently?

0 - FALSE - healthy choice
1 - TRUE - unhealthy choice

(2) `smoked_ever` Have you ever smoked?

0 - FALSE - healthy choice
1 - TRUE - unhealthy choice

These two variables will be manufactured by qualitative harmonization in the Development section.

(II) Development

In this section we will define the schema sets for harmonizing smoking construct (i.e. specify which variables from which studies will be contributing to computing harmonized variables ). Each of these schema sets (e.g. "alsa" = c("SMOKER", "PIPCIGAR")) will have a particular pattern of possible response values to these variables(e.g. “id_1” = c(“SMOKER”=“YES”, “PIPCIGAR”=“NO”)), which we will export for inspection as .csv tables. We then will manually edit these .csv tables, populating new columns that will map values of harmonized variables to the specific response pattern of the schema set variables. We then will import harmonization algorithms encoded in .csv tables and apply them to compute harmonized variables in the dataset combining raw and harmonized variables for smoking construct across studies.

(II.A)

(1) Categorization

Preliminary inspection of the schema sets variables revealed two continuous measures:

dto[["metaData"]] %>% dplyr::filter(study_name=="share", name=="BR0030") %>% dplyr::select(name,label)

    name                 label
1 BR0030 how many years smoked

dto[["metaData"]] %>% dplyr::filter(study_name=="tilda", name=="BH003") %>% dplyr::select(name,label)

   name                                             label
1 BH003 bh003  How old were you when you stopped smoking?

The variable BR0030 of SATSA has been excluded from the Data Schema for operationalizing harmonized items of the construct smoking. We do not need to know how long they smoked, only that they do not smoke now.

The variable BH003 of TILDA has been excluded from the Data Schema for operationalizing harmonized items of the construct smoking. We do not need to know when they quit, only that they did quit (or never smoked).

(2) Schema sets

Having all potential variables in categorical format we have defined the sets of data schema variables thus:

schema_sets <- list(
  "alsa" = c("SMOKER", "PIPCIGAR"),
  "lbsl" = c("SMK94","SMOKE"),
  "satsa" = c("GSMOKNOW", "GEVRSMK","GEVRSNS"),
  "share" = c("BR0010","BR0020"), # "BR0030" is continuous
  "tilda" = c("BH001","BH002", "BEHSMOKER") # "BH003" is continuous
)

Each of these schema sets have a particular pattern of possible response values, for example:

# view the joint profile of responses
dto[["unitData"]][["alsa"]] %>% 
  dplyr::group_by(SMOKER, PIPCIGAR) %>% 
  dplyr::summarize(count = n())

Source: local data frame [5 x 3]
Groups: SMOKER [?]

  SMOKER PIPCIGAR count
  (fctr)   (fctr) (int)
1    Yes      Yes     7
2    Yes       No   169
3     No      Yes    41
4     No       NA  1851
5     NA       NA    19

We output these tables into self-standing .csv files, so we can manually provide the logic of computing harmonized variables.

# define function to extract profiles
response_profile <- function(dto, h_target, study, varnames_values){
  ds <- dto[["unitData"]][[study]]
  varnames_values <- lapply(varnames_values, as.symbol)   # Convert character vector to list of symbols
  d <- ds %>% 
    dplyr::group_by_(.dots=varnames_values) %>% 
    dplyr::summarize(count = n()) 
  write.csv(d,paste0("./data/meta/response-profiles-live/",h_target,"-",study,".csv"))
}
# extract response profile for data schema set from each study
for(s in names(schema_sets)){
  response_profile(dto,
                   study = s,
                   h_target = 'smoking',
                   varnames_values = schema_sets[[s]]
                   )
}

You can examine them in `./data/meta/response-profiles-live/

(II.B) `smoke_now`

Are you a smoker presently?

0 - FALSE - healthy choice
1 - TRUE - unhealthy choice

ALSA

Items that can contribute to generating values for the harmonized variable smoke_now are:

dto[["metaData"]] %>%
  dplyr::filter(name %in% c("SMOKER", "PIPCIGAR")) %>%
  dplyr::select(study_name, name, label,categories)

  study_name     name                                 label categories
1       alsa   SMOKER    Do you currently smoke cigarettes?          2
2       alsa PIPCIGAR Do you regularly smoke pipe or cigar?          2

We encode the harmonization rule by manually editing the values in a corresponding .csv file located in ./data/meta/h-rules/. Then, we apply the recoding logic it contains and append the newly created, harmonized variable to the initial data set.

study_name <- "alsa"
path_to_hrule <- "./data/meta/h-rules/h-rules-smoking-alsa.csv"
dto[["unitData"]][[study_name]] <- recode_with_hrule(
  dto,
  study_name = study_name, 
  variable_names = c("SMOKER", "PIPCIGAR"), 
  harmony_name = "smoke_now"
)

Source: local data frame [5 x 4]
Groups: SMOKER, PIPCIGAR [?]

  SMOKER PIPCIGAR smoke_now     n
   (chr)    (chr)     (lgl) (int)
1     No      Yes      TRUE    41
2     No       NA     FALSE  1851
3    Yes       No      TRUE   169
4    Yes      Yes      TRUE     7
5     NA       NA        NA    19

# verify
dto[["unitData"]][["alsa"]] %>%
  dplyr::filter(id %in% sample(unique(id),10)) %>%
  dplyr::select_("id", "SMOKER", "PIPCIGAR", "smoke_now")

      id SMOKER PIPCIGAR smoke_now
1   2932     No     <NA>     FALSE
2   3111     No     <NA>     FALSE
3   3492     No     <NA>     FALSE
4   5592     No     <NA>     FALSE
5   9681     No     <NA>     FALSE
6  10901     No     <NA>     FALSE
7  27891     No     <NA>     FALSE
8  28311     No     <NA>     FALSE
9  35282     No     <NA>     FALSE
10 42861     No     <NA>     FALSE

LBSL

Items that can contribute to generating values for the harmonized variable smoke_now are:

dto[["metaData"]] %>%
  dplyr::filter(study_name == "lbsl", construct == "smoking") %>%
  # dplyr::filter(name %in% c("SMK94", "SMOKE")) %>%
  dplyr::select(study_name, name, label_short,categories)

  study_name  name        label_short categories
1       lbsl SMK94   Currently smoke?          2
2       lbsl SMOKE Smoke, tobacco use          3

study_name <- "lbsl"
path_to_hrule <- "./data/meta/h-rules/h-rules-smoking-lbsl.csv"
dto[["unitData"]][[study_name]] <- recode_with_hrule(
  dto,
  study_name = study_name, 
  variable_names = c("SMK94", "SMOKE"), 
  harmony_name = "smoke_now"
)

Source: local data frame [9 x 4]
Groups: SMK94, SMOKE [?]

  SMK94                                         SMOKE smoke_now     n
  (chr)                                         (chr)     (lgl) (int)
1    no don't smoke at present but smoked in the past     FALSE   272
2    no                                  never smoked     FALSE   205
3    no                         smoke at present time        NA     3
4    no                                            NA     FALSE     3
5   yes don't smoke at present but smoked in the past        NA     3
6   yes                         smoke at present time      TRUE    71
7    NA don't smoke at present but smoked in the past        NA     8
8    NA                                  never smoked        NA     2
9    NA                                            NA        NA    89

# verify
dto[["unitData"]][["lbsl"]] %>%
  dplyr::filter(id %in% sample(unique(id),10)) %>%
  dplyr::select_("id", "SMK94", "SMOKE", "smoke_now")

        id SMK94                                         SMOKE smoke_now
1  4001026    no                                  never smoked     FALSE
2  4142014    no don't smoke at present but smoked in the past     FALSE
3  4191077    no                                  never smoked     FALSE
4  4192096  <NA>                                          <NA>        NA
5  4221086    no don't smoke at present but smoked in the past     FALSE
6  4242074    no don't smoke at present but smoked in the past     FALSE
7  4272032   yes                         smoke at present time      TRUE
8  4381043    no don't smoke at present but smoked in the past     FALSE
9  4402047    no don't smoke at present but smoked in the past     FALSE
10 4412037    no                                  never smoked     FALSE

SATSA

Items that can contribute to generating values for the harmonized variable smoke_now are:

dto[["metaData"]] %>%
  dplyr::filter(study_name == "satsa", construct == "smoking") %>%
  # dplyr::filter(name %in% c("SMK94", "SMOKE")) %>%
  dplyr::select(study_name, name, label_short,categories)

  study_name     name             label_short categories
1      satsa  GEVRSMK   Do you smoke tobacco?          3
2      satsa  GEVRSNS      Do you take snuff?          3
3      satsa GSMOKNOW Smoked some last month?          2

study_name <- "satsa"
path_to_hrule <- "./data/meta/h-rules/h-rules-smoking-satsa.csv"
dto[["unitData"]][[study_name]] <- recode_with_hrule(
  dto,
  study_name = study_name, 
  variable_names = c("GEVRSMK", "GEVRSNS","GSMOKNOW"), 
  harmony_name = "smoke_now"
)

Source: local data frame [33 x 5]
Groups: GEVRSMK, GEVRSNS, GSMOKNOW [?]

                   GEVRSMK                      GEVRSNS GSMOKNOW smoke_now     n
                     (chr)                        (chr)    (chr)     (lgl) (int)
1  No, I have never smoked No, I have never taken snuff       No     FALSE    56
2  No, I have never smoked No, I have never taken snuff      Yes      TRUE     1
3  No, I have never smoked No, I have never taken snuff       NA     FALSE   635
4  No, I have never smoked                  No, I quit.       No     FALSE     1
5  No, I have never smoked                  No, I quit.       NA     FALSE     6
6  No, I have never smoked                          Yes       No     FALSE     2
7  No, I have never smoked                          Yes      Yes      TRUE    13
8  No, I have never smoked                           NA       NA     FALSE     6
9              No, I quit. No, I have never taken snuff       No     FALSE    66
10             No, I quit. No, I have never taken snuff      Yes      TRUE     6
11             No, I quit. No, I have never taken snuff       NA     FALSE   206
12             No, I quit.                  No, I quit.       No     FALSE    13
13             No, I quit.                  No, I quit.       NA     FALSE    25
14             No, I quit.                          Yes       No     FALSE    10
15             No, I quit.                          Yes      Yes      TRUE    34
16             No, I quit.                          Yes       NA     FALSE     3
17             No, I quit.                           NA       No     FALSE     1
18             No, I quit.                           NA      Yes      TRUE     2
19             No, I quit.                           NA       NA     FALSE     3
20                     Yes No, I have never taken snuff       No     FALSE    24
21                     Yes No, I have never taken snuff      Yes      TRUE   249
22                     Yes No, I have never taken snuff       NA        NA     8
23                     Yes                  No, I quit.       No     FALSE     1
24                     Yes                  No, I quit.      Yes      TRUE    13
25                     Yes                          Yes      Yes      TRUE    26
26                     Yes                           NA      Yes      TRUE     6
27                      NA No, I have never taken snuff      Yes      TRUE     1
28                      NA No, I have never taken snuff       NA        NA     2
29                      NA                          Yes      Yes      TRUE     2
30                      NA                          Yes       NA        NA     1
31                      NA                           NA       No     FALSE     9
32                      NA                           NA      Yes      TRUE    12
33                      NA                           NA       NA        NA    54

# verify
dto[["unitData"]][["satsa"]] %>%
  dplyr::filter(id %in% sample(unique(id),10)) %>%
  dplyr::select_("id", "GEVRSMK", "GEVRSNS","GSMOKNOW", "smoke_now")

        id                 GEVRSMK                      GEVRSNS GSMOKNOW smoke_now
1   134911                     Yes No, I have never taken snuff      Yes      TRUE
2   140411                    <NA>                         <NA>     <NA>        NA
3   185322 No, I have never smoked No, I have never taken snuff     <NA>     FALSE
4   250001             No, I quit. No, I have never taken snuff     <NA>     FALSE
5   293201 No, I have never smoked No, I have never taken snuff     <NA>     FALSE
6  2105662 No, I have never smoked No, I have never taken snuff     <NA>     FALSE
7  2122671                    <NA>                         <NA>     <NA>        NA
8  2168051 No, I have never smoked No, I have never taken snuff     <NA>     FALSE
9  2205702 No, I have never smoked No, I have never taken snuff     <NA>     FALSE
10 2405031                     Yes No, I have never taken snuff      Yes      TRUE

Items that can contribute to generating values for the harmonized variable smoke_now are:

dto[["metaData"]] %>%
  dplyr::filter(study_name == "share", construct == "smoking") %>%
  # dplyr::filter(name %in% c("SMK94", "SMOKE")) %>%
  dplyr::select(study_name, name, label_short,categories)

  study_name   name                           label_short categories
1      share BR0010 Ever smoked tobacco daily for a year?          2
2      share BR0020                     Smoke at present?          2
3      share BR0030                How many years smoked?         NA

study_name <- "share"
path_to_hrule <- "./data/meta/h-rules/h-rules-smoking-share.csv"
dto[["unitData"]][[study_name]] <- recode_with_hrule(
  dto,
  study_name = study_name, 
  variable_names = c("BR0010", "BR0020"), 
  harmony_name = "smoke_now"
)

Source: local data frame [4 x 4]
Groups: BR0010, BR0020 [?]

  BR0010             BR0020 smoke_now     n
   (chr)              (chr)     (lgl) (int)
1     no                 NA     FALSE  1542
2    yes no, i have stopped     FALSE   644
3    yes                yes      TRUE   408
4     NA                 NA        NA     4

# verify
dto[["unitData"]][["share"]] %>%
  dplyr::filter(id %in% sample(unique(id),10)) %>%
  dplyr::select_("id", "BR0010", "BR0020", "smoke_now")

             id BR0010             BR0020 smoke_now
1  2.505201e+12     no               <NA>     FALSE
2  2.505225e+12     no               <NA>     FALSE
3  2.505247e+12     no               <NA>     FALSE
4  2.505259e+12    yes no, i have stopped     FALSE
5  2.505260e+12    yes no, i have stopped     FALSE
6  2.505267e+12     no               <NA>     FALSE
7  2.505281e+12    yes no, i have stopped     FALSE
8  2.605233e+12     no               <NA>     FALSE
9  2.605274e+12    yes                yes      TRUE
10 2.605299e+12     no               <NA>     FALSE

TILDA

Items that can contribute to generating values for the harmonized variable smoke_now are:

dto[["metaData"]] %>%
  dplyr::filter(study_name == "tilda", construct == "smoking") %>%
  # dplyr::filter(name %in% c("SMK94", "SMOKE")) %>%
  dplyr::select(study_name, name, label_short, categories)

  study_name      name                           label_short categories
1      tilda     BH001 Ever smoked tobacco daily for a year?          2
2      tilda     BH002                     Smoke at present?          2
3      tilda     BH003              Age when stopped smoking         NA
4      tilda BEHSMOKER                Respondent is a smoker          3

study_name <- "tilda"
path_to_hrule <- "./data/meta/h-rules/h-rules-smoking-tilda.csv"
dto[["unitData"]][[study_name]] <- recode_with_hrule(
  dto,
  study_name = study_name, 
  variable_names = c("BH001", "BH002","BEHSMOKER" ), 
  harmony_name = "smoke_now"
)

Source: local data frame [4 x 5]
Groups: BH001, BH002, BEHSMOKER [?]

  BH001              BH002 BEHSMOKER smoke_now     n
  (chr)              (chr)     (chr)     (lgl) (int)
1    No  UNDOCUMENTED CODE     Never     FALSE  3726
2   Yes No, I have stopped      Past     FALSE  3213
3   Yes                Yes   Current      TRUE  1564
4    NA  UNDOCUMENTED CODE        NA        NA     1

# verify
dto[["unitData"]][["tilda"]] %>%
  dplyr::filter(id %in% sample(unique(id),10)) %>%
  dplyr::select_("id", "BH001", "BH002","BEHSMOKER", "smoke_now")

                   id BH001              BH002 BEHSMOKER smoke_now
1  36032                Yes No, I have stopped      Past     FALSE
2  43861                 No  UNDOCUMENTED CODE     Never     FALSE
3  123052               Yes No, I have stopped      Past     FALSE
4  142051               Yes No, I have stopped      Past     FALSE
5  183911               Yes No, I have stopped      Past     FALSE
6  265821               Yes                Yes   Current      TRUE
7  309791                No  UNDOCUMENTED CODE     Never     FALSE
8  313412                No  UNDOCUMENTED CODE     Never     FALSE
9  461482               Yes No, I have stopped      Past     FALSE
10 568411               Yes No, I have stopped      Past     FALSE

(II.C) `smoked_ever`

Have you ever smoked?

0 - FALSE - healthy choice
1 - TRUE - unhealthy choice

ALSA

Items that can contribute to generating values for the harmonized variable smoked_ever are:

dto[["metaData"]] %>%
  dplyr::filter(name %in% c("SMOKER", "PIPCIGAR")) %>%
  dplyr::select(study_name, name, label,categories)

  study_name     name                                 label categories
1       alsa   SMOKER    Do you currently smoke cigarettes?          2
2       alsa PIPCIGAR Do you regularly smoke pipe or cigar?          2

study_name <- "alsa"
path_to_hrule <- "./data/meta/h-rules/h-rules-smoking-alsa.csv"
dto[["unitData"]][[study_name]] <- recode_with_hrule(
  dto,
  study_name = study_name, 
  variable_names = c("SMOKER", "PIPCIGAR"), 
  harmony_name = "smoked_ever"
)

Source: local data frame [5 x 4]
Groups: SMOKER, PIPCIGAR [?]

  SMOKER PIPCIGAR smoked_ever     n
   (chr)    (chr)       (lgl) (int)
1     No      Yes        TRUE    41
2     No       NA       FALSE  1851
3    Yes       No        TRUE   169
4    Yes      Yes        TRUE     7
5     NA       NA          NA    19

# verify
dto[["unitData"]][["alsa"]] %>%
  dplyr::filter(id %in% sample(unique(id),10)) %>%
  dplyr::select_("id", "SMOKER", "PIPCIGAR", "smoked_ever")

      id SMOKER PIPCIGAR smoked_ever
1   4111     No     <NA>       FALSE
2   5222     No     <NA>       FALSE
3  17531     No     <NA>       FALSE
4  19081     No     <NA>       FALSE
5  22882     No     <NA>       FALSE
6  25261     No     <NA>       FALSE
7  25991     No     <NA>       FALSE
8  27001     No     <NA>       FALSE
9  27611   <NA>     <NA>          NA
10 30021     No     <NA>       FALSE

LBSL

Items that can contribute to generating values for the harmonized variable smoked_ever are:

dto[["metaData"]] %>%
  dplyr::filter(study_name == "lbsl", construct == "smoking") %>%
  # dplyr::filter(name %in% c("SMK94", "SMOKE")) %>%
  dplyr::select(study_name, name, label_short,categories)

  study_name  name        label_short categories
1       lbsl SMK94   Currently smoke?          2
2       lbsl SMOKE Smoke, tobacco use          3

study_name <- "lbsl"
path_to_hrule <- "./data/meta/h-rules/h-rules-smoking-lbsl.csv"
dto[["unitData"]][[study_name]] <- recode_with_hrule(
  dto,
  study_name = study_name, 
  variable_names = c("SMK94", "SMOKE"), 
  harmony_name = "smoked_ever"
)

Source: local data frame [9 x 4]
Groups: SMK94, SMOKE [?]

  SMK94                                         SMOKE smoked_ever     n
  (chr)                                         (chr)       (lgl) (int)
1    no don't smoke at present but smoked in the past        TRUE   272
2    no                                  never smoked       FALSE   205
3    no                         smoke at present time          NA     3
4    no                                            NA          NA     3
5   yes don't smoke at present but smoked in the past          NA     3
6   yes                         smoke at present time        TRUE    71
7    NA don't smoke at present but smoked in the past        TRUE     8
8    NA                                  never smoked       FALSE     2
9    NA                                            NA          NA    89

# verify
dto[["unitData"]][["lbsl"]] %>%
  dplyr::filter(id %in% sample(unique(id),10)) %>%
  dplyr::select_("id", "SMK94", "SMOKE", "smoked_ever")

        id SMK94                                         SMOKE smoked_ever
1  4221087    no don't smoke at present but smoked in the past        TRUE
2  4231082  <NA>                                          <NA>          NA
3  4251013    no don't smoke at present but smoked in the past        TRUE
4  4252073    no                                  never smoked       FALSE
5  4282088    no don't smoke at present but smoked in the past        TRUE
6  4291073   yes                         smoke at present time        TRUE
7  4371016    no                                  never smoked       FALSE
8  4452038    no                                  never smoked       FALSE
9  4591000    no                                  never smoked       FALSE
10 4601003    no                                  never smoked       FALSE

SATSA

Items that can contribute to generating values for the harmonized variable smoked_ever are:

dto[["metaData"]] %>%
  dplyr::filter(study_name == "satsa", construct == "smoking") %>%
  # dplyr::filter(name %in% c("SMK94", "SMOKE")) %>%
  dplyr::select(study_name, name, label_short,categories)

  study_name     name             label_short categories
1      satsa  GEVRSMK   Do you smoke tobacco?          3
2      satsa  GEVRSNS      Do you take snuff?          3
3      satsa GSMOKNOW Smoked some last month?          2

study_name <- "satsa"
path_to_hrule <- "./data/meta/h-rules/h-rules-smoking-satsa.csv"
dto[["unitData"]][[study_name]] <- recode_with_hrule(
  dto,
  study_name = study_name, 
  variable_names = c("GEVRSMK", "GEVRSNS","GSMOKNOW"), 
  harmony_name = "smoked_ever"
)

Source: local data frame [33 x 5]
Groups: GEVRSMK, GEVRSNS, GSMOKNOW [?]

                   GEVRSMK                      GEVRSNS GSMOKNOW smoked_ever     n
                     (chr)                        (chr)    (chr)       (lgl) (int)
1  No, I have never smoked No, I have never taken snuff       No       FALSE    56
2  No, I have never smoked No, I have never taken snuff      Yes          NA     1
3  No, I have never smoked No, I have never taken snuff       NA       FALSE   635
4  No, I have never smoked                  No, I quit.       No       FALSE     1
5  No, I have never smoked                  No, I quit.       NA          NA     6
6  No, I have never smoked                          Yes       No       FALSE     2
7  No, I have never smoked                          Yes      Yes          NA    13
8  No, I have never smoked                           NA       NA       FALSE     6
9              No, I quit. No, I have never taken snuff       No        TRUE    66
10             No, I quit. No, I have never taken snuff      Yes        TRUE     6
11             No, I quit. No, I have never taken snuff       NA        TRUE   206
12             No, I quit.                  No, I quit.       No        TRUE    13
13             No, I quit.                  No, I quit.       NA        TRUE    25
14             No, I quit.                          Yes       No        TRUE    10
15             No, I quit.                          Yes      Yes        TRUE    34
16             No, I quit.                          Yes       NA        TRUE     3
17             No, I quit.                           NA       No        TRUE     1
18             No, I quit.                           NA      Yes        TRUE     2
19             No, I quit.                           NA       NA        TRUE     3
20                     Yes No, I have never taken snuff       No        TRUE    24
21                     Yes No, I have never taken snuff      Yes        TRUE   249
22                     Yes No, I have never taken snuff       NA        TRUE     8
23                     Yes                  No, I quit.       No        TRUE     1
24                     Yes                  No, I quit.      Yes        TRUE    13
25                     Yes                          Yes      Yes        TRUE    26
26                     Yes                           NA      Yes        TRUE     6
27                      NA No, I have never taken snuff      Yes          NA     1
28                      NA No, I have never taken snuff       NA          NA     2
29                      NA                          Yes      Yes          NA     2
30                      NA                          Yes       NA          NA     1
31                      NA                           NA       No          NA     9
32                      NA                           NA      Yes          NA    12
33                      NA                           NA       NA          NA    54

# verify
dto[["unitData"]][["satsa"]] %>%
  dplyr::filter(id %in% sample(unique(id),10)) %>%
  dplyr::select_("id", "GEVRSMK", "GEVRSNS","GSMOKNOW", "smoked_ever")

        id                 GEVRSMK                      GEVRSNS GSMOKNOW smoked_ever
1   161332                     Yes No, I have never taken snuff      Yes        TRUE
2   163101             No, I quit. No, I have never taken snuff       No        TRUE
3   163402 No, I have never smoked No, I have never taken snuff       No       FALSE
4   180262             No, I quit. No, I have never taken snuff     <NA>        TRUE
5   211191 No, I have never smoked No, I have never taken snuff     <NA>       FALSE
6   216222                     Yes No, I have never taken snuff      Yes        TRUE
7   294162 No, I have never smoked No, I have never taken snuff     <NA>       FALSE
8  2105802             No, I quit. No, I have never taken snuff     <NA>        TRUE
9  2222901             No, I quit. No, I have never taken snuff     <NA>        TRUE
10 2329602             No, I quit. No, I have never taken snuff     <NA>        TRUE

Items that can contribute to generating values for the harmonized variable smoked_ever are:

dto[["metaData"]] %>%
  dplyr::filter(study_name == "share", construct == "smoking") %>%
  # dplyr::filter(name %in% c("SMK94", "SMOKE")) %>%
  dplyr::select(study_name, name, label_short,categories)

  study_name   name                           label_short categories
1      share BR0010 Ever smoked tobacco daily for a year?          2
2      share BR0020                     Smoke at present?          2
3      share BR0030                How many years smoked?         NA

study_name <- "share"
path_to_hrule <- "./data/meta/h-rules/h-rules-smoking-share.csv"
dto[["unitData"]][[study_name]] <- recode_with_hrule(
  dto,
  study_name = study_name, 
  variable_names = c("BR0010", "BR0020"), 
  harmony_name = "smoked_ever"
)

Source: local data frame [4 x 4]
Groups: BR0010, BR0020 [?]

  BR0010             BR0020 smoked_ever     n
   (chr)              (chr)       (lgl) (int)
1     no                 NA       FALSE  1542
2    yes no, i have stopped        TRUE   644
3    yes                yes        TRUE   408
4     NA                 NA          NA     4

# verify
dto[["unitData"]][["share"]] %>%
  dplyr::filter(id %in% sample(unique(id),10)) %>%
  dplyr::select_("id", "BR0010", "BR0020", "smoked_ever")

             id BR0010             BR0020 smoked_ever
1  2.505200e+12    yes                yes        TRUE
2  2.505211e+12     no               <NA>       FALSE
3  2.505213e+12    yes no, i have stopped        TRUE
4  2.505216e+12    yes                yes        TRUE
5  2.505244e+12    yes no, i have stopped        TRUE
6  2.505249e+12    yes                yes        TRUE
7  2.505255e+12     no               <NA>       FALSE
8  2.505281e+12     no               <NA>       FALSE
9  2.505286e+12     no               <NA>       FALSE
10 2.705249e+12     no               <NA>       FALSE

TILDA

Items that can contribute to generating values for the harmonized variable smoked_ever are:

dto[["metaData"]] %>%
  dplyr::filter(study_name == "tilda", construct == "smoking") %>%
  # dplyr::filter(name %in% c("SMK94", "SMOKE")) %>%
  dplyr::select(study_name, name, label_short,categories)

  study_name      name                           label_short categories
1      tilda     BH001 Ever smoked tobacco daily for a year?          2
2      tilda     BH002                     Smoke at present?          2
3      tilda     BH003              Age when stopped smoking         NA
4      tilda BEHSMOKER                Respondent is a smoker          3

study_name <- "tilda"
path_to_hrule <- "./data/meta/h-rules/h-rules-smoking-tilda.csv"
dto[["unitData"]][[study_name]] <- recode_with_hrule(
  dto,
  study_name = study_name, 
  variable_names = c("BH001", "BH002","BEHSMOKER" ), 
  harmony_name = "smoked_ever"
)

Source: local data frame [4 x 5]
Groups: BH001, BH002, BEHSMOKER [?]

  BH001              BH002 BEHSMOKER smoked_ever     n
  (chr)              (chr)     (chr)       (lgl) (int)
1    No  UNDOCUMENTED CODE     Never       FALSE  3726
2   Yes No, I have stopped      Past        TRUE  3213
3   Yes                Yes   Current        TRUE  1564
4    NA  UNDOCUMENTED CODE        NA          NA     1

# verify
dto[["unitData"]][["tilda"]] %>%
  dplyr::filter(id %in% sample(unique(id),10)) %>%
  dplyr::select_("id", "BH001", "BH002","BEHSMOKER", "smoked_ever")

                   id BH001              BH002 BEHSMOKER smoked_ever
1  21051                Yes                Yes   Current        TRUE
2  40411                Yes No, I have stopped      Past        TRUE
3  43432                 No  UNDOCUMENTED CODE     Never       FALSE
4  56201                 No  UNDOCUMENTED CODE     Never       FALSE
5  176652                No  UNDOCUMENTED CODE     Never       FALSE
6  262522                No  UNDOCUMENTED CODE     Never       FALSE
7  335151               Yes No, I have stopped      Past        TRUE
8  425322                No  UNDOCUMENTED CODE     Never       FALSE
9  578751                No  UNDOCUMENTED CODE     Never       FALSE
10 611201               Yes No, I have stopped      Past        TRUE

(III) Recapitulation

At this point the dto[["unitData"]] elements (raw data files for each study) have been augmented with the harmonized variables smoke_now and smoked_ever. We retrieve harmonized variables to view frequency counts across studies:

dumlist <- list()
for(s in dto[["studyName"]]){
  ds <- dto[["unitData"]][[s]]
  dumlist[[s]] <- ds[,c("id","smoke_now","smoked_ever")]
}
ds <- plyr::ldply(dumlist,data.frame,.id = "study_name")
head(ds)

  study_name  id smoke_now smoked_ever
1       alsa  41     FALSE       FALSE
2       alsa  42     FALSE       FALSE
3       alsa  61     FALSE       FALSE
4       alsa  71     FALSE       FALSE
5       alsa  91     FALSE       FALSE
6       alsa 121     FALSE       FALSE

ds$id <- 1:nrow(ds) # some ids values might be identical, replace
table( ds$smoke_now, ds$study_name, useNA = "always")

       
        alsa lbsl satsa share tilda <NA>
  FALSE 1851  480  1067  2186  6939    0
  TRUE   217   71   365   408  1564    0
  <NA>    19  105    65     4     1    0

table( ds$smoked_ever, ds$study_name, useNA = "always")

       
        alsa lbsl satsa share tilda <NA>
  FALSE 1851  207   700  1542  3726    0
  TRUE   217  351   696  1052  4777    0
  <NA>    19   98   101     4     1    0

Finally, we have added the newly created, harmonized variables to the raw source objects and save the data transfer object.

# Save as a compress, binary R dataset.  It's no longer readable with a text editor, but it saves metadata (eg, factor information).
saveRDS(dto, file="./data/unshared/derived/dto.rds", compress="xz")

Harmonize: smoking

(I) Exposition

(I.A) Ellis Island

Meta

(I.B) Target-H

(1) `smoke_now` : Are you a smoker presently?

(2) `smoked_ever` Have you ever smoked?

(II) Development

(II.A)

(1) Categorization

(2) Schema sets

(II.B) `smoke_now`

ALSA

LBSL

SATSA

TILDA

(II.C) `smoked_ever`

ALSA

LBSL

SATSA

TILDA

(III) Recapitulation

Harmonize: smoking

(I) Exposition

(I.A) Ellis Island

Meta

(I.B) Target-H

(1) smoke_now : Are you a smoker presently?

(2) smoked_ever Have you ever smoked?

(II) Development

(II.A)

(1) Categorization

(2) Schema sets

(II.B) smoke_now

ALSA

LBSL

SATSA

SHARE

TILDA

(II.C) smoked_ever

ALSA

LBSL

SATSA

SHARE

TILDA

(III) Recapitulation

(1) `smoke_now` : Are you a smoker presently?

(2) `smoked_ever` Have you ever smoked?

(II.B) `smoke_now`

(II.C) `smoked_ever`