BBMC Validator: catch and communicate data errors

OUHSC Statistical Computing User Group

Will Beasley1, Geneva Marshall1, Thomas Wilson1,
Som Bohora2, & Maleeha Shahid2.

  1. Biomedical and Behavioral Methodology Core (BBMC)
  2. Center on Child Abuse and Neglect (CCAN)

November 1, 2016

Objectives for Validator Software

  1. catches & displays data entry errors,
  2. communicates problems to statisticians,
  3. communicates problems to data collectors and managers
    (who typically have some tech phobia),
  4. executes with automation, and
  5. produces self-contained report file that can be emailed.

`validation_check` S3 object

validation_check <- function( 
  name, error_message, priority, passing_test 
) {
  # S3 object to check
  l <- list()
  class(l)         <- "check"
  l$name           <- name
  l$error_message  <- error_message
  l$priority       <- priority
  l$passing_test   <- passing_test
  return( l )
}

Declare List of Checks

# Add to this list for new validators.
checks <- list(
  validation_check(
    name          = "record_id_no_white_space",
    error_message = "'record_id' contains white space.",
    priority      = 1L,
    passing_test  = function( d ) {
      !grepl("\\s", d$record_id, perl=T)
    }
  ),
  validation_check(
    name          = "interview_started_set",
    error_message = "`interview_started` can't be missing.",
    priority      = 2L,
    passing_test  = function( d ) {
      !is.na(d$interview_started)
    }
  ),
  ...
)

Execute Checks

for( check in checks ) {
  index <- length(ds_violation_list) + 1L
  violations <- !check$passing_test(ds_interview)
  ds_violation <- ds_interview %>%
    dplyr::filter(violations)


  if( nrow(ds_violation) > 0L ) {
    ds_violation_list[[index]] <- extract_violation_info(ds_violation, check)
  }
  rm(violations, ds_violation)
}

Display Failures as HTML

DT::datatable(
  data         = ds_violation_pretty,
  filter       = "bottom",
  caption      = paste("Violations at", Sys.time()),
  escape       = FALSE,
  options      = list(pageLength = 30, dom = 'tip')
)

Save Failures as CSV

# ---- save-to-disk ----------------------------------
message("Saving list of violations to `", path_output, "`.")

readr::write_csv(ds_violation, path=path_output)

Live Products

Table of Violations

display-table

Portable HTML Report

full-html-report

Example Checks

example-checks

Important Characteristics

  1. No PHI within report.
    Because you can't control where it will be emailed.
  2. URLs link to PHI within REDCap.
    Let REDCap handle all the authentication duties.
  3. Sortable & filterable table.
    By date, user, error type.
  4. Portable & disconnected report.
    The data collectorsaren't always OUHSC employees or on campus.
  5. Database agnostic.
    Accommodates REDCap, SQL Server, CSV, …

Human Considerations

  1. Each check should be easy to understand
  2. Each violation should be easy (as possible) to fix
  3. Send reports frequently to data collectors
    • So the list doesn't become overwhelmingly large
    • So the cases are fresh on their minds
  4. What other suggestions do you have?

Upcoming Features/Uses

  1. Report runs updates every 10 minutes, and is displayed in Shiny.
  2. Report-level checks will supplement the record-level checks.
    (e.g., “At least 30% of participants should be female.”)
  3. Graph performance of each data collector.
    (Suggested by Geneva Marshall.)
  4. The data collectors could check the report after their 3 hour interview, but before leaving the participant's home.
    (Suggested by Thomas Wilson.)
  5. Pull the reusable code into a package, leaving a file with only the checks and a few project-specific parameters.

Generalizable

  • We want this mechanism to be used in almost all our research that involves live data collection. We'll also make this publically available.
  • Ideally, a single mechanism accommodates all these types of research.
  • How could this be modified/expanded to accommodate your type of research and human environments?

Feedback During Presentation

  • Mike Anderson: use a similar tool to create an action item report that fills the gap between (a) static REDCap scheduling and (b) “errors” of the Validator report.
  • Summer Frank: Building off of Thomas's idea, the data collectors could review only the top priority violations before leaving the participant. Ignore the checks that can be corrected later, that don't eat up more of the participant's time.
  • Dwayne Geller: Use REDCap's DET (data entry trigger) to run a mini version of the validator that shows errors to the data collector in a REDCap text field. This would reduce a round trip between the Validator report and REDCap.