3.4 Data Dictionary

Example Data Dictionary

3.4.1 Variable names:

  • Must be unique
  • Should be short and meaningful, descriptions can be put in the Description column
  • Must begin with a letter
  • Should not contain special characters or spaces, except for an underscore ’_’.
  • Long survey questions should be placed in the description. For the variable name use either a shortened version, or simply Q1, Q2, etc.
Bad Variable Names Good Variable Names
patient date in clinic clinic_date
pre-treatment ECOG ECOG_pre
Q1. What is your relationship to… Q1_relationship

3.4.2 Types of Data Identifiers

  • No personally identifying data should appear, including EMR numbers
  • Instead, keep a sheet separate to the data linking Study IDs to patient IDs Numeric Data

  • Enter continuous data, such as Age or Weight as a single numeric field without any extra text (ie enter 50 instead of 50kg)
  • Do not enter both Age and Age Category. Instead, enter age and specify AgeCat as a calculated variable
  • Entering data once reduces the amount of data entry and the potential for errors. Categorical Data

  • Enter the Levels of categorical and code variables in the order you would like them presented (ie CR=complete recovery, PR=partial recovery,SD=stable disease,PD=progressive disease)
  • Categorical data can be entered as numbers, letters or abbreviations instead of text
  • Categories are entered separate by commas
  • Example:
    • T1,T2,T3,T4
  • Codes are entered in the data dictionary in the format code=label separated by commas
  • Examples:
    • 1=Female, 2=Male
    • CR= Complete Recovery, PR= Partial Recovery, SD = Stable Disease, PD = Progressive Disease Dates

  • Should be entered in an unambiguous format ie “01-Jan-2020”
  • Should not begin with the Year (formatting and validation won’t work properly)
  • Dates can be copied in that begin with a year, but data checks need to be performed manually
  • Dates after the current date will be highlighted in red as a warning

3.4.3 Variable Ranges

  • All date, integer and numeric values should have ranges specified. These should correspond to inclusion and exclusion criteria for your study, or natural values the variable can take. If you don’t know the upper limit (ie of a biomarker), then put in the maximum reasonable value. Values above this will be flagged and you can adjust the maximum to include them if you wish.

The Minimum and Maximum can be:

  • values (ie 40)
  • variables names (if you want the minimum date of death to be DxDate for example)
  • for date variables you can enter today to allow all dates up to the date of data entry