3.2 importExcelData Sub- Functions

3.2.1 Import the Data Dictionary

The readDataDict function will import the dictionary file. This is slightly more useful than just using readxl::read_excel because checks will be performed to ensure the dictionary is complete and dates will be converted to easy to read text strings.

The file and sheet name must be specified.

Optional Arguments:

  • colnames Column names of the DataDictionary, defaults to those used in the Excel template
  • range allows a specified range of the sheet to be imported (ie range = “A1:F10”)
  • origin enables the origin of Excel dates to be set (defaults to 30 December 1899)

!readDataDict must be run before readExcelData

3.2.2 Import the Data

The readExcelData function will import the data, using the specifications provided in the data dictionary. Again, this should be slightly more useful than a readxl::read_excel because:

  • dates will be imported nicely
  • variables not in the data dictionary will be removed
  • there is an option to store any import warnings, which may be useful to send to the user
  • the data can be read in ‘strictly’, setting all values that fall outside the allowed range to NA (This can be useful if users insist on using missing value codes)

The file and sheetname, as well as the object returned by the readDataDict function must be specified.

Optional Arguments:

  • range allows a specified range of the sheet to be imported (ie range = “A1:AB102”)
  • origin enables the origin of Excel dates to be set (defaults to 30 December 1899)
  • saveWarnings will return a list with the data frame and the import warnings (if there are any)
  • setErrorsMissing will set all values out of range to NA this should only be done if no warnings are given

3.2.3 Perform Checks

The checkData function will simply check the data against the data dictionary. This requires the outputs of readDataDict and readExcelData.

The checkData function will check for duplicate rows and check that all numeric, integer, date, code and category variables meet the requirements of the data dictionary. It does not check the data types, because these are enforced by the readExcelData function.

The outputs from readDataDict and readExcelData are required.

Optional Argument:

  • id can be used to indicate the patient identifier (ie id =‘SubjectID’) which will provide errors for each id, instead of by row number

checkData returns a list with three elements:

  • a data frame containing all the errors (rows and columns without errors are removed)

  • a two-column data frame with Variable Names in one column and the location of errors in the second (either row numbers or IDs if provided)

  • a two-column data frame with Row Numbers (or IDs) in one columns and the variable containing errors in the second column.

3.2.4 Create Factor Variables

If keepOriginal = TRUE then addFactorVariables function will rename all the code and category variables suffixed with _orig (so ecog becomes ecog_orig) and create factor variables using the codes provided in the data dictionary as the factor levels and labels. Otherwise, if keepOriginal = FALSE then the existing variables are over-written with factor variables.

3.2.5 Create Calculated Variables

The ‘createCalculated’ function calculated variables according the specifications given in the data dictionary. For recoded variables a contingency table is exported to the log file to ensure the categories have been created as expected. For categorised variables a table with the minimum and maximum values on the continuous variable is given for each category and also exported to the log file. For combined variables an attribute is tagged onto the variable so that it is printed as a Pareto graph by the PlotVariables function.