Skip to contents

Introduction

anumaan is an R package for preprocessing antimicrobial resistance (AMR) surveillance data and performing GBD-style burden estimation. It takes messy hospital data and produces analysis-ready, standardized datasets - handling organism and antibiotic normalization, event deduplication, MDR/XDR classification, polymicrobial weighting, and DALY calculations.

Key Features

  • Automated pipeline: run_preprocess() runs the full sequence in one call
  • Input validation: column checks, key quality, date repair before every step
  • Organism normalization: handles typos, abbreviations, case, MRSA/CoNS flags
  • Antibiotic classification: WHO AWaRe categories and drug classes
  • AST cleaning: standardizes free-text R/S/I values, flags invalid entries
  • MDR/XDR classification: Magiorakos 2012 (WHO/CDC-compliant)
  • Event deduplication: unique infection episodes across repeat cultures
  • Polymicrobial detection and fractional weighting
  • HAI / CAI classification: derived from admission-to-culture date gap
  • Diagnosis -> ICD-10 -> syndrome mapping: text, rule-based, or embedding
  • Contaminant flagging: syndrome-aware contaminant lists
  • Attrition tracking: patient and event counts at every filter step
  • DALY burden estimation: YLL, YLD, attributable and associated burden
  • Visualization: ggplot2-based resistance, burden, and spatial plots

Installation

# From GitHub
remotes::install_github("saketlab/anumaan")

# From source during development
devtools::install()

Load the Package

library(anumaan)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

Data Format

anumaan expects long format: one row per patient-organism-antibiotic combination. The minimum required columns are:

Column Description
patient_id Unique patient identifier
organism_name Organism name (raw, any format)
antibiotic_name Antibiotic name (raw)
antibiotic_value Susceptibility result: R, I, or S
date_of_culture Date the culture was collected

Optional but used when present: date_of_admission, date_of_final_outcome, DOB, Age, specimen_type, gender, final_outcome, infection_type.

sample_data <- data.frame(
  patient_id            = c("P001","P001","P002","P003","P003",
                             "P004","P004","P005","P006","P006"),
  date_of_admission     = as.Date(c("2024-01-10","2024-01-10","2024-01-18",
                                    "2024-02-01","2024-02-01","2024-02-10",
                                    "2024-02-10","2024-03-05","2024-03-15",
                                    "2024-03-15")),
  date_of_culture       = as.Date(c("2024-01-15","2024-01-15","2024-01-20",
                                    "2024-02-03","2024-02-03","2024-02-14",
                                    "2024-02-14","2024-03-07","2024-03-17",
                                    "2024-03-17")),
  date_of_final_outcome = as.Date(c("2024-01-25","2024-01-25","2024-01-28",
                                    "2024-02-15","2024-02-15","2024-02-22",
                                    "2024-02-22","2024-03-18","2024-03-28",
                                    "2024-03-28")),
  final_outcome         = c("Survived","Survived","Died","Survived","Survived",
                             "Died","Died","Survived","Survived","Survived"),
  specimen_type         = c("Blood","Blood","Urine","Blood","Blood",
                             "Urine","Urine","Blood","Blood","Blood"),
  organism_name         = c("E. coli","E. coli","K. pneumoniae","S. aureus",
                             "S. aureus","E. coli","K. pneumoniae",
                             "A. baumannii","K. pneumoniae","K. pneumoniae"),
  antibiotic_name       = c("Ampicillin","Meropenem","Ciprofloxacin","Oxacillin",
                             "Vancomycin","Ampicillin","Ciprofloxacin","Meropenem",
                             "Ceftriaxone","Meropenem"),
  antibiotic_value      = c("R","S","R","R","S","R","R","R","R","S"),
  DOB                   = as.Date(c("1960-05-01","1960-05-01","1975-08-15",
                                    "1990-03-20","1990-03-20","2000-11-05",
                                    "2000-11-05","1955-02-28","1968-07-12",
                                    "1968-07-12")),
  gender                = c("Male","Male","Female","Male","Male",
                             "Female","Female","Male","Female","Female"),
  center_name           = c("Centre A","Centre A","Centre A",
                             "Centre B","Centre B","Centre B","Centre B",
                             "Centre C","Centre C","Centre C"),
  age_years             = c(63,63,48,33,33,23,23,69,55,55),
  Age_bin               = factor(
                            c("45-64","45-64","45-64","18-44","18-44",
                              "18-44","18-44","65+","45-64","45-64"),
                            levels = c("<5","18-44","45-64","65+")),
  is_polymicrobial      = c(FALSE,FALSE,FALSE,FALSE,FALSE,TRUE,TRUE,FALSE,FALSE,FALSE),
  location              = c("ICU","ICU","Ward","ICU","ICU",
                             "Ward","Ward","ICU","Ward","Ward"),
  infectious_syndrome   = c("BSI","BSI","UTI","BSI","BSI",
                             "UTI","UTI","BSI","BSI","BSI"),
  stringsAsFactors      = FALSE
)

The Complete Pipeline

run_preprocess() is the single entry point that runs all phases in sequence.

result <- run_preprocess(
  data            = sample_data,
  config          = amr_config(),
  phases          = "all",       # "standardize", "enrich", "derive", or "all"
  verbose         = TRUE,
  validate        = TRUE,
  generate_report = TRUE
)

clean_data     <- result$data      # preprocessed data frame
config_used    <- result$config    # amr_config() object used
processing_log <- result$log       # per-phase transformation logs
report         <- result$report    # preprocessing report
metadata       <- result$metadata  # timing, row counts, phases run

Configuration

Pass amr_config() to override any default:

config <- amr_config(
  fuzzy_match               = TRUE,          # auto-detect column names
  hai_cutoff                = 3,             # days after admission -> HAI
  event_gap_days            = 14,            # deduplication window
  mortality_window          = 14,            # days post-culture to attribute death
  age_bins                  = "GBD_standard",# "pediatric", "geriatric", or custom
  mdr_definition            = "CDC",         # "CDC", "WHO", or numeric threshold
  xdr_definition            = "CDC",
  intermediate_as_resistant = TRUE,          # treat I as R (except Colistin)
  strict_validation         = FALSE          # warn vs stop on issues
)
print(config)
#> AMR Preprocessing Configuration
#> ================================
#> 
#> Phase 1 - Standardization:
#>   Fuzzy column matching: TRUE
#>   Strict validation: FALSE
#>   Date columns: date_of_admission, date_of_culture, date_of_final_outcome, DOB
#> 
#> Phase 2 - Enrichment:
#>   HAI cutoff: 3 days
#>   Infer department: TRUE
#> 
#> Phase 3 - Derivation:
#>   Event gap: 14 days
#>   Mortality window: 14 days
#>   Age bins: custom
#>   Contaminant method: auto
#>   MDR definition: CDC
#>   XDR definition: CDC
#> 
#> Reference Data:
#>   RR table: GBD_2021
#>   Organism map: default
#>   Antibiotic map: WHO_2023
#> 
#> Processing Rules:
#>   Intermediate as Resistant: TRUE
#>   Verbose: TRUE

Run on a Subset of Phases

# Only standardization
result_std <- run_preprocess(sample_data, phases = "standardize")

# Standardization + Enrichment (no event/MDR derivation)
result_enr <- run_preprocess(sample_data, phases = c("standardize", "enrich"))

Input Validation and Schema Checks

Run these before any join or preprocessing step to catch problems early.

Column and Key Checks

# Check required columns exist and report types
prep_check_columns(
  sample_data,
  required    = c("patient_id", "organism_name", "antibiotic_name", "antibiotic_value"),
  table_label = "sample_data"
)
#> [sample_data] Column check: 17 cols | 4 required | 0 missing | 0 type warnings

# Check key column quality: missing, duplicates, placeholder values
prep_check_keys(
  sample_data,
  key_col     = "patient_id",
  table_label = "sample_data"
)
#> [sample_data] Key 'patient_id': 10 rows | 0 missing (0.0%) | 6 distinct | 4 duplicated | 0 rows missing both date_of_admission and date_of_culture

Full Table Validation

prep_validate_table() combines column checks, key checks, and date coercion in one call - the function used internally before every merge step.

val <- prep_validate_table(
  data          = sample_data,
  required_cols = c("patient_id", "organism_name"),
  key_col       = "patient_id",
  date_cols     = c("date_of_admission", "date_of_culture"),
  table_label   = "sample_data"
)
#> [sample_data] Column check: 17 cols | 2 required | 0 missing | 0 type warnings
#> [sample_data] Key 'patient_id': 10 rows | 0 missing (0.0%) | 6 distinct | 4 duplicated | 0 rows missing both date_of_admission and date_of_culture
# val$data        - date-coerced data frame
# val$col_report  - per-column type/missing summary
# val$key_report  - key quality summary

Missingness Report

prep_missingness_report(
  sample_data,
  threshold = 20,   # flag columns with > 20% missing
  cols      = c("patient_id", "organism_name", "antibiotic_value", "DOB")
)
#>           col_name n_total n_missing pct_missing is_high_missing
#> 1       patient_id      10         0           0           FALSE
#> 2    organism_name      10         0           0           FALSE
#> 3 antibiotic_value      10         0           0           FALSE
#> 4              DOB      10         0           0           FALSE

Required Fields and Data Quality

validation <- validate_required_fields(
  sample_data,
  required_cols    = c("patient_id", "organism_name",
                        "antibiotic_name", "antibiotic_value"),
  min_completeness = 0.8
)
validation$valid
validation$messages

quality <- validate_data_quality(
  sample_data,
  min_rows        = 1,
  max_missing_pct = 90,
  stop_on_failure = FALSE
)

Detect Schema Drift Across Datasets

Useful when combining data from multiple sources or time periods:

prep_detect_schema_drift(
  data_list        = list(centre_A = data_A, centre_B = data_B),
  reference_centre = "centre_A"
)

Column Name Standardization

Automatic Mapping

# Renames known aliases to the package-standard names
# e.g. "PatientID" -> "patient_id", "Organism" -> "organism_name"
mapped <- prep_standardize_column_names(
  sample_data,
  fuzzy_match     = TRUE,
  fuzzy_threshold = 0.3
)
#> Auto fuzzy: 'patient_id' -> 'patient_id' (dist 0.00)
#> Auto fuzzy: 'gender' -> 'gender' (dist 0.00)
#> Auto fuzzy: 'location' -> 'location' (dist 0.00)
#> Auto fuzzy: 'date_of_admission' -> 'date_of_admission' (dist 0.00)
#> Auto fuzzy: 'date_of_culture' -> 'date_of_culture' (dist 0.00)
#> Auto fuzzy: 'date_of_final_outcome' -> 'date_of_final_outcome' (dist 0.00)
#> Auto fuzzy: 'final_outcome' -> 'final_outcome' (dist 0.00)
#> Auto fuzzy: 'organism_name' -> 'organism_name' (dist 0.00)
#> Auto fuzzy: 'antibiotic_name' -> 'antibiotic_name' (dist 0.00)
#> Auto fuzzy: 'antibiotic_value' -> 'antibiotic_value' (dist 0.00)
#> Auto fuzzy: 'specimen_type' -> 'specimen_type' (dist 0.00)
#> [prep_apply_column_map] Renamed 13 column(s).
#> [prep_apply_column_map] 4 column(s) not in map (kept as-is): center_name, Age_bin, is_polymicrobial, infectious_syndrome
# mapped$data        - renamed data frame
# mapped$mapping_log - what was renamed and how (exact / fuzzy)
# mapped$unmapped    - columns that couldn't be matched

Manual Column Map

For datasets with non-standard names that fuzzy matching misses:

col_map <- prep_build_column_map(
  sample_data,
  column_map = list(
    patient_id    = c("PtID", "patient_number"),
    organism_name = c("Pathogen", "bug")
  )
)
data_renamed <- prep_apply_column_map(sample_data, col_map)

Assert Standard Names

Stops with a clear error if a required standard column is absent - useful as a guard at the start of analysis scripts:

prep_assert_standard_names(
  sample_data,
  required_standard_names = c("patient_id", "organism_name", "antibiotic_value"),
  strict = TRUE
)

Detect Preprocessing Capabilities

Reports which preprocessing steps are possible given the columns present:

prep_report_capabilities(sample_data)
#> === Preprocessing Capabilities ===
#> Enabled (5):
#>   [+] standardize_organism
#>   [+] standardize_antibiotic
#>   [+] harmonize_ast
#>   [+] flag_polymicrobial
#>   [+] build_outputs
#> Skipped / missing inputs (8):
#>   [-] parse_dates
#>   [-] derive_hai
#>   [-] derive_los
#>   [-] derive_age
#>   [-] standardize_specimen
#>   [-] create_events
#>   [-] flag_contaminants
#>   [-] classify_mdr
#> ===================================

Date Handling

Parse a Single Date Column

Handles Excel serial numbers, Unix timestamps (ms), ISO strings, DMY/MDY formats, and reversed YYYYDDMM patterns automatically:

dates_raw <- c("2024-01-15", "45306", "15/01/2024", NA)
prep_parse_date_column(dates_raw, col_name = "date_of_culture", table_label = "example")
#> [example] 'date_of_culture': 1 value(s) decoded as Excel serial date.
#> [example] 'date_of_culture': 3 / 3 non-missing value(s) successfully parsed as Date (0 failed).
#> [1] "2024-01-15" "2024-01-15" "2024-01-15" NA

Coerce All Date Columns in a Data Frame

data_with_dates <- prep_coerce_dates(
  sample_data,
  cols        = c("date_of_admission", "date_of_culture", "date_of_final_outcome"),
  table_label = "sample_data"
)

Validate Date Logic

Checks that admission <= culture <= outcome, and that DOB is before culture:

prep_validate_date_logic(
  sample_data,
  admission_col = "date_of_admission",
  culture_col   = "date_of_culture",
  outcome_col   = "date_of_final_outcome",
  dob_col       = "DOB"
)

Organism Standardization

Normalize Organism Names

Maps free-text organism names to a standard reference, handles typos, case, abbreviations (E. coli -> Escherichia coli), MRSA/CoNS flags:

org_data <- data.frame(
  organism_name = c("E. coli", "ACINETOBACTER BAUMANNII",
                     "Klebsiella pnuemoniae", "MRSA", "CoNS")
)
normalized <- prep_standardize_organisms(org_data, organism_col = "organism_name")
#> Normalized 5/5 organisms (100.0%)
#> Result: 5 unique organisms
#> is_MRSA: 1
#> is_MSSA: 0
#> is_MRCONS: 0
#> is_MSCONS: 0
normalized[, c("organism_name", "organism_normalized")]
#>             organism_name              organism_normalized
#> 1                 E. coli                 escherichia coli
#> 2 ACINETOBACTER BAUMANNII          acinetobacter baumannii
#> 3   Klebsiella pnuemoniae            klebsiella spp. other
#> 4                    MRSA            staphylococcus aureus
#> 5                    CoNS coagulase-negative staphylococci

Assign Organism Group

Adds organism_group (e.g. “Enterobacterales”, “Gram-positive cocci”, “Fungal isolates”) - required for MDR/XDR classification:

data <- prep_assign_organism_group(normalized, organism_col = "organism_normalized")
table(data$organism_group)

Extract Genus and Species

data <- prep_extract_genus(data,   organism_col = "organism_normalized")
data <- prep_extract_species(data, organism_col = "organism_normalized")
# Adds: org_genus, org_species

Flag Unmatched Organisms

data <- prep_flag_organism_unmatched(data, organism_col = "organism_normalized")
# Adds: organism_match_status ("matched" / "unmatched")
table(data$organism_match_status)

Specimen, Sex, and Outcome Standardization

data <- prep_standardize_specimens(
  sample_data,
  specimen_col = "specimen_type"
)
#> Specimen normalization: 10/10 matched (100.0%)
#> Sterile classification: 7 sterile, 3 non-sterile

data <- prep_standardize_sex(data, col = "gender")
#> Standardized gender: M=5, F=5, NA=0

data <- prep_standardize_final_outcome(data, col = "final_outcome")
#> Standardized final outcome distribution:
#> 
#>     Died Survived 
#>        3        7

data <- prep_standardize_infection_type(data, col = "infection_type")
#> Warning in prep_standardize_infection_type(data, col = "infection_type"):
#> Column 'infection_type' not found. Skipping infection type standardization.

Typos and shorthand in specimen names are handled with rule-based cleaning and fuzzy matching against the specimen reference table:

spec_typo <- data.frame(
  specimen_type = c("blood culure", "urne c/s", "csff"),
  stringsAsFactors = FALSE
)
spec_out <- prep_standardize_specimens(spec_typo, specimen_col = "specimen_type")
#> Specimen normalization: 3/3 matched (100.0%)
#> Sterile classification: 2 sterile, 1 non-sterile
spec_out[, c("specimen_type", "specimen_normalized")]
#>   specimen_type specimen_normalized
#> 1  blood culure               Blood
#> 2      urne c/s               Urine
#> 3          csff                 CSF

Antibiotic Standardization

Normalize Antibiotic Names

Maps to WHO standard names and adds drug class and AWaRe category:

abx_data <- data.frame(
  antibiotic_name = c("Ampicillin", "Meropenem", "Ciprofloxacin", "Vancomycin")
)
normalized_abx <- prep_standardize_antibiotics(
  abx_data,
  antibiotic_col = "antibiotic_name",
  add_class      = TRUE,
  add_aware      = TRUE
)
#> Normalizing 4 unique antibiotic names against WHO reference (257 antibiotics)...
#> Normalized: 4 unique names -> 4
#> With antibiotic_class: 4 (100.0%)
#> With AWaRe category: 4 (100.0%)
normalized_abx[, c("antibiotic_name", "antibiotic_normalized",
                    "antibiotic_class", "aware_category")]
#>   antibiotic_name antibiotic_normalized antibiotic_class aware_category
#> 1      Ampicillin            ampicillin Aminopenicillins         Access
#> 2       Meropenem             meropenem      Carbapenems          Watch
#> 3   Ciprofloxacin         ciprofloxacin Fluoroquinolones          Watch
#> 4      Vancomycin         vancomycin_iv       Vancomycin          Watch

Classify Drug Class and AWaRe Separately

data <- prep_classify_antibiotic_class(data, antibiotic_col = "antibiotic_normalized")
data <- prep_classify_aware(data, antibiotic_col = "antibiotic_normalized")

AST Value Cleaning

Clean Free-Text R/S/I Values

Standardizes “Resistant”, “res”, “r”, “1”, “SENS”, etc. to R / I / S:

ast_data <- data.frame(
  antibiotic_value = c("R", "Resistant", "S", "Sensitive",
                        "I", "Intermediate", "resistant", "1", "0", NA)
)
cleaned <- prep_clean_ast_values(ast_data, value_col = "antibiotic_value")
#> Cleaning antibiotic values: 9 unique values found
#> Cleaned: 9 unique values -> 3 (S/I/R)
#> [!] 3 values could not be parsed (30.0%)
#> 
#> Value distribution:
#> 
#>    I    R    S <NA> 
#>    2    3    2    3
cleaned
#>    antibiotic_value
#> 1                 R
#> 2                 R
#> 3                 S
#> 4                 S
#> 5                 I
#> 6                 I
#> 7                 R
#> 8              <NA>
#> 9              <NA>
#> 10             <NA>

Harmonize AST Interpretations

Applies breakpoint-based harmonization (handles MIC/disk zone inputs):

data <- prep_harmonize_ast(data, ast_col = "antibiotic_value")

Recode Intermediate as Resistant

# Per WHO/EUCAST "I = susceptible with increased exposure" - keep I, or:
data <- prep_recode_intermediate_ast(data, col = "antibiotic_value")

Flag Invalid AST Entries

data <- prep_flag_invalid_ast(data, col = "ast_value_harmonized")
# Adds: ast_invalid (TRUE/FALSE)

Wide-Format AST Import

If your data has one column per antibiotic, pivot to long first:

long_data <- prep_pivot_ast_wide_to_long(
  wide_data,
  id_cols         = c("patient_id", "date_of_culture", "organism_name"),
  antibiotic_cols = NULL   # NULL = auto-detect
)

Age and Demographics

Fill Missing Age from Date of Birth

age_data <- data.frame(
  DOB             = as.Date(c("1960-05-01", "1975-08-15", NA)),
  date_of_culture = as.Date(c("2024-01-15", "2024-01-20", "2024-02-03")),
  Age             = c(NA, NA, 45)
)
result_age <- prep_fill_age(
  age_data,
  age_col   = "Age",
  dob_col   = "DOB",
  date_col  = "date_of_culture",
  overwrite = FALSE
)
#> Enriched Age: 2 rows filled using DOB calculation
#> 
#> Age enrichment summary:
#>            age_method age_confidence n
#> 1 calculated_from_dob           high 2
#> 2            provided           high 1
result_age[, c("Age", "age_method", "age_confidence")]
#>        Age          age_method age_confidence
#> 1 63.70705 calculated_from_dob           high
#> 2 48.43258 calculated_from_dob           high
#> 3 45.00000            provided           high

Assign Age Bins

binned <- prep_assign_age_bins(
  result_age,
  age_col = "Age",
  bins    = "GBD_standard"   # or "pediatric", "geriatric", or custom numeric vector
)
#> Assigned age bins: 3 binned, 0 unbinned
table(binned$Age_bin)
#> 
#>    <1   1-5  5-10 10-15 15-20 20-25 25-30 30-35 35-40 40-45 45-50 50-55 55-60 
#>     0     0     0     0     0     0     0     0     0     0     2     0     0 
#> 60-65 65-70 70-75 75-80 80-85   85+ 
#>     1     0     0     0     0     0

Derive Age from DOB Components

When DOB is split across day/month/year columns:

data <- prep_derive_dob_from_components(
  data,
  day_col   = "birth_day",
  month_col = "birth_month",
  year_col  = "birth_year"
)

Length of Stay and HAI/CAI

Derive Length of Stay

los_data <- prep_derive_los_from_dates(
  sample_data,
  admission_col = "date_of_admission",
  outcome_col   = "date_of_final_outcome",
  los_col       = "los_days"
)
#> [prep_derive_los_from_dates] 10 rows filled from date_of_final_outcome - date_of_admission.
#> [prep_derive_los_from_dates] 10 filled; 0 remain missing.
#> 
#> LOS summary:
#>   mean_los median_los min_los max_los
#> 1     13.1         13      10      15
los_data[, c("patient_id", "date_of_admission", "date_of_final_outcome", "los_days")]
#>    patient_id date_of_admission date_of_final_outcome los_days
#> 1        P001        2024-01-10            2024-01-25       15
#> 2        P001        2024-01-10            2024-01-25       15
#> 3        P002        2024-01-18            2024-01-28       10
#> 4        P003        2024-02-01            2024-02-15       14
#> 5        P003        2024-02-01            2024-02-15       14
#> 6        P004        2024-02-10            2024-02-22       12
#> 7        P004        2024-02-10            2024-02-22       12
#> 8        P005        2024-03-05            2024-03-18       13
#> 9        P006        2024-03-15            2024-03-28       13
#> 10       P006        2024-03-15            2024-03-28       13

Classify HAI vs CAI

hc_data <- prep_derive_hai_cai(
  sample_data,
  admission_col = "date_of_admission",
  culture_col   = "date_of_culture",
  hai_cutoff    = 3   # cultures >= 3 days after admission = HAI
)
#> Inferring infection type using 3-day HAI cutoff...
#> Enriched infection_type: 10 rows filled
#> 
#> Infection type distribution:
#>   infection_type infection_type_method n
#> 1            CAI  inferred_3day_cutoff 6
#> 2            HAI  inferred_3day_cutoff 4
table(hc_data$hai_cai)
#> < table of extent 0 >

Flag HAI Observations Not Clinically Coded

data <- prep_flag_hai_inferred(data)
# Adds: hai_inferred (TRUE where hai_cai derived differs from coded infection_type)

Derive ICU Flag

data <- prep_derive_icu_flag(data, department_col = "department")
# Adds: is_icu (TRUE/FALSE)

Event Deduplication

Repeat cultures of the same organism within a time window belong to the same infection episode. prep_create_event_ids() groups them into events.

event_data <- prep_create_event_ids(
  sample_data,
  patient_col  = "patient_id",
  date_col     = "date_of_culture",
  organism_col = "organism_name",
  specimen_col = "specimen_type",
  gap_days     = 14   # > 14 days between same organism/site = new event
)
#> Creating events (gap threshold: >14 days) ...
#> Done: 6 patients -> 7 events (1.17 per patient)
event_data[, c("patient_id", "organism_name", "date_of_culture", "event_id")]
#>    patient_id organism_name date_of_culture
#> 1        P001       E. coli      2024-01-15
#> 2        P001       E. coli      2024-01-15
#> 3        P002 K. pneumoniae      2024-01-20
#> 4        P003     S. aureus      2024-02-03
#> 5        P003     S. aureus      2024-02-03
#> 6        P004       E. coli      2024-02-14
#> 7        P004 K. pneumoniae      2024-02-14
#> 8        P005  A. baumannii      2024-03-07
#> 9        P006 K. pneumoniae      2024-03-17
#> 10       P006 K. pneumoniae      2024-03-17
#>                                 event_id
#> 1        P001_blood_20240115_e. coli_001
#> 2        P001_blood_20240115_e. coli_001
#> 3  P002_urine_20240120_k. pneumoniae_001
#> 4      P003_blood_20240203_s. aureus_001
#> 5      P003_blood_20240203_s. aureus_001
#> 6        P004_urine_20240214_e. coli_001
#> 7  P004_urine_20240214_k. pneumoniae_002
#> 8   P005_blood_20240307_a. baumannii_001
#> 9  P006_blood_20240317_k. pneumoniae_001
#> 10 P006_blood_20240317_k. pneumoniae_001

Remove Duplicate Rows Within Events

deduped <- prep_deduplicate_events(
  event_data,
  event_col      = "event_id",
  organism_col   = "organism_name",
  antibiotic_col = "antibiotic_name"
)

Readmission Classification

data <- prep_flag_readmission(
  data,
  patient_col   = "patient_id",
  admission_col = "date_of_admission",
  gap_days      = 30
)
data <- prep_classify_readmission(data, readmission_col = "readmission_class")

Contaminant Detection

Syndrome-Aware Contaminant Lists

# Retrieve the contaminant list for a syndrome
contaminants <- prep_get_contaminant_list(
  syndrome   = "Bloodstream infections",
  return_all = FALSE
)
contaminants$names
#>  [1] "Coagulase-negative Staphylococcus" "Staphylococcus epidermidis"       
#>  [3] "Staphylococcus hominis"            "Staphylococcus haemolyticus"      
#>  [5] "Staphylococcus capitis"            "Staphylococcus warneri"           
#>  [7] "Corynebacterium species"           "Corynebacterium striatum"         
#>  [9] "Corynebacterium jeikeium"          "Cutibacterium"                    
#> [11] "Cutibacterium acnes"               "Micrococcus species"              
#> [13] "Micrococcus luteus"                "Bacillus (non-anthracis)"         
#> [15] "Bacillus subtilis"                 "Bacillus cereus"                  
#> [17] "Viridans streptococci"             "Aerococcus spp."                  
#> [19] "Kocuria spp."                      "Dermacoccus spp."                 
#> [21] "Rothia spp."

Flag Likely Contaminants

data <- prep_flag_contaminants(
  data,
  method = "auto"   # "auto", "device_based", "heuristic", "provided"
)
table(data$is_contaminant)

Test Individual Organism Names

prep_is_contaminant(
  organism_name = c("Staphylococcus epidermidis", "Escherichia coli"),
  syndrome      = "Bloodstream infections"
)
#> [1]  TRUE FALSE

Exclude Fungal Isolates

data <- prep_filter_fungal(data, organism_group_col = "organism_group")

Polymicrobial Infections

Multiple organisms from the same patient-episode must be weighted so each organism does not count as a full independent event.

# 1. Flag episodes with > 1 organism
poly_data <- prep_flag_polymicrobial(
  event_data,
  patient_col  = "patient_id",
  organism_col = "organism_name"
)
#> Identifying polymicrobial infections ...
#> 
#> Polymicrobial: 1/6 patient-context groups (16.7%)
#> 
#> Organism count distribution per patient-context:
#> # A tibble: 2 × 2
#>   n_organisms n_groups
#>         <int>    <int>
#> 1           1        5
#> 2           2        1
table(poly_data$is_polymicrobial)
#> < table of extent 0 >
# 2. Assign fractional weight to each organism within an episode
poly_data <- prep_compute_poly_weights(
  poly_data,
  episode_col = "event_id",
  method      = "monomicrobial_proportion"
)

# 3. Expand: one row per organism with its weight
poly_data <- prep_split_poly_episode(poly_data)

MDR / XDR Classification

Classification follows Magiorakos et al. (2012) criteria. Requires organism_group, antibiotic_class, and antibiotic_value.

# Collapse to antibiotic-class level first (one row per event-class)
class_data <- prep_collapse_class_level(
  data,
  event_col          = "event_id",
  organism_col       = "organism_normalized",
  class_col          = "antibiotic_class",
  susceptibility_col = "antibiotic_value"
)

# Classify
class_data <- prep_classify_mdr(class_data, definition = "CDC")
class_data <- prep_classify_xdr(class_data, definition = "CDC")

table(class_data$mdr)
table(class_data$xdr)

Reference Tables

# MDR/XDR thresholds per organism group (Magiorakos 2012)
anumaan:::get_magiorakos_thresholds()
#> # A tibble: 6 × 4
#>   organism_group           mdr_threshold xdr_threshold total_categories
#>   <chr>                            <dbl> <chr>                    <dbl>
#> 1 Enterobacterales                     3 all_but_2                    9
#> 2 Pseudomonas aeruginosa               3 all_but_2                   10
#> 3 Acinetobacter spp                    3 all_but_2                    9
#> 4 Staphylococcus aureus                3 all_but_2                    9
#> 5 Enterococcus spp                     3 all_but_2                    7
#> 6 Streptococcus pneumoniae             3 all_but_2                    5

# Antimicrobial categories for a specific organism group
anumaan:::get_antimicrobial_categories("Enterobacterales")
#> [1] "Aminoglycosides"                        
#> [2] "Carbapenems"                            
#> [3] "Cephalosporins (3rd gen)"               
#> [4] "Cephalosporins (4th gen)"               
#> [5] "Fluoroquinolones"                       
#> [6] "Monobactams"                            
#> [7] "Penicillins + beta-lactamase inhibitors"
#> [8] "Polymyxins"                             
#> [9] "Tigecycline"

# Beta-lactam hierarchy (broadest -> narrowest spectrum)
anumaan:::get_beta_lactam_hierarchy()
#> [1] "Carbapenems"                                          
#> [2] "Fourth-generation-cephalosporins"                     
#> [3] "Third-generation-cephalosporins"                      
#> [4] "Beta-lactam/beta-lactamase-inhibitor_anti-pseudomonal"
#> [5] "Beta-lactam/beta-lactamase-inhibitor"                 
#> [6] "Aminopenicillins"                                     
#> [7] "Penicillins"

Resistance Profiles

Per-Isolate Resistance Profile

data <- prep_create_resistance_profile(
  data,
  organism_col       = "organism_normalized",
  antibiotic_col     = "antibiotic_normalized",
  susceptibility_col = "antibiotic_value"
)

Wide AST Matrix

Pivot to one row per isolate, one column per antibiotic:

wide_ast <- prep_create_wide_ast_matrix(
  data,
  event_col          = "event_id",
  antibiotic_col     = "antibiotic_normalized",
  susceptibility_col = "antibiotic_value"
)

Diagnosis and Syndrome Mapping

Map free-text diagnoses to ICD-10 codes and then to clinical syndromes.

Text-Based Diagnosis Mapping

# Rule-based matching (no Python required)
data <- prep_map_diagnosis_to_icd(
  data,
  diagnosis_col = "diagnosis",
  method        = "text_match"   # or "python_embedding" (requires alethia)
)

ICD-10 to Syndrome

data <- prep_map_icd_to_syndrome(
  data,
  icd_col       = "icd10_code",
  hierarchy_path = NULL   # NULL uses the bundled infectious_syndrome_hierarchy.csv
)

Assign Patient-Level Syndrome

data <- prep_assign_patient_syndrome(
  data,
  patient_col   = "patient_id",
  syndrome_col  = "infectious_syndrome"
)

Outcome Cohorts and Attrition Tracking

Track Patient Counts at Every Step

prep_attrition_flow() accumulates a table of rows, unique patients, and unique events at each stage of your pipeline:

# Initialize
flow <- prep_attrition_flow(
  flow        = NULL,
  data        = sample_data,
  stage_name  = "raw_input",
  reason      = "all loaded records",
  patient_col = "patient_id"
)

# After a filter step
filtered_data <- sample_data[sample_data$antibiotic_value != "S", ]
flow <- prep_attrition_flow(
  flow        = flow,
  data        = filtered_data,
  stage_name  = "resistant_only",
  reason      = "removed susceptible records",
  patient_col = "patient_id"
)
flow
#>            stage n_rows n_patients n_events n_removed
#> 1      raw_input     10          6       NA         0
#> 2 resistant_only      7          6       NA         3
#>                        reason
#> 1          all loaded records
#> 2 removed susceptible records

Analytical Readiness Filters

# Keep only patients with at least one non-NA value in all required columns
ready <- prep_filter_analysis_ready(
  data,
  required_cols = c("final_outcome", "organism_name",
                     "antibiotic_name", "antibiotic_value")
)
# ready$data          - filtered data
# ready$attrition     - attrition table
# ready$patient_flags - per-patient completeness flags

# Final gate: remove rows that are completely unusable
data <- prep_filter_minimally_usable(data)

# Full analytical readiness assertion (stops if required fields absent)
prep_validate_analysis_ready(data)

Build Outcome Cohorts

fatal_cohort    <- prep_build_fatal_cohort(data,
                     outcome_col = "final_outcome",
                     fatal_value = "Died")

nonfatal_cohort <- prep_build_nonfatal_cohort(data,
                     outcome_col = "final_outcome",
                     fatal_value = "Died")

DALY Burden Estimation

After preprocessing, anumaan supports GBD-style DALY calculations: Years of Life Lost (YLL) + Years Lived with Disability (YLD).

Pathogen Fractions and Deaths

# Total deaths by cause
deaths <- daly_calc_deaths_by_cause(pop_data, data)

# Fraction attributable to infection
inf_fraction <- daly_calc_infection_fraction(pop_data, data)

# Syndrome-specific death counts
syndrome_deaths <- daly_calc_deaths_by_syndrome(
  d_j      = deaths,
  inf_frac = inf_fraction
)

# Incident cases from case-fatality ratios
incidents <- daly_calc_incidence_from_cfr(deaths_L = syndrome_deaths)

# Pathogen fraction of deaths (isolate-level)
pfrac <- daly_calc_pathogen_fraction_fatal(data, syndrome_deaths)

Years of Life Lost (YLL)

# Associated YLL: all deaths in patients with this pathogen
yll_associated <- daly_calc_yll_associated(
  data,
  life_expectancy_path = "GBD_2019_LE.csv"
)

# Attributable YLL: deaths caused by resistance (RR-based)
yll_attributable <- daly_calc_yll_attributable(
  data,
  rr_mortality = rr_table
)

Years Lived with Disability (YLD) and PAF

yld_base <- daly_calc_yld_baseline(incidence_data)

paf_los  <- daly_calc_paf_los(data, rr_los = rr_los_table)

yld_associated   <- daly_calc_fraction_associated_yld(yld_base, paf_los)
yld_attributable <- daly_calc_yld_attributable(yld_base, paf_los)

Hospital-Level DALY Summary

hospital_daly <- compute_hospital_daly(
  hospital_counts  = centre_counts,
  total_deaths     = sum(deaths$total),
  total_discharged = sum(centre_counts$discharged_h),
  yll_base         = sum(yll_base),
  yll_associated   = sum(yll_associated),
  yll_attributable = sum(yll_attributable),
  yld_base         = sum(yld_base),
  yld_associated   = sum(yld_associated),
  yld_attributable = sum(yld_attributable)
)

Visualization

All plot functions return a ggplot2 object that can be further customized.

Resistance Heatmap

plot_resistance_heatmap(
  data        = result$data,
  isolate_col = "organism_normalized",
  class_col   = "antibiotic_class",
  result_col  = "antibiotic_value"
)

Bar and Grouped Bar Charts

plot_bar(
  data  = result$data,
  x     = "organism_normalized",
  fill  = "mdr",
  title = "MDR Status by Organism"
)

plot_grouped_bar(
  data = result$data,
  x    = "organism_group",
  fill = "aware_category"
)

plot_stacked_bar(
  data = result$data,
  x    = "specimen_type",
  fill = "mdr"
)

LOS Distributions

plot_los_distributions(hospital_daly)

DALY Burden Plots

# Hospital-level YLL/YLD bars (output of compute_hospital_daly())
plot_burden_by_hospital(hospital_daly, metric = "YLL")
plot_burden_by_hospital(hospital_daly, metric = "YLD")

# Top organisms by YLL/YLD burden
plot_burden_by_organism(organism_yll, metric = "YLL", n = 8)

# YLL heatmap: resistance class x pathogen group
plot_yll_heatmap(yll_class_data, value_col = "YLL_class",
                 type = "attributable", n_admissions = 500)

# YLD heatmap: associated vs attributable by organism
plot_yld_heatmap(yld_data, n_admissions = 500)

Consistent AMR Theme

library(ggplot2)
ggplot(result$data, aes(x = organism_normalized)) +
  geom_bar() +
  amr_theme()

EDA Plots

The plot_* functions produce ready-made exploratory charts. The examples below use the same sample_data defined earlier. Where function defaults differ from sample_data column names, explicit overrides are shown.

Enrolment

# Unique patients per hospital, sorted by count
plot_patients_by_hospital(sample_data, patient_col = "patient_id")

# Syndrome distribution -- overall, faceted, or single centre
plot_syndrome_distribution(sample_data, mode = "overall",
                           patient_col = "patient_id")
plot_syndrome_distribution(sample_data, mode = "faceted",
                           patient_col = "patient_id", ncol = 2)
plot_syndrome_distribution(sample_data, mode = "single",
                           patient_col = "patient_id", center = "Centre A")

Organisms and Resistance

# Top organisms by unique patient count
plot_top_organisms(sample_data, mode = "overall", n = 10,
                   patient_col = "patient_id")
plot_top_organisms(sample_data, mode = "faceted", n = 5,
                   patient_col = "patient_id")

# Resistance rate per antibiotic (stacked R/I/S)
plot_abx_susceptibility(sample_data, mode = "overall",
                        patient_col = "patient_id")

# Resistance by specimen type
plot_resistance_by_sample(sample_data, mode = "overall",
                          patient_col = "patient_id",
                          sample_col  = "specimen_type")

# Pathogen x antibiotic resistance heatmap
plot_abx_heatmap(sample_data, mode = "all", patient_col = "patient_id")

Outcomes

# Outcome distribution -- pooled and faceted
plot_outcome_distribution(sample_data, mode = "overall",
                          patient_col = "patient_id")
plot_outcome_distribution(sample_data, mode = "faceted",
                          patient_col = "patient_id", ncol = 2)

# Death vs Discharged split (sample_data uses "Died"/"Survived")
plot_death_discharged(sample_data, mode = "overall",
                      patient_col       = "patient_id",
                      death_label       = "Died",
                      discharged_label  = "Survived")

# Outcomes within each organism
plot_outcome_by_organism(sample_data, mode = "overall",
                         patient_col = "patient_id")

# Outcomes by GBD age bin
plot_outcome_by_agebin(sample_data, mode = "overall",
                       patient_col = "patient_id")

# Outcome trends by year
plot_outcome_by_year(sample_data, mode = "overall",
                     patient_col = "patient_id",
                     date_col    = "date_of_final_outcome")

Infection Type and Location

# HAI vs CAI per centre (derived from admission-to-culture gap)
plot_hai_cai_by_facility(sample_data,
                         patient_col   = "patient_id",
                         admission_col = "date_of_admission",
                         culture_col   = "date_of_culture")

# Mono vs polymicrobial infections per centre
plot_mono_poly_by_facility(sample_data, patient_col = "patient_id")

# ICU / Ward / Other breakdown per centre
plot_location_by_facility(sample_data, patient_col = "patient_id")

Length of Stay and Age

# LOS distribution ridge plot (plot_los_ridge defaults patient_col = "patient_id")
plot_los_ridge(sample_data, mode = "all",
               admission_col = "date_of_admission",
               discharge_col = "date_of_final_outcome")

# Age distribution ridge plot
plot_age_ridge(sample_data, mode = "all", patient_col = "patient_id")

# Median LOS by age group
plot_los_by_agebin(sample_data, mode = "overall",
                   patient_col   = "patient_id",
                   admission_col = "date_of_admission",
                   discharge_col = "date_of_final_outcome")

All functions return a ggplot2 object and accept base_size, title, and subtitle arguments for quick customisation.


Common Analysis Tasks

Resistance Rates

sample_data %>%
  filter(!is.na(antibiotic_value)) %>%
  group_by(organism_name) %>%
  summarise(
    n               = n(),
    n_resistant     = sum(antibiotic_value == "R"),
    resistance_rate = round(100 * n_resistant / n, 1),
    .groups         = "drop"
  ) %>%
  arrange(desc(resistance_rate))
#> # A tibble: 4 × 4
#>   organism_name     n n_resistant resistance_rate
#>   <chr>         <int>       <int>           <dbl>
#> 1 A. baumannii      1           1           100  
#> 2 K. pneumoniae     4           3            75  
#> 3 E. coli           3           2            66.7
#> 4 S. aureus         2           1            50

MDR Prevalence by Organism

result$data %>%
  group_by(organism_normalized) %>%
  summarise(
    total    = n(),
    n_mdr    = sum(mdr == "MDR", na.rm = TRUE),
    mdr_rate = round(100 * n_mdr / total, 1),
    .groups  = "drop"
  ) %>%
  arrange(desc(mdr_rate))

Resistance by Specimen Type

result$data %>%
  filter(!is.na(specimen_type), !is.na(antibiotic_value)) %>%
  group_by(specimen_type) %>%
  summarise(
    n               = n(),
    resistance_rate = round(100 * mean(antibiotic_value == "R"), 1),
    .groups         = "drop"
  )

Export Results

write.csv(result$data, "processed_amr_data.csv", row.names = FALSE)

# Save full pipeline result for reproducibility
saveRDS(result, "preprocessing_result.rds")

# Reload
result <- readRDS("preprocessing_result.rds")

Troubleshooting

Column not found after standardization

prep_standardize_column_names() uses fuzzy matching when fuzzy_match = TRUE. Inspect your column names and provide explicit overrides if needed:

names(sample_data)
#>  [1] "patient_id"            "date_of_admission"     "date_of_culture"      
#>  [4] "date_of_final_outcome" "final_outcome"         "specimen_type"        
#>  [7] "organism_name"         "antibiotic_name"       "antibiotic_value"     
#> [10] "DOB"                   "gender"                "center_name"          
#> [13] "age_years"             "Age_bin"               "is_polymicrobial"     
#> [16] "location"              "infectious_syndrome"
# Then supply a manual map:
# config <- amr_config(column_mappings = list(organism_name = "Pathogen"))

Organism names not normalizing

Organisms absent from the reference pass through unchanged. Check coverage:

table(result$data$organism_match_status)
# "unmatched" rows will have organism_normalized == organism_name (unchanged)

Too many / too few events

Tune the deduplication window:

config <- amr_config(event_gap_days = 7)   # smaller -> more events
config <- amr_config(event_gap_days = 30)  # larger  -> fewer events

Intermediate (I) counted as resistant

config <- amr_config(intermediate_as_resistant = FALSE)

Date columns not parsing

Use prep_parse_date_column() directly to inspect how values are being interpreted:

prep_parse_date_column(
  c("15/01/2024", "45306", "2024-01-15"),
  col_name    = "date_of_culture",
  table_label = "debug"
)
#> [debug] 'date_of_culture': 1 value(s) decoded as Excel serial date.
#> [debug] 'date_of_culture': 3 / 3 non-missing value(s) successfully parsed as Date (0 failed).
#> [1] "2024-01-15" "2024-01-15" "2024-01-15"

Getting Help

?run_preprocess
?amr_config
?prep_standardize_organisms
?prep_create_event_ids
?prep_classify_mdr
?prep_derive_hai_cai
?prep_attrition_flow
?prep_map_diagnosis_to_icd

help(package = "anumaan")
ls("package:anumaan")

Report bugs at https://github.com/saketlab/anumaan/issues.


Session Info

sessionInfo()
#> R version 4.6.0 (2026-04-24)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
#>  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
#>  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
#> [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
#> 
#> time zone: UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] dplyr_1.2.1        anumaan_0.1.0.9009
#> 
#> loaded via a namespace (and not attached):
#>  [1] bit_4.6.0         jsonlite_2.0.0    crayon_1.5.3      compiler_4.6.0   
#>  [5] tidyselect_1.2.1  stringr_1.6.0     parallel_4.6.0    jquerylib_0.1.4  
#>  [9] systemfonts_1.3.2 textshaping_1.0.5 yaml_2.3.12       fastmap_1.2.0    
#> [13] readr_2.2.0       R6_2.6.1          generics_0.1.4    knitr_1.51       
#> [17] htmlwidgets_1.6.4 tibble_3.3.1      desc_1.4.3        lubridate_1.9.5  
#> [21] tzdb_0.5.0        bslib_0.10.0      pillar_1.11.1     rlang_1.2.0      
#> [25] utf8_1.2.6        stringi_1.8.7     cachem_1.1.0      xfun_0.57        
#> [29] fs_2.1.0          sass_0.4.10       bit64_4.8.0       otel_0.2.0       
#> [33] timechange_0.4.0  cli_3.6.6         withr_3.0.2       pkgdown_2.2.0    
#> [37] magrittr_2.0.5    stringdist_0.9.17 digest_0.6.39     vroom_1.7.1      
#> [41] hms_1.1.4         lifecycle_1.0.5   vctrs_0.7.3       evaluate_1.0.5   
#> [45] glue_1.8.1        ragg_1.5.2        rmarkdown_2.31    tools_4.6.0      
#> [49] pkgconfig_2.0.3   htmltools_0.5.9