Getting Started with anumaan
Nilanjana Dey, Nima Jijo, Namrata Kharat, Saket Choudhary
2026-05-15
Source:vignettes/getting-started.Rmd
getting-started.RmdIntroduction
anumaan is an R package for preprocessing antimicrobial resistance (AMR) surveillance data and performing GBD-style burden estimation. It takes messy hospital data and produces analysis-ready, standardized datasets - handling organism and antibiotic normalization, event deduplication, MDR/XDR classification, polymicrobial weighting, and DALY calculations.
Key Features
-
Automated pipeline:
run_preprocess()runs the full sequence in one call - Input validation: column checks, key quality, date repair before every step
- Organism normalization: handles typos, abbreviations, case, MRSA/CoNS flags
- Antibiotic classification: WHO AWaRe categories and drug classes
- AST cleaning: standardizes free-text R/S/I values, flags invalid entries
- MDR/XDR classification: Magiorakos 2012 (WHO/CDC-compliant)
- Event deduplication: unique infection episodes across repeat cultures
- Polymicrobial detection and fractional weighting
- HAI / CAI classification: derived from admission-to-culture date gap
- Diagnosis -> ICD-10 -> syndrome mapping: text, rule-based, or embedding
- Contaminant flagging: syndrome-aware contaminant lists
- Attrition tracking: patient and event counts at every filter step
- DALY burden estimation: YLL, YLD, attributable and associated burden
- Visualization: ggplot2-based resistance, burden, and spatial plots
Data Format
anumaan expects long format: one row per patient-organism-antibiotic combination. The minimum required columns are:
| Column | Description |
|---|---|
patient_id |
Unique patient identifier |
organism_name |
Organism name (raw, any format) |
antibiotic_name |
Antibiotic name (raw) |
antibiotic_value |
Susceptibility result: R, I, or S |
date_of_culture |
Date the culture was collected |
Optional but used when present: date_of_admission,
date_of_final_outcome, DOB, Age,
specimen_type, gender,
final_outcome, infection_type.
sample_data <- data.frame(
patient_id = c("P001","P001","P002","P003","P003",
"P004","P004","P005","P006","P006"),
date_of_admission = as.Date(c("2024-01-10","2024-01-10","2024-01-18",
"2024-02-01","2024-02-01","2024-02-10",
"2024-02-10","2024-03-05","2024-03-15",
"2024-03-15")),
date_of_culture = as.Date(c("2024-01-15","2024-01-15","2024-01-20",
"2024-02-03","2024-02-03","2024-02-14",
"2024-02-14","2024-03-07","2024-03-17",
"2024-03-17")),
date_of_final_outcome = as.Date(c("2024-01-25","2024-01-25","2024-01-28",
"2024-02-15","2024-02-15","2024-02-22",
"2024-02-22","2024-03-18","2024-03-28",
"2024-03-28")),
final_outcome = c("Survived","Survived","Died","Survived","Survived",
"Died","Died","Survived","Survived","Survived"),
specimen_type = c("Blood","Blood","Urine","Blood","Blood",
"Urine","Urine","Blood","Blood","Blood"),
organism_name = c("E. coli","E. coli","K. pneumoniae","S. aureus",
"S. aureus","E. coli","K. pneumoniae",
"A. baumannii","K. pneumoniae","K. pneumoniae"),
antibiotic_name = c("Ampicillin","Meropenem","Ciprofloxacin","Oxacillin",
"Vancomycin","Ampicillin","Ciprofloxacin","Meropenem",
"Ceftriaxone","Meropenem"),
antibiotic_value = c("R","S","R","R","S","R","R","R","R","S"),
DOB = as.Date(c("1960-05-01","1960-05-01","1975-08-15",
"1990-03-20","1990-03-20","2000-11-05",
"2000-11-05","1955-02-28","1968-07-12",
"1968-07-12")),
gender = c("Male","Male","Female","Male","Male",
"Female","Female","Male","Female","Female"),
center_name = c("Centre A","Centre A","Centre A",
"Centre B","Centre B","Centre B","Centre B",
"Centre C","Centre C","Centre C"),
age_years = c(63,63,48,33,33,23,23,69,55,55),
Age_bin = factor(
c("45-64","45-64","45-64","18-44","18-44",
"18-44","18-44","65+","45-64","45-64"),
levels = c("<5","18-44","45-64","65+")),
is_polymicrobial = c(FALSE,FALSE,FALSE,FALSE,FALSE,TRUE,TRUE,FALSE,FALSE,FALSE),
location = c("ICU","ICU","Ward","ICU","ICU",
"Ward","Ward","ICU","Ward","Ward"),
infectious_syndrome = c("BSI","BSI","UTI","BSI","BSI",
"UTI","UTI","BSI","BSI","BSI"),
stringsAsFactors = FALSE
)The Complete Pipeline
run_preprocess() is the single entry point that runs all
phases in sequence.
result <- run_preprocess(
data = sample_data,
config = amr_config(),
phases = "all", # "standardize", "enrich", "derive", or "all"
verbose = TRUE,
validate = TRUE,
generate_report = TRUE
)
clean_data <- result$data # preprocessed data frame
config_used <- result$config # amr_config() object used
processing_log <- result$log # per-phase transformation logs
report <- result$report # preprocessing report
metadata <- result$metadata # timing, row counts, phases runConfiguration
Pass amr_config() to override any default:
config <- amr_config(
fuzzy_match = TRUE, # auto-detect column names
hai_cutoff = 3, # days after admission -> HAI
event_gap_days = 14, # deduplication window
mortality_window = 14, # days post-culture to attribute death
age_bins = "GBD_standard",# "pediatric", "geriatric", or custom
mdr_definition = "CDC", # "CDC", "WHO", or numeric threshold
xdr_definition = "CDC",
intermediate_as_resistant = TRUE, # treat I as R (except Colistin)
strict_validation = FALSE # warn vs stop on issues
)
print(config)
#> AMR Preprocessing Configuration
#> ================================
#>
#> Phase 1 - Standardization:
#> Fuzzy column matching: TRUE
#> Strict validation: FALSE
#> Date columns: date_of_admission, date_of_culture, date_of_final_outcome, DOB
#>
#> Phase 2 - Enrichment:
#> HAI cutoff: 3 days
#> Infer department: TRUE
#>
#> Phase 3 - Derivation:
#> Event gap: 14 days
#> Mortality window: 14 days
#> Age bins: custom
#> Contaminant method: auto
#> MDR definition: CDC
#> XDR definition: CDC
#>
#> Reference Data:
#> RR table: GBD_2021
#> Organism map: default
#> Antibiotic map: WHO_2023
#>
#> Processing Rules:
#> Intermediate as Resistant: TRUE
#> Verbose: TRUERun on a Subset of Phases
# Only standardization
result_std <- run_preprocess(sample_data, phases = "standardize")
# Standardization + Enrichment (no event/MDR derivation)
result_enr <- run_preprocess(sample_data, phases = c("standardize", "enrich"))Input Validation and Schema Checks
Run these before any join or preprocessing step to catch problems early.
Column and Key Checks
# Check required columns exist and report types
prep_check_columns(
sample_data,
required = c("patient_id", "organism_name", "antibiotic_name", "antibiotic_value"),
table_label = "sample_data"
)
#> [sample_data] Column check: 17 cols | 4 required | 0 missing | 0 type warnings
# Check key column quality: missing, duplicates, placeholder values
prep_check_keys(
sample_data,
key_col = "patient_id",
table_label = "sample_data"
)
#> [sample_data] Key 'patient_id': 10 rows | 0 missing (0.0%) | 6 distinct | 4 duplicated | 0 rows missing both date_of_admission and date_of_cultureFull Table Validation
prep_validate_table() combines column checks, key
checks, and date coercion in one call - the function used internally
before every merge step.
val <- prep_validate_table(
data = sample_data,
required_cols = c("patient_id", "organism_name"),
key_col = "patient_id",
date_cols = c("date_of_admission", "date_of_culture"),
table_label = "sample_data"
)
#> [sample_data] Column check: 17 cols | 2 required | 0 missing | 0 type warnings
#> [sample_data] Key 'patient_id': 10 rows | 0 missing (0.0%) | 6 distinct | 4 duplicated | 0 rows missing both date_of_admission and date_of_culture
# val$data - date-coerced data frame
# val$col_report - per-column type/missing summary
# val$key_report - key quality summaryMissingness Report
prep_missingness_report(
sample_data,
threshold = 20, # flag columns with > 20% missing
cols = c("patient_id", "organism_name", "antibiotic_value", "DOB")
)
#> col_name n_total n_missing pct_missing is_high_missing
#> 1 patient_id 10 0 0 FALSE
#> 2 organism_name 10 0 0 FALSE
#> 3 antibiotic_value 10 0 0 FALSE
#> 4 DOB 10 0 0 FALSERequired Fields and Data Quality
validation <- validate_required_fields(
sample_data,
required_cols = c("patient_id", "organism_name",
"antibiotic_name", "antibiotic_value"),
min_completeness = 0.8
)
validation$valid
validation$messages
quality <- validate_data_quality(
sample_data,
min_rows = 1,
max_missing_pct = 90,
stop_on_failure = FALSE
)Detect Schema Drift Across Datasets
Useful when combining data from multiple sources or time periods:
prep_detect_schema_drift(
data_list = list(centre_A = data_A, centre_B = data_B),
reference_centre = "centre_A"
)Column Name Standardization
Automatic Mapping
# Renames known aliases to the package-standard names
# e.g. "PatientID" -> "patient_id", "Organism" -> "organism_name"
mapped <- prep_standardize_column_names(
sample_data,
fuzzy_match = TRUE,
fuzzy_threshold = 0.3
)
#> Auto fuzzy: 'patient_id' -> 'patient_id' (dist 0.00)
#> Auto fuzzy: 'gender' -> 'gender' (dist 0.00)
#> Auto fuzzy: 'location' -> 'location' (dist 0.00)
#> Auto fuzzy: 'date_of_admission' -> 'date_of_admission' (dist 0.00)
#> Auto fuzzy: 'date_of_culture' -> 'date_of_culture' (dist 0.00)
#> Auto fuzzy: 'date_of_final_outcome' -> 'date_of_final_outcome' (dist 0.00)
#> Auto fuzzy: 'final_outcome' -> 'final_outcome' (dist 0.00)
#> Auto fuzzy: 'organism_name' -> 'organism_name' (dist 0.00)
#> Auto fuzzy: 'antibiotic_name' -> 'antibiotic_name' (dist 0.00)
#> Auto fuzzy: 'antibiotic_value' -> 'antibiotic_value' (dist 0.00)
#> Auto fuzzy: 'specimen_type' -> 'specimen_type' (dist 0.00)
#> [prep_apply_column_map] Renamed 13 column(s).
#> [prep_apply_column_map] 4 column(s) not in map (kept as-is): center_name, Age_bin, is_polymicrobial, infectious_syndrome
# mapped$data - renamed data frame
# mapped$mapping_log - what was renamed and how (exact / fuzzy)
# mapped$unmapped - columns that couldn't be matchedManual Column Map
For datasets with non-standard names that fuzzy matching misses:
col_map <- prep_build_column_map(
sample_data,
column_map = list(
patient_id = c("PtID", "patient_number"),
organism_name = c("Pathogen", "bug")
)
)
data_renamed <- prep_apply_column_map(sample_data, col_map)Assert Standard Names
Stops with a clear error if a required standard column is absent - useful as a guard at the start of analysis scripts:
prep_assert_standard_names(
sample_data,
required_standard_names = c("patient_id", "organism_name", "antibiotic_value"),
strict = TRUE
)Detect Preprocessing Capabilities
Reports which preprocessing steps are possible given the columns present:
prep_report_capabilities(sample_data)
#> === Preprocessing Capabilities ===
#> Enabled (5):
#> [+] standardize_organism
#> [+] standardize_antibiotic
#> [+] harmonize_ast
#> [+] flag_polymicrobial
#> [+] build_outputs
#> Skipped / missing inputs (8):
#> [-] parse_dates
#> [-] derive_hai
#> [-] derive_los
#> [-] derive_age
#> [-] standardize_specimen
#> [-] create_events
#> [-] flag_contaminants
#> [-] classify_mdr
#> ===================================Date Handling
Parse a Single Date Column
Handles Excel serial numbers, Unix timestamps (ms), ISO strings, DMY/MDY formats, and reversed YYYYDDMM patterns automatically:
dates_raw <- c("2024-01-15", "45306", "15/01/2024", NA)
prep_parse_date_column(dates_raw, col_name = "date_of_culture", table_label = "example")
#> [example] 'date_of_culture': 1 value(s) decoded as Excel serial date.
#> [example] 'date_of_culture': 3 / 3 non-missing value(s) successfully parsed as Date (0 failed).
#> [1] "2024-01-15" "2024-01-15" "2024-01-15" NACoerce All Date Columns in a Data Frame
data_with_dates <- prep_coerce_dates(
sample_data,
cols = c("date_of_admission", "date_of_culture", "date_of_final_outcome"),
table_label = "sample_data"
)Validate Date Logic
Checks that admission <= culture <= outcome, and that DOB is before culture:
prep_validate_date_logic(
sample_data,
admission_col = "date_of_admission",
culture_col = "date_of_culture",
outcome_col = "date_of_final_outcome",
dob_col = "DOB"
)Organism Standardization
Normalize Organism Names
Maps free-text organism names to a standard reference, handles typos,
case, abbreviations (E. coli ->
Escherichia coli), MRSA/CoNS flags:
org_data <- data.frame(
organism_name = c("E. coli", "ACINETOBACTER BAUMANNII",
"Klebsiella pnuemoniae", "MRSA", "CoNS")
)
normalized <- prep_standardize_organisms(org_data, organism_col = "organism_name")
#> Normalized 5/5 organisms (100.0%)
#> Result: 5 unique organisms
#> is_MRSA: 1
#> is_MSSA: 0
#> is_MRCONS: 0
#> is_MSCONS: 0
normalized[, c("organism_name", "organism_normalized")]
#> organism_name organism_normalized
#> 1 E. coli escherichia coli
#> 2 ACINETOBACTER BAUMANNII acinetobacter baumannii
#> 3 Klebsiella pnuemoniae klebsiella spp. other
#> 4 MRSA staphylococcus aureus
#> 5 CoNS coagulase-negative staphylococciAssign Organism Group
Adds organism_group (e.g. “Enterobacterales”,
“Gram-positive cocci”, “Fungal isolates”) - required for MDR/XDR
classification:
data <- prep_assign_organism_group(normalized, organism_col = "organism_normalized")
table(data$organism_group)Extract Genus and Species
data <- prep_extract_genus(data, organism_col = "organism_normalized")
data <- prep_extract_species(data, organism_col = "organism_normalized")
# Adds: org_genus, org_speciesFlag Unmatched Organisms
data <- prep_flag_organism_unmatched(data, organism_col = "organism_normalized")
# Adds: organism_match_status ("matched" / "unmatched")
table(data$organism_match_status)Specimen, Sex, and Outcome Standardization
data <- prep_standardize_specimens(
sample_data,
specimen_col = "specimen_type"
)
#> Specimen normalization: 10/10 matched (100.0%)
#> Sterile classification: 7 sterile, 3 non-sterile
data <- prep_standardize_sex(data, col = "gender")
#> Standardized gender: M=5, F=5, NA=0
data <- prep_standardize_final_outcome(data, col = "final_outcome")
#> Standardized final outcome distribution:
#>
#> Died Survived
#> 3 7
data <- prep_standardize_infection_type(data, col = "infection_type")
#> Warning in prep_standardize_infection_type(data, col = "infection_type"):
#> Column 'infection_type' not found. Skipping infection type standardization.Typos and shorthand in specimen names are handled with rule-based cleaning and fuzzy matching against the specimen reference table:
spec_typo <- data.frame(
specimen_type = c("blood culure", "urne c/s", "csff"),
stringsAsFactors = FALSE
)
spec_out <- prep_standardize_specimens(spec_typo, specimen_col = "specimen_type")
#> Specimen normalization: 3/3 matched (100.0%)
#> Sterile classification: 2 sterile, 1 non-sterile
spec_out[, c("specimen_type", "specimen_normalized")]
#> specimen_type specimen_normalized
#> 1 blood culure Blood
#> 2 urne c/s Urine
#> 3 csff CSFAntibiotic Standardization
Normalize Antibiotic Names
Maps to WHO standard names and adds drug class and AWaRe category:
abx_data <- data.frame(
antibiotic_name = c("Ampicillin", "Meropenem", "Ciprofloxacin", "Vancomycin")
)
normalized_abx <- prep_standardize_antibiotics(
abx_data,
antibiotic_col = "antibiotic_name",
add_class = TRUE,
add_aware = TRUE
)
#> Normalizing 4 unique antibiotic names against WHO reference (257 antibiotics)...
#> Normalized: 4 unique names -> 4
#> With antibiotic_class: 4 (100.0%)
#> With AWaRe category: 4 (100.0%)
normalized_abx[, c("antibiotic_name", "antibiotic_normalized",
"antibiotic_class", "aware_category")]
#> antibiotic_name antibiotic_normalized antibiotic_class aware_category
#> 1 Ampicillin ampicillin Aminopenicillins Access
#> 2 Meropenem meropenem Carbapenems Watch
#> 3 Ciprofloxacin ciprofloxacin Fluoroquinolones Watch
#> 4 Vancomycin vancomycin_iv Vancomycin WatchClassify Drug Class and AWaRe Separately
data <- prep_classify_antibiotic_class(data, antibiotic_col = "antibiotic_normalized")
data <- prep_classify_aware(data, antibiotic_col = "antibiotic_normalized")AST Value Cleaning
Clean Free-Text R/S/I Values
Standardizes “Resistant”, “res”, “r”, “1”, “SENS”, etc. to R / I / S:
ast_data <- data.frame(
antibiotic_value = c("R", "Resistant", "S", "Sensitive",
"I", "Intermediate", "resistant", "1", "0", NA)
)
cleaned <- prep_clean_ast_values(ast_data, value_col = "antibiotic_value")
#> Cleaning antibiotic values: 9 unique values found
#> Cleaned: 9 unique values -> 3 (S/I/R)
#> [!] 3 values could not be parsed (30.0%)
#>
#> Value distribution:
#>
#> I R S <NA>
#> 2 3 2 3
cleaned
#> antibiotic_value
#> 1 R
#> 2 R
#> 3 S
#> 4 S
#> 5 I
#> 6 I
#> 7 R
#> 8 <NA>
#> 9 <NA>
#> 10 <NA>Harmonize AST Interpretations
Applies breakpoint-based harmonization (handles MIC/disk zone inputs):
data <- prep_harmonize_ast(data, ast_col = "antibiotic_value")Recode Intermediate as Resistant
# Per WHO/EUCAST "I = susceptible with increased exposure" - keep I, or:
data <- prep_recode_intermediate_ast(data, col = "antibiotic_value")Flag Invalid AST Entries
data <- prep_flag_invalid_ast(data, col = "ast_value_harmonized")
# Adds: ast_invalid (TRUE/FALSE)Wide-Format AST Import
If your data has one column per antibiotic, pivot to long first:
long_data <- prep_pivot_ast_wide_to_long(
wide_data,
id_cols = c("patient_id", "date_of_culture", "organism_name"),
antibiotic_cols = NULL # NULL = auto-detect
)Age and Demographics
Fill Missing Age from Date of Birth
age_data <- data.frame(
DOB = as.Date(c("1960-05-01", "1975-08-15", NA)),
date_of_culture = as.Date(c("2024-01-15", "2024-01-20", "2024-02-03")),
Age = c(NA, NA, 45)
)
result_age <- prep_fill_age(
age_data,
age_col = "Age",
dob_col = "DOB",
date_col = "date_of_culture",
overwrite = FALSE
)
#> Enriched Age: 2 rows filled using DOB calculation
#>
#> Age enrichment summary:
#> age_method age_confidence n
#> 1 calculated_from_dob high 2
#> 2 provided high 1
result_age[, c("Age", "age_method", "age_confidence")]
#> Age age_method age_confidence
#> 1 63.70705 calculated_from_dob high
#> 2 48.43258 calculated_from_dob high
#> 3 45.00000 provided highAssign Age Bins
binned <- prep_assign_age_bins(
result_age,
age_col = "Age",
bins = "GBD_standard" # or "pediatric", "geriatric", or custom numeric vector
)
#> Assigned age bins: 3 binned, 0 unbinned
table(binned$Age_bin)
#>
#> <1 1-5 5-10 10-15 15-20 20-25 25-30 30-35 35-40 40-45 45-50 50-55 55-60
#> 0 0 0 0 0 0 0 0 0 0 2 0 0
#> 60-65 65-70 70-75 75-80 80-85 85+
#> 1 0 0 0 0 0Derive Age from DOB Components
When DOB is split across day/month/year columns:
data <- prep_derive_dob_from_components(
data,
day_col = "birth_day",
month_col = "birth_month",
year_col = "birth_year"
)Length of Stay and HAI/CAI
Derive Length of Stay
los_data <- prep_derive_los_from_dates(
sample_data,
admission_col = "date_of_admission",
outcome_col = "date_of_final_outcome",
los_col = "los_days"
)
#> [prep_derive_los_from_dates] 10 rows filled from date_of_final_outcome - date_of_admission.
#> [prep_derive_los_from_dates] 10 filled; 0 remain missing.
#>
#> LOS summary:
#> mean_los median_los min_los max_los
#> 1 13.1 13 10 15
los_data[, c("patient_id", "date_of_admission", "date_of_final_outcome", "los_days")]
#> patient_id date_of_admission date_of_final_outcome los_days
#> 1 P001 2024-01-10 2024-01-25 15
#> 2 P001 2024-01-10 2024-01-25 15
#> 3 P002 2024-01-18 2024-01-28 10
#> 4 P003 2024-02-01 2024-02-15 14
#> 5 P003 2024-02-01 2024-02-15 14
#> 6 P004 2024-02-10 2024-02-22 12
#> 7 P004 2024-02-10 2024-02-22 12
#> 8 P005 2024-03-05 2024-03-18 13
#> 9 P006 2024-03-15 2024-03-28 13
#> 10 P006 2024-03-15 2024-03-28 13Classify HAI vs CAI
hc_data <- prep_derive_hai_cai(
sample_data,
admission_col = "date_of_admission",
culture_col = "date_of_culture",
hai_cutoff = 3 # cultures >= 3 days after admission = HAI
)
#> Inferring infection type using 3-day HAI cutoff...
#> Enriched infection_type: 10 rows filled
#>
#> Infection type distribution:
#> infection_type infection_type_method n
#> 1 CAI inferred_3day_cutoff 6
#> 2 HAI inferred_3day_cutoff 4
table(hc_data$hai_cai)
#> < table of extent 0 >Flag HAI Observations Not Clinically Coded
data <- prep_flag_hai_inferred(data)
# Adds: hai_inferred (TRUE where hai_cai derived differs from coded infection_type)Derive ICU Flag
data <- prep_derive_icu_flag(data, department_col = "department")
# Adds: is_icu (TRUE/FALSE)Event Deduplication
Repeat cultures of the same organism within a time window belong to
the same infection episode. prep_create_event_ids() groups
them into events.
event_data <- prep_create_event_ids(
sample_data,
patient_col = "patient_id",
date_col = "date_of_culture",
organism_col = "organism_name",
specimen_col = "specimen_type",
gap_days = 14 # > 14 days between same organism/site = new event
)
#> Creating events (gap threshold: >14 days) ...
#> Done: 6 patients -> 7 events (1.17 per patient)
event_data[, c("patient_id", "organism_name", "date_of_culture", "event_id")]
#> patient_id organism_name date_of_culture
#> 1 P001 E. coli 2024-01-15
#> 2 P001 E. coli 2024-01-15
#> 3 P002 K. pneumoniae 2024-01-20
#> 4 P003 S. aureus 2024-02-03
#> 5 P003 S. aureus 2024-02-03
#> 6 P004 E. coli 2024-02-14
#> 7 P004 K. pneumoniae 2024-02-14
#> 8 P005 A. baumannii 2024-03-07
#> 9 P006 K. pneumoniae 2024-03-17
#> 10 P006 K. pneumoniae 2024-03-17
#> event_id
#> 1 P001_blood_20240115_e. coli_001
#> 2 P001_blood_20240115_e. coli_001
#> 3 P002_urine_20240120_k. pneumoniae_001
#> 4 P003_blood_20240203_s. aureus_001
#> 5 P003_blood_20240203_s. aureus_001
#> 6 P004_urine_20240214_e. coli_001
#> 7 P004_urine_20240214_k. pneumoniae_002
#> 8 P005_blood_20240307_a. baumannii_001
#> 9 P006_blood_20240317_k. pneumoniae_001
#> 10 P006_blood_20240317_k. pneumoniae_001Remove Duplicate Rows Within Events
deduped <- prep_deduplicate_events(
event_data,
event_col = "event_id",
organism_col = "organism_name",
antibiotic_col = "antibiotic_name"
)Readmission Classification
data <- prep_flag_readmission(
data,
patient_col = "patient_id",
admission_col = "date_of_admission",
gap_days = 30
)
data <- prep_classify_readmission(data, readmission_col = "readmission_class")Contaminant Detection
Syndrome-Aware Contaminant Lists
# Retrieve the contaminant list for a syndrome
contaminants <- prep_get_contaminant_list(
syndrome = "Bloodstream infections",
return_all = FALSE
)
contaminants$names
#> [1] "Coagulase-negative Staphylococcus" "Staphylococcus epidermidis"
#> [3] "Staphylococcus hominis" "Staphylococcus haemolyticus"
#> [5] "Staphylococcus capitis" "Staphylococcus warneri"
#> [7] "Corynebacterium species" "Corynebacterium striatum"
#> [9] "Corynebacterium jeikeium" "Cutibacterium"
#> [11] "Cutibacterium acnes" "Micrococcus species"
#> [13] "Micrococcus luteus" "Bacillus (non-anthracis)"
#> [15] "Bacillus subtilis" "Bacillus cereus"
#> [17] "Viridans streptococci" "Aerococcus spp."
#> [19] "Kocuria spp." "Dermacoccus spp."
#> [21] "Rothia spp."Flag Likely Contaminants
data <- prep_flag_contaminants(
data,
method = "auto" # "auto", "device_based", "heuristic", "provided"
)
table(data$is_contaminant)Test Individual Organism Names
prep_is_contaminant(
organism_name = c("Staphylococcus epidermidis", "Escherichia coli"),
syndrome = "Bloodstream infections"
)
#> [1] TRUE FALSEPolymicrobial Infections
Multiple organisms from the same patient-episode must be weighted so each organism does not count as a full independent event.
# 1. Flag episodes with > 1 organism
poly_data <- prep_flag_polymicrobial(
event_data,
patient_col = "patient_id",
organism_col = "organism_name"
)
#> Identifying polymicrobial infections ...
#>
#> Polymicrobial: 1/6 patient-context groups (16.7%)
#>
#> Organism count distribution per patient-context:
#> # A tibble: 2 × 2
#> n_organisms n_groups
#> <int> <int>
#> 1 1 5
#> 2 2 1
table(poly_data$is_polymicrobial)
#> < table of extent 0 >
# 2. Assign fractional weight to each organism within an episode
poly_data <- prep_compute_poly_weights(
poly_data,
episode_col = "event_id",
method = "monomicrobial_proportion"
)
# 3. Expand: one row per organism with its weight
poly_data <- prep_split_poly_episode(poly_data)MDR / XDR Classification
Classification follows Magiorakos et al. (2012) criteria. Requires
organism_group, antibiotic_class, and
antibiotic_value.
# Collapse to antibiotic-class level first (one row per event-class)
class_data <- prep_collapse_class_level(
data,
event_col = "event_id",
organism_col = "organism_normalized",
class_col = "antibiotic_class",
susceptibility_col = "antibiotic_value"
)
# Classify
class_data <- prep_classify_mdr(class_data, definition = "CDC")
class_data <- prep_classify_xdr(class_data, definition = "CDC")
table(class_data$mdr)
table(class_data$xdr)Reference Tables
# MDR/XDR thresholds per organism group (Magiorakos 2012)
anumaan:::get_magiorakos_thresholds()
#> # A tibble: 6 × 4
#> organism_group mdr_threshold xdr_threshold total_categories
#> <chr> <dbl> <chr> <dbl>
#> 1 Enterobacterales 3 all_but_2 9
#> 2 Pseudomonas aeruginosa 3 all_but_2 10
#> 3 Acinetobacter spp 3 all_but_2 9
#> 4 Staphylococcus aureus 3 all_but_2 9
#> 5 Enterococcus spp 3 all_but_2 7
#> 6 Streptococcus pneumoniae 3 all_but_2 5
# Antimicrobial categories for a specific organism group
anumaan:::get_antimicrobial_categories("Enterobacterales")
#> [1] "Aminoglycosides"
#> [2] "Carbapenems"
#> [3] "Cephalosporins (3rd gen)"
#> [4] "Cephalosporins (4th gen)"
#> [5] "Fluoroquinolones"
#> [6] "Monobactams"
#> [7] "Penicillins + beta-lactamase inhibitors"
#> [8] "Polymyxins"
#> [9] "Tigecycline"
# Beta-lactam hierarchy (broadest -> narrowest spectrum)
anumaan:::get_beta_lactam_hierarchy()
#> [1] "Carbapenems"
#> [2] "Fourth-generation-cephalosporins"
#> [3] "Third-generation-cephalosporins"
#> [4] "Beta-lactam/beta-lactamase-inhibitor_anti-pseudomonal"
#> [5] "Beta-lactam/beta-lactamase-inhibitor"
#> [6] "Aminopenicillins"
#> [7] "Penicillins"Resistance Profiles
Per-Isolate Resistance Profile
data <- prep_create_resistance_profile(
data,
organism_col = "organism_normalized",
antibiotic_col = "antibiotic_normalized",
susceptibility_col = "antibiotic_value"
)Wide AST Matrix
Pivot to one row per isolate, one column per antibiotic:
wide_ast <- prep_create_wide_ast_matrix(
data,
event_col = "event_id",
antibiotic_col = "antibiotic_normalized",
susceptibility_col = "antibiotic_value"
)Diagnosis and Syndrome Mapping
Map free-text diagnoses to ICD-10 codes and then to clinical syndromes.
Text-Based Diagnosis Mapping
# Rule-based matching (no Python required)
data <- prep_map_diagnosis_to_icd(
data,
diagnosis_col = "diagnosis",
method = "text_match" # or "python_embedding" (requires alethia)
)ICD-10 to Syndrome
data <- prep_map_icd_to_syndrome(
data,
icd_col = "icd10_code",
hierarchy_path = NULL # NULL uses the bundled infectious_syndrome_hierarchy.csv
)Assign Patient-Level Syndrome
data <- prep_assign_patient_syndrome(
data,
patient_col = "patient_id",
syndrome_col = "infectious_syndrome"
)Outcome Cohorts and Attrition Tracking
Track Patient Counts at Every Step
prep_attrition_flow() accumulates a table of rows,
unique patients, and unique events at each stage of your pipeline:
# Initialize
flow <- prep_attrition_flow(
flow = NULL,
data = sample_data,
stage_name = "raw_input",
reason = "all loaded records",
patient_col = "patient_id"
)
# After a filter step
filtered_data <- sample_data[sample_data$antibiotic_value != "S", ]
flow <- prep_attrition_flow(
flow = flow,
data = filtered_data,
stage_name = "resistant_only",
reason = "removed susceptible records",
patient_col = "patient_id"
)
flow
#> stage n_rows n_patients n_events n_removed
#> 1 raw_input 10 6 NA 0
#> 2 resistant_only 7 6 NA 3
#> reason
#> 1 all loaded records
#> 2 removed susceptible recordsAnalytical Readiness Filters
# Keep only patients with at least one non-NA value in all required columns
ready <- prep_filter_analysis_ready(
data,
required_cols = c("final_outcome", "organism_name",
"antibiotic_name", "antibiotic_value")
)
# ready$data - filtered data
# ready$attrition - attrition table
# ready$patient_flags - per-patient completeness flags
# Final gate: remove rows that are completely unusable
data <- prep_filter_minimally_usable(data)
# Full analytical readiness assertion (stops if required fields absent)
prep_validate_analysis_ready(data)Build Outcome Cohorts
fatal_cohort <- prep_build_fatal_cohort(data,
outcome_col = "final_outcome",
fatal_value = "Died")
nonfatal_cohort <- prep_build_nonfatal_cohort(data,
outcome_col = "final_outcome",
fatal_value = "Died")DALY Burden Estimation
After preprocessing, anumaan supports GBD-style DALY calculations: Years of Life Lost (YLL) + Years Lived with Disability (YLD).
Pathogen Fractions and Deaths
# Total deaths by cause
deaths <- daly_calc_deaths_by_cause(pop_data, data)
# Fraction attributable to infection
inf_fraction <- daly_calc_infection_fraction(pop_data, data)
# Syndrome-specific death counts
syndrome_deaths <- daly_calc_deaths_by_syndrome(
d_j = deaths,
inf_frac = inf_fraction
)
# Incident cases from case-fatality ratios
incidents <- daly_calc_incidence_from_cfr(deaths_L = syndrome_deaths)
# Pathogen fraction of deaths (isolate-level)
pfrac <- daly_calc_pathogen_fraction_fatal(data, syndrome_deaths)Years of Life Lost (YLL)
# Associated YLL: all deaths in patients with this pathogen
yll_associated <- daly_calc_yll_associated(
data,
life_expectancy_path = "GBD_2019_LE.csv"
)
# Attributable YLL: deaths caused by resistance (RR-based)
yll_attributable <- daly_calc_yll_attributable(
data,
rr_mortality = rr_table
)Years Lived with Disability (YLD) and PAF
yld_base <- daly_calc_yld_baseline(incidence_data)
paf_los <- daly_calc_paf_los(data, rr_los = rr_los_table)
yld_associated <- daly_calc_fraction_associated_yld(yld_base, paf_los)
yld_attributable <- daly_calc_yld_attributable(yld_base, paf_los)Hospital-Level DALY Summary
hospital_daly <- compute_hospital_daly(
hospital_counts = centre_counts,
total_deaths = sum(deaths$total),
total_discharged = sum(centre_counts$discharged_h),
yll_base = sum(yll_base),
yll_associated = sum(yll_associated),
yll_attributable = sum(yll_attributable),
yld_base = sum(yld_base),
yld_associated = sum(yld_associated),
yld_attributable = sum(yld_attributable)
)Visualization
All plot functions return a ggplot2 object that can be
further customized.
Resistance Heatmap
plot_resistance_heatmap(
data = result$data,
isolate_col = "organism_normalized",
class_col = "antibiotic_class",
result_col = "antibiotic_value"
)Bar and Grouped Bar Charts
plot_bar(
data = result$data,
x = "organism_normalized",
fill = "mdr",
title = "MDR Status by Organism"
)
plot_grouped_bar(
data = result$data,
x = "organism_group",
fill = "aware_category"
)
plot_stacked_bar(
data = result$data,
x = "specimen_type",
fill = "mdr"
)LOS Distributions
plot_los_distributions(hospital_daly)DALY Burden Plots
# Hospital-level YLL/YLD bars (output of compute_hospital_daly())
plot_burden_by_hospital(hospital_daly, metric = "YLL")
plot_burden_by_hospital(hospital_daly, metric = "YLD")
# Top organisms by YLL/YLD burden
plot_burden_by_organism(organism_yll, metric = "YLL", n = 8)
# YLL heatmap: resistance class x pathogen group
plot_yll_heatmap(yll_class_data, value_col = "YLL_class",
type = "attributable", n_admissions = 500)
# YLD heatmap: associated vs attributable by organism
plot_yld_heatmap(yld_data, n_admissions = 500)EDA Plots
The plot_* functions produce ready-made exploratory
charts. The examples below use the same sample_data defined
earlier. Where function defaults differ from sample_data
column names, explicit overrides are shown.
Enrolment
# Unique patients per hospital, sorted by count
plot_patients_by_hospital(sample_data, patient_col = "patient_id")
# Syndrome distribution -- overall, faceted, or single centre
plot_syndrome_distribution(sample_data, mode = "overall",
patient_col = "patient_id")
plot_syndrome_distribution(sample_data, mode = "faceted",
patient_col = "patient_id", ncol = 2)
plot_syndrome_distribution(sample_data, mode = "single",
patient_col = "patient_id", center = "Centre A")Organisms and Resistance
# Top organisms by unique patient count
plot_top_organisms(sample_data, mode = "overall", n = 10,
patient_col = "patient_id")
plot_top_organisms(sample_data, mode = "faceted", n = 5,
patient_col = "patient_id")
# Resistance rate per antibiotic (stacked R/I/S)
plot_abx_susceptibility(sample_data, mode = "overall",
patient_col = "patient_id")
# Resistance by specimen type
plot_resistance_by_sample(sample_data, mode = "overall",
patient_col = "patient_id",
sample_col = "specimen_type")
# Pathogen x antibiotic resistance heatmap
plot_abx_heatmap(sample_data, mode = "all", patient_col = "patient_id")Outcomes
# Outcome distribution -- pooled and faceted
plot_outcome_distribution(sample_data, mode = "overall",
patient_col = "patient_id")
plot_outcome_distribution(sample_data, mode = "faceted",
patient_col = "patient_id", ncol = 2)
# Death vs Discharged split (sample_data uses "Died"/"Survived")
plot_death_discharged(sample_data, mode = "overall",
patient_col = "patient_id",
death_label = "Died",
discharged_label = "Survived")
# Outcomes within each organism
plot_outcome_by_organism(sample_data, mode = "overall",
patient_col = "patient_id")
# Outcomes by GBD age bin
plot_outcome_by_agebin(sample_data, mode = "overall",
patient_col = "patient_id")
# Outcome trends by year
plot_outcome_by_year(sample_data, mode = "overall",
patient_col = "patient_id",
date_col = "date_of_final_outcome")Infection Type and Location
# HAI vs CAI per centre (derived from admission-to-culture gap)
plot_hai_cai_by_facility(sample_data,
patient_col = "patient_id",
admission_col = "date_of_admission",
culture_col = "date_of_culture")
# Mono vs polymicrobial infections per centre
plot_mono_poly_by_facility(sample_data, patient_col = "patient_id")
# ICU / Ward / Other breakdown per centre
plot_location_by_facility(sample_data, patient_col = "patient_id")Length of Stay and Age
# LOS distribution ridge plot (plot_los_ridge defaults patient_col = "patient_id")
plot_los_ridge(sample_data, mode = "all",
admission_col = "date_of_admission",
discharge_col = "date_of_final_outcome")
# Age distribution ridge plot
plot_age_ridge(sample_data, mode = "all", patient_col = "patient_id")
# Median LOS by age group
plot_los_by_agebin(sample_data, mode = "overall",
patient_col = "patient_id",
admission_col = "date_of_admission",
discharge_col = "date_of_final_outcome")All functions return a ggplot2 object and accept
base_size, title, and subtitle
arguments for quick customisation.
Common Analysis Tasks
Resistance Rates
sample_data %>%
filter(!is.na(antibiotic_value)) %>%
group_by(organism_name) %>%
summarise(
n = n(),
n_resistant = sum(antibiotic_value == "R"),
resistance_rate = round(100 * n_resistant / n, 1),
.groups = "drop"
) %>%
arrange(desc(resistance_rate))
#> # A tibble: 4 × 4
#> organism_name n n_resistant resistance_rate
#> <chr> <int> <int> <dbl>
#> 1 A. baumannii 1 1 100
#> 2 K. pneumoniae 4 3 75
#> 3 E. coli 3 2 66.7
#> 4 S. aureus 2 1 50Export Results
write.csv(result$data, "processed_amr_data.csv", row.names = FALSE)
# Save full pipeline result for reproducibility
saveRDS(result, "preprocessing_result.rds")
# Reload
result <- readRDS("preprocessing_result.rds")Troubleshooting
Column not found after standardization
prep_standardize_column_names() uses fuzzy matching when
fuzzy_match = TRUE. Inspect your column names and provide
explicit overrides if needed:
names(sample_data)
#> [1] "patient_id" "date_of_admission" "date_of_culture"
#> [4] "date_of_final_outcome" "final_outcome" "specimen_type"
#> [7] "organism_name" "antibiotic_name" "antibiotic_value"
#> [10] "DOB" "gender" "center_name"
#> [13] "age_years" "Age_bin" "is_polymicrobial"
#> [16] "location" "infectious_syndrome"
# Then supply a manual map:
# config <- amr_config(column_mappings = list(organism_name = "Pathogen"))Organism names not normalizing
Organisms absent from the reference pass through unchanged. Check coverage:
table(result$data$organism_match_status)
# "unmatched" rows will have organism_normalized == organism_name (unchanged)Too many / too few events
Tune the deduplication window:
config <- amr_config(event_gap_days = 7) # smaller -> more events
config <- amr_config(event_gap_days = 30) # larger -> fewer eventsIntermediate (I) counted as resistant
config <- amr_config(intermediate_as_resistant = FALSE)Date columns not parsing
Use prep_parse_date_column() directly to inspect how
values are being interpreted:
prep_parse_date_column(
c("15/01/2024", "45306", "2024-01-15"),
col_name = "date_of_culture",
table_label = "debug"
)
#> [debug] 'date_of_culture': 1 value(s) decoded as Excel serial date.
#> [debug] 'date_of_culture': 3 / 3 non-missing value(s) successfully parsed as Date (0 failed).
#> [1] "2024-01-15" "2024-01-15" "2024-01-15"Getting Help
?run_preprocess
?amr_config
?prep_standardize_organisms
?prep_create_event_ids
?prep_classify_mdr
?prep_derive_hai_cai
?prep_attrition_flow
?prep_map_diagnosis_to_icd
help(package = "anumaan")
ls("package:anumaan")Report bugs at https://github.com/saketlab/anumaan/issues.
Session Info
sessionInfo()
#> R version 4.6.0 (2026-04-24)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
#> [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
#> [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
#> [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] dplyr_1.2.1 anumaan_0.1.0.9009
#>
#> loaded via a namespace (and not attached):
#> [1] bit_4.6.0 jsonlite_2.0.0 crayon_1.5.3 compiler_4.6.0
#> [5] tidyselect_1.2.1 stringr_1.6.0 parallel_4.6.0 jquerylib_0.1.4
#> [9] systemfonts_1.3.2 textshaping_1.0.5 yaml_2.3.12 fastmap_1.2.0
#> [13] readr_2.2.0 R6_2.6.1 generics_0.1.4 knitr_1.51
#> [17] htmlwidgets_1.6.4 tibble_3.3.1 desc_1.4.3 lubridate_1.9.5
#> [21] tzdb_0.5.0 bslib_0.10.0 pillar_1.11.1 rlang_1.2.0
#> [25] utf8_1.2.6 stringi_1.8.7 cachem_1.1.0 xfun_0.57
#> [29] fs_2.1.0 sass_0.4.10 bit64_4.8.0 otel_0.2.0
#> [33] timechange_0.4.0 cli_3.6.6 withr_3.0.2 pkgdown_2.2.0
#> [37] magrittr_2.0.5 stringdist_0.9.17 digest_0.6.39 vroom_1.7.1
#> [41] hms_1.1.4 lifecycle_1.0.5 vctrs_0.7.3 evaluate_1.0.5
#> [45] glue_1.8.1 ragg_1.5.2 rmarkdown_2.31 tools_4.6.0
#> [49] pkgconfig_2.0.3 htmltools_0.5.9