HealthTable: Preprocessing Functions

This page documents the preprocessing and transformation functions available for HealthTable objects when working with OMOP CDM data. These functions are provided by the OMOP CDM extension and enable data preparation workflows for machine learning and analysis.

One-Hot Encoding

Transform categorical variables into binary indicator columns suitable for machine learning algorithms.

HealthBase.one_hot_encodeFunction
one_hot_encode(ht::HealthTable; cols, drop_original=true, return_features_only=false)

One-hot encode the categorical columns in ht using FeatureTransforms.jl.

For every requested column the function appends Boolean indicator columns — one per unique (non-missing) level. New columns are named col_value, e.g. gender_concept_id_8507.

Boolean source columns are detected and skipped automatically with a warning.

Arguments

  • ht::HealthTable: Table to transform (schema-aware).

Keyword Arguments

  • cols::Vector{Symbol}: Categorical columns to encode.
  • drop_original::Bool=true: Drop the source columns after encoding.
  • return_features_only::Bool=false: If true return a DataFrame containing only the encoded data; if false wrap the result in a HealthTable with disable_type_enforcement=true (because the output is no longer standard OMOP CDM).

Returns

  • DataFrame or HealthTable depending on return_features_only.

Example

ht_ohe = one_hot_encode(ht; cols = [:gender_concept_id, :race_concept_id])
X = one_hot_encode(ht; cols = [:gender_concept_id], return_features_only = true) # ML features
source

Vocabulary Compression

Reduce the dimensionality of categorical variables by grouping infrequent levels under a common label.

HealthBase.apply_vocabulary_compressionFunction
apply_vocabulary_compression(ht::HealthTable; cols, min_freq=10, other_label="Other")

Group infrequent categorical levels under a single other label.

Arguments

  • ht::HealthTable: Input data table.

Keyword Arguments

  • cols::Vector{Symbol}: Columns to compress.
  • min_freq::Int=10: Minimum frequency for a value to remain unchanged.
  • other_label::String="Other": Label used to replace infrequent values.
  • drop_original::Bool=false: Whether to drop original columns after compression.

Returns

  • HealthTable: Table with compressed categorical levels.

Examples

ht_small = apply_vocabulary_compression(ht; cols=[:condition_source_value], min_freq=5)
source

Concept Translation

Concept Mapping (Immutable)

Map OMOP concept IDs to human-readable concept names using the OMOP vocabulary tables, returning a new HealthTable.

HealthBase.map_conceptsFunction
map_concepts(ht::HealthTable, col::Symbol, new_col::String, conn::DuckDB.DB; drop_original::Bool = false, concept_table::String = "concept", schema::String = "main")

Map concept IDs in a column to their corresponding concept names using the OMOP concept table. Only direct mappings using concept IDs are supported.

Arguments

  • ht::HealthTable: Input OMOP data table.
  • cols::Union{Symbol, Vector{Symbol}}: Column(s) containing concept IDs.
  • conn::DuckDB.DB: Database connection for concept lookup.

Keyword Arguments

  • new_cols: Name(s) for output columns. If not provided, uses col * suffix.
  • suffix::String="_mapped": Suffix for default new column names.
  • drop_original::Bool=false: Drop source column(s) after mapping.
  • concept_table::String="concept": Table name for concepts.
  • schema::String="main": Schema containing the concept table.

Returns

  • A new HealthTable with the concept names added in new_col.

Example

conn = DBInterface.connect(DuckDB.DB, "path/to/db/.duckdb")

# Map gender_concept_id to concept_name
ht_mapped = map_concepts(ht, :gender_concept_id, "gender_name", conn; schema = "dbt_synthea_dev")
source

Concept Mapping (In-Place)

In-place version of concept mapping that modifies the original HealthTable directly for memory efficiency.

HealthBase.map_concepts!Function
map_concepts!(ht::HealthTable, cols, conn; ...)

In-place version of map_concepts. Maps concept IDs to human-readable names using the OMOP concept table.

Arguments

  • ht::HealthTable: The table to update.
  • cols: Single column or list of columns with concept IDs.
  • conn::DuckDB.DB: Connection to the OMOP database.

Keyword Arguments

  • new_cols: Optional new column names. Defaults to col * "_mapped".
  • suffix: Suffix used when new_cols is not provided.
  • drop_original: Whether to drop the original columns.
  • concept_table, schema: Source table and schema.

Returns

  • The mutated HealthTable.

Example

conn = DBInterface.connect(DuckDB.DB, "path/to/db/.duckdb")

# Map gender_concept_id to concept_name in-place
map_concepts!(ht, :gender_concept_id, conn; new_cols="gender_name", schema="dbt_synthea_dev")
source