Quickstart
Welcome to the Quickstart guide for HealthBase.jl! This guide walks you through setting up your Julia environment, creating example OMOP CDM data, validating it, and applying preprocessing steps using the HealthTable system.
Getting Started
Launch Julia and Enter Your Project Environment
To get started:
- Open your terminal or Julia REPL.
- Navigate to your project folder (where
Project.tomlis located):
cd path/to/your/project- Activate the project:
julia --project=.- (Optional for docs) For working on documentation:
julia --project=docs1. Load Packages
Before loading HealthBase, you must first load some trigger packages. These packages enable HealthBase's extensions, which power important features like type validation and concept mapping.
⚠️ Important: Load the following packages before
using HealthBase. Otherwise, some functions may not be available due to missing extensions.
# First, load the trigger packages
using DataFrames, OMOPCommonDataModel, InlineStrings, Serialization, Dates, FeatureTransforms, DBInterface, DuckDB
# Then, load HealthBase
using HealthBase2. Create Example DataFrames
We'll create two DataFrames:
good_df- a minimal, valid slice of the OMOP person table.wrong_df- intentionally invalid (wrong types & extra column) so you can see the constructor’s validation in action.
good_df = DataFrame(
person_id = 1:6,
gender_concept_id = [8507, 8507, 8532, 8532, 8507, 8532],
year_of_birth = [1980, 1995, 1990, 1975, 1988, 2001],
race_concept_id = [8527, 8515, 8527, 8516, 8527, 8516]
)
# Invalid DataFrame to test validation
wrong_df = DataFrame(
person_id = ["1", "2"], # Should be Int64
gender_concept_id = [8507, 8532],
year_of_birth = [1990, 1985],
race_concept_id = [8527, 8516],
extra_col = [true, false], # Extra column not in the OMOP schema
)
ht = HealthTable(good_df; omop_cdm_version="v5.4.1")
# OMOP CDM version metadata
metadata(ht.source, "omop_cdm_version")
# Will give column-specific metadata
colmetadata(ht.source, :gender_concept_id)
# This will throw an error (strict enforcement)
ht = HealthTable(wrong_df; omop_cdm_version="v5.4.1", disable_type_enforcement = false)
# If you want to *load anyway* and just receive warnings, disable type enforcement:
ht_relaxed = HealthTable(wrong_df; omop_cdm_version="v5.4.1", disable_type_enforcement = true)3. Preprocessing Pipeline
Now, we'll apply a series of transformations to clean and prepare the data.
Mapping Concepts
Convert concept codes (e.g., gender ID) into readable or binary columns using a DuckDB connection.
conn = DBInterface.connect(DuckDB.DB, "synthea_1M_3YR.duckdb")
# Single column, auto-suffixed column name (gender_concept_id_mapped)
ht_mapped = map_concepts(ht, :gender_concept_id, conn; schema = "dbt_synthea_dev")
# Multiple columns, custom new column names
ht_mapped2 = map_concepts(ht, [:gender_concept_id, :race_concept_id], conn; new_cols = ["gender", "race"], schema = "dbt_synthea_dev", drop_original=true)
# In-place variant
map_concepts!(ht, [:gender_concept_id], conn; schema = "dbt_synthea_dev")Manual Concept Mapping (Without DB)
Sometimes, you may want to map concept IDs using a custom dictionary instead of querying the database.
# Define custom mapping manually
custom_map = Dict(8507 => "Male", 8532 => "Female")
# Option 1: Add a new column using `Base.map`
ht.source.gender_label = map(x -> get(custom_map, x, "Unknown"), ht.source.gender_concept_id)
# Option 2: Use `Base.map!` with a new destination vector
gender_labels = Vector{String}(undef, length(ht.source.gender_concept_id))
map!(x -> get(custom_map, x, "Unknown"), gender_labels, ht.source.gender_concept_id)
ht.source.gender_label = gender_labelsCompress sparse categories
Group rare values into an "Other" category so they don’t overwhelm your model.
ht_compressed = apply_vocabulary_compression(ht_mapped; cols = [:race_concept_id], min_freq = 2, other_label = "Other")One-hot encode categorical columns
Convert categorical codes into binary indicator columns (true/false).
ht_ohe = one_hot_encode(ht_compressed; cols=[:gender_concept_id, :race_concept_id])For Developers: Interactive Use in the REPL
When working interactively in the REPL during development:
- Always load the trigger packages first
- Then load
HealthBase - Only after that, use extension functions like
one_hot_encode,map_concepts, etc.
# Correct load order for extensions to work:
using DataFrames, OMOPCommonDataModel, InlineStrings, Serialization, Dates, FeatureTransforms, DBInterface, DuckDB
using HealthBase
# Now this will work:
# ht_ohe = one_hot_encode(ht; cols=[:gender_concept_id])Happy experimenting with HealthBase.jl! 🎉 Feel free to explore more advanced workflows in the other guide sections.