Skip to content

Training and Evaluation

With the feature matrix built and preprocessed, this page covers the final step of the pipeline: splitting the data, training three classification models, and comparing their predictive performance with AUC.

Load Preprocessed Data

The preprocessing step saved the final training and test splits to CSV. Load them here:

julia
using CSV, DataFrames
using CategoricalArrays: categorical
using MLJ

train_df = CSV.read(joinpath("output", "train.csv"), DataFrame)
test_df  = CSV.read(joinpath("output", "test.csv"),  DataFrame)

# MLJ classifiers require a CategoricalVector of strings for the target
train_df.outcome = categorical(string.(train_df.outcome); levels = ["0", "1"])
test_df.outcome  = categorical(string.(test_df.outcome);  levels = ["0", "1"])

y_train = train_df.outcome
X_train = select(train_df, Not(:outcome))

y_test  = test_df.outcome
X_test  = select(test_df,  Not(:outcome))

The levels = ["0", "1"] call ensures both the train and test vectors share the same encoding - important for models that inspect the level order internally.

Train-Test Split

The 80/20 split is performed during preprocessing (see preprocessing.jl) with a fixed random seed so the partition is reproducible:

julia
# Done in preprocessing.jl:
# train, test = partition(df, 0.8; shuffle = true, rng = 42)

Training Models with MLJ.jl

MLJ.jl provides a uniform interface for every model family in Julia. The same machine -> fit! -> predict pattern works regardless of algorithm, making it straightforward to swap models or run comparisons.

Three models are trained and compared. The evaluation metric is AUC (Area Under the ROC Curve): 1.0 is a perfect classifier, 0.5 is random guessing.

julia
using ROCAnalysis

function evaluate_model(model, X_train, y_train, X_test, y_test)
    mach = machine(model, X_train, y_train)
    fit!(mach; verbosity = 0)
    # pdf(..., "1") extracts the predicted probability of the positive class
    probs = pdf.(predict(mach, X_test), "1")
    score = auc(roc(probs, y_test .== "1"))
    return score, mach
end

Logistic Regression with L1 Regularization

Logistic regression is the transparent baseline. L1 regularization (Lasso) drives the coefficients of irrelevant features exactly to zero, effectively selecting the most informative predictors and keeping the model interpretable.

See: MLJLinearModels.jl documentation

julia
using MLJLinearModels

logreg = MLJLinearModels.LogisticClassifier(penalty = :l1, lambda = 0.0428)
auc_lr, mach_lr = evaluate_model(logreg, X_train, y_train, X_test, y_test)
println("Logistic Regression  AUC: $auc_lr")

Random Forest

Random forests build an ensemble of decision trees on random subsets of the training data and average their predictions. They handle non-linear relationships naturally, require minimal preprocessing, and are robust to correlated and noisy features.

See: MLJDecisionTreeInterface.jl documentation

julia
using MLJDecisionTreeInterface

rf = MLJDecisionTreeInterface.RandomForestClassifier(n_trees = 100, max_depth = 10)
auc_rf, mach_rf = evaluate_model(rf, X_train, y_train, X_test, y_test)
println("Random Forest AUC: $auc_rf")

n_trees = 100 provides a stable ensemble; max_depth = 10 caps each tree's complexity to reduce overfitting.

XGBoost

XGBoost (eXtreme Gradient Boosting) trains trees sequentially, with each tree correcting the residual errors of the previous one. It consistently delivers strong performance on structured tabular data and is often the best-performing algorithm in clinical prediction benchmarks.

See: MLJXGBoostInterface.jl documentation

julia
using MLJXGBoostInterface

xgb = MLJXGBoostInterface.XGBoostClassifier(num_round = 100, max_depth = 5, eta = 0.1)
auc_xgb, mach_xgb = evaluate_model(xgb, X_train, y_train, X_test, y_test)
println("XGBoost AUC: $auc_xgb")

eta = 0.1 is the learning rate - how much each tree corrects the model. Lower values generalise better but need more boosting rounds. max_depth = 5 limits tree depth to control model complexity.

Full Pipeline Summary

The complete PLP workflow from raw database to evaluated model:

StepScriptKey Package
Study initializationsetupHealthBase.jl
Cohort downloadsetupOHDSIAPI.jl
Data explorationsrc/01_data_loader.jlDuckDB.jl + PrettyTables.jl
Cohort SQL translation & executionsrc/02_cohort_definition.jlOHDSICohortExpressions.jl + FunSQL.jl
Feature extractionsrc/03_feature_extraction.jlDuckDB.jl + DataFrames.jl
Distribution checksrc/04_distribution_check.jlStatistics
Outcome labelingsrc/05_outcome_attach.jlDataFrames.jl
Imputation, standardization, encodingsrc/06_preprocessing.jlCategoricalArrays.jl + Statistics
Train / test splitsrc/06_preprocessing.jlMLJ.jl
Model training & evaluationsrc/07_train_model.jlMLJLinearModels · MLJDecisionTreeInterface · MLJXGBoostInterface
AUC scoringsrc/07_train_model.jlROCAnalysis.jl

The entire workflow is orchestrated by run.jl:

julia
# run.jl
steps = [
    ("Defining cohorts",     joinpath("src", "02_cohort_definition.jl")),
    ("Extracting features",  joinpath("src", "03_feature_extraction.jl")),
    ("Checking distributions", joinpath("src", "04_distribution_check.jl")),
    ("Attaching outcomes",   joinpath("src", "05_outcome_attach.jl")),
    ("Preprocessing",        joinpath("src", "06_preprocessing.jl")),
    ("Training models",      joinpath("src", "07_train_model.jl")),
]

for (label, path) in steps
    println("\n── $label ────")
    include(path)
end

To reproduce the full pipeline:

bash
julia --project=. run.jl

References