HFR

ML Copilot Agent

Preprint Under Review Journal of Engineering AI

From Prompt to Reproducible Biomedical ML

ML Copilot executes leakage-aware workflows across genomics, proteomics, and hematology data—turning task cards into calibrated models, decision-curve analysis, and citable manifests without bespoke scripting.

6 biomedical cohorts (TCGA, CPTAC, CBC) Deterministic seeds + leakage guards Calibration & DCA exported by default

ML Workflow Automation

AI Assistant

Automated Preprocessing

Handle missing values, encoding, scaling, and feature engineering

Model Selection

Recommends models based on problem type and dataset

Evaluation & Reporting

Generates metrics, visualizations, and ready-to-share reports

Validated Across Oncology & Hematology

The copilot replicated six task cards from our manuscript—spanning TCGA, CPTAC, and clinical CBC datasets— with deterministic seeding, leakage checks, and calibrated outputs. Highlights from the four flagship case studies:

HNSCC Prognostic Classifier

RNA-seq clustering and supervised refinement discovered a 12-gene signature that stratified TCGA survival and generalized to CPTAC validation.

Log-rank p = 0.00562 (CPTAC)
Concordance index 0.70
TCGA-HNSC CPTAC-HNSC 12 genes Calibrated RF

LUAD TRU/PP/PI Transfer

Harmonized expression matrices enabled cross-cohort subtype prediction with disciplined ComBat ablations and probability calibration.

Balanced accuracy 0.912
Macro F1 0.905, Macro AUPRC 0.974
TCGA-LUAD CPTAC-LUAD Logistic Regression TRU · PP · PI

CPTAC-BRCA Receptor Profiling

Parallel RNA vs proteome pipelines quantified hormone receptor status with calibration curves and net-benefit analysis for ER/PR/HER2.

Proteome ROC-AUC: ER 0.954, HER2 0.951
RNA PR-AUC (PR) 0.932
CPTAC-BRCA Calibrated LR DCA & Brier logs

CBC-Based Anemia Triage

Routine labs plus derived Mentzer-style indices delivered interpretable triage with SHAP explanations for physicians.

Random forest F1 1.00 on hold-out
Logistic regression F1 0.984 with best calibration
Adult & Pediatric CBC SHAP Diagnostics Mentzer · Shine-Lal

How ML Copilot Works

Command Input

User provides natural language instructions for ML tasks

LLM Processing

GPT-4o/Gemini interprets request and generates Python code

Code Execution

Code interpreter safely executes generated Python code

Result Delivery

Outputs, visualizations, and files are returned to user

Event-Driven Architecture & Guardrails

Each workflow step is an event with explicit inputs, schema validation, deterministic seeds, and rollback protection. The copilot records prompts, generated code, package versions, SHAP/DCA assets, and final manifests for exact reproduction.

  • Leakage guards on patient/sample IDs plus per-cohort z-scaling and ComBat ablations.
  • Built-in calibration, decision-curve analysis, and Brier logging for every classifier.
  • Automatic failure recovery with schema snapshots and immutable manifests.
  • Modular events for preprocess, train, evaluate, visualize, and custom instructions.
ML Copilot system architecture diagram

The manuscript’s system diagram shows the GPT-4o/Gemini planner orchestrating sandboxed Python execution via LlamaIndex events with provenance captured at each hop.

Key Features

  • Interactive command-line interface for ML workflows
  • Support for multiple LLM providers (OpenAI, Gemini)
  • Built-in code interpreter for safe Python execution
  • Automated data preprocessing and feature engineering
  • Multiple ML algorithms with hyperparameter tuning
  • Comprehensive model evaluation metrics
  • Automated visualization and reporting

Technical Specifications

  • Core Architecture: LLM Agent with Code Interpreter
  • Supported Data: CSV, Excel, JSON, Images (with extensions)
  • ML Libraries: scikit-learn, pandas, numpy, matplotlib, seaborn
  • Interface: CLI with Python API options
  • Security: Local code execution with sandboxing
  • Documentation: Auto-generated reports in Markdown/HTML

Technology Stack

Python
LlamaIndex
Gemini
GPT-4o
scikit-learn
pandas
numpy
matplotlib
seaborn
lifelines
GitHub
ReadTheDocs

Case Study Visuals

Direct exports from the manuscript highlight what the copilot produced—Kaplan-Meier survival plots, subtype transfer precision-recall curves, receptor-status comparisons, and SHAP-based interpretability. Each figure is reproducible from the task cards shipped with ML Copilot.

Case Study 1 - HNSCC survival stratification

Case Study 1: Survival stratification for HNSCC with TCGA discovery and CPTAC validation showing the 12-gene classifier’s log-rank separation.

Case Study 2 - LUAD subtype transfer results

Case Study 2: Cross-cohort LUAD TRU/PP/PI transfer with per-class precision-recall curves and balanced accuracy summaries for linear vs. tree-based models.

Case Study 3 - CPTAC BRCA receptor results

Case Study 3: RNA vs. proteome comparison for ER/PR/HER2 receptor status including label distributions and modality-specific PR-AUC.

Case Study 4 - CBC anemia model diagnostics

Case Study 4: CBC-based anemia classification with learning curves, calibration overlays, and SHAP feature importances that mirror hematology heuristics.

Experience ML Copilot

Try Sample Commands

list files

Show files in current directory

preprocess data.csv target=outcome

Clean and prepare your dataset

train model=random_forest

Train a machine learning model

plot confusion_matrix

Visualize model performance

document

Generate analysis report

Output Preview

$ python -m ml_copilot_agent
ML Copilot Agent initialized with GPT-4o
$ preprocess data.csv target=outcome
Loading data.csv (1425 rows × 28 columns)
Handling missing values: 5 rows dropped
Encoding categorical variables: 3 features one-hot encoded
Normalizing numerical features
Saved preprocessed data to data_preprocessed.csv
$ train model=random_forest
Training Random Forest classifier...
Using 5-fold cross-validation
Best parameters: n_estimators=200, max_depth=10
Validation accuracy: 0.87 ± 0.03
Saved model to models/rf_model.pkl
$ plot confusion_matrix
Generating confusion matrix...
Saved plot to results/confusion_matrix.png
Confusion Matrix
TP: 245 | FP: 32
FN: 28 | TN: 267

Academic Research

Our preprint "ML Copilot: An LLM-Powered Agent for Orchestrating Complex Machine Learning Workflows" is currently under peer review at the Journal of Engineering Artificial Intelligence.

Let’s Build Clinical‑Grade AI Together

Partner with HFR on pilots, research collaborations, or enterprise integrations across healthcare and biosciences.