ML Copilot Agent - AI-Powered Machine Learning Assistant

From Prompt to Reproducible Biomedical ML

ML Copilot executes leakage-aware workflows across genomics, proteomics, and hematology data—turning task cards into calibrated models, decision-curve analysis, and citable manifests without bespoke scripting.

6 biomedical cohorts (TCGA, CPTAC, CBC) Deterministic seeds + leakage guards Calibration & DCA exported by default

Try Demo See Validations

ML Workflow Automation

AI Assistant

Automated Preprocessing

Handle missing values, encoding, scaling, and feature engineering

Model Selection

Recommends models based on problem type and dataset

Evaluation & Reporting

Generates metrics, visualizations, and ready-to-share reports

Validated Across Oncology & Hematology

The copilot replicated six task cards from our manuscript—spanning TCGA, CPTAC, and clinical CBC datasets— with deterministic seeding, leakage checks, and calibrated outputs. Highlights from the four flagship case studies:

HNSCC Prognostic Classifier

RNA-seq clustering and supervised refinement discovered a 12-gene signature that stratified TCGA survival and generalized to CPTAC validation.

Log-rank p = 0.00562 (CPTAC)

Concordance index 0.70

TCGA-HNSC CPTAC-HNSC 12 genes Calibrated RF

LUAD TRU/PP/PI Transfer

Harmonized expression matrices enabled cross-cohort subtype prediction with disciplined ComBat ablations and probability calibration.

Balanced accuracy 0.912

Macro F1 0.905, Macro AUPRC 0.974

TCGA-LUAD CPTAC-LUAD Logistic Regression TRU · PP · PI

CPTAC-BRCA Receptor Profiling

Parallel RNA vs proteome pipelines quantified hormone receptor status with calibration curves and net-benefit analysis for ER/PR/HER2.

Proteome ROC-AUC: ER 0.954, HER2 0.951

RNA PR-AUC (PR) 0.932

CPTAC-BRCA Calibrated LR DCA & Brier logs

CBC-Based Anemia Triage

Routine labs plus derived Mentzer-style indices delivered interpretable triage with SHAP explanations for physicians.

Random forest F1 1.00 on hold-out

Logistic regression F1 0.984 with best calibration

Adult & Pediatric CBC SHAP Diagnostics Mentzer · Shine-Lal

How ML Copilot Works

Command Input

User provides natural language instructions for ML tasks

LLM Processing

GPT-4o/Gemini interprets request and generates Python code

Code Execution

Code interpreter safely executes generated Python code

Result Delivery

Outputs, visualizations, and files are returned to user

Event-Driven Architecture & Guardrails

Each workflow step is an event with explicit inputs, schema validation, deterministic seeds, and rollback protection. The copilot records prompts, generated code, package versions, SHAP/DCA assets, and final manifests for exact reproduction.

Leakage guards on patient/sample IDs plus per-cohort z-scaling and ComBat ablations.
Built-in calibration, decision-curve analysis, and Brier logging for every classifier.
Automatic failure recovery with schema snapshots and immutable manifests.
Modular events for preprocess, train, evaluate, visualize, and custom instructions.

The manuscript’s system diagram shows the GPT-4o/Gemini planner orchestrating sandboxed Python execution via LlamaIndex events with provenance captured at each hop.

Key Features

Interactive command-line interface for ML workflows
Support for multiple LLM providers (OpenAI, Gemini)
Built-in code interpreter for safe Python execution
Automated data preprocessing and feature engineering
Multiple ML algorithms with hyperparameter tuning
Comprehensive model evaluation metrics
Automated visualization and reporting

Technical Specifications

Core Architecture: LLM Agent with Code Interpreter
Supported Data: CSV, Excel, JSON, Images (with extensions)
ML Libraries: scikit-learn, pandas, numpy, matplotlib, seaborn
Interface: CLI with Python API options
Security: Local code execution with sandboxing
Documentation: Auto-generated reports in Markdown/HTML

Technology Stack

Python

LlamaIndex

Gemini

GPT-4o

scikit-learn

pandas

numpy

matplotlib

seaborn

lifelines

GitHub

ReadTheDocs

Case Study Visuals

Direct exports from the manuscript highlight what the copilot produced—Kaplan-Meier survival plots, subtype transfer precision-recall curves, receptor-status comparisons, and SHAP-based interpretability. Each figure is reproducible from the task cards shipped with ML Copilot.

Case Study 1 - HNSCC survival stratification

Case Study 1: Survival stratification for HNSCC with TCGA discovery and CPTAC validation showing the 12-gene classifier’s log-rank separation.

Case Study 2 - LUAD subtype transfer results

Case Study 2: Cross-cohort LUAD TRU/PP/PI transfer with per-class precision-recall curves and balanced accuracy summaries for linear vs. tree-based models.

Case Study 3 - CPTAC BRCA receptor results

Case Study 3: RNA vs. proteome comparison for ER/PR/HER2 receptor status including label distributions and modality-specific PR-AUC.

Case Study 4 - CBC anemia model diagnostics

Case Study 4: CBC-based anemia classification with learning curves, calibration overlays, and SHAP feature importances that mirror hematology heuristics.

Experience ML Copilot

Try Sample Commands

list files

Show files in current directory

preprocess data.csv target=outcome

Clean and prepare your dataset

train model=random_forest

Train a machine learning model

plot confusion_matrix

Visualize model performance

document

Generate analysis report

Or enter custom command

Select LLM Provider

Output Preview

$ python -m ml_copilot_agent

ML Copilot Agent initialized with GPT-4o

$ preprocess data.csv target=outcome

Loading data.csv (1425 rows × 28 columns)

Handling missing values: 5 rows dropped

Encoding categorical variables: 3 features one-hot encoded

Normalizing numerical features

Saved preprocessed data to data_preprocessed.csv

$ train model=random_forest

Training Random Forest classifier...

Using 5-fold cross-validation

Best parameters: n_estimators=200, max_depth=10

Validation accuracy: 0.87 ± 0.03

Saved model to models/rf_model.pkl

$ plot confusion_matrix

Generating confusion matrix...

Saved plot to results/confusion_matrix.png

Confusion Matrix

TP: 245 | FP: 32

FN: 28 | TN: 267

Academic Research

Our preprint "ML Copilot: An LLM-Powered Agent for Orchestrating Complex Machine Learning Workflows" is currently under peer review at the Journal of Engineering Artificial Intelligence.

Read Preprint Documentation GitHub Repo

Let’s Build Clinical‑Grade AI Together

Partner with HFR on pilots, research collaborations, or enterprise integrations across healthcare and biosciences.

Request Demo Explore Products