# PET-MBF-AI

**Repository Path**: mirrors_mitre/PET-MBF-AI

## Basic Information

- **Project Name**: PET-MBF-AI
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-07-15
- **Last Updated**: 2026-05-17

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# Repo Structure

Analysis_notebooks
- These notebooks contain analysis. Mainly model selection and hyperparameter
  optimization for the various models, datasets, and prediction problems. 

codebase
- This contains reusable code and scripts that are often leveraged accross
  notebooks and scripts and includes scripts for training final models.

config_files
- This contains json files describing optimal hyperparameters for the final
  models

data
- This contians the raw data and the data split into files based on predictive
  task (localization / detection) and train / validation / test splits

logs
- This contians log files associated with the trainings of final models (i.e.
  model checkpoints and tensorboard logs for keras models)

models
- Contains final trained models. For mlp and unet models that are ensembled, the 
  individual trainings are saved in models/model0 - models/model11. The final
  ensemble is stored in models/ensembles. All other models are stored in models/model0

results
- Contains the results from comparison of models and from evaluation of final
  models (one model for detection and one for localization) for generalization
  performance.

scripts
- Contains the scripts used to create patientwise splits stratified by outcome,
  to train final models, and to evaluate models

Singularity
- .def and .sif files for the Singularity environment

run_analysis.sh
- A file that shows how to run all stages of analysis (besides iterative 
  hyperparameter tuning) with an example dataset of random values.

# Model selection experiments
A tool was written to streamline the hyperparameter tuning process and help
keep hyperparameter tuning organized. This tool launches hyperparameter tuning
experiments to SLURM from a simple interface. This makes it easy to adjust
hyperparameter ranges searched and keep track of these changes from a notebook
or single script, rather than having to create many versions of similar scripts
and submitting each to SLURM. This tool also makes it easy to keep code and output
organized by enforcing naming conventions and file structure for the files
created during hyperparameter tuning.

There is an option to use this same interface to launch model selection jobs
locally if one is not running this code on a cluster managed by SLURM. The
structure of the code that creates this tool is held within
codebase/hpc_scripts. See Analysis_notebooks/17_segment_hpc/model_selection_17_segment_localization.ipynb
for an example using this tool through the interfaces in codebase/model_selection_utils.py.


### Model selection SLURM job launcher overview
The directory codebase/hpc_scripts contains reusable scripts to help launch model selection / hyperparameter
tuning jobs on the HPC cluster managed with SLURM for jobs related to the
MITRE UOHI collaboration PET imaging project. These reusable scripts allow for
the launching of hyperparameter tuning jobs for scikit-learn's svm and random forest
models using scikit-learn's [GridSearchCV] (https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html).
The scripts also allow for the launching of 
hyperparameter tuning jobs for a keras fully connected neural network and a keras U-Net with 
[Bayesian optimization] (http://hyperopt.github.io/hyperopt/) and
[grid search] (http://hyperopt.github.io/hyperopt/). The script run's
 HyperOpt's implementation of Tree-structured Parzen Estimators for the Bayesian
optimization algorithm.  Both the Bayesian optimization and gridsearch for the 
fully connected neural network are run through [Ray Tune] (https://docs.ray.io/en/latest/tune/index.html).

Currently, these algorithms are implemented for use with the classification
of normal/abnormal scans with the 3 vessel, 17 segment, and polar plot datasets.
The scripts were designed to be easily expanded to handle the localization and
classification-scar/ischemia problems.

### Files

There are three files that are used together to launch these gridsearch jobs
to SLURM: run_ms_on_hpc.py, ms_on_hpc.py, model_selection_utils.py. run_ms_on_hpc.py
and ms_on_hpc.py are command line programs. ms_on_hpc.py contains the code
for hyperparameter tuning for both sklearn and keras jobs. run_ms_on_hp.py 
requests resources and submits a hyperparameter tuning job to SLURM by running
ms_on_hpc.py. model_selection_utils.py contains the functions 
launch_mlp_hyper_opt_hpc and ms_on_hpc. These functions provide a clean interface
to launch hyperparameter tuning jobs on slurm from within python scripts or 
jupyter notebooks. These functions also help organize the names and locations
of files created as a part of the hyperparamter tuning process.

In order for these scripts to be completely reusable--requiring no code changes
to launch a new hyperparameter tuning job--the hyperparameter search space
to be searched must be specified in either a .json file or a .joblib file 
(depending on if the model is sklearn or keras), and this file will be read in
by ms_on_hpc.py where the search is executed. The interfaces in 
model_selection_utils will generate, save, and name these hyperparameter
space specification files and will pass the appropriate file path to 
run_ms_on_hpc.py.

To launch a new hyperparameter tuning job, the only functions you will
need are launch_mlp_hyper_opt_hpc and ms_on_hpc from model_selection_utils.py.

### Hyperparameter space specification

The format of the hyperparameter space passed to them model selection SLURM job
launcher will depend on model type. The hyperparameter space for scikit-learn
models will be in the format of the param_grid argument of scikit learn's
model_selection.GridSearchCV. The hyperparameter space for keras models will
be in the format of hyperopt search space when the search algorithm is hyperopt
and it will be in the format of the Ray tune.gridsearch interface.

For scikit learn models, hyperparameter names can be found in scikit learn
documentation. 

Please see Analysis_Notebooks/17_segment/model_selection_17_segment_localiation.ipynb
for description of the included hyperparameters and hyperparameter names and
examples of correctly formatted search spaces. 

### Output

When run, various output will be saved depending on the model type.

sklearn models:
- results: a csv with various model
configurations and performance for each according to various metrics.

keras models:
- results: a csv with various model
configurations and performance for each according to various metrics.
- Ray Tune log files: files containing performance info about each model configuration.
Performance information is stored in a separate file for each model configuration
as this enables Ray to run hyperparameter tuning in a distributed manner accross nodes. 
- TensorBoard files: Files containing information about both training and performance
of each model configuration. 

A directory structure will be created as follows for the output files. A argument
out_dir for the interfaces provided in model_selection_utils will specify the
base directory of the output directory structure. This directory and all
subdirectories will be created if they do not exist. 

The output structure:

    e-emagin-pet/output
        3_vessel
            norm_abn
                mlp
                    logs
                    hp_specs
                    results
                svm
                    logs
                    hp_specs
                    results
                rf
                    logs
                    hp_specs
                    results
        17_segment
            norm_abn
                mlp
                    logs
                    hp_specs
                    results
                svm
                    logs
                    hp_specs
                    results
                rf
                    logs
                    hp_specs
                    results
            localization
                mlp
                    ...
                svm 
                    ...
                rf
                    ...
        polar_plot
            ...


# Data 

    e-emagin-pet/data/raw
        formatted_raw_17s_3v_data.csv
        polar_plot
            - contains csv files representing cartesian mappings of full polar plot
              image data for rest and stress scans for each study
            - naming convention STUDYNO_SCANTYPE.csv where SCANTYPE is either 
              'rest' or 'stress'
    e-emagin-pet/data/splits
        3_vessel
            norm_abn
                3v_norm_abn_train.csv
                3v_norm_abn_val.csv
                3v_norm_abn_test.csv
                3v_norm_abn_full.csv
        17_segment
            norm_abn
                17s_norm_abn_train.csv
                17s_norm_abn_val.csv
                17s_norm_abn_test.csv
                17s_norm_abn_full.csv
            localization
                17s_loc_train.csv
                17s_loc_val.csv
                17s_loc_test.csv
                17s_loc_full.csv
        

Tha above directory structure and csv files contianing data splits can be
generated by following the instructions in scripts/make_stratified_datasets.py.
The csv files contain tabular data for the relevant flow
measurements, the appropriate outcome columns, and columns for patient_id and
study_no. The files contianing training sets will have two additional columns,
‘cv_splits’ and ‘nn_val_split’, which indicate different groups within the
training set to be used during hyperparameter tuning for the classical ML 
models and the neural networks respectively. The ‘cv_splits’ column indicates
membership to 1 of 10 different groups which are the different folds of the data
used in 10 fold cross validation for the SVM and random forest. ‘nn_val_split’ indicates
membership to one of two groups – a training set (0) and a validation set for
hyperparameter tuning (1) for the neural networks. Helpful functions for
loading the datasets can be found in codebase/data_utils.py.

# Environment

### Singularity

The Singularity directory contians both a Singularity .def and .sif file. While
all of the code in this repo will run in the provided singularity
environment, the model selection job launcher cannot communicate with SLURM if
it is run from a within a singularity envoronment, as it needs access to the 
local machine's installation of SLURM. This is not possible with Singularity,
as the system's installations are kept separate from any process operating in
the Singularity image. Model selection initiated by the launcer tool can be
run from within the provided Singularity environment only when local mode is
enabled. Below are instructions for setting up a conda environment from which
jobs can be launhed on SLURM using this launcher tool.

### Conda

In order for these scripts to run, create a virtual environment 'e-emagin-pet'.
To set up this environment, execute the following commands.

1. conda create -n e-emagin-pet tensorflow pip scikit-learn pandas seaborn jupyter xlrd openpyxl
2. conda activate e-emagin-pet
3. pip install -U hyperopt


# Evaluation

A command line script, codebase/evaluate.py, in codebase can be used to compare the
performance of the various models. This evaluation scripts takes objects of
class ModelEvalWrapper. The ModelEvalWrapper class gives the same interface
to scikitlearn and keras models for functions necessary for evaluation. They
also record the input features used by each model as the logistic regression
may use fewer features than the other models on the tabular dataset. 


# Example

Follow the steps below to leverage the code from this repository for your
own analysis. Run run_analysis.sh to run all steps below (besides the iterative
hyperparameter tuning step) using fake data, formatted as expected by the
scripts in this repository, but with randomly generated values. The purpose of
run_analysis.sh is to illustrate how to run the included scripts, not to train
meaningful models. To train with data from your own institution, format data
as described below and as shown in the example random data, and run the
hyperparameter tuning for each model type with the example jupyter notebook
provided in Analysis_Notebooks.

## Step 1: Format tabular data

Save a csv containing features columns for 17 segmant and 3 vessel data
and label columns for detection and localization/classification of
abnormality type as described below. The file should also contian a unique
identifier for each study, 'study_no', and a unique identifier for each
patient 'pt_id'. File should be saved in ./data/raw/
    
### Feature columns:

- 3 vessel data:
    - 12 feature columns with names ${MEASUREMENT}_${VESSEL_TERRITORY} for
      measurements: ['rest', 'stress', 'reserve', 'difference'] and vessel territories:
      ['lad', 'rca', 'lcx']


- 17 segment data:
    - 68 feature columns with names ${MEASUREMENT}_${REGION} for measurements:
      ['rest', 'stress', 'reserve', 'difference'] and regions ['basal_anterior',
      'basal_anteroseptal', 'basal_inferoseptal', 'basal_inferior',
      'basal_inferolateral', 'basal_anterolateral', 'mid_anterior',
      'mid_anteroseptal', 'mid_inferoseptal', 'mid_inferior',
      'mid_inferolateral', 'mid_anterolateral', 'apical_anterior',
      'apical_septal', 'apical_inferior', 'apical_lateral', 'apex']

- polar plot data:
    - A directory data/raw/polar_plot with csvs of each rest and stress study
      in the format of 48 x 48 feature matrix with cartesian representation
      of polar plot and the naming convention STUDYNO_SCANTYPE.csv where
      SCANTYPE is either 'rest' or 'stress'

### Label columns:

- Per-patient detection:
    - 1 binary label column 'abnormal'. 0 represents normal, 1 represents abnormal'.

- Per-vessel localization and classification of abnormality type:
    - 6 binary columns ['scar_lad', 'scar_rca', 'scar_lcx', 'ischemia_lad',
      'ischemia_rca', 'ischemia_lcx'], where 0 indicates that the abnormality
      of the given type is not present in the given vessel territory, and 1
      indicates that the abnormality of the given type is present in the given
      vessel territory.

## Step 2: Split the data

- Use the script /codebase/make_stratified_datasets.py to split and store
  the formatted data file.
- Once stratified datasets are saved by the above script and image data is
  saved in the specified format, data can be loaded with
  data_utils.load_dataset, which is used widely through out the code.  
- If script does not converge, try increasing max_class_dev. See argument
  description with --help. 

Example:

    singularity exec --bind /home:/home ./Singularity/ray_nvidia.sif ./scripts/make_stratified_datasets.py 
        --outdir ./data/splits
        --in_datapath ./data/raw/data_random_values.csv
        --max_class_dev 0.02

## Step 3: Hyperparameter tuning

- Identify optimal hyperparameters for each model, for each predictive
  problem (detection, localizaiton).
- See example of iterative hyperparameter tuning process and hyperparameter
  tuning job launcher in Analysis_Notebooks/17_segment/model_selection_17_segment_localiation.ipynb 
- If your institution does not run a cluster managed by SLURM, it is still
  possible to use the python interface to the hyperparameter tuning job
  launcher by passing the argument local=True. If you use the local option,
  launch the jupyter notebook within the singlularity environment to ensure
  access to all necessary packages. If you want to launch jobs on your
  institution's SLURM cluster, do not use the local argument, and launch
  the jupyter notebook within the e-emagin-pet conda environment described
  above in the environment section.


## Step 4: Train final models

- Once the optimal hyperparameters have been identified for each model,
  record optimal hyperparameters for each model in a json file in /config_files. 

Example:

    singularity exec --bind /home:/home e-emagin-pet/Singularity/ray_nvidia.sif python3 train_models_17s_loc.py
        --datadir=./data
        --outdir=./logs
        --saved_models_path=./models/17_segment/localization
        --hyperparam_path=./config_files/17s_loc_hyperparams.json
      
## Step 4: Model comparison

- Use the script /codebase/evaluate.py to compare final models and identify
  highest performing model for each predictive task and representation of
  perfusion data.
- Run evaluation on validation set for model comparison

Example:

    singularity exec --bind /home:/home e-emagin-pet/Singularity/ray_nvidia.sif python3 evaluate.py
        MODEL1.joblib
        MODEL2.joblib 
        MODEL3.joblib
        --outdir ./results/norm_abn
        --datadir ./data
        --dataset 17_segment
        --problem norm_abn
        --all
        

## Step 5: Generalization performance

- Use /codebase/evaluate.py to calculate generalization performance for the
  highest performing model from the comparison on the held out test set for
  generalizabilty. 

Example:

    singularity exec --bind /home:/home e-emagin-pet/Singularity/ray_nvidia.sif python3 evaluate.py
        FINAL_MODEL.joblib
        --outdir ./results/norm_abn/generalization
        --datadir ./data
        --dataset 17_segment
        --problem norm_abn
        --final

# Public Release
©2022 The MITRE Corporation and The Ottawa Heart Institute
Approved for Public Release; Distribution Unlimited. 
Public Release Case Number 22-1848

# License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

`http://www.apache.org/licenses/LICENSE-2.0`

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.