About The Datathon

From the Kaggle challenge's description:

The WiDS Datathon Global Challenge was developed with Ann S. Bowers Women’s Brain Health Initiative (WBHI) in collaboration with Cornell University and UC Santa Barbara. Datasets and support are provided by the Healthy Brain Network (HBN), the signature scientific initiative of the Child Mind Institute, and the Reproducible Brain Charts project (RBC).

Neuropsychiatric disorders that occur in development, like anxiety, depression, autism, and attention deficit hyperactivity disorder, or ADHD, often differ in how and to what extent they affect males and females. ADHD occurs in about 11% of adolescents, with around 14% of boys and 8% of girls having a diagnosis. There is some evidence that girls with ADHD can often go undiagnosed, as they tend to have more inattentive symptoms which are harder to detect. Girls with ADHD who are undiagnosed will continue suffering with symptoms that burden their mental health and capacity to function.

Exploratory Data Analysis

Three sets of data were provided for 1213 participants:

Processed functional connectome matrices data, consisting of 19900 columns
Quantitative measures, such as the Strengths and Difficulties Questionnaire (SDQ), Edinburgh Handedness Questionnaire (EHQ), Color Vision (CV) Test (for red-green color deficiency), etc.
Qualitative measures like race, ethnicity, parent education levels, parent occupation, etc.

From the result of:

import ydata_profiling as yp
profile = yp.ProfileReport(df_train)
profile.to_notebook_iframe()
profile.to_file('eda_report.html')

We have:

High Correlations

ADHD_Outcome is highly correlated with SDQ_SDQ_Externalizing and other fields.
APQ_P_APQ_P_INV is highly correlated with APQ_P_APQ_P_PP.
APQ_P_APQ_P_PM is highly correlated with MRI_Track_Age_at_Scan.
SDQ_SDQ_Difficulties_Total has strong correlations with multiple SDQ-related fields.

Missing Values

MRI_Track_Age_at_Scan has 360 missing values (29.7%).

Zero Values

SDQ_SDQ_Conduct_Problems has 348 zeros (28.7%).
SDQ_SDQ_Emotional_Problems has 299 zeros (24.6%).
SDQ_SDQ_Generating_Impact has 204 zeros (16.8%).

Categorical Variables

participant_id: Unique identifier (100% distinct).
ADHD_Outcome: Binary variable (1: 831, 0: 382) (more ADHD diagnoses than no ADHD diagnoses)
Sex_F: Binary variable (0: 797, 1: 416) (more males than females)

Numerical Variables

EHQ_EHQ_Total: Mean = 58.88, Min = -100, Max = 100, 12% negative values.
ColorVision_CV_Score: Mean = 13.16, Min = 0, Max = 14.
APQ_P_APQ_P_CP: Mean = 3.78, Min = 0, Max = 12.
SDQ_SDQ_Conduct_Problems: Mean = 2.06, Min = 0, Max = 10, 28.7% zeros.
SDQ_SDQ_Difficulties_Total: Mean = 12.12, Min = 0, Max = 34, 1.9% zeros.
MRI_Track_Age_at_Scan: Mean = 11.24, Min = 0, Max = 21.56, 360 missing values (29.7%).

Correlations

Heatmap analysis shows strong positive correlations between SDQ_SDQ_Difficulties_Total and other SDQ subscales.
APQ_P_APQ_P_INV and APQ_P_APQ_P_PP show high correlations.

Missing Values Visualization

MRI_Track_Age_at_Scan has the highest percentage of missing values.

Other Notes

Difference in mean SDQ scores for ADHD and non-ADHD participants were statistically significant (Welch's T-test, p < 0.05)
Difference in mean Color Vision and EHQ scores for ADHD and non-ADHD participants were not statistically significant
The fMRI data doesn't map to anything—I don't know what pair of brain regions is represented by each column.

Building the Neural Network

Due to time constraints, I focused on the fMRI data and quantitative data only.

Data Import

import pandas as pd

# fMRI data
TRAIN_FMRI_PATH = "data/TRAIN/TRAIN_FUNCTIONAL_CONNECTOME_MATRICES.csv"
train_fmri_df = pd.read_csv(TRAIN_FMRI_PATH)

# Quantitative metadata
TRAIN_QUANT_PATH = "data/TRAIN/TRAIN_QUANTITATIVE_METADATA.xlsx"
train_quant_df = pd.read_excel(TRAIN_QUANT_PATH)

# Solutions
TRAIN_SOLUTIONS_PATH = "data/TRAIN/TRAINING_SOLUTIONS.xlsx"
train_solutions_df = pd.read_excel(TRAIN_SOLUTIONS_PATH)

Principal Component Analysis (on fMRI data only)

import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

# Extract activation data (excluding participant ID)
X = train_fmri_df.iloc[:, 1:].values  # Shape (n_samples, 19900)

# Apply PCA to reduce dimensionality
# This reduces to 1188 components—down from 19900!
pca = PCA(n_components=0.99)  # Retain 99% of variance
X_pca = pca.fit_transform(X)  # Shape: (n_samples, reduced_dim)

# Print number of components selected
print(f"Reduced to {X_pca.shape[1]} principal components")

# Plot explained variance
plt.figure(figsize=(8, 5))
plt.plot(np.cumsum(pca.explained_variance_ratio_), marker='o', linestyle='-')
plt.xlabel("Number of Components")
plt.ylabel("Cumulative Explained Variance")
plt.title("Explained Variance vs. Number of Components")
plt.grid()
plt.show()

Plot of Cumulative Explained Variance VS Number of fMRI Components

Downsize the input dataset to include the most important features only

# Based on: https://stackoverflow.com/a/56722874/11620221
# Get number of PCA components that explain 99% of our variance
num_components = pca.components_.shape[0]

# Get the index of the most important feature
# on EACH component i.e. largest absolute value
most_important = [np.abs(pca.components_[i]).argmax() for i in range(num_components)]

# Get the column names
initial_feature_names = train_fmri_df.columns[1:]
most_important_col_names = [initial_feature_names[most_important[i]] for i in range(num_components)]

# Only use SDQ Scores
QUANT_INDICES = [0, 9, *range(11, 18)] # Skip Difficulties_Total

# Merge the datasets by Participant ID
train_merged_df = pd.merge(
    # Select only the most important features
    train_fmri_df.loc[:, ["participant_id", *most_important_col_names]],
    train_quant_df.iloc[:, QUANT_INDICES],
    on="participant_id",
)

train_merged_df = pd.merge(train_merged_df, train_solutions_df, on="participant_id")

Account for class imbalance

This is achieved by:

Creating a custom loss function
Computing class weights

import tensorflow.keras.backend as K
from sklearn.utils.class_weight import compute_class_weight

# Custom loss function to account for class imbalance
def weighted_binary_crossentropy(y_true, y_pred, weight_0, weight_1):
    epsilon = K.epsilon()
    y_pred = K.clip(y_pred, epsilon, 1.0 - epsilon)

    loss = -(
        weight_0 * (1 - y_true) * K.log(1 - y_pred) + weight_1 * y_true * K.log(y_pred)
    )

    return K.mean(loss)

# Compute class weights
adhd_weights = compute_class_weight("balanced", classes=np.array([0, 1]), y=train_solutions_df.iloc[:, 1])
sex_weights = compute_class_weight("balanced", classes=np.array([0, 1]), y=train_solutions_df.iloc[:, 2])

adhd_loss = lambda y_true, y_pred: weighted_binary_crossentropy(
    y_true, y_pred, weight_0=adhd_weights[0], weight_1=adhd_weights[1]
)
sex_loss = lambda y_true, y_pred: weighted_binary_crossentropy(
    y_true, y_pred, weight_0=sex_weights[0], weight_1=sex_weights[1]
)

Create a hypermodel for hyperparameter tuning

from tensorflow.keras import layers, Input, optimizers, metrics
from tensorflow.keras.models import Model

def create_hypermodel(hp):
    # MRI Input Branch (num_components features)
    mri_input = Input(shape=(num_components,), name="mri_input")
    x1 = mri_input

    # Add multiple dense layers with hyperparameter tuning
    for i in range(hp.Int("num_mri_layers", 1, 3)):
        x1 = layers.Dense(
            units=hp.Int(
                f"mri_units_{i}",
                min_value=16,
                max_value=int(num_components / 2),
                step=32,
            ),
            activation="relu",
        )(x1)

    # Questionnaire Input Branch
    # -1 to exclude participant ID
    survey_input = Input(shape=(len(QUANT_INDICES) - 1,), name="survey_input")
    x2 = layers.Dense(len(QUANT_INDICES) * 3, activation="relu")(survey_input)

    # Merge both feature representations
    merged = layers.Concatenate()([x1, x2])

    # Add multiple dense layers with hyperparameter tuning
    for i in range(hp.Int("num_layers", 1, 3)):
        merged = layers.Dense(
            units=hp.Int(
                f"units_{i}", min_value=16, max_value=num_components / 3, step=32
            ),
            activation="relu",
        )(merged)

    # Add a final Dense layer with two units and
    # sigmoid activation for binary classification
    # (1 = ADHD, 0 = No ADHD / 1 = Female, 0 = Male)
    output_adhd = layers.Dense(1, activation="sigmoid", name="adhd_output")(merged)
    output_sex = layers.Dense(1, activation="sigmoid", name="sex_output")(merged)

    # Compile the model with:
    # - weighted binary cross-entropy loss
    # - Adam optimizer
    # - AUC metric
    model = Model(inputs=[mri_input, survey_input], outputs=[output_adhd, output_sex])
    model.compile(
        loss={
            "adhd_output": adhd_loss,
            "sex_output": sex_loss,
        },
        metrics={
            "adhd_output": [
                metrics.AUC(name="roc_auc"),
                metrics.AUC(name="pr_auc", curve="PR"),
            ],
            "sex_output": [
                metrics.AUC(name="roc_auc"),
                metrics.AUC(name="pr_auc", curve="PR"),
            ],
        },
        optimizer=optimizers.legacy.Adam(learning_rate=0.001),
    )

    return model

Model Evaluation

import numpy as np
import keras_tuner as kt
from sklearn.model_selection import KFold

MAX_EPOCHS = 50


def model_eval(X_mri, X_SDQ, targets):
    # Initialize the RandomSearch tuner
    # for hyperparameter search based on Area Under Curve
    tuner = kt.RandomSearch(
        create_hypermodel,
        objective=[
            kt.Objective("val_adhd_output_pr_auc", direction="max"),
            kt.Objective("val_sex_output_pr_auc", direction="max"),
        ],
        directory="archive",
        project_name="bayes",
        overwrite=True,
    )

    # K-fold Cross Validation fold counter
    fold_no = 1

    # Define per-fold score containers
    loss_per_fold = []
    auc_per_fold = []

    # Initialize K-fold cross-validation with 10 splits
    kfolder = KFold(n_splits=10, shuffle=True, random_state=42)

    # Go through each fold
    for train, val in kfolder.split(X_mri, targets):
        # Get the training data
        input_vectors = [X_mri[train], X_SDQ[train]]
        y_train = targets[train]
        y_train_vectors = [y_train[:, 0], y_train[:, 1]]

        # Perform hyperparameter search with Keras Tuner
        tuner.search(
            input_vectors,
            y_train_vectors,
            validation_split=0.2,
            verbose=False,
        )

        # Get the optimal hyperparameters
        best_hps = tuner.get_best_hyperparameters()[0]

        # Build the model with the optimal hyperparameters and train it
        model = tuner.hypermodel.build(best_hps)
        history = model.fit(
            input_vectors,
            y_train_vectors,
            epochs=MAX_EPOCHS,
            validation_split=0.2,
            verbose=False,
        )

        # Find the best epoch based on validation auc
        val_auc_adhd_per_epoch = history.history["val_adhd_output_pr_auc"]
        val_auc_sex_per_epoch = history.history["val_sex_output_pr_auc"]
        best_epoch = int(
            (
                val_auc_adhd_per_epoch.index(max(val_auc_adhd_per_epoch))
                + val_auc_sex_per_epoch.index(max(val_auc_sex_per_epoch))
                + 2
            )
            / 2
        )

        # Retrain the model on the best hyperparameters and epoch
        hypermodel = tuner.hypermodel.build(best_hps)
        hypermodel.fit(
            input_vectors,
            y_train_vectors,
            epochs=best_epoch,
            validation_split=0.2,
            verbose=False,
        )

        # Vectorize the validation data
        input_vectors_val = [X_mri[val], X_SDQ[val]]
        y_val = targets[val]
        y_val_vectors = [y_val[:, 0], y_val[:, 1]]

        # Evaluate the model on the validation data
        (
            loss,
            adhd_output_loss,
            sex_output_loss,
            adhd_output_roc_auc,
            adhd_output_pr_auc,
            sex_output_roc_auc,
            sex_output_pr_auc,
        ) = hypermodel.evaluate(input_vectors_val, y_val_vectors, verbose=False)

        loss_per_fold.append(loss)
        auc_per_fold.append((adhd_output_pr_auc + sex_output_pr_auc) / 2)

        # Increase fold number
        fold_no = fold_no + 1

    # Return the best hyperparameters and their
    # corresponding F1 and loss metrics
    best_hps = tuner.get_best_hyperparameters(num_trials=1)[0]
    hp_data = {
        "loss": round(np.mean(loss_per_fold), 3),
        "auc": round(np.mean(auc_per_fold), 3),
        "num_mri_layers": best_hps.values["num_mri_layers"],
        "num_layers": best_hps.values["num_layers"],
    }

    # Capture the best model's structure
    # Track the sequence of layer(s)
    # and their no. of units
    mri_units = []
    units = []

    # Iterate over num_mri_layers
    for i in range(best_hps.values["num_mri_layers"]):
        try:
            # Track no. of mri layers and units
            if best_hps.values[f"mri_units_{i}"]:
                num_mri_units = best_hps.values[f"mri_units_{i}"]

                # Save the number of units per
                # layer in the correct sequence
                mri_units.append(str(num_mri_units))

        except:
            pass

    # Iterate over num_layers
    for i in range(best_hps.values["num_layers"]):
        try:
            # Track the no. of layers and units
            # after the merge
            if best_hps.values[f"units_{i}"]:
                num_units = best_hps.values[f"units_{i}"]

                # Save the number of units per
                # layer in the correct sequence
                units.append(str(num_units))

        except:
            pass

    # Store in hp_data for return
    hp_data["mri_units"] = mri_units
    hp_data["units"] = units

    return hp_data

# Helper function to find the best model
def find_best_model(tries, X_mri, X_SDQ, y_train):
    # Track metric & hyperparameter
    # information about the neural networks
    data = []

    for _ in range(tries):
        # Find and save the best neural network structure
        data.append(model_eval(X_mri, X_SDQ, y_train))

    # Print a table summarizing the neural networks' structures
    print("Loss\t", "AUC\t", "Hidden MRI Layers\t", "Hidden Layers\t")

    max_auc_index = -1
    max_auc = 0

    for i in range(len(data)):
        hp_data = data[i]

        print(
            f"{hp_data['loss']}\t",
            f"{hp_data['auc']}\t",
            f"{' → '.join(hp_data['mri_units'])}\t",
            f"{' → '.join(hp_data['units'])}\t",
        )

        if (hp_data["auc"] > max_auc):
            max_auc = hp_data["auc"]
            max_auc_index = i

    return data[max_auc_index]

# Try to find the best neural network structure, 10 times
from sklearn.preprocessing import StandardScaler

# Get features and outcomes
X = train_merged_df.iloc[:, 1:-2]  # 26throw_129thcolumn ... whatever column
y = train_merged_df.iloc[:, -2:].values.astype("float32")  # [ADHD_Outcome, Sex_F]

# Correctly slice the input arrays
X_mri = X.iloc[:, :num_components].values
X_SDQ_unscaled = X.iloc[:, num_components:].values

# Scale the features
scaler = StandardScaler()
X_SDQ_scaled = scaler.fit_transform(X_SDQ_unscaled)

best_model_data = find_best_model(10, X_mri.copy(), X_SDQ_scaled.copy(), y.copy())
best_model_data

Output should look something like:

Loss	 AUC	 Hidden MRI Layers	 Hidden Layers
1.199	 0.683	 208 → 80	 	 	 304 → 272 → 112
1.189	 0.672	 432 → 16	 	 	 368 → 112 → 16
1.231	 0.668	 368 → 368	 	 	 368 → 16
1.237	 0.684	 560	 	 	 	 272
1.21	 0.66	 240	 	 	 	 240 → 80 → 208
1.239	 0.654	 304	 	 	 	 48 → 176 → 208
1.474	 0.644	 464 → 528 → 432	 176 → 16 → 16
1.187	 0.687	 208 → 272	 	 	 368 → 240
1.201	 0.685	 240 → 432 → 80	 	 304 → 80
1.288	 0.65	 240 → 80	 	 	 144 → 176 → 16

Predicting Labels for Performance Evaluation

from sklearn.model_selection import train_test_split


# This function is similar to the hypermodel one,
# but it simply removes hyperparameter tuning
def create_model(num_mri_layers, num_layers):
    # MRI Input Branch (num_components features)
    mri_input = Input(shape=(num_components,), name="mri_input")
    x1 = mri_input

    # Add multiple dense layers with hyperparameter tuning
    for i in range(len(num_mri_layers)):
        x1 = layers.Dense(
            units=int(num_mri_layers[i]),
            activation="relu",
        )(x1)

    # Questionnaire Input Branch
    # -1 to exclude participant ID
    survey_input = Input(shape=(len(QUANT_INDICES) - 1,), name="survey_input")
    x2 = layers.Dense(len(QUANT_INDICES) * 3, activation="relu")(survey_input)

    # Merge both feature representations
    merged = layers.Concatenate()([x1, x2])

    # Add multiple dense layers with hyperparameter tuning
    for i in range(len(num_layers)):
        merged = layers.Dense(
            units=int(num_layers[i]),
            activation="relu",
        )(merged)

    # Add a final Dense layer with two units and
    # sigmoid activation for binary classification
    # (1 = ADHD, 0 = No ADHD / 1 = Female, 0 = Male)
    output_adhd = layers.Dense(1, activation="sigmoid", name="adhd_output")(merged)
    output_sex = layers.Dense(1, activation="sigmoid", name="sex_output")(merged)

    # Compile the model with:
    # - weighted binary cross-entropy loss
    # - Adam optimizer
    # - AUC metric
    model = Model(inputs=[mri_input, survey_input], outputs=[output_adhd, output_sex])
    model.compile(
        loss={
            "adhd_output": adhd_loss,
            "sex_output": sex_loss,
        },
        metrics={
            "adhd_output": [
                metrics.AUC(name="roc_auc"),
                metrics.AUC(name="pr_auc", curve="PR"),
            ],
            "sex_output": [
                metrics.AUC(name="roc_auc"),
                metrics.AUC(name="pr_auc", curve="PR"),
            ],
        },
        optimizer=optimizers.legacy.Adam(learning_rate=0.001),
    )

    return model


# Re-create the model from the best hyperparameters
best_model = create_model(best_model_data["mri_units"], best_model_data["units"])

# Train-test split on our training dataset (1213 rows)
X_mri_train, X_mri_test, X_SDQ_train, X_SDQ_test, y_train, y_test = train_test_split(
    X_mri, X_SDQ_scaled, y, test_size=0.2, random_state=42
)

# Train the model
best_model.fit(
    [X_mri_train, X_SDQ_train],
    [y_train[:, 0], y_train[:, 1]],
    verbose=False,
)

# Predict classes
y_pred = best_model.predict([X_mri_test, X_SDQ_test])
y_pred

# Optional Saving of Model
best_model.save("nn-sdq-model", save_format='h5')
best_model.summary()

Result

Visualization

# Check distribution of predictions
# Distinctly bimodal distribution = more likely to have higher AUC score
plt.figure(figsize=(12, 6))

# ADHD Prediction Distribution
plt.subplot(1, 2, 1)
plt.hist(y_pred[0], bins=20, color='blue', density=True, alpha=0.7)
plt.title('Distribution of ADHD Predictions')
plt.xlabel('Predicted Probability')
plt.ylabel('Frequency')

# Sex Prediction Distribution
plt.subplot(1, 2, 2)
plt.hist(y_pred[1], bins=20, color='orange', density=True, alpha=0.7)
plt.title('Distribution of Sex Predictions')
plt.xlabel('Predicted Probability')
plt.ylabel('Frequency (Density)')

plt.tight_layout()
plt.show()

Calculating ROC-AUC, Accuracy, and F1 scores

from sklearn.metrics import roc_auc_score, accuracy_score, classification_report

# Evaluate the best model on the test set
y_test_vector = [y_test[:, 0], y_test[:, 1]]
converted_y_pred = [
    (np.array(y_pred[0]) >= 0.5).astype(int),
    (np.array(y_pred[1]) >= 0.5).astype(int),
]

# Calculate ROC-AUC score for each output
roc_auc_adhd = roc_auc_score(y_test_vector[0], converted_y_pred[0])
roc_auc_sex = roc_auc_score(y_test_vector[1], converted_y_pred[1])

# Report the ROC-AUC score on the test set
print("ROC-AUC Score on Test Set (ADHD):", roc_auc_adhd)
print("ROC-AUC Score on Test Set (Sex):", roc_auc_sex)

# Calculate accuracy for each output
accuracy_adhd = accuracy_score(y_test_vector[0], converted_y_pred[0])
accuracy_sex = accuracy_score(y_test_vector[1], converted_y_pred[1])

# Report the accuracy on the test set
print("Accuracy Score on Test Set (ADHD):", accuracy_adhd)
print("Accuracy Score on Test Set (Sex):", accuracy_sex)

# Print classification report for each output
print("")
print("Classification Report (ADHD):")
print(classification_report(y_test_vector[0], converted_y_pred[0], zero_division=True))
print("Classification Report (Sex):")
print(classification_report(y_test_vector[1], converted_y_pred[1], zero_division=True))

ROC-AUC Score on Test Set (ADHD): 0.7675438596491229
ROC-AUC Score on Test Set (Sex): 0.6022875816993463
Accuracy Score on Test Set (ADHD): 0.7860082304526749
Accuracy Score on Test Set (Sex): 0.6172839506172839

Classification Report (ADHD):
              precision    recall  f1-score   support

         0.0       0.62      0.72      0.67        72
         1.0       0.87      0.81      0.84       171

    accuracy                           0.79       243
   macro avg       0.75      0.77      0.75       243
weighted avg       0.80      0.79      0.79       243

Classification Report (Sex):
              precision    recall  f1-score   support

         0.0       0.71      0.66      0.68       153
         1.0       0.49      0.54      0.51        90

    accuracy                           0.62       243
   macro avg       0.60      0.60      0.60       243
weighted avg       0.63      0.62      0.62       243

^ Not satisfactory

Submission to Kaggle

Unsurprisingly, the F1-score on the test set is only 0.3+ :')))

Most probable reason: Classification for Sex_F is not good enough, pulling down the average F1 score.

Other Findings and Next Steps

My teammate Godson has discovered that PCA and K-fold cross validation seem to harm the F1-score, not improve it.
Godson also introduced me to LazyClassifier. It evaluates a bunch of different models at once so you don't have to manually build the pipelines for each model. Sex_F has been very hard to predict; the highest scores came from RidgeClassifier (a model that is generally robust to class imbalances), but even then, the training F1-score was only 0.71.

My intended next steps:

Try more models! I'm currently trying the deep autoencoder. Getting NaN loss values at the moment, but I'll figure it out.
Prof Watson (my advisor) has suggested creating a balanced dataset and using it for training.

Predicting Sex and ADHD Diagnosis from fMRI matrices

Contents