Comment by noahho
Author here! The breast cancer dataset is simple and heavily saturated, so small differences between methods are expected. As you say, single-use examples can be noisy due to randomness in how the data is randomly split into training and testing sets especially for a saturated dataset like this one. Cross-validation reduces this variance by averaging over multiple splits. I just ran this below:
TabPFN mean ROC AUC: 0.9973
SVM mean ROC AUC: 0.9903
TabPFN per split: [0.99737963 0.99639699 0.99966931 0.99338624 0.99966465]
SVM per split: [0.99312152 0.98788077 0.99603175 0.98313492 0.99128102]
from sklearn.model_selection import cross_val_score
from tabpfn import TabPFNClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.svm import LinearSVC
import numpy as np
data = load_breast_cancer()
X, y = data.data, data.target
# TabPFN
tabpfn_clf = TabPFNClassifier()
tabpfn_scores = cross_val_score(tabpfn_clf, X, y, cv=5,
scoring='roc_auc')
print("TabPFN per split:", tabpfn_scores)
print("TabPFN mean ROC AUC:", np.mean(tabpfn_scores))
# SVM
svm_clf = LinearSVC(C=0.01)
svm_scores = cross_val_score(svm_clf, X, y, cv=5,
scoring='roc_auc')
print("SVM per split:", svm_scores)
print("SVM mean ROC AUC:", np.mean(svm_scores))
It's hard to communicate this properly, we should probably make sure to have a favourable example ready, but just included the simplest one!
thanks, this is helpful!
I certainly appreciate how the example in the README makes it instantly apparent how to use the code.