Comment by noahho

Author here! The breast cancer dataset is simple and heavily saturated, so small differences between methods are expected. As you say, single-use examples can be noisy due to randomness in how the data is randomly split into training and testing sets especially for a saturated dataset like this one. Cross-validation reduces this variance by averaging over multiple splits. I just ran this below:

  TabPFN mean ROC AUC: 0.9973

  SVM mean ROC AUC: 0.9903

  TabPFN per split: [0.99737963 0.99639699 0.99966931 0.99338624 0.99966465]

  SVM per split: [0.99312152 0.98788077 0.99603175 0.98313492 0.99128102]

  from sklearn.model_selection import cross_val_score
  from tabpfn import TabPFNClassifier
  from sklearn.datasets import load_breast_cancer
  from sklearn.svm import LinearSVC
  import numpy as np

  data = load_breast_cancer()
  X, y = data.data, data.target

  # TabPFN
  tabpfn_clf = TabPFNClassifier()
  tabpfn_scores = cross_val_score(tabpfn_clf, X, y, cv=5, 
  scoring='roc_auc')
  print("TabPFN per split:", tabpfn_scores)
  print("TabPFN mean ROC AUC:", np.mean(tabpfn_scores))
  
  # SVM
  svm_clf = LinearSVC(C=0.01)
  svm_scores = cross_val_score(svm_clf, X, y, cv=5, 
  scoring='roc_auc')
  print("SVM per split:", svm_scores)
  print("SVM mean ROC AUC:", np.mean(svm_scores))

It's hard to communicate this properly, we should probably make sure to have a favourable example ready, but just included the simplest one!