Comment by simonw
Paper: https://arxiv.org/abs/2505.18878
Code: https://github.com/SalesforceAIResearch/CRMArena
Data: https://huggingface.co/datasets/Salesforce/CRMArenaPro (8,614 rows)
Here's one of those JSON files loaded in Datasette Lite (15MB page load): https://lite.datasette.io/?json=https://huggingface.co/datas...
I had Gemini 2.5 Pro extract the prompts they used from the code:
llm install llm-gemini
llm install llm-fragments-github
llm -m gemini/gemini-2.5-pro-preview-06-05 \
-f github:SalesforceAIResearch/CRMArena \
-s 'Markdown with a comprehensive list of all prompts used and how they are used'
Result here: https://gist.github.com/simonw/33d51edc574dbbd9c7e3fa9c9f79e...
I recommend folks check out the linked paper -- it's discussing more than just confidentiality tests as a benchmark for being ready for B2B AI usage.
But when it comes to confidentiality, having fine-grained authorization securing your RAG layer is the only valid solution that I've seen in used in industry. Injecting data into the context window and relying on prompting will never be secure.