LaTeXpOsEd: A Systematic Analysis of Information Leakage in Preprint Archives
(arxiv.org)26 points by oldfuture 4 hours ago
26 points by oldfuture 4 hours ago
Google has a great aid to reduce the attack surface: https://github.com/google-research/arxiv-latex-cleaner
Paper LaTeX files often contain surprising details. When a paper lacks code, looking at latex source has become a part of my reproduction workflow. The comments often reveal non-trivial insights. Often, they reveal a simpler version of the methodology section (which for poor "novelty" purposes is purposely obscured via mathematical jargon).
I sort of understand the reasoning on why Arxiv prefers tex to pdf[1], even though I feel it's a bit much to make it mandatory to submit the original tex file if they detect a submitted pdf was produced from one. But I've never understood what the added value is in hosting the source publicly.
Though I have to admit, when I was still in academia, whenever I saw a beautiful figure or formatting in a preprint, I'd often try to take some inspiration from the source for my own work, occasionally learning a new neat trick or package.
A huge value in having authors upload the original source, is it divorces the content from the presentation (mostly). That the original sources were available was sufficient for a large majority of the corpus to be automatically rendered into HTML for easier reading on many devices: https://info.arxiv.org/about/accessible_HTML.html. I don't think it would have been as simple if they had to convert PDFs.
As far as I can tell they trawled a big archive for sensitive information, (unsurprisingly) found some, and then didn't try to contact anyone affected before telling the world "hey, there are login credentials to be found in here".