Comment by dredmorbius
Comment by dredmorbius 21 hours ago
This reduces to parsing PDFs, which is an unsolved hard problem.
At low volumes, my preferred approach is to select and extract text (copy/paste, perhaps using the poppler library for larger-scale work), dump that to plain-text and convert that (manually / scripted) to Markdown. From there you can get to PDF or pretty much any other format through tools such as pandoc.