Comment by dredmorbius

This reduces to parsing PDFs, which is an unsolved hard problem.

At low volumes, my preferred approach is to select and extract text (copy/paste, perhaps using the poppler library for larger-scale work), dump that to plain-text and convert that (manually / scripted) to Markdown. From there you can get to PDF or pretty much any other format through tools such as pandoc.