Comment by brotchie

Comment by brotchie 14 hours ago

8 replies

You'd think the go-to workflow for releasing redacted PDFs would be to draw black rectangles and then rasterize to image-only PDFs :shrug:

selinkocalar 13 hours ago

As someone who's built an entire business on "anti-screenshots" this is brilliant.

PDF redaction fails are everywhere and it's usually because people don't understand that covering text with a black box doesn't actually remove the underlying data.

I see this constantly in compliance. People think they're protecting sensitive info but the original text is still there in the PDF structure.

shbooms 13 hours ago

often times you will have requirements that the documents you release be digitally searchable and so in these cases, this would not be an option

  • pottertheotter 12 hours ago

    This made me think of something I came across recently that’s almost the opposite problem of requiring PDFs to be searchable. A local government would publish PDFs where the text is clearly readable on screen, but the selectable text layer is intentionally scrambled, so copy/paste or search returns garbage. It's a very hostile thing to do, especially with public data!

    • eviks 6 hours ago

      Hostile indeed, and also happens in user-facing documents like product manuals!

  • 8note 13 hours ago

    run some ocr on them after to recreate the text layer?

    • albert_e 8 hours ago

      With the aggressive push of LLMs and Generative AI ..i am expecting a lot of OCR features to become "smarter" by default, namely go beyond mechanical OCR and start inserting hallucinations and sematically/contextually "more correct" information in OCR output

      It's not hard to imagine some powerful LLMs being able to undo some light redactions that are deducible based on context

[removed] 12 hours ago
[deleted]