
- DEFINITION REDACTOR HOW TO
- DEFINITION REDACTOR PDF
- DEFINITION REDACTOR INSTALL
- DEFINITION REDACTOR FULL
DEFINITION REDACTOR INSTALL
Tests require some additional packages: pip install -r requirements-dev.txt If you're redacting metadata, you should check the output using pdfinfo from the poppler-utils package: # check that the metadata is fully redacted Some unusual fonts may not be processed correctly, in which case text layer redaction regular expressions may not match or substitution text may not render correctly. This tool has a limited understanding of glyph-to-Unicode codepoint mappings. (qpdf's first argument can't be standard input, unfortunately, so a one-liner isn't possible.) Exotic fonts & qpdf -linearize /tmp/temp.pdf document-redacted.pdf | python3 pdf_redactor.py > /tmp/temp.pdf
DEFINITION REDACTOR FULL
The full command would be something like: qpdf -stream-data=uncompress document.pdf - \
DEFINITION REDACTOR PDF
Content stream compressionīecause pdfrw doesn't support all content stream compression methods, you should use a tool like qpdf to decompress the PDF prior to using this tool, and then to re-compress and web-optimize (linearize) the PDF after. Hopefully at least one of those characters is present (maybe none are!), and in that case your replacement text will at least show up as something and not disappear. To get around this problem, pdf_redactor checks your replacement text for new characters and replaces them with characters from the content_replacement_glyphs list (defaulting to ?, #, *, and a space) if any of those characters are present in the font information already stored in the PDF.
DEFINITION REDACTOR HOW TO
Those characters simply won't show up when the PDF is viewed because the PDF didn't contain any information about how to display them. Since redaction in the text layer works by performing simple text substitution in the text stream, you may create replacement text that contains characters that were not previously in the PDF. This has an unfortunate consequence for redaction in the text layer. So if a document doesn't contain a particular letter or symbol, information for rendering the letter or symbol is not stored in the PDF. Most PDFs are optimized to only embed the font information for characters that are actually used in the document. One of the PDF format's strengths is that it embeds font information so that documents can be displayed even if the fonts used to create the PDF aren't available when the PDF is viewed. It would take a lot more effort to write a redaction tool that scanned all possible places content can be hidden inside a PDF besides the places that this tool looks at, so please be aware that it is your responsibility to ensure that the PDFs you use this tool on only use the capabilities of the PDF format that this tool knows how to redact. There are so many exotic capabilities in PDF documents that it would be difficult to list them all, so this list is a very partial list.

Use regular expressions to perform text substitution on the text layer (e.g.Graphical elements, images, and other embedded resources are not touched. the Document Information Dictionary, a.k.a.the text layer of the document's pages (content stream text).This Python module is a general tool to help you automatically redact text from PDFs. Pdf-redactor uses pdfrw under the hood to parse and write out the PDF.

A general-purpose PDF text-layer redaction tool, in pure Python, by Joshua Tauberer and Antoine McGrath.
