Home News Not Just Hands, Your PDFs Also Need to be Sanitized

Not Just Hands, Your PDFs Also Need to be Sanitized

Cybercriminals often focus on harvesting sensitive data from poorly sanitized PDFs. Research suggests perfect sanitization is required for securing PDFs to avoid security exploits.

75% Of Security Pros Say Remote Work Led to Changes in Financial Services Cyber Programs: Survey

Most organizations and security agencies publish and share Portable Document Format (PDF) files without proper sanitization, leaving them open to data theft. Cybersecurity experts suggest that most users and businesses are unaware that cybercriminals often target these kinds of PDF documents to pilfer sensitive information and exploit them to attack an organization.

A recent analysis found that security agencies are not sanitizing PDF docs before sending them to others. The analysis collected a corpus of 39,664 PDF files published by 75 security agencies, from 47 countries to find out the quality and quantity of data leaked from these PDF files. It was found that these files can be misused to find loopholes in an organization, like discovering employees who use outdated software.

What is PDF Sanitization? 

PDF sanitization is a process of removing classified and sensitive data from a protected document before its publication. Also known as Data Anonymization, the sanitization process reduces the document’s classification level, possibly making the document an unclassified one.

Low Adoption of Sanitization

Besides, the analysis revealed that the implementation of the sanitization procedure within security agencies is low. It was found that only seven organizations used it to remove hidden sensitive information from their PDF files, before publishing. And 65% of these sanitized PDFs still contained sensitive information. This is because some organizations are using weak sanitization techniques. A proper sanitization procedure requires removing all the hidden sensitive data from the PDF docs and simply deleting important data.

Hidden Data Found in PDF Files 

According to the National Security Agency (NSA), 11 types of hidden data and embedded content can be found in PDF files. These include:

  • Metadata
  • Embedded Content and Attached Files
  • Scripts
  • Hidden Layers
  • Embedded Search Index
  • Stored Interactive Form Data
  • Reviewing and Commenting
  • Hidden Page, Image, and Update Data
  • Obscured Text and Images
  • PDF Comments (Non-Displayed)
  • Unreferenced Data

The NSA stated that a PDF file is safe for publication and distribution only after removing these 11 types of hidden information from it.   

Levels of Sanitization

The NSA also listed four levels of sanitization:

Level-0 – Consists of PDF files that include complete metadata information. There is no sanitization.

Level-1: Consists of PDF files with partial metadata after removing certain metadata fields.

Level-2: Consists of PDF files without any metadata.

Level-3: Consists of PDF files with no information leakage and properly cleaned (Full Sanitization)

“The issue is that popular PDF producer tools are keeping metadata by default with much other information while creating a PDF file. They provide no option for sanitization or it can only be achieved by following a complex procedure. Software producing PDF files needs to enforce sanitization by default. The user should be able to add metadata only as an option,” the researchers said.