Most organizations and security agencies publish and share Portable Document Format (PDF) files without proper sanitization, leaving them open to data theft. Cybersecurity experts suggest that most users and businesses are unaware that cybercriminals often target these kinds of PDF documents to pilfer sensitive information and exploit them to attack an organization.
A recent analysis found that security agencies are not sanitizing PDF docs before sending them to others. The analysis collected a corpus of 39,664 PDF files published by 75 security agencies, from 47 countries to find out the quality and quantity of data leaked from these PDF files. It was found that these files can be misused to find loopholes in an organization, like discovering employees who use outdated software.
What is PDF Sanitization?
PDF sanitization is a process of removing classified and sensitive data from a protected document before its publication. Also known as Data Anonymization, the sanitization process reduces the document’s classification level, possibly making the document an unclassified one.
Low Adoption of Sanitization
Besides, the analysis revealed that the implementation of the sanitization procedure within security agencies is low. It was found that only seven organizations used it to remove hidden sensitive information from their PDF files, before publishing. And 65% of these sanitized PDFs still contained sensitive information. This is because some organizations are using weak sanitization techniques. A proper sanitization procedure requires removing all the hidden sensitive data from the PDF docs and simply deleting important data.
Hidden Data Found in PDF Files
According to the National Security Agency (NSA), 11 types of hidden data and embedded content can be found in PDF files. These include:
- Metadata
- Embedded Content and Attached Files
- Scripts
- Hidden Layers
- Embedded Search Index
- Stored Interactive Form Data
- Reviewing and Commenting
- Hidden Page, Image, and Update Data
- Obscured Text and Images
- PDF Comments (Non-Displayed)
- Unreferenced Data
The NSA stated that a PDF file is safe for publication and distribution only after removing these 11 types of hidden information from it.
Levels of Sanitization
The NSA also listed four levels of sanitization:
Level-0 – Consists of PDF files that include complete metadata information. There is no sanitization.
Level-1: Consists of PDF files with partial metadata after removing certain metadata fields.
Level-2: Consists of PDF files without any metadata.
Level-3: Consists of PDF files with no information leakage and properly cleaned (Full Sanitization)
“The issue is that popular PDF producer tools are keeping metadata by default with much other information while creating a PDF file. They provide no option for sanitization or it can only be achieved by following a complex procedure. Software producing PDF files needs to enforce sanitization by default. The user should be able to add metadata only as an option,” the researchers said.