Securely Deleting Data

11 minute read

Published:

Imagine the following scenario: You’re editing a private document on your computer. When you’re done, you delete the file for security reasons. However, in an unforeseen turn of events, your computer gets stolen. In an ideal situation, the thief should be completely blocked from accessing any of your files. With full disk encryption and robust authentication measures (like a high-entropy user password), your data would generally be considered secure. But, what if the thief manages to crack your authentication system (perhaps your password wasn’t as robust as you believed)? In such a case, you would still hope that the deleted document remains unrecoverable. This is the essence of secure deletion: ensuring that, even if someone gains unrestricted access to your computer, your erased files remain confidential.

Secure deletion might seem like a basic aspect of data security, yet it’s not provided by most commodity file systems. For example, on popular Linux file systems like EXT4, deleting a file marks the file’s on-device data blocks as unallocated. However, the original data persists and is only replaced if the file system eventually chooses to allocate those blocks to a new file. Additionally, attempts to manually overwrite old file data, like using the Linux shred command for zero-filling, are often ineffective. This is particularly true with solid-state drives (SSDs) found in most modern laptops and desktops. SSDs employ wear-leveling techniques that inadvertently preserve file data by duplicating it across different parts of the drive, even after a file is supposedly deleted. For this reason, the implementation of software-based approaches for secure deletion can be tricky. One must either (a) make assumptions about how the underlying storage controller manages and writes data, or (b) treat the storage controller itself as adversarial.

Secure deletion has been extensively explored in previous research. While I won’t delve into a detailed analysis here, I recommend the following surveys by Zinkus, Jois, and Green and Reardon, Basin, and Capkun. The most common method for secure deletion is cryptographic erasure. This technique involves encrypting all data written to the disk and decrypting it upon reading. Securely deleting a file becomes a matter of making its corresponding encryption key unrecoverable. This is usually accomplished by only storing the encryption key in a secure, wipeable storage (often with limited capacity), and then erasing/rotating the key during a secure deletion operation.

Take Apple devices as an example. Apple devices encrypt the entire file system with a key, which is itself encrypted by an “effaceable key” stored in a reserved area of flash memory. When this effaceable key is wiped from memory, it renders the file system, and therefore the data, inaccessible.

Figure 1. Effaceable storage on Apple devices. The effaceable key encrypts the file system key which in turn encrypts the per-file keys. Note that this is a slightly simplified figure, and in reality, the file keys are additionally protected by one or more "class keys." The class keys are encrypted under a secure enclave UID key and the user's passcode.

However, Apple’s method offers a coarse-grained approach to secure deletion. There’s often a need for more refined control, such as per-file secure deletion. Consider a scenario where a user’s device is compromised without their knowledge, so that the user never issues a wipe command. Even if the user is aware that their device has been compromised, they may be unable to securely wipe it without facing legal consequences, e.g., in the case of device seizure by warrant. In such cases, the user would still expect any previously deleted files (prior to the intruder gaining access) to remain unrecoverable.

This is where our research enters the picture! My co-authors (Wittmann Goh, Abe Wieland, James Mickens, and Ryan Williams) and I wrote a paper that was recently accepted to Oakland’24! In this work, we designed Holepunch, a new block device driver that provides full disk encryption and secure deletion of file system data. Holepunch improves over the current state-of-the-art by:

  • making no assumptions about how storage hardware internally manages and writes data
  • reducing both the memory space and storage IOs needed to manage file keys
  • providing crash consistency, ensuring that the file system is usable after a crash or power outage occurs

Given these properties, we believe that Holepunch is the first practical system for securely deleting file system data.

Holepunch Overview

Holepunch leverages a cryptographic primitive known as a puncturable pseudorandom function (puncturable PRF). A psuedorandom function $F$ accepts a key $k$ and an input $x$ and outputs a bit string $y$ that is indisitnguishable from random.1 Given such an object, one can output many random strings that are suitable for cryptographic use, e.g., we can quickly generate the $n$ keys:

$k_1 = F(k, 1), k_2 = F(k, 2), \dots, k_n = F(k, n)$

and then use these keys to encrypt data. Notably, a puncturable PRF includes the additional property that the PRF key can be punctured at an evaluation point. For example, given a key $k$ and a point $x$, one can puncture $k$ to produce a new key $k’$ such that $F(k’, x) = \bot$. We can use this puncturing feature to “forget” how to generate certain cryptographic keys.

In Holepunch puncturable PRF generates the wrapping keys that encrypt the individual per-file keys, which in turn are used for encrypting file data. The puncturable PRF key is then encrypted by a master key that is only ever stored in tamper-resistant hardware such as a Trusted Platform Module (TPM).

Figure 3. A high-level overview of Holepunch.

To delete a file, the puncturable PRF key is punctured at a single point such that the new key can no longer generate the necessary wrapping key. Holepunch then rotates the master key stored within the TPM, and the newly punctured PRF key is then re-encrypted under the updated master key.

Unsurprisingly, there are some technical hurdles with making our approach efficient and usable. First, our implementation of the puncturable PRF is based on the Goldreich, Goldwasser, and Micali construction. As a result, the puncturable PRF key grows linearly with every puncture operation. Re-encrypting this ever-growing punctured PRF key under a newly rotated master key can become increasingly resource-intensive over time, potentially impacting system performance. Second, we require precise management and synchronization between the file system’s on-disk state and the TPM’s state. This coordination is essential to prevent data loss during unforeseen events like system crashes or power outages. If you’re interested in the technical details of how we solved these problems, check out our paper!

I’m happy to say that after solving some of these problems, our Holepunch implementation was able to match the performance of dm-crypt (standard full disk encryption without secure deletion) and Eraser. Improving upon the Eraser tool was one of the original motivations of this work and we were able to significantly reduce the memory footprint of key management using our puncturable PRF based scheme.

Figure 4. Baseline memory consumption of Eraser (Onarlioglu et al.) compared to various Holepunch configurations.

Conclusion

Secure deletion frameworks are increasingly relevant for several reasons. Governments worldwide have begun enforcing laws that regulate the collection and retention of user data. Notable examples include the European Union’s General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA). Both of these grant users a “right to be forgotten,” enabling them to request the complete removal of their information from a business’s records. While this right is a boon for privacy advocates, its effective implementation heavily relies on robust technical solutions that can enforce GDPR-style regulations in real-world scenarios.

Beyond legal requirements, secure deletion is also critical for mitigating inherent security risks associated with improper data sanitization practices. There have been notable incidents, such as the Pennsylvania Department of Labor and Industry inadvertently reselling computers with unsanitized drives containing state employee records, or the United States Veterans Administration distributing hard disks with sensitive patient data like credit card numbers. These examples highlight a prevalent issue: if devices don’t facilitate secure deletion, many individuals and organizations might neglect to properly sanitize storage media, leading to significant security breaches. Therefore, the implementation of secure deletion technology is not just a regulatory compliance measure but also a crucial step in safeguarding sensitive information for both individuals and organizations.

A Puncturable PRF Implementation in Python

As a bonus, if you’re interested in playing around with puncturable PRFs, here’s a reference implementation we programmed in Python that’s based on the GGM’86 construction.

import bisect, pyaes, secrets

KEY_PREFIX = 0
KEY_DEPTH = 1
KEY_VALUE = 2

class PPRF:

    def __init__(self, key, domain_bits=128):
        assert domain_bits <= 128, "PPRF only supports domain sizes up to 2^128" 
        assert len(key) == 16, "PPRF key must be 16 bytes"
        self.key = [(0, 0, key)]
        self.domain_bits = domain_bits
        # Used for length-doubling PRG
        self.inputs = b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
        self.inputs += b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01'

    # Length Doubling PRG
    # Expands 128-bit seed to 256-bit pseudorandom output
    # Uses AES-ECB as a PRF
    def __prg(self, seed):
        aes = pyaes.AESModeOfOperationECB(seed)
        ciphertext = aes.encrypt(self.inputs[:16])
        ciphertext += aes.encrypt(self.inputs[16:])
        return ciphertext

    # Punctures the PRF at a point x and returns a new punctured key
    def puncture(self, x):
        key, key_idx = self.__get_longest_matching_prefix(x)
        seed = key[KEY_VALUE]
        check_val = key[KEY_PREFIX]
        prefix = key[KEY_PREFIX]
        self.key.pop(key_idx)
        new_keys = []
        for i in range(key[KEY_DEPTH], self.domain_bits):
            prg_output = self.__prg(seed)
            bit = x >> (self.domain_bits - 1 - i) & 1
            prefix_add = (1 - bit) * (2 ** (self.domain_bits - 1 - i))
            if bit:
                seed = prg_output[16:]
                bisect.insort(new_keys, (prefix + prefix_add, i + 1, prg_output[:16]))
                check_val += (2 ** (self.domain_bits - 1 - i))
            else:
                seed = prg_output[:16]
                bisect.insort(new_keys, (prefix + prefix_add, i + 1, prg_output[16:]))

            prefix += bit * (2 ** (self.domain_bits - 1 - i))

        # Key was already punctured at this point.
        if check_val != x:
            self.key.insert(key_idx, key)
            return self.key
 
        self.key[key_idx:key_idx] = new_keys
        return self.key

    # Get key that can evaluate the point x
    def __get_longest_matching_prefix(self, x):
        i = bisect.bisect_left(self.key, (x, 2 ** self.domain_bits, 2 ** self.domain_bits))

        if i == len(self.key):
            return self.key[i - 1], i - 1
        elif self.key[i] == x:
            return self.key[i], i

        return self.key[i - 1], i - 1

    def get_key_idx(self, x):
        _, idx = self.__get_longest_matching_prefix(x)
        return idx

    # Evaluate the PPRF at a point x
    def eval(self, x):
        key,_ = self.__get_longest_matching_prefix(x)
        seed = key[KEY_VALUE]
        check_val = key[KEY_PREFIX]

        for i in range(key[KEY_DEPTH], self.domain_bits):
            prg_output = self.__prg(seed)
            if x >> (self.domain_bits - 1 - i) & 1:
                seed = prg_output[16:]
                check_val += (2 ** (self.domain_bits - 1 - i))
            else:
                seed = prg_output[:16]

        # value already punctured
        if check_val != x:
            seed = None

        return seed

uniformly random string.

  1. Technically speaking, the output of the pseudorandom function is computationally indistinguishable from random. This means that any computationally-bounded (polynomial-time) adversary cannot distinguish the output of the PRF from a