Plan B for UUIDs: double AES-128

Hacker News - Fri Aug 5 13:23

It looks like internauts are having another go at the “UUID as primary key” debate, where the fundamental problem is the tension between nicely structured primary keys that tend to improve spatial locality in the storage engine, and unique but otherwise opaque identifiers that avoid running into Hyrum’s law when communicating with external entities and generally prevent unintentional information leakage.1

I guess I’m lucky that the systems I’ve worked on mostly fall in two classes:2

  1. those with trivial write load (often trivial load in general), where the performance implications of UUIDs for primary keys are irrelevant.

  2. those where performance concerns lead us to heavily partition the data, by tenant if not more finely… making information leaks from sequentially allocation a minor concern.

Of course, there’s always the possibility that a system in the first class eventually handles a much higher load. Until roughly 2016, I figured we could always sacrifice some opacity and switch to one of the many k-sorted alternatives created by web-scale companies.

By 2016-17, I felt comfortable assuming AES-NI was available on any x86 server,3 and that opens up a different option: work with structured “leaky” keys internally, and encrypt/decrypt them at the edge (e.g., by printing a user-defined type in the database server). Assuming we get the cryptography right, such an approach lets us have our cake (present structured keys to the database’s storage engine), and eat it too (present opaque unique identifiers to external parties), as long as the computation overhead of repeated encryption and decryption at the edge remains reasonable.

I can’t know why this approach has so little mindshare, but I think part of the reason must be that developers tend to have an outdated mental cost model for strong encryption like AES-128.4 This quantitative concern is the easiest to address, so that’s what I’ll do in this post. That leaves the usual hard design questions around complexity, debuggability, and failure modes… and new ones related to symmetric key management.

Brandur compares sequential keys and UUIDs. I’m thinking more generally about “structured” keys, which may be sequential in single-node deployments, or include a short sharding prefix in smaller (range-sharded) distributed systems. Eventually, a short prefix will run out of bits, and fully random UUIDs are definitely more robust for range-sharded systems that might scale out to hundreds of nodes… especially ones focused more on horizontal scalability than single-node performance.

That being said, design decisions that unlock scalability to hundreds or thousands of nodes have a tendency to also force you to distribute work over a dozen machines when a laptop might have sufficed.

Mentioning cryptography makes people ask for a crisp threat model. There isn’t one here (and the question makes sense outside cryptography and auth!).

Depending on the domain, leaky or guessable external ids can enable scraping, let competitors estimate the creation rate and number of accounts (or, similarly, activity) in your application, or, more benignly, expose an accidentally powerful API endpoint that will be difficult to replace.

Rather than try to pinpoint the exact level of dedication we’re trying to foil, from curious power user to nation state actor, let’s aim for something that’s hopefully as hard to break as our transport (e.g., HTTPS). AES should be helpful.

Hardware-assisted AES: not not fast

Intel shipped their first chip with AES-NI in 2010, and AMD in 2013. A decade later, it’s anything but exotic, and is available even in low-power Goldmont Atoms. For consumer hardware, with a longer tail of old machines than servers, the May 2022 Steam hardware survey shows 96.28% of the responses came from machines that support AES-NI (under “Other Settings”), an availability rate somewhere between those of AVX (2011) and SSE4.2 (2008).

The core of the AES-NI extension to the x86-64 instruction set is a pair of instructions to perform one round of AES encryption (AESENC) or one round of decryption (AESDEC) on a 16-byte block. Andreas Abel’s uops.info shows that the first implementation, in Westmere, had a 6-cycle latency for each round, and that Intel and AMD have been optimising the instructions to bring their latencies down to 3 (Intel) or 4 (AMD) cycles per round.

That’s pretty good (on the order of a multiplication), but each instruction only handles one round. The schedule for AES-128, the fastest option, consists of 10 rounds: an initial whitening xor, 9 aesenc / aesdec and 1 aesenclast / aesdeclast. Multiply 3 cycles per round by 10 “real” rounds, and we find a latency of 30 cycles (\(+ 1\) for the whitening xor) on recent Intels and \(40 + 1\) cycles on recent AMDs, assuming the key material is already available in registers or L1 cache.

This might be disappointing given that AES128-CTR could already achieve more than 1 byte/cycle in 2013. There’s a gap between throughput and latency because pipelining lets contemporary x86 chips start two rounds per cycle, while prior rounds are still in flight (i.e., 6 concurrent rounds when each has a 3 cycle latency).

Still, 35-50 cycles latency to encrypt or decrypt a single 16-byte block with AES-128 is similar to a L3 cache hit… really not that bad compared to executing a durable DML statement, or even a single lookup in a big hash table stored in RAM.

A trivial encryption scheme for structured keys

AES works on 16 byte blocks, and 16-byte randomish external ids are generally accepted practice. The simplest approach to turn structured keys into something that’s provably difficult to distinguish from random bits probably goes as follows:

  1. Fix a global AES-128 key.
  2. Let primary keys consist of a sequential 64-bit id and a randomly generated 64-bit integer.5
  3. Convert a primary key to an external id by encrypting the primary key’s 128 bits with AES-128, using the global key (each global key defines a unique permutation from 128 bits input to 128 bit output).
  4. Convert an external id to a potential primary key by decrypting the external id with AES-128, using the same global key.

source: aes128.c

The computational core lies in the encode and decode functions, two identical functions from a performance point of view. We can estimate how long it takes to encode (or decode) an identifier by executing encode in a tight loop, with a data dependency linking each iteration to the next; the data dependency is necessary to prevent superscalar chips from overlapping multiple loop iterations.6

uiCA predicts 36 cycles per iteration on Ice Lake. On my unloaded 2 GHz EPYC 7713, I observe 50 cycles/encode (without frequency boost), and 13.5 ns/encode when boosting a single active core. That’s orders of magnitude less than a syscall, and in the same range as a slow L3 hit.

source: aes128-latency.c

This simple solution works if our external interface may expose arbitrary 16-byte ids. AES-128 defines permutation, so we could also run it in reverse to generate sequence/nonce pairs for preexisting rows that avoid changing their external id too much (e.g., pad integer ids with zero bytes).

However, it’s sometimes important to generate valid UUIDs, or to at least save one bit in the encoding as an escape hatch for a versioning scheme. We can do that, with format-preserving encryption.

Controlling one bit in the external encrypted id

We view our primary keys as pairs of 64-bit integers, where the first integer is a sequentially allocated identifier. Realistically, the top bit of that sequential id will always be zero (i.e., the first integer’s value will be less than \(2^{63}\)). Let’s ask the same of our external ids.

The code in this post assumes a little-endian encoding, for simplicity (and because the world runs on little endian), but the same logic works for big endian.

Black and Rogaway’s cycle-walking method can efficiently fix one input/output bit: we just keep encrypting the data until bit 63 is zero.

When decrypting, we know the initial (fully decrypted) value had a zero in bit 63, and we also know that we only re-encrypted when the output did not have a zero in bit 63. This means we can keep iterating the decryption function (at least once) until we find a value with a zero in bit 63.

source: aes128-cycle-walk.c

This approach terminates after two rounds of encryption (encode) or decryption (decode), in expectation.

That’s not bad, but some might prefer a deterministic algorithm. More importantly, the expected runtime scales exponentially with the number of bits we want to control, and no one wants to turn their database server into a glorified shitcoin miner. This exponential scaling is far from ideal for UUIDv4, where only 122 of the 128 bits act as payload: we can expect to loop 64 times in order to fix the remaining 6 bits.

Controlling more bits with a Feistel network

A Feistel network derives a permutation over tuples of values from hash functions over the individual values. There are NIST recommendations for general format-preserving encryption (FFX) with Feistel networks, but they call for 8+ AES invocations to encrypt one value.

FFX solves a much harder problem than ours: we only have 64 bits (not even) of actual information, the rest is just random bits. Full format-preserving encryption must assume everything in the input is meaningful information that must not be leaked, and supports arbitrary domains (e.g., decimal credit card numbers).

Our situation is closer to a 64-bit payload (the sequential id) and a 64-bit random nonce. It’s tempting to simply xor the payload with the low bits of (truncated) AES-128, or any PRF like SipHash7 or BLAKE3 applied to the nonce:

BrokenPermutation(id, nonce):
    id ^= PRF_k(nonce)[0:len(id)]  # e.g., truncated AES_k
    return (id, nonce)

The nonce is still available, so we can apply the same PRF_k to the nonce, and undo the xor (xor is a self-inverse) to recover the original id. Unfortunately, random 64-bit values could repeat on realistic database sizes (a couple billion rows). When an attacker observes two external ids with the same nonce, they can xor the encrypted payloads and find the xor of the two plaintext sequential ids. This might seem like a minor information leak, but clever people have been known to amplify similar leaks and fully break encryption systems.

Intuitively, we’d want to also mix the 64 random bits with before returning an external id. That sounds a lot like a Feistel network, for which Luby and Rackoff have shown that 3 rounds are pretty good:

PseudoRandomPermutation(A, B):
    B ^= PRF_k1(A)[0:len(b)]  # e.g., truncated AES_k1
    A ^= PRF_k2(B)[0:len(a)]
    B ^= PRF_k3(A)[0:len(b)]
    
    return (A, B)

This function is reversible (a constructive proof that it’s a permutation): apply the ^= PRF_k steps in reverse order (at each step, the value fed to the PRF passes unscathed), like peeling an onion.

If we let A be the sequentially allocated id, and B the 64 random bits, we can observe that xoring the uniformly generated B with a pseudorandom function’s output is the same as generating bits uniformly. In our case, we can skip the first round of the Feistel network; we deterministically need exactly two PRF evaluations, instead of the two expected AES (PRP) evaluations for the previous cycle-walking algorithm.

ReducedPseudoRandomPermutation(id, nonce):
    id ^= AES_k1(nonce)[0:len(id)]
    nonce ^= AES_k2(id)[0:len(nonce)]
    return (id, nonce)

This is a minimal tweak to fix BrokenPermutation: we hide the value of nonce before returning it, in order to make it harder to use collisions. That Feistel network construction works for arbitrary splits between id and nonce, but closer (balanced) bitwidths are safer. For example, we can work within the layout proposed for UUIDv8 and assign \(48 + 12 = 60\) bits for the sequential id (row id or timestamp), and 62 bits for the uniformly generated value.8

source: aes128-feistel.c

Again, we can evaluate the time it takes to encode (or symmetrically, decode) an internal identifier into an opaque UUID by encoding in a loop, with a data dependency between each iteration and the next (source: aes128-feistel-latency.c).

The format-preserving Feistel network essentially does double the work of a plain AES-128 encryption, with a serial dependency between the two AES-128 evaluations. We expect roughly twice the latency, and uiCA agrees: 78 cycles/format-preserving encoding on Ice Lake (compared to 36 cycles for AES-128 of 16 bytes).

On my unloaded 2 GHz EPYC 7713, I observe 98 cycles/format-preserving encoding (compared to 50 cycles for AES-128 of 16 bytes), and 26.5 ns/format-presering encoding when boosting a single active core (13.5 ns for AES-128).

Still much faster than a syscall, and, although twice as slow as AES-128 of one 16 byte block, not that slow: somewhere between a L3 hit and a load from RAM.

Sortable internal ids, pseudo-random external ids: not not fast

With hardware-accelerated AES-128 (SipHash or BLAKE3 specialised for 8-byte inputs would probably be slower, but not unreasonably so), converting between structured 128-bit ids and opaque UUIDs takes less than 100 cycles on contemporary x86-64 servers… faster than a load from main memory!

This post only addressed the question of runtime performance. I think the real challenges with encrypting external ids aren’t strictly technical in nature, and have more to do with making it hard for programmers to accidentally leak internal ids. I don’t know how that would go because I’ve never had to use this trick in a production system, but it seems like it can’t be harder than doing the same in a schemas that have explicit internal primary keys and external ids on each table. I’m also hopeful that one could do something smart with views and user-defined types.

Either way, I believe the runtime overhead of encrypting and decrypting 128-bit identifiers is a non-issue for the vast majority of database workloads. Arguments against encrypting structured identifiers should probably focus on system complexity, key management9 (e.g., between production and testing environments), and graceful failure in the face of faulty hardware or code accidentally leaking internal identifiers.

Thank you Andrew, Barkley, Chris, Jacob, Justin, Marius, and Ruchir, for helping me clarify this post, and for reminding me about things like range-sharded distributed databases.