So Zero It's ... Negative? (Zero-Copy #3)

lobste.rs - Fri Aug 5 18:31

This is part 3 of a three-part series on interesting abstractions for zero-copy deserialization I’ve been working on over the last year. This part is about eliminating the deserialization step entirely. Part 1 is about making it more pleasant to work with and can be found here; while Part 2 is about making it work for more types and can be found here. The posts can be read in any order, though only the first post contains an explanation of what zero-copy deserialization is.

And when Alexander saw the breadth of his work, he wept. For there were no more copies left to zero.

—Hans Gruber, after designing three increasingly unhinged zero-copy crates

Part 1 of this series attempted to answer the question “how can we make zero-copy deserialization pleasant”, while part 2 answered “how do we make zero-copy deserialization more useful?”.

This part goes one step further and asks “what if we could avoid deserialization altogether?”.

Speech bubble for character Confused pion

Wait, what?

Bear with me.

As mentioned in the previous posts, internationalization libraries like ICU4X need to be able to load and manage a lot of internationalization data. ICU4X in particular wants this part of the process to be as flexible and efficient as possible. The focus on efficiency is why we use zero-copy deserialization for basically everything, whereas the focus on flexibility has led to a robust and pluggable data loading infrastructure that allows you to mix and match data sources.

Deserialization is a great way to load data since it’s in and of itself quite flexible! You can put your data in a neat little package and load it off the filesystem! Or send it over the network! It’s even better when you have efficient techniques like zero-copy deserialization because the cost is low.

But the thing is, there is still a cost. Even with zero-copy deserialization, you have to validate the data you receive. It’s often a cost folks are happy to pay, but that’s not always the case.

For example, you might be, say, a web browser interested in using ICU4X, and you really care about startup times. Browsers typically need to set up a lot of stuff when being started up (and when opening a new tab!), and every millisecond counts when it comes to giving the user a smooth experience. Browsers also typically ship with most of the internationalization data they need already. Spending precious time deserializing data that you shipped with is suboptimal.

What would be ideal would be something that works like this:

static DATA: &Data = &serde_json::deserialize!(include_bytes!("./testdata.json"));

where you can have stuff get deserialized at compile time and loaded into a static. Unfortunately, Rust const support is not at the stage where the above code is possible whilst working within serde’s generic framework, though it might be in a year or so.

You could write a very unsafe version of serde::Deserialize that operates on fully trusted data and uses some data format that is easy to zero-copy deserialize whilst avoiding any kind of validation. However, this would still have some cost: you still have to scan the data to reconstruct the full deserialized output. More importantly, it would require a parallel universe of unsafe serde-like traits that everyone has to derive or implement, where even small bugs in manual implementations would likely cause memory corruption.

Speech bubble for character Positive pion

Sounds like you need some format that needs no validation or scanning to zero-copy deserialize, and can be produced safely. But that doesn’t exist, does it?

It does.

… but you’re not going to like where I’m going with this.

Speech bubble for character Positive pion

Oh no.

There is such a format: Rust code. Specifically, Rust code in statics. When compiled, Rust statics are basically “free” to load, beyond the typical costs involved in paging in memory. The Rust compiler trusts itself to be good at codegen, so it doesn’t need validation when loading a compiled static from memory. There is the possibility of codegen bugs, however we have to trust the compiler about that for the rest of our program anyway!

This is even more “zero” than “zero-copy deserialization”! Regular “zero copy deserialization” still involves a scanning and potentially a validation step, it’s really more about “zero allocations” than actually avoiding all of the copies. On the other hand, there’s truly no copies or anything going on when you load Rust statics; it’s already ready to go as a &'static reference!

We just have to figure out a way to “serialize to const Rust code” such that the resultant Rust code could just be compiled in to the binary, and people who need to load trusted data into ICU4X can load it for free!

Speech bubble for character Confused pion

What does “const code” mean in this context?

In Rust, const code essentially is code that can be proven to be side-effect-free, and it’s the only kind of code allowed in statics, consts, and const fns.

Speech bubble for character Confused pion

I see! Does this code actually have to be “constant”?

Not quite! Rust supports mutation and even things like for loops in const code! Ultimately, it has to be the kind of code that can be computed at compile time with no difference of behavior: so no reading from files or the network, or using random numbers.

For a long time only very simple code was allowed in const, but over the last year the scope of what that environment can do has expanded greatly, and it’s actually possible to do complicated things here, which is precisely what enables us to actually do “serialize to Rust code” in a reasonable way.

databake

A lot of the design here can also be found in the design doc. While I did the bulk of the design for this crate, it was almost completely implemented by Robert, who also worked on integrating it into ICU4X, and cleaned up the design in the process.

Enter databake (née crabbake). databake is a crate that provides just this; the ability to serialize your types to const code that can then be used in statics allowing for truly zero-cost data loading, no deserialization necessary!

The core entry point to databake is the Bake trait:

pub trait Bake {
    fn bake(&self, ctx: &CrateEnv) -> TokenStream;
}

A TokenStream is the type typically used in Rust procedural macros to represent a snippet of Rust code. The Bake trait allows you to take an instance of a type, and convert it to Rust code that represents the same value.

The CrateEnv object is used to track which crates are needed, so that it is possible for tools generating this code to let the user know which direct dependencies are needed.

This trait is augmented by a #[derive(Bake)] custom derive that can be used to apply it to most types automatically:

// inside crate `bar`, module `module.rs`

use databake::Bake;

#[derive(Bake)]
#[databake(path = bar::module)]
pub struct Person<'a> {
   pub name: &'a str,
   pub age: u32,
}

As with most custom derives, this only works on structs and enums that contain other types that already implement Bake. Most types not involving mandatory allocation should be able to.

How to use it

databake itself doesn’t really prescribe any particular code generation strategy. It can be used in a proc macro or in a build.rs, or, even in a separate binary. ICU4X does the latter, since that’s just what ICU4X’s model for data generation is: clients can use the binary to customize the format and contents of the data they need.

So a typical way of using this crate might be to do something like this in build.rs:

use some_dep::Data;
use databake::Bake;
use quote::quote;

fn main() {
   // load data from file
   let json_data = include_str!("data.json");

   // deserialize from json
   let my_data: Data = serde_json::from_str(json_data);

   // get a token tree out of it
   let baked = my_data.bake();


   // Construct rust code with this in a static
   // The quote macro is used by procedural macros to do easy codegen,
   // but it's useful in build scripts as well.
   let my_data_rs = quote! {
      use some_dep::Data;
      static MY_DATA: Data = #baked;
   }

   // Write to file
   let out_dir = env::var_os("OUT_DIR").unwrap();
   let dest_path = Path::new(&out_dir).join("data.rs");
   fs::write(
      &dest_path,
      &my_data_rs.to_string()
   ).unwrap();

   // (Optional step omitted: run rustfmt on the file)

   // tell Cargo that we depend on this file
   println!("cargo:rerun-if-changed=src/data.json");
}

What it looks like

ICU4X generates all of its test data into JSON, postcard, and “baked” formats. For example, for this JSON data representing how a particular locale does numbers, the “baked” data looks like this. That’s a rather simple data type, but we do use this for more complex data like date time symbol data, which is unfortunately too big for GitHub to render normally.

ICU4X’s code for generating this is in this file. It’s complicated primarily because ICU4X’s data generation pipeline is super configurable and complicated, The core thing that it does is, for each piece of data, it calls tokenize(), which is a thin wrapper around calling .bake() on the data and some other stuff. It then takes all of the data and organizes it into files like those linked above, populated with a static for each piece of data. In our case, we include all this generated rust code into our “testdata” crate as a module, but there are many possibilities here!

For our “test” data, which is currently 2.7 MB in the postcard format (which is optimized for being lightweight), the same data ends up being 11 MB of JSON, and 18 MB of generated Rust code! That’s … a lot of Rust code, and tools like rust-analyzer struggle to load it. It’s of course much smaller once compiled into the binary, though that’s much harder to measure, because Rust is quite aggressive at optimizing unused data out in the baked version (where it has ample opportunity to). From various unscientific tests, it seems like 2MB of deduplicated postcard data corresponds to roughly 500KB of deduplicated baked data. This makes sense, since one can expect baked data to be near the theoretical limit of how small the data is without applying some heavy compression. Furthermore, while we deduplicate baked data at a per-locale level, it can take advantage of LLVM’s ability to deduplicate statics further, so if, for example, two different locales have mostly the same data for a given data key1 with some differences, LLVM may be able to use the same statics for sub-data.

Limitations

const support in Rust still has a ways to go. For example, it doesn’t yet support creating objects like Strings which are usually on the heap, though they are working on allowing this. This isn’t a huge problem for us; all of our data already supports zero-copy deserialization, which means that for every instance of our data types, there is some way to represent it as a borrow from another static.

A more pesky limitation is that you can’t interact with traits in const environments. To some extent, were that possible, the purpose of this crate could also have been fulfilled by making the serde pipeline const-friendly2, and then the code snippet from the beginning of this post would work:

static DATA: &Data = &serde_json::deserialize!(include_bytes!("./testdata.json"));

This means that for things like ZeroVec (see part 2), we can’t actually just make their safe constructors const and pass in data to be validated — the validation code is all behind traits — so we have to unsafely construct them. This is somewhat unfortunate, however ultimately if the zerovec byte representation had trouble roundtripping we would have larger problems, so it’s not an introduction of a new surface of unsafety. We’re still able to validate things when generating the baked data, we just can’t get the compiler to also re-validate before agreeing to compile the const code.

Try it out!

databake is much less mature compared to yoke and zerovec, but it does seem to work rather well so far. Try it out! Let me know what you think!

Thanks to Finch, Jane, Shane, and Robert for reviewing drafts of this post