Search Saved

Data Laced with History: Causal Trees & Operational CRDTs

After mulling over my bullet points, it occurred to me that the network problems I was dealing with—background cloud sync, editing across multiple devices, real-time collaboration, offline support, and reconciliation of distant or conflicting revisions—were all pointing to the same question: was it possible to design a system where any two revisions of the same document could be merged deterministically and sensibly without requiring user intervention?

It’s what happened after sync that was troubling. On encountering a merge conflict, you’d be thrown into a busy conversation between the network, model, persistence, and UI layers just to get back into a consistent state. The data couldn’t be left alone to live its peaceful, functional life: every concurrent edit immediately became a cross-architectural matter.

I kept several questions in mind while doing my analysis. Could a given technique be generalized to arbitrary and novel data types? Did the technique pass the PhD Test? And was it possible to use the technique in an architecture with smart clients and dumb servers?

Concurrent edits are sibling branches. Subtrees are runs of characters. By the nature of reverse timestamp+UUID sort, sibling subtrees are sorted in the order of their head operations.

This is the underlying premise of the Causal Tree. In contrast to all the other CRDTs I’d been looking into, the design presented in Victor Grishchenko’s brilliant paper was simultaneously clean, performant, and consequential. Instead of dense layers of theory and labyrinthine data structures, everything was centered around the idea of atomic, immutable, metadata-tagged, and causally-linked operations, stored in low-level data structures and directly usable as the data they represented.

I’m going to be calling this new breed of CRDTs operational replicated data types—partly to avoid confusion with the exiting term “operation-based CRDTs” (or CmRDTs), and partly because “replicated data type” (RDT) seems to be gaining popularity over “CRDT” and the term can be expanded to “ORDT” without impinging on any existing terminology.

Much like Causal Trees, ORDTs are assembled out of atomic, immutable, uniquely-identified and timestamped “operations” which are arranged in a basic container structure. (For clarity, I’m going to be referring to this container as the structured log of the ORDT.) Each operation represents an atomic change to the data while simultaneously functioning as the unit of data resultant from that action. This crucial event–data duality means that an ORDT can be understood as either a conventional data structure in which each unit of data has been augmented with event metadata; or alternatively, as an event log of atomic actions ordered to resemble its output data structure for ease of execution

To implement a custom data type as a CT, you first have to “atomize” it, or decompose it into a set of basic operations, then figure out how to link those operations such that a mostly linear traversal of the CT will produce your output data. (In other words, make the structure analogous to a one- or two-pass parsable format.)

OT and CRDT papers often cite 50ms as the threshold at which people start to notice latency in their text editors. Therefore, any code we might want to run on a CT—including merge, initialization, and serialization/deserialization—has to fall within this range. Except for trivial cases, this precludes O(n2) or slower complexity: a 10,000 word article at 0.01ms per character would take 7 hours to process! The essential CT functions have to be O(nlogn) at the very worst.

Of course, CRDTs aren’t without their difficulties. For instance, a CRDT-based document will always be “live”, even when offline. If a user inadvertently revises the same CRDT-based document on two offline devices, they won’t see the familiar pick-a-revision dialog on reconnection: both documents will happily merge and retain any duplicate changes. (With ORDTs, this can be fixed after the fact by filtering changes by device, but the user will still have to learn to treat their documents with a bit more caution.) In fully decentralized contexts, malicious users will have a lot of power to irrevocably screw up the data without any possibility of a rollback, and encryption schemes, permission models, and custom protocols may have to be deployed to guard against this. In terms of performance and storage, CRDTs contain a lot of metadata and require smart and performant peers, whereas centralized architectures are inherently more resource-efficient and only demand the bare minimum of their clients. You’d be hard-pressed to use CRDTs in data-heavy scenarios such as screen sharing or video editing. You also won’t necessarily be able to layer them on top of existing infrastructure without significant refactoring.

Perhaps a CRDT-based text editor will never quite be as fast or as bandwidth-efficient as Google Docs, for such is the power of centralization. But in exchange for a totally decentralized computing future? A world full of devices that control their own data and freely collaborate with one another? Data-centric code that’s entirely free from network concerns? I’d say: it’s surely worth a shot!

#local-first #technical #academic #science #math #research #tech #collaboration #swift #backend #data structures

·archagon.net·Oct 14, 2024

Data Laced with History: Causal Trees & Operational CRDTs

Netflix's head of design on the future of Netflix - Fast Company

At Netflix, we have such a diverse population of shows in 183 countries around the world. We’re really trying to serve up lots of stories people haven’t heard before. When you go into our environment, you’re like, “Ooh, what is that?” You’re almost kind of afraid to touch it, because you’re like, “Well, I don’t want to waste my time.”That level of discovery is literally, I’m not bullshitting you, man, that’s the thing that keeps me up at night. How do I help figure out how to help people discover things, with enough evidence that they trust it? And when they click on it, they love it, and then they immediately ping their best friend, “Have you seen this documentary? It’s amazing.” And she tells her friends, and then that entire viral loop starts.

The discovery engine is very temporal. Member number 237308 could have been into [reality TV] because she or he just had a breakup. Now they just met somebody, so all of a sudden it shifts to rom-coms.Now that person that they met loves to travel. So [they might get into] travel documentaries. And now that person that they’re with, they may have a kid, so they might want more kids’ shows. So, it’s very dangerous for us to ever kind of say, “This is what you like. You have a cat. You must like cat documentaries.”

We don’t see each other, obviously, and I don’t want to social network on Netflix. But knowing other humans exist there is part of it.You answered the question absolutely perfectly. Not only because it’s your truth, but that’s what everyone says! That connection part. So another thing that goes back to your previous question, when you’re asking me what’s on my mind? It’s that. How do I help make sure that when you’re in that discovery loop, you still feel that you’re connected to others.I’m not trying to be the Goth kids on campus who are like, “I don’t care about what’s popular.” But I’m also not trying to be the super poppy kids who are always chasing trends. There’s something in between which is, “Oh, hey, I haven’t heard about that, and I kind of want to be up on it.”

I am looking forward to seeing what Apple does with this and then figuring out more, how are people going to use it? Then I think that we should have a real discussion about how Netflix does it.But to just port Netflix over? No. It’s got to make sure that it’s using the power of the system as much as humanly possible so that it’s really making that an immersive experience. I don’t want to put resources toward that right now.

On porting Netflix to Apple Vision Pro

The design team here at Netflix, we played a really big hand in how that worked because we had to design the back-end tool. What people don’t know about our team is 30% of our organization is actually designing and developing the software tools that we use to make the movies. We had to design a tool that allowed the teams to understand both what extra footage to shoot and how that might branch. When the Black Mirror team was trying to figure out how to make this narrative work, the software we provided really made that easier.

#companies/netflix #design #backend #consumer behavior #tech #entertainment #storytelling #product strategy

·fastcompany.com·May 22, 2024

Netflix's head of design on the future of Netflix - Fast Company