CEX Resilience · Cue Cards (with voice)

Slide 1

Opening

开场

▶ First — big thanks to Joshua. That was a fantastic share. Really great content.

首先,非常感谢 Joshua。刚才那段分享真的非常棒,内容也非常精彩。

▶ For my part, I'll focus on something different.

我这一部分会讲点不一样的。

▶ Over the past two years, we talked to many CEX customers.

过去两年,我们和很多 CEX 客户有过深入交流。

▶ We saw the same patterns come up again and again.

我们反复看到一些共性的 pattern 出现。

▶ Today I want to share what we saw.

今天我想分享一下我们的观察。

▶ The common patterns. How architectures change over time. And how these exchanges build their resilience.

这些共性的 pattern。架构怎么随时间演进。以及这些交易所怎么构建他们的韧性。

▶ This 40-minute session has two parts.

这 40 分钟分两部分。

▶ First — me. I'll walk through the architecture patterns we see across CEX customers.

第一部分 — 我来讲。我会讲我们在 CEX 客户身上看到的架构 pattern。

▶ Second — my colleague Kenny will go deeper on Aeron, with more details.

第二部分 — 我的同事 Kenny 会深入讲 Aeron,以及更多细节。

▶ You'll see this on the agenda in a moment.

这些等下你会在 agenda 里看到。

▶ One more thing — please feel free to interrupt me anytime.

还有一点 — 任何时候都可以打断我。

▶ This is a conversation, not a one-way share.

这是一次交流,不是单向的分享。

▶ And if my English fails me, my colleagues here will rescue me.

另外如果我英语卡住了,旁边同事会救场。

▶ OK, let's start.

好,我们开始。

Slide 2

Agenda

议程

▶ Quick look at the agenda.

快速看一下议程。

▶ First — why resilience matters. We'll start with a short look at AWS major incidents over the past 15 years.

首先 — 为什么韧性重要。我们会快速回顾过去 15 年 AWS 的几次大事故。

▶ Second — our resilience framework. The mental model we use with customers.

其次 — 我们的韧性框架。我们和客户对话时用的思维模型。

▶ Third — CEX architecture tiers. From leg to head. And how each tier handles HA and DR.

第三 — CEX 架构分层。从腿部到头部。每一层是怎么做 HA 和 DR 的。

▶ At the end, I'll touch on Aeron Premium, and walk through one real deployment case.

最后,我会简单提一下 Aeron Premium,并过一个真实部署案例。

Slide 3

Three-Layer Defense + Chaos Engineering

三层防护 + 混沌工程

▶ So this is the mental model we use.

这是我们的思维模型。

▶ Resilience has three layers. Plus one practice that ties them together.

韧性分三层。加上一个贯穿全局的实践。

▶ Layer 1 is Backup. The mission is simple — no data loss.

第一层是 Backup。目标很简单 — 数据不丢。

▶ Periodic snapshots. Cross-region replication. Recovery drills.

定期快照。跨 Region 复制。恢复演练。

▶ Cost is low — about 5 to 10 percent.

成本很低 — 大概 5% 到 10%。

▶ Every customer needs this. No exceptions.

所有客户都需要。没有例外。

▶ Layer 2 is HA. The mission — no service outage.

第二层是 HA。目标 — 服务不停。

▶ Today, most mid-tier customers stop at single-AZ HA.

现在,大多数腰部客户停在单 AZ HA。

▶ But AZ-level failures do happen. They are rare, but they happen.

但 AZ 级故障真的会发生。罕见,但会发生。

▶ So we strongly recommend Multi-AZ HA as the new baseline.

所以我们强烈推荐 Multi-AZ HA 作为新基线。

▶ Cost goes up to 100 to 150 percent. But recovery becomes seconds to minutes.

成本上升到 100% 到 150%。但恢复时间是秒到分钟级。

▶ Layer 3 is DR. The mission — business continuity.

第三层是 DR。目标 — 业务延续。

▶ This is multi-region active or warm standby. For region-level events.

这是多 Region active 或 warm standby。应对 Region 级事件。

▶ For mission-critical workloads, this is not optional.

对核心业务来说,这不是可选项。

▶ Now — the most important part.

现在 — 最重要的部分。

▶ Chaos Engineering.

混沌工程。

▶ All these layers only work if you have actually tested them.

前面这些层,只有你真的演练过,才会 work。

▶ We see this again and again. Companies build great architectures.

我们反复看到这一点。公司建了很好的架构。

▶ Three-AZ. Multi-region. All the right patterns.

三 AZ。多 Region。所有该有的 pattern 都有。

▶ But when something real happens — the runbook is out of date.

但真出事的时候 — runbook 已经过时。

▶ The team has never practiced. Decisions get made under pressure.

团队从来没演练过。压力下做决定。

▶ Often at three in the morning.

经常是凌晨三点。

▶ Chaos Engineering is only 2 to 5 percent extra cost.

混沌工程只增加 2% 到 5% 的成本。

▶ But it tells you whether the previous 150 percent of investment will actually pay off.

但它决定了前面 150% 的投入是否真的有用。

▶ Without chaos drills, the rest is just paper architecture.

没有混沌演练,前面都是纸上谈兵。

Slide 4

Architecture Summary

架构分层概览

▶ Now let's look at where real CEX customers actually sit today.

现在我们看一下,真实的 CEX 客户今天都在什么位置。

▶ We map customers into five tiers — leg, waist, chest, shoulder, and head.

我们把客户分成五层 — 腿、腰、胸、肩颈、头。

▶ But — important — this is by architectural maturity, not by business size.

但要注意 — 这是按架构成熟度分的,不是按业务体量。

▶ Some very large exchanges by volume actually sit at the waist or chest tier.

有些体量很大的交易所,其实架构上还在腰部或胸部。

▶ At the bottom — leg tier.

最底下 — 腿部。

▶ Single-leader OMS. Kafka. Single-leader matching engine. Synchronous DB writes.

单主 OMS。Kafka。单主撮合引擎。同步写库。

▶ Throughput under 10K per pair. P99 around 30 ms.

单币吞吐量 1 万以下。P99 大约 30 毫秒。

▶ The DB is the bottleneck.

瓶颈在数据库。

▶ Waist tier — most CEX customers are here.

腰部 — 大多数 CEX 客户在这里。

▶ Active-standby OMS. Sharding. In-memory matching with async DB.

主备 OMS。分片。内存化撮合,异步写库。

▶ 20K to 50K TPS. P99 10 to 20 ms.

2 万到 5 万 TPS。P99 10 到 20 毫秒。

▶ Some are multi-AZ. Many are still single-AZ.

有些是多 AZ。很多还停在单 AZ。

▶ Chest tier. OMS already on Raft. Kafka still in the middle. ME still active-standby.

胸部。OMS 已经上 Raft。Kafka 还在中间。ME 还是主备。

▶ 40K to 150K TPS. P99 around 15 ms.

4 万到 15 万 TPS。P99 大约 15 毫秒。

▶ The OMS upgrade to Raft is the key step here.

OMS 升级到 Raft,是这一层的关键一步。

▶ Shoulder tier. Full-stack Sofa-Jraft. Kafka removed from the trading path.

肩颈部。全链路 Sofa-Jraft。Kafka 从交易路径上去掉了。

▶ 150K to 500K TPS. P99 under 10 ms.

15 万到 50 万 TPS。P99 在 10 毫秒以内。

▶ TCP at the limit.

TCP 极限。

▶ Head tier. Aeron or in-house multicast. UDP-based, end-to-end Raft, mostly multi-AZ.

头部。Aeron 或自研组播。基于 UDP,全链路 Raft,大多多 AZ。

▶ 400K to 1.2 million TPS. P99 in single-digit milliseconds.

40 万到 120 万 TPS。P99 个位数毫秒。

▶ The key observation —

关键观察 —

▶ Head-tier exchanges are not just chasing performance.

头部交易所不只是在追性能。

▶ They also have the strongest resilience setup. Multi-AZ, automatic failover, the works.

他们的韧性设计也是最强的。多 AZ、自动切换,该有的都有。

▶ Below the shoulder, most customers are still single-AZ.

肩颈部以下,大多数客户还停在单 AZ。

▶ That is where we see the biggest engagement opportunity. Helping them evolve to multi-AZ without sacrificing latency.

这是我们看到最大的 engagement 空间。帮他们演进到多 AZ,同时不牺牲延迟。

Slide 5

Three Tech Stacks at a Glance

三种技术栈对比

▶ Now let's compress this view into three technology stacks.

现在我们把刚才看到的客户,归纳成三种技术栈来看。

▶ Kafka standard — about 80 percent of CEX customers sit here.

Kafka 标准 — 大约 80% 的 CEX 客户在这里。

▶ Mature ecosystem. Standardized components. Optional dual-AZ.

生态成熟。组件标准化。可选双 AZ 部署。

▶ P99 15 to 20 ms. 20K to 100K TPS. Easiest to operate.

P99 在 15 到 20 毫秒。2 万到 10 万 TPS。运维最简单。

▶ Sofa-Jraft over gRPC — about 10 percent.

Sofa-Jraft 配 gRPC — 大约 10%。

▶ No message broker. Heavy TCP optimization — Core Pinning, DPDK, NUMA, ENA.

没有消息中间件。TCP 极致优化 — Core Pinning、DPDK、NUMA、ENA。

▶ P99 under 10 ms. Up to 500K TPS.

P99 在 10 毫秒以内。最高到 50 万 TPS。

▶ Aeron over UDP — about 5 percent. Head tier.

Aeron 配 UDP — 大约 5%。头部所。

▶ Reliable UDP multicast. Raft consensus. GC-free. Ring buffer zero-copy. Memory-mapped files.

可靠 UDP 组播。Raft 共识。GC-free。Ring Buffer 零拷贝。Memory Mapped File。

▶ P99 under 1 ms. Up to 1.2 million TPS.

P99 在 1 毫秒以内。最高到 120 万 TPS。

▶ The key point — complexity and cost go up steeply as you move right.

关键点 — 越往右,复杂度和成本陡升。

▶ Kafka is three stars and three dollars.

Kafka 是三颗星、三块钱。

▶ Aeron is five and five.

Aeron 是五颗星、五块钱。

▶ So our advice is — match the stack to the business stage. Not the other way around.

所以我们的建议是 — 让技术栈匹配业务阶段。不是反过来。

▶ Don't pick Aeron because it's cool. Pick it because you've outgrown the alternatives.

不要因为 Aeron 酷就选它。要等你真的发现别的栈不够用了再选。

Slide 6

Learner vs Follower

▶ Quick concept page — Learner vs Follower.

快速过一个概念 — Learner 和 Follower。

▶ In Raft, Followers vote. They count toward quorum.

在 Raft 里,Follower 是投票的。他们计入多数派。

▶ Three Followers means majority is two. Write succeeds when two confirm.

3 个 Follower,多数派就是 2。两个确认了,写入就成功。

▶ Learners do not vote. They receive logs, apply logs. But they don't participate in elections.

Learner 不投票。他们接收日志,应用日志。但不参与选举。

▶ Why does this matter? Three reasons.

为什么这个重要?三个原因。

▶ One — read-only replicas. Distribute read load without slowing writes.

一 — 只读副本。分担读压力,不影响写入。

▶ Two — cross-region deployment. Far-away nodes don't slow down primary writes.

二 — 跨地域部署。远端节点不会拖慢主集群的写入。

▶ Three — node warm-up. A new node syncs as Learner first, then promotes when ready.

三 — 节点预热。新节点先以 Learner 同步,准备好之后再提升。

▶ The key insight is on the right.

关键 insight 在右边。

▶ Three Followers, two Learners.

3 个 Follower,2 个 Learner。

▶ A write needs two of three Followers to ack.

写入只需要 3 个 Follower 中的 2 个确认。

▶ The two Learners — even if they're far away or in another region — do not block the write.

那 2 个 Learner — 即使在很远或者别的 Region — 不会阻塞写入。

▶ So this is how Jraft and Aeron achieve cross-region DR without sacrificing primary latency.

所以这就是 Jraft 和 Aeron 实现跨 Region DR 而不牺牲主集群延迟的原理。

Slide 7

Resilience Evolution Across Three Stacks

三种栈的韧性演进

▶ Now — the evolution.

现在我们看演进路径。

▶ Two dimensions on this page.

这一页有两个维度。

▶ Horizontal — left to right — is the stack upgrade. Kafka, Jraft, Aeron.

横向 — 从左到右 — 是技术栈升级。Kafka、Jraft、Aeron。

▶ Vertical — bottom to top — is the resilience level.

纵向 — 从下往上 — 是韧性级别。

▶ Every customer follows some version of this path.

每个客户都会沿着这条路径的某个版本演进。

▶ Bottom — baseline.

最底层 — 基础。

▶ Kafka with single-leader OMS and ME. Jraft and Aeron with basic Raft cluster.

Kafka 配单主 OMS 和 ME。Jraft 和 Aeron 配基础 Raft 集群。

▶ Next step — Baseline HA.

下一步 — 基础 HA。

▶ Kafka adds cross-AZ active-standby to OMS and ME.

Kafka 这边给 OMS 和 ME 加跨 AZ 的主备。

▶ Jraft and Aeron add cross-AZ Learners. Same purpose, different mechanism.

Jraft 和 Aeron 加跨 AZ 的 Learner。目的一样,机制不同。

▶ Next — Standard HA.

再上一层 — 标准 HA。

▶ Kafka customers begin migrating OMS to Raft. This is the qualitative jump.

Kafka 客户开始把 OMS 迁到 Raft。这是质变的一步。

▶ Jraft and Aeron — continue strengthening Multi-AZ Learners.

Jraft 和 Aeron — 继续加强多 AZ 的 Learner。

▶ Standard DR — cross-region Learner.

标准 DR — 跨 Region 的 Learner。

▶ You'll notice Kafka column is empty here.

你会看到 Kafka 这一列是空的。

▶ That is intentional. Kafka is a stateless message broker.

这是故意的。Kafka 是无状态的消息中间件。

▶ Cross-region replication on Kafka itself isn't where we recommend customers invest.

我们不建议客户在 Kafka 这一层投入跨 Region 复制。

▶ The right place to do cross-region DR is on the stateful layer — OMS, ME.

跨 Region DR 该做在有状态的服务上 — OMS、ME。

▶ That is where Raft and Aeron Learners shine.

这正是 Raft 和 Aeron Learner 发挥作用的地方。

▶ Top — Advanced DR.

顶层 — 进阶 DR。

▶ All three stacks can reach this. Cross-region Learners. Full multi-region active or warm standby.

三种栈都可以到达。跨 Region 的 Learner。完整的多 Region active 或 warm standby。

▶ The takeaway — every customer can be on this map.

启示 — 每个客户都能定位在这张图上。

▶ The conversation is — where are you now, and what's the next step.

对话就变成 — 你现在在哪里,下一步是什么。

Slide 8

Multi-AZ MSK · Active-Standby Reference

多 AZ MSK 主备参考架构

▶ Let me make this concrete with four real architectures.

我们用四张真实架构图把它落地。

▶ I'll keep these brief. The goal is to show what these patterns look like end-to-end. Not to deep-dive.

我会简短带过。目的是让大家看到这些 pattern 端到端长什么样。不深入细节。

▶ First — waist tier.

第一张 — 腰部。

▶ OMS active-standby across two AZs. Two-AZ MSK Kafka. ME active-standby on EC2.

OMS 跨双 AZ 主备。两 AZ 的 MSK Kafka。EC2 上的 ME 主备。

▶ Routing decided at the trading server based on user ID.

Trading Server 根据 user ID 做路由。

▶ Odd users to AZ-1. Even users to AZ-2. Basically that.

单数 user 到 AZ-1。双数 user 到 AZ-2。大概就是这样。

▶ Snapshots every 10 minutes. Async write to Aurora downstream.

每 10 分钟一次 snapshot。下游异步写 Aurora。

▶ P99 around 10 ms. Each shard handles around 50K TPS.

P99 大约 10 毫秒。每个分片大约 50K TPS。

▶ The Chinese sticky notes on the diagram cover snapshot strategy and offset sync logic. I'll translate as we go.

图里的中文 sticky 是 snapshot 策略和 offset 同步的逻辑。我边讲边翻。

▶ The key takeaway — this is the most adopted pattern.

关键 takeaway — 这是最主流的 pattern。

▶ It works well for waist-tier exchanges who want resilience without going to Sofa-Jraft yet.

适合腰部所 — 想要韧性,但还没到上 Sofa-Jraft 的阶段。

Slide 9

Single-AZ Self-Managed Kafka · Raft OMS

单 AZ 自建 Kafka + Raft OMS

▶ Second case — chest tier in transition.

第二个 — 胸部过渡态。

▶ The big change here — OMS is now Raft.

这里的大变化 — OMS 已经上 Raft 了。

▶ But Kafka is still in the middle. And now it's self-managed in a single AZ. Not MSK.

但 Kafka 还在中间。而且变成自建单 AZ 了。不是 MSK。

▶ ME is still active-standby.

ME 还是主备。

▶ Why this hybrid?

为什么是这种混合?

▶ Customers at this tier want OMS consistency guarantees from Raft.

这一层的客户想要 OMS 的 Raft 一致性保证。

▶ But they don't want to give up Kafka. Their ecosystem already depends on it.

但又不想丢 Kafka。他们的整个生态都依赖 Kafka。

▶ So they upgrade the front first — OMS to Raft. And leave the rest stable.

所以他们先升级前端 — OMS 上 Raft。后面留稳定。

▶ One ME snapshot per minute here. More aggressive than the previous case.

ME 在这里是每分钟一次 snapshot。比上一个更激进。

▶ P99 stays around 10 to 15 ms.

P99 保持在 10 到 15 毫秒。

Slide 10

Sofa-Jraft End-to-End

Sofa-Jraft 全链路

▶ Third — shoulder tier. Full Sofa-Jraft.

第三个 — 肩颈部。全链路 Sofa-Jraft。

▶ Kafka is gone from the trading path.

Kafka 从交易路径上消失了。

▶ OMS cluster and ME cluster — both Raft, both gRPC, both deployed on metal.

OMS 集群和 ME 集群 — 都是 Raft,都用 gRPC,都部署在裸金属上。

▶ Notice OMS-shard-1 and OMS-shard-2 each have their own Raft group. With leader and followers.

你会看到 OMS-shard-1 和 shard-2 各自有自己的 Raft group。有 leader 和 followers。

▶ Same on the ME side.

ME 那边也一样。

▶ Snapshots still 10 minutes. Sofa-Jraft handles consistency. Snapshots are for fast recovery.

Snapshot 还是 10 分钟。Sofa-Jraft 管一致性。Snapshot 用于快速恢复。

▶ Customers at this tier have invested heavily — DPDK, NUMA, custom kernels.

这一层的客户投入很重 — DPDK、NUMA、自定义内核。

▶ P99 under 10 ms is achievable, with the right tuning.

调优好可以做到 P99 10 毫秒以内。

▶ One trade-off worth mentioning — the Sofa-Jraft codebase is mature but evolves slowly.

有一个 trade-off 值得提一下 — Sofa-Jraft 代码库成熟但更新慢。

▶ Most customers at this tier maintain their own forks.

这一层的客户大多维护自己 fork 的版本。

Slide 11

Aeron Cluster End-to-End · Head Tier

Aeron Cluster 全链路 · 头部参考

▶ Last one — head tier. Aeron Cluster end-to-end.

最后 — 头部。全链路 Aeron Cluster。

▶ This is the publicly known architecture pattern. Based on Frank Yu's QCon and Aeron MeetUp talks.

这是公开已知的架构 pattern。来自 Frank Yu 在 QCon 和 Aeron MeetUp 的演讲。

▶ Aeron Cluster on both OMS and ME sides. UDP Aeron Transport between them.

OMS 端和 ME 端都是 Aeron Cluster。中间是 UDP 的 Aeron Transport。

▶ All deployed on metal instances.

全部部署在裸金属上。

▶ Two sharding strategies — OMS shards by user. ME shards by trading pair.

两种分片策略 — OMS 按 user 分片。ME 按交易对分片。

▶ What gives this architecture its edge —

是什么让这个架构有优势 —

▶ Reliable UDP multicast.

可靠的 UDP 组播。

▶ Aeron Cluster Raft consensus.

Aeron Cluster 的 Raft 共识。

▶ NAK-based precise retransmission.

基于 NAK 的精准重传。

▶ GC-free, zero-copy ring buffers.

GC-free 的零拷贝 Ring Buffer。

▶ The numbers speak — P99 under 1 millisecond.

数字会说话 — P99 在 1 毫秒以内。

▶ A note for our peers here today.

给在场各位同事一个 note。

▶ What we draw on this slide is the public pattern.

这一页上画的是公开的 pattern。

▶ We've also been spending time understanding what makes this architecture work in production. And where the edge cases are.

我们也花了不少时间研究这个架构在生产中怎么 work。以及哪些是 edge case。

▶ So we'd love to hear your perspective later. Especially on how this pattern evolves over the next 12 to 24 months.

所以稍后非常希望听听各位的看法。特别是这个 pattern 在接下来 12 到 24 个月怎么演进。

▶ That's the overview. Happy to take questions.

overview 到此结束。欢迎随时提问。

▶ And if any of these architectures are familiar to you, I'd love to hear more.

如果这些架构里有你们熟悉的,也很想多聊聊。