Slide 1
Opening
开场
▶
First — big thanks to Joshua. That was a fantastic share. Really great content.
首先,非常感谢 Joshua。刚才那段分享真的非常棒,内容也非常精彩。
▶
For my part, I'll focus on something different.
我这一部分会讲点不一样的。
▶
Over the past two years, we talked to many CEX customers.
过去两年,我们和很多 CEX 客户有过深入交流。
▶
We saw the same patterns come up again and again.
我们反复看到一些共性的 pattern 出现。
▶
Today I want to share what we saw.
今天我想分享一下我们的观察。
▶
The common patterns. How architectures change over time. And how these exchanges build their resilience.
这些共性的 pattern。架构怎么随时间演进。以及这些交易所怎么构建他们的韧性。
▶
This 40-minute session has two parts.
这 40 分钟分两部分。
▶
First — me. I'll walk through the architecture patterns we see across CEX customers.
第一部分 — 我来讲。我会讲我们在 CEX 客户身上看到的架构 pattern。
▶
Second — my colleague Kenny will go deeper on Aeron, with more details.
第二部分 — 我的同事 Kenny 会深入讲 Aeron,以及更多细节。
▶
You'll see this on the agenda in a moment.
这些等下你会在 agenda 里看到。
▶
One more thing — please feel free to interrupt me anytime.
还有一点 — 任何时候都可以打断我。
▶
This is a conversation, not a one-way share.
这是一次交流,不是单向的分享。
▶
And if my English fails me, my colleagues here will rescue me.
另外如果我英语卡住了,旁边同事会救场。
▶
OK, let's start.
好,我们开始。
Slide 2
Agenda
议程
▶
Quick look at the agenda.
快速看一下议程。
▶
First — why resilience matters. We'll start with a short look at AWS major incidents over the past 15 years.
首先 — 为什么韧性重要。我们会快速回顾过去 15 年 AWS 的几次大事故。
▶
Second — our resilience framework. The mental model we use with customers.
其次 — 我们的韧性框架。我们和客户对话时用的思维模型。
▶
Third — CEX architecture tiers. From leg to head. And how each tier handles HA and DR.
第三 — CEX 架构分层。从腿部到头部。每一层是怎么做 HA 和 DR 的。
▶
At the end, I'll touch on Aeron Premium, and walk through one real deployment case.
最后,我会简单提一下 Aeron Premium,并过一个真实部署案例。
Slide 3
Three-Layer Defense + Chaos Engineering
三层防护 + 混沌工程
▶
So this is the mental model we use.
这是我们的思维模型。
▶
Resilience has three layers. Plus one practice that ties them together.
韧性分三层。加上一个贯穿全局的实践。
▶
Layer 1 is Backup. The mission is simple — no data loss.
第一层是 Backup。目标很简单 — 数据不丢。
▶
Periodic snapshots. Cross-region replication. Recovery drills.
定期快照。跨 Region 复制。恢复演练。
▶
Cost is low — about 5 to 10 percent.
成本很低 — 大概 5% 到 10%。
▶
Every customer needs this. No exceptions.
所有客户都需要。没有例外。
▶
Layer 2 is HA. The mission — no service outage.
第二层是 HA。目标 — 服务不停。
▶
Today, most mid-tier customers stop at single-AZ HA.
现在,大多数腰部客户停在单 AZ HA。
▶
But AZ-level failures do happen. They are rare, but they happen.
但 AZ 级故障真的会发生。罕见,但会发生。
▶
So we strongly recommend Multi-AZ HA as the new baseline.
所以我们强烈推荐 Multi-AZ HA 作为新基线。
▶
Cost goes up to 100 to 150 percent. But recovery becomes seconds to minutes.
成本上升到 100% 到 150%。但恢复时间是秒到分钟级。
▶
Layer 3 is DR. The mission — business continuity.
第三层是 DR。目标 — 业务延续。
▶
This is multi-region active or warm standby. For region-level events.
这是多 Region active 或 warm standby。应对 Region 级事件。
▶
For mission-critical workloads, this is not optional.
对核心业务来说,这不是可选项。
▶
Now — the most important part.
现在 — 最重要的部分。
▶
Chaos Engineering.
混沌工程。
▶
All these layers only work if you have actually tested them.
前面这些层,只有你真的演练过,才会 work。
▶
We see this again and again. Companies build great architectures.
我们反复看到这一点。公司建了很好的架构。
▶
Three-AZ. Multi-region. All the right patterns.
三 AZ。多 Region。所有该有的 pattern 都有。
▶
But when something real happens — the runbook is out of date.
但真出事的时候 — runbook 已经过时。
▶
The team has never practiced. Decisions get made under pressure.
团队从来没演练过。压力下做决定。
▶
Often at three in the morning.
经常是凌晨三点。
▶
Chaos Engineering is only 2 to 5 percent extra cost.
混沌工程只增加 2% 到 5% 的成本。
▶
But it tells you whether the previous 150 percent of investment will actually pay off.
但它决定了前面 150% 的投入是否真的有用。
▶
Without chaos drills, the rest is just paper architecture.
没有混沌演练,前面都是纸上谈兵。
Slide 4
Architecture Summary
架构分层概览
▶
Now let's look at where real CEX customers actually sit today.
现在我们看一下,真实的 CEX 客户今天都在什么位置。
▶
We map customers into five tiers — leg, waist, chest, shoulder, and head.
我们把客户分成五层 — 腿、腰、胸、肩颈、头。
▶
But — important — this is by architectural maturity, not by business size.
但要注意 — 这是按架构成熟度分的,不是按业务体量。
▶
Some very large exchanges by volume actually sit at the waist or chest tier.
有些体量很大的交易所,其实架构上还在腰部或胸部。
▶
At the bottom — leg tier.
最底下 — 腿部。
▶
Single-leader OMS. Kafka. Single-leader matching engine. Synchronous DB writes.
单主 OMS。Kafka。单主撮合引擎。同步写库。
▶
Throughput under 10K per pair. P99 around 30 ms.
单币吞吐量 1 万以下。P99 大约 30 毫秒。
▶
The DB is the bottleneck.
瓶颈在数据库。
▶
Waist tier — most CEX customers are here.
腰部 — 大多数 CEX 客户在这里。
▶
Active-standby OMS. Sharding. In-memory matching with async DB.
主备 OMS。分片。内存化撮合,异步写库。
▶
20K to 50K TPS. P99 10 to 20 ms.
2 万到 5 万 TPS。P99 10 到 20 毫秒。
▶
Some are multi-AZ. Many are still single-AZ.
有些是多 AZ。很多还停在单 AZ。
▶
Chest tier. OMS already on Raft. Kafka still in the middle. ME still active-standby.
胸部。OMS 已经上 Raft。Kafka 还在中间。ME 还是主备。
▶
40K to 150K TPS. P99 around 15 ms.
4 万到 15 万 TPS。P99 大约 15 毫秒。
▶
The OMS upgrade to Raft is the key step here.
OMS 升级到 Raft,是这一层的关键一步。
▶
Shoulder tier. Full-stack Sofa-Jraft. Kafka removed from the trading path.
肩颈部。全链路 Sofa-Jraft。Kafka 从交易路径上去掉了。
▶
150K to 500K TPS. P99 under 10 ms.
15 万到 50 万 TPS。P99 在 10 毫秒以内。
▶
TCP at the limit.
TCP 极限。
▶
Head tier. Aeron or in-house multicast. UDP-based, end-to-end Raft, mostly multi-AZ.
头部。Aeron 或自研组播。基于 UDP,全链路 Raft,大多多 AZ。
▶
400K to 1.2 million TPS. P99 in single-digit milliseconds.
40 万到 120 万 TPS。P99 个位数毫秒。
▶
The key observation —
关键观察 —
▶
Head-tier exchanges are not just chasing performance.
头部交易所不只是在追性能。
▶
They also have the strongest resilience setup. Multi-AZ, automatic failover, the works.
他们的韧性设计也是最强的。多 AZ、自动切换,该有的都有。
▶
Below the shoulder, most customers are still single-AZ.
肩颈部以下,大多数客户还停在单 AZ。
▶
That is where we see the biggest engagement opportunity. Helping them evolve to multi-AZ without sacrificing latency.
这是我们看到最大的 engagement 空间。帮他们演进到多 AZ,同时不牺牲延迟。
Slide 5
Three Tech Stacks at a Glance
三种技术栈对比
▶
Now let's compress this view into three technology stacks.
现在我们把刚才看到的客户,归纳成三种技术栈来看。
▶
Kafka standard — about 80 percent of CEX customers sit here.
Kafka 标准 — 大约 80% 的 CEX 客户在这里。
▶
Mature ecosystem. Standardized components. Optional dual-AZ.
生态成熟。组件标准化。可选双 AZ 部署。
▶
P99 15 to 20 ms. 20K to 100K TPS. Easiest to operate.
P99 在 15 到 20 毫秒。2 万到 10 万 TPS。运维最简单。
▶
Sofa-Jraft over gRPC — about 10 percent.
Sofa-Jraft 配 gRPC — 大约 10%。
▶
No message broker. Heavy TCP optimization — Core Pinning, DPDK, NUMA, ENA.
没有消息中间件。TCP 极致优化 — Core Pinning、DPDK、NUMA、ENA。
▶
P99 under 10 ms. Up to 500K TPS.
P99 在 10 毫秒以内。最高到 50 万 TPS。
▶
Aeron over UDP — about 5 percent. Head tier.
Aeron 配 UDP — 大约 5%。头部所。
▶
Reliable UDP multicast. Raft consensus. GC-free. Ring buffer zero-copy. Memory-mapped files.
可靠 UDP 组播。Raft 共识。GC-free。Ring Buffer 零拷贝。Memory Mapped File。
▶
P99 under 1 ms. Up to 1.2 million TPS.
P99 在 1 毫秒以内。最高到 120 万 TPS。
▶
The key point — complexity and cost go up steeply as you move right.
关键点 — 越往右,复杂度和成本陡升。
▶
Kafka is three stars and three dollars.
Kafka 是三颗星、三块钱。
▶
Aeron is five and five.
Aeron 是五颗星、五块钱。
▶
So our advice is — match the stack to the business stage. Not the other way around.
所以我们的建议是 — 让技术栈匹配业务阶段。不是反过来。
▶
Don't pick Aeron because it's cool. Pick it because you've outgrown the alternatives.
不要因为 Aeron 酷就选它。要等你真的发现别的栈不够用了再选。
Slide 6
Learner vs Follower
Learner vs Follower
▶
Quick concept page — Learner vs Follower.
快速过一个概念 — Learner 和 Follower。
▶
In Raft, Followers vote. They count toward quorum.
在 Raft 里,Follower 是投票的。他们计入多数派。
▶
Three Followers means majority is two. Write succeeds when two confirm.
3 个 Follower,多数派就是 2。两个确认了,写入就成功。
▶
Learners do not vote. They receive logs, apply logs. But they don't participate in elections.
Learner 不投票。他们接收日志,应用日志。但不参与选举。
▶
Why does this matter? Three reasons.
为什么这个重要?三个原因。
▶
One — read-only replicas. Distribute read load without slowing writes.
一 — 只读副本。分担读压力,不影响写入。
▶
Two — cross-region deployment. Far-away nodes don't slow down primary writes.
二 — 跨地域部署。远端节点不会拖慢主集群的写入。
▶
Three — node warm-up. A new node syncs as Learner first, then promotes when ready.
三 — 节点预热。新节点先以 Learner 同步,准备好之后再提升。
▶
The key insight is on the right.
关键 insight 在右边。
▶
Three Followers, two Learners.
3 个 Follower,2 个 Learner。
▶
A write needs two of three Followers to ack.
写入只需要 3 个 Follower 中的 2 个确认。
▶
The two Learners — even if they're far away or in another region — do not block the write.
那 2 个 Learner — 即使在很远或者别的 Region — 不会阻塞写入。
▶
So this is how Jraft and Aeron achieve cross-region DR without sacrificing primary latency.
所以这就是 Jraft 和 Aeron 实现跨 Region DR 而不牺牲主集群延迟的原理。
Slide 7
Resilience Evolution Across Three Stacks
三种栈的韧性演进
▶
Now — the evolution.
现在我们看演进路径。
▶
Two dimensions on this page.
这一页有两个维度。
▶
Horizontal — left to right — is the stack upgrade. Kafka, Jraft, Aeron.
横向 — 从左到右 — 是技术栈升级。Kafka、Jraft、Aeron。
▶
Vertical — bottom to top — is the resilience level.
纵向 — 从下往上 — 是韧性级别。
▶
Every customer follows some version of this path.
每个客户都会沿着这条路径的某个版本演进。
▶
Bottom — baseline.
最底层 — 基础。
▶
Kafka with single-leader OMS and ME. Jraft and Aeron with basic Raft cluster.
Kafka 配单主 OMS 和 ME。Jraft 和 Aeron 配基础 Raft 集群。
▶
Next step — Baseline HA.
下一步 — 基础 HA。
▶
Kafka adds cross-AZ active-standby to OMS and ME.
Kafka 这边给 OMS 和 ME 加跨 AZ 的主备。
▶
Jraft and Aeron add cross-AZ Learners. Same purpose, different mechanism.
Jraft 和 Aeron 加跨 AZ 的 Learner。目的一样,机制不同。
▶
Next — Standard HA.
再上一层 — 标准 HA。
▶
Kafka customers begin migrating OMS to Raft. This is the qualitative jump.
Kafka 客户开始把 OMS 迁到 Raft。这是质变的一步。
▶
Jraft and Aeron — continue strengthening Multi-AZ Learners.
Jraft 和 Aeron — 继续加强多 AZ 的 Learner。
▶
Standard DR — cross-region Learner.
标准 DR — 跨 Region 的 Learner。
▶
You'll notice Kafka column is empty here.
你会看到 Kafka 这一列是空的。
▶
That is intentional. Kafka is a stateless message broker.
这是故意的。Kafka 是无状态的消息中间件。
▶
Cross-region replication on Kafka itself isn't where we recommend customers invest.
我们不建议客户在 Kafka 这一层投入跨 Region 复制。
▶
The right place to do cross-region DR is on the stateful layer — OMS, ME.
跨 Region DR 该做在有状态的服务上 — OMS、ME。
▶
That is where Raft and Aeron Learners shine.
这正是 Raft 和 Aeron Learner 发挥作用的地方。
▶
Top — Advanced DR.
顶层 — 进阶 DR。
▶
All three stacks can reach this. Cross-region Learners. Full multi-region active or warm standby.
三种栈都可以到达。跨 Region 的 Learner。完整的多 Region active 或 warm standby。
▶
The takeaway — every customer can be on this map.
启示 — 每个客户都能定位在这张图上。
▶
The conversation is — where are you now, and what's the next step.
对话就变成 — 你现在在哪里,下一步是什么。
Slide 8
Multi-AZ MSK · Active-Standby Reference
多 AZ MSK 主备参考架构
▶
Let me make this concrete with four real architectures.
我们用四张真实架构图把它落地。
▶
I'll keep these brief. The goal is to show what these patterns look like end-to-end. Not to deep-dive.
我会简短带过。目的是让大家看到这些 pattern 端到端长什么样。不深入细节。
▶
First — waist tier.
第一张 — 腰部。
▶
OMS active-standby across two AZs. Two-AZ MSK Kafka. ME active-standby on EC2.
OMS 跨双 AZ 主备。两 AZ 的 MSK Kafka。EC2 上的 ME 主备。
▶
Routing decided at the trading server based on user ID.
Trading Server 根据 user ID 做路由。
▶
Odd users to AZ-1. Even users to AZ-2. Basically that.
单数 user 到 AZ-1。双数 user 到 AZ-2。大概就是这样。
▶
Snapshots every 10 minutes. Async write to Aurora downstream.
每 10 分钟一次 snapshot。下游异步写 Aurora。
▶
P99 around 10 ms. Each shard handles around 50K TPS.
P99 大约 10 毫秒。每个分片大约 50K TPS。
▶
The Chinese sticky notes on the diagram cover snapshot strategy and offset sync logic. I'll translate as we go.
图里的中文 sticky 是 snapshot 策略和 offset 同步的逻辑。我边讲边翻。
▶
The key takeaway — this is the most adopted pattern.
关键 takeaway — 这是最主流的 pattern。
▶
It works well for waist-tier exchanges who want resilience without going to Sofa-Jraft yet.
适合腰部所 — 想要韧性,但还没到上 Sofa-Jraft 的阶段。
Slide 9
Single-AZ Self-Managed Kafka · Raft OMS
单 AZ 自建 Kafka + Raft OMS
▶
Second case — chest tier in transition.
第二个 — 胸部过渡态。
▶
The big change here — OMS is now Raft.
这里的大变化 — OMS 已经上 Raft 了。
▶
But Kafka is still in the middle. And now it's self-managed in a single AZ. Not MSK.
但 Kafka 还在中间。而且变成自建单 AZ 了。不是 MSK。
▶
ME is still active-standby.
ME 还是主备。
▶
Why this hybrid?
为什么是这种混合?
▶
Customers at this tier want OMS consistency guarantees from Raft.
这一层的客户想要 OMS 的 Raft 一致性保证。
▶
But they don't want to give up Kafka. Their ecosystem already depends on it.
但又不想丢 Kafka。他们的整个生态都依赖 Kafka。
▶
So they upgrade the front first — OMS to Raft. And leave the rest stable.
所以他们先升级前端 — OMS 上 Raft。后面留稳定。
▶
One ME snapshot per minute here. More aggressive than the previous case.
ME 在这里是每分钟一次 snapshot。比上一个更激进。
▶
P99 stays around 10 to 15 ms.
P99 保持在 10 到 15 毫秒。
Slide 10
Sofa-Jraft End-to-End
Sofa-Jraft 全链路
▶
Third — shoulder tier. Full Sofa-Jraft.
第三个 — 肩颈部。全链路 Sofa-Jraft。
▶
Kafka is gone from the trading path.
Kafka 从交易路径上消失了。
▶
OMS cluster and ME cluster — both Raft, both gRPC, both deployed on metal.
OMS 集群和 ME 集群 — 都是 Raft,都用 gRPC,都部署在裸金属上。
▶
Notice OMS-shard-1 and OMS-shard-2 each have their own Raft group. With leader and followers.
你会看到 OMS-shard-1 和 shard-2 各自有自己的 Raft group。有 leader 和 followers。
▶
Same on the ME side.
ME 那边也一样。
▶
Snapshots still 10 minutes. Sofa-Jraft handles consistency. Snapshots are for fast recovery.
Snapshot 还是 10 分钟。Sofa-Jraft 管一致性。Snapshot 用于快速恢复。
▶
Customers at this tier have invested heavily — DPDK, NUMA, custom kernels.
这一层的客户投入很重 — DPDK、NUMA、自定义内核。
▶
P99 under 10 ms is achievable, with the right tuning.
调优好可以做到 P99 10 毫秒以内。
▶
One trade-off worth mentioning — the Sofa-Jraft codebase is mature but evolves slowly.
有一个 trade-off 值得提一下 — Sofa-Jraft 代码库成熟但更新慢。
▶
Most customers at this tier maintain their own forks.
这一层的客户大多维护自己 fork 的版本。
Slide 11
Aeron Cluster End-to-End · Head Tier
Aeron Cluster 全链路 · 头部参考
▶
Last one — head tier. Aeron Cluster end-to-end.
最后 — 头部。全链路 Aeron Cluster。
▶
This is the publicly known architecture pattern. Based on Frank Yu's QCon and Aeron MeetUp talks.
这是公开已知的架构 pattern。来自 Frank Yu 在 QCon 和 Aeron MeetUp 的演讲。
▶
Aeron Cluster on both OMS and ME sides. UDP Aeron Transport between them.
OMS 端和 ME 端都是 Aeron Cluster。中间是 UDP 的 Aeron Transport。
▶
All deployed on metal instances.
全部部署在裸金属上。
▶
Two sharding strategies — OMS shards by user. ME shards by trading pair.
两种分片策略 — OMS 按 user 分片。ME 按交易对分片。
▶
What gives this architecture its edge —
是什么让这个架构有优势 —
▶
Reliable UDP multicast.
可靠的 UDP 组播。
▶
Aeron Cluster Raft consensus.
Aeron Cluster 的 Raft 共识。
▶
NAK-based precise retransmission.
基于 NAK 的精准重传。
▶
GC-free, zero-copy ring buffers.
GC-free 的零拷贝 Ring Buffer。
▶
The numbers speak — P99 under 1 millisecond.
数字会说话 — P99 在 1 毫秒以内。
▶
A note for our peers here today.
给在场各位同事一个 note。
▶
What we draw on this slide is the public pattern.
这一页上画的是公开的 pattern。
▶
We've also been spending time understanding what makes this architecture work in production. And where the edge cases are.
我们也花了不少时间研究这个架构在生产中怎么 work。以及哪些是 edge case。
▶
So we'd love to hear your perspective later. Especially on how this pattern evolves over the next 12 to 24 months.
所以稍后非常希望听听各位的看法。特别是这个 pattern 在接下来 12 到 24 个月怎么演进。
▶
That's the overview. Happy to take questions.
overview 到此结束。欢迎随时提问。
▶
And if any of these architectures are familiar to you, I'd love to hear more.
如果这些架构里有你们熟悉的,也很想多聊聊。