A view at Amazon AWS massive network

15-12-2024

Last year Lincoln Dale from Amazon AWS gave an excellent talk at AusNOG 2023 on AWS’s massive network scale. The talk is available on YouTube here and I have embedded the same below. While metrics can be outdated, they cover the fundamental design ideas.

Some interesting parts from the presentation:

Over a million network devices are in the network!
No global IGP to reduce blast radius.
They increased internet capacity in some North American locations by over 100Tbps for football streaming.
Network runs on commodity hardware which is sourced from multiple vendors.
1 x switch 12.8Tbps as a fundamental building block and it has 32 x 400G with 8GB of ram.
32 of these switches are stacked in a rack giving 100Tbps of usable capacity with redundancy.
32 racks (each rack with 32 switches and each switch with 32 x 400G ports) give a 3.2Pbps (Petabits per second) fabric.
No central UPS but per rack UPS for important stuff.
Chip supports up to 51.2Tbps but that hardware isn’t used yet due to associated costs.
Makes use of specially created SN connectors instead of breakout cables to reduce cabling.
2.3 million config changes a day by the automated system across the network including pre-checks, post-checks, rollback if needed etc.
Network engineers do not make changes in the network but instead make changes in the automated system which makes changes in the network.
No use of BFD internally, makes use of fast LCAP for sub-second detection.
Use of eBGP inside the fabric with confederation.
Ran out of private IPv4 pools and hence using Class E (240/4). Use of Class E + CGNAT 100.64 pool breaks traceroutes towards AWS.
Some devices do not do TTL depreciation (a fundamental thing used for traceroute to work!) because they hit 64 TTLs in many places.
Disaggregated control plane - Peering connectivity physically terminates on a device while BGP is done on some container in compute.
Ping from connected peers to get DUP responses. This is because one comes from a physical device, while one from the containers. Happens only when packets are sent via /31 peering IP in the source.