27 Oct

Being Open How Facebook Got Its Edge

An excellent presentation by James Quinn from Facebook on “Being Open How Facebook Got Its Edge” at NANOG68. YouTube link here and video is embedded in the post below.

 

 

Some key points mentioned by James:

  1. BGP routing is inefficient as scale grows especially around distributing traffic. They can get a lot of traffic concentrated to a specific PoP apart from the fact that BGP best AS_PATH can simply be an inefficient low AS_PATH based path.
  2. Facebook comes with a cool idea of “evolving beyond BGP with BGP” where they use BGP concepts to beat some of the BGP-related problems.
  3. He also points to fact that IPv6 has much larger address space and huge summarization can result in egress problems for them. A single route announcement can just have almost entire network behind it!
  4. Traffic management is based on local and a global controller. Local controller picks efficient routes, injects them via BGP and takes care of traffic balancing within a given PoP/city, balancing traffic across local circuits. On the other hand, Global PoP is based on DNS logic and helps in moving traffic across cities.

 

It’s wonderful to see that Facebook is solving the performance and load related challenges using fundamental blocks like BGP (local controller) and DNS (global controller). 🙂

20 Aug

Bangladesh .bd TLD outage on 18th August 2016

 

outage

Day before yesterday i.e on 18th August 2016 Bangladesh’s TLD .bd went had an outage. It was originally reported by Jasim Alam on bdNOG mailing list.

 

His message shows that DNS resolution of BTCL (Bangladesh Telecommunications Company Ltd) was failing. Later Alok Das that it was the power problem resulting in outage.

Let’s look ask one of 13 root DNS server about NS records on who has the delegation for .bd.

So two of out of these three seem to be on BTCL network and that too on same /24.

 

Let’s ping to all these three using NLNOG Ring node of bdHUB: bdhub01.ring.nlnog.net

So clearly all three servers are in Bangladesh/local as per super low latency from bdHUB node. From traces from outside India it’s quite unlikely of any other anycast node outside Bangladesh. This is a serious design issue. For a country’s TLD one should have much more resiliency.

My good friend Fakrul from APNIC mentioned on mailing list about PCH becoming secondary for .bd. Same is visible now in the authority NS records of the domain.

dig @dns.bd. bd. ns +short
jamuna.btcl.net.bd.
dns.bd.
bd-ns.anycast.pch.net.
surma.btcl.net.bd.

 

So once the same is added on root DNS servers, it will bring up bit more resiliency with PCH’s platform with large number of anycast nodes.

So what was impact of this outage?
Well, probably a lot. .bd TLD outage would have brought down a lot of websites running on .bd domain. Any fresh DNS lookup would have failed, any websites with lower TTL would have went down. As per bdIX traffic graph some disturbance is visible across that day.

bdix drop