31 May

BGP Administrative Shutdown Communication

I recently came across an excellent draft at IETF by Job Snijders &  friends.  This is to address scenarios where a network might miss communication about a maintenance activity when BGP shutdown happens. Once implemented, this can potentially offer to send peer a message with up to 128 bytes with info about shutdown like “Ticket XXX: We are upgrading the router, will be back live in 1hr” etc. It depends by appending such data to the sys notification which is part of BGP protocol. This is one which sends a message just before the shutdown of the session. So it similar to the way you see session tearing down due to prefix limits etc. This has already been implemented in some of the open source routing implementations like OpenBGPd, GoBGP, PMacct, Exabgp etc.
Here’s the latest draft of this change: https://tools.ietf.org/html/draft-ietf-idr-shutdown-09. And here’s Job’s talk from NANOG conference at the start of this year.

Hopefully, we will see this implemented across large vendor routers!

29 May

Welcome Facebook (AS32934) to India!

Today I was having a chat with my friend Hari Haran. He mentioned that Facebook has started its PoP in Mumbai. This seems true and Facebook has mentioned GPX Mumbai as their private peering PoP in their peeringdb record.

I triggered a quick test trace to “www.facebook.com” on IPv4 from all Indian RIPE Atlas probes and resolved “www.facebook.com” on the probe itself. The lowest latency is from Airtel Karnataka and that’s still hitting Facebook in Singapore. I do not see any of networks with probe coverage hitting Facebook node locally.

Except for a few a lot just seem to be hitting Singapore. Thus clearly there’s a huge scope of peering.

Full measurement results here: https://atlas.ripe.net/measurements/8777252/#!general
If you are an ISP in India, you can start peering with Facebook right away!

29 May

What makes BSNL AS9829 as most unstable ASN in the world?!

On weekend  I was looking at BGP Instability Report data. As usual (and unfortunately) BSNL tops that list. BSNL is the most unstable autonomous network in the world. In past, I have written previously about how AS9829 is the rotten IP backbone.

This isn’t a surprise since they keep on coming on top but I think it’s well worth a check on what exactly is causing that. So I looked into BGP tables updates published on Oregon route-views from 21st May to 27th May and pulled data specifically for AS9829. I see zero withdrawals which are very interesting. I thought there would be a lot of announcements & withdrawals as they switch transits to balance traffic.
If I plot the data, I get following chart of withdrawals against timestamp. This consists of summarised view of every 15mins and taken from 653 routing update dumps. It seems not feasible to graph data for 653 dumps, so I picked top 300. Here’s how it look like:

Except for few large spikes, it seems to have a relatively consistent pattern. We can see daily fresh announcements of close to 50,000 announcements.
This data gives no idea and I can’t say much by looking at it. Instead of looking at updates, I pulled last weeks RIBs and pulled AS9829 announcements. The idea here is to get map announcements to each upstream against time stamp along with announcements across various subnet masks.

Here’s total route announcement graph:
The graph above clearly shows that total routes announcements increased significantly on 23rd May at 06:00 UTC from 127664 to 129298. Thus dipped significantly at 14:00 on 26th May to 124301. So between 10:00 to 14:00 on 26th, the drop in routes as much as 4% drop clear indicating a large outage they had in their network.

Next part is to look at how they tweak their announcements to upstream.

So clearly they are announcing a large number of routes to Tata AS6453 and these are IPLC links where they are buying IP transit outside India. Some of these key spikes show a mirror among other transit giving a clear hint of circuit balancing by moving route announcement.
Next part is to view their announcements in terms of prefix size.

/20 as well as /22 as both seems relatively consistent except showing a dip on 26th.
So all I can say based on above data is following:

  1. BSNL had some issues last week. Possibly one of their upstream pipes had issues and they increased their announcements on Tata AS6453 during that time.
  2. They are an only large operator who is buying transits from as many as 9 upstream. This would result in broken capacity across at least 9 and possibly 30-40 circuits resulting in a major capacity management challenge across these upstream.
  3. They are announcing a large number of prefix sizes. /18, /20, /22, /23 and even /24s. This isn’t good practice at their large scale.
  4. They need to start peering. They are the only network of that scale who isn’t peering except with a couple of content players like Google AS15169. They need to peer aggressively inside India & follow same outside India if they actually keep on running such network. Or else even buying transit domestic only will be a better strategy.

Most of these problems can be fixed if BSNL aggregates it’s a number of transits (and circuits per transit) along with aggregation of routes. For a three transit scenario, they can follow /18, /20 and /22 strategy and leave /24 only for emergency cases to balance traffic.