Transit at IXP & next-hop-self
And college started after pretty good holi holidays. Again having bit painful time due to hot weather and this is just start of summers. Well all I can hope is that there won’t be voltage issues in village again (like last time). And just to make sure on that part - I have put 2 RTI’s asking Power department about their preparation details. :)
Anyways coming on blog post topic for the day - the effect of “next-hop-self” at an IXP when there are peers as well as transit customers of a network. Just to be clear in start - this post will stick to technical side of it and without going into IXP policy side of it.
OK let’s consider a case of an Internet Exchange Point (IXP) where we have three participants - A, B and C. Now A is very big ISP, while B is a big (not as big as A though) while C is pretty small. All are connected on same switch and under same broadcast subnet 10.0.0.0/24 with A having autonomous system number 1 and is allocated 10.0.0.1 (from IXP’s /24), B having AS2 with IP 10.0.0.2 and C having AS3 with IP 10.0.0.3.
Now B “requests” A for peering and A decides that since B has a significant part of routing table, it’s a good idea to peer, and so A and B start peering. Next, C goes to A with same requests and considering (small) size of C, A rejects peering request and rather offers paid transit at some X price. B hears about this issue and goes to C to offer “cheap transit” to reach A (since B peers with A already) and eventually C agrees and becomes a downstream customer of B.
I am doing this scenario setup on GNS and here’s how things will look like:
Looking at router B (AS2):
b.net>sh ip bgp summary
BGP router identifier 10.0.0.2, local AS number 2
BGP table version is 6, main routing table version 6
5 network entries using 485 bytes of memory
5 path entries using 180 bytes of memory
3 BGP path attribute entries using 180 bytes of memory
2 BGP AS-PATH entries using 48 bytes of memory
0 BGP route-map cache entries using 0 bytes of memory
0 BGP filter-list cache entries using 0 bytes of memory
BGP using 893 total bytes of memory
BGP activity 5/0 prefixes, 5/0 paths, scan interval 60 secs
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd
10.0.0.1 4 1 25 26 6 0 0 00:20:02 3
10.0.0.3 4 3 5 6 6 0 0 00:00:39 1
b.net>
So we have two sessions - one with A (peering) and one with C (customer).
Now let’s check on customer C’s router to see what they have got in table:
c.net>
c.net>sh ip bgp
BGP table version is 6, local router ID is 10.0.0.3
Status codes: s suppressed, d damped, h history, * valid, > best, i – internal
Origin codes: i – IGP, e – EGP, ? – incomplete
Network Next Hop Metric LocPrf Weight Path
*> 60.0.0.0/24 10.0.0.1 0 2 1 i
*> 61.0.0.0/24 10.0.0.1 0 2 1 i
*> 62.0.0.0/24 10.0.0.1 0 2 1 i
*> 70.0.0.0/24 10.0.0.2 0 0 2 i
*> 80.0.0.0/24 0.0.0.0 0 32768 i
c.net>
OK - we can see C is getting 60.0.0.0/24, 61.0.0.0/24 and 62.0.0.0/24 which is originated by AS 1 (ISP A) and AS path is 2 > 1 which seems all good but “next hop” is 10.0.0.1 which is IP of router A. So basically traffic is NOT going via B, only AS path is telling AS path is C > B > A but actual flow of packets is C > A directly
Let’s look at trace:
c.net>traceroute 60.0.0.1
Type escape sequence to abort.
Tracing the route to 60.0.0.1
1 10.0.0.1 16 msec 16 msec *
c.net>
This is something which few people dislike because traffic is flowing between C and A directly while C is paying to B. So B is making money without even having traffic on their ports!
Again - I am not going into whether this argument is good or not because then we will come on argument why A didn’t peered with C when they were on same switch already?! ! If traffic flows from B’s port then it will have to pass IXP switch twice C > switch > B and B > switch > A and again return as A > switch > B, B > switch > C.
What if IXP tries and make sure that direct flow doesn’t happens?
Let’s say if IXP forces “next-hop-self”?
Firstly let’s recall what exactly “next-hop-self” means:
It’s an important parameter used in BGP sessions to put “next hop” as itself for adjacent peer. It is needed in lot’s of cases when router which is originating prefix appears to be on same subnet and but not reachable by other peer.
Let’s reconfigure router of B.net for it’s neighbour C.net on 10.0.0.3
b.net>enable
Password:
b.net#conf t
Enter configuration commands, one per line. End with CNTL/Z.
b.net(config)#router b
b.net(config)#router bgp 2
b.net(config-router)#nei
b.net(config-router)#neighbor 10.0.0.3 next-hop-self
b.net(config-router)#end
b.net#
00:37:18: %SYS-5-CONFIG_I: Configured from console by console
b.net#
Checking routing table on router C again:
c.net>sh ip bgp
BGP table version is 9, local router ID is 10.0.0.3
Status codes: s suppressed, d damped, h history, * valid, > best, i – internal
Origin codes: i – IGP, e – EGP, ? – incomplete
Network Next Hop Metric LocPrf Weight Path
*> 60.0.0.0/24 10.0.0.2 0 2 1 i
*> 61.0.0.0/24 10.0.0.2 0 2 1 i
*> 62.0.0.0/24 10.0.0.2 0 2 1 i
*> 70.0.0.0/24 10.0.0.2 0 0 2 i
*> 80.0.0.0/24 0.0.0.0 0 32768 i
c.net>
And as we see - next hop is 10.0.0.2 now and thus traffic will now actually pass B.
c.net>traceroute 60.0.0.1
Type escape sequence to abort.
Tracing the route to 60.0.0.1
1 10.0.0.2 16 msec 20 msec 20 msec
2 10.0.0.1 44 msec 44 msec *
c.net>
In general default behaviour of BGP is to take shortest path and ideally it should pick direct router rather then routing packets via IXP switch twice. So well that’s about it.
Time to get ready for college class!
Disclaimer: Post is completely a reflection of my personal thoughts and has NOTHING to do with my employer. It does not reflects thoughts or vision of my employeer.