31 Mar

Dark spot in Global IPv6 routing

 

 

Fest time at college – Good since I get lot of free time to spend around looking at routing tables. It’s always interesting since last week was full of some major submarine cable cuts and has huge impact on Indian networks.

Anyways, an interesting issue to post today about Global IPv6 routing . There are “dark spots” in global IPv6 routing because of peering dispute between multiple tier 1 ISPs involving Hurricane Electric (AS6939) & Cogent Communications (AS174).  What’s happening here is that both tier 1 providers failed to reach on agreement to keep peering up in case of IPv6. This has resulted in parts of global IPv6 internet where packets from one network (and it’s downstream) can’t reach other network or their downsteam singled hommed networks. 

Only publicly known information about de-peering of Cogent from HE is Mr Mike Leber’s email to NANOG mailing list here. Overall Hurricane Electric seems pretty much open in peering and networking community knows this well. So it is not hard to believe in to Mr Mike’s mail. Infact they even baked a cake to cheer Cogent up at NANOG meeting 47 at Dearborn, Michigan in 2009.

 

 

Why IPv6 Internet is broken when simply two providers de-peered? 

Answer of this lies in fundamental theory of a Tier 1 network i.e a “transit free” network. Hurricane Electric is world’s biggest IPv6 backbone in terms of number of interconnections while Cogent Communications is a big ISP in US and Europe with significant last mile fiber in many areas of US. It is a popular choice for cheap datacenter upstream transit. 

Now since both ISPs are tier 1 i.e transit free network in case of IPv6 internet, they simply do not pay to anyone (on layer 3) to reach any network. Packets from HE can’t go to Cogent simply because there’s no transit provider for HE in IPv6 (infact it is the transit provider to lot of networks!). At the same time Cogent is also not having any transit provider in IPv6. Transit here is important because there are many networks in world which are not connected. Say e.g Indian BSNL doesn’t connects to Hurricane Electric or say Tulip Telecom doesn’t connects to AT&T directly but packets can be routed because in both cases they have transit from an upstream network which eventually connects to AT&T or peers with AT&T. 

 

 

Looking at Cogent’s IPv6 prefix – 2001:0550::/32 announced from AS174 from Hurricane Electric’s route server:

 

route-server> show bgp ipv6 2001:0550::/32
% Network not in table
route-server>

 

There is no public route server from Cogent, thus I am using their looking glass to reach IPv6 address of he.net to test connectivity:

PING he.net(he.net) 56 data bytes
From 2001:550:1:31f::1 icmp_seq=2 Destination unreachable: No route
From 2001:550:1:31f::1 icmp_seq=3 Destination unreachable: No route

— he.net ping statistics —
5 packets transmitted, 0 received, +2 errors, 100% packet loss, time 14003ms

 

 

Is dark spot just in case of IPv6? What about their IPv4?

Yes, this problem in IPv6 specific only. HE and Cogent do not peer in case of IPv4 too but since HE is not a tier 1 in case of IPv4, it rather has a couple of transit providers who seem to be having peering relation with Cogent.

Looking at Cogent’s IPv4 38.100.128.10  from HE’s route server:

route-server> show ip bgp 38.0.0.0/8 long
BGP table version is 0, local router ID is 64.62.142.154
Status codes: s suppressed, d damped, h history, * valid, > best, i – internal,
r RIB-failure, S Stale, R Removed
Origin codes: i – IGP, e – EGP, ? – incomplete

Network Next Hop Metric LocPrf Weight Path
* i38.0.0.0 213.248.92.33 48 70 0 1299 174 i
* i 213.248.92.33 60 70 0 1299 174 i
* i 216.218.252.174 48 70 0 1299 174 i
* i 213.248.86.53 48 70 0 1299 174 i
* i 213.248.93.81 48 70 0 1299 174 i
* i 213.248.93.81 48 70 0 1299 174 i
* i 213.248.67.105 48 70 0 1299 174 i
* i 213.248.96.177 48 70 0 1299 174 i
* i 213.248.67.125 48 70 0 1299 174 i
* i 213.248.70.37 48 70 0 1299 174 i
* i 213.248.92.33 48 70 0 1299 174 i
* i 213.248.101.145 48 70 0 1299 174 i

(short extracted view of long output)

 

So clearly HE seems using AS1299 which Telia Global Network – one of IPv4 Tier 1 ISPs to reach Cogent. I can guess it is transit provider for HE. At the same time I can see a route from Cogent to HE in IPv4 via Global Crossing:

traceroute to 216.218.186.2 (216.218.186.2), 30 hops max, 60 byte packets
1 vl99.mag01.ord01.atlas.cogentco.com (66.250.250.89) 0.497 ms 0.444 ms
2 te0-5-0-3.ccr21.ord01.atlas.cogentco.com (154.54.45.193) 0.437 ms 0.569 ms
3 te0-5-0-5.ccr22.ord03.atlas.cogentco.com (154.54.44.162) 0.647 ms te0-5-0-1.ccr22.ord03.atlas.cogentco.com (154.54.43.230) 0.821 ms
4 Tenge4-4-10000M.ar3.CHI2.gblx.net (64.212.107.73) 0.554 ms 0.562 ms
5 Hurrican-Electric-LLC.Port-channel100.ar3.SJC2.gblx.net (64.214.174.246) 54.313 ms 54.016 ms
6 10gigabitethernet1-1.core1.fmt1.he.net (72.52.92.109) 54.792 ms 55.231 ms
7 * *
8 * *

 

So clearly networks have connectivity in IPv4 via HE’s upstreams Global Crossing (which is now Level 3) & Telia. In IPv6 HE simply is not having a customer relationship with Gblx and Telia. And so the dark spot remains there. 

 

The other fact which confirms that Telia and Gblx are transit for HE is via RADB records of AS1299.

 

Anurags-MacBook-Pro:~ anurag$ whois -h whois.radb.net as1299 | grep 6939
import: from AS6939 action pref=50; accept AS-HURRICANE
export: to AS6939 announce ANY
mp-import: afi ipv6 from AS6939 accept AS-HURRICANE
mp-export: afi ipv6 to AS6939 announce AS-TELIANET-V6
Anurags-MacBook-Pro:~ anurag$

 

Clearly it is announcing ANY ie 0.0.0.0/0 to HE on IPv4 while for IPv6 it is announcing only AS-TELIANET-V6 i.e transit in IPv4 while peering in IPv6.

 

With hope that this issue is resolved in near future, time for me to get some sleep! 🙂

 

Disclaimer: Focus of this blog post is not about who is responsible for not peering & creating such situation but rather a technical analysis of what happens when big Tier 1 ISPs de-peer.

Comments are personal and have nothing to do with my employer. I know most of people I mentioned in this post personally and this fact has nothing to do with this blog post!

 

21 Nov

Analysis: Inconsistent latency between two end points

An interesting evening here in village. From today sessional tests started at college and so does my blog posts too (to keep myself with positive energy!) 😉

 

 

 

Learned something new while troubleshooting. 🙂

I am used to getting latency of ~350ms with my server in Europe as I have mentioned in my past blog posts. My connection > Server goes direct but return path goes via US and this is what increases latency. Today all of sudden I saw latency of 200ms with my server. 150ms less – that’s significant.

Immediately I got idea that BSNL has changed BGP announcement and likely announcing prefix at EU to have direct return path. To confirm that I connected to my server and shooted a traceroute  It gave me a strange result of latency over 350ms as it has been. I pinged my home router from server and latency was still 350ms. While from other side i.e my home connection > server ping was 200ms. Very very strange!

Remember “packets CAN take different path but whatever they follow” – that stays same. So if home > server is 200ms (for SURE not via US) then how come server > home is via US and 350ms? Based on my understanding even if forward and return route is different – the round trip latency stays same

 

I decided to collect some more data and confirm my observation. Thus I ran 1000 packets ping from both points – home to server and server to home.

 

Results: 

Home > Server ping:

1000 packets transmitted, 1000 received, 0% packet loss, time 999739ms
rtt min/avg/max/mdev = 197.128/201.314/294.035/11.347 ms

Server > Home ping:

1000 packets transmitted, 960 received, 4% packet loss, time 999710ms
rtt min/avg/max/mdev = 319.060/377.464/499.294/18.203 ms
You have new mail in /var/mail/root

 

clearly Home > Server is 200ms while server > home is 320ms. What seems going strange here? 

I did a trace from both end points to compare latency by each hop, though clearly return path is via US.

 

Home to server trace:

anurag@laptop:~$ traceroute 178.238.225.247 -A
traceroute to 178.238.225.247 (178.238.225.247), 30 hops max, 60 byte packets
1 router.local (10.0.0.1) [AS1] 3.578 ms 3.894 ms 4.234 ms
2 117.200.48.1 (117.200.48.1) [AS9829] 30.015 ms 32.443 ms 35.071 ms
3 218.248.173.46 (218.248.173.46) [AS9829] 37.402 ms 39.812 ms 43.527 ms
4 115.254.1.138 (115.254.1.138) [AS18101] 50.175 ms 52.586 ms 55.038 ms
5 115.255.252.57 (115.255.252.57) [AS18101] 80.995 ms 84.568 ms 92.177 ms
6 62.216.147.101 (62.216.147.101) [AS15412] 321.376 ms 287.380 ms 299.725 ms
7 xe-8-3-0.0.pjr04.mmb004.flagtel.com (85.95.26.69) [AS15412] 74.802 ms 76.024 ms 77.518 ms
8 xe-0-0-0.0.pjr04.ldn001.flagtel.com (85.95.25.186) [AS15412] 362.197 ms 364.614 ms 367.255 ms
9 xe-11-0-0.edge5.London1.Level3.net (212.187.138.53) [AS3356/AS9057] 365.825 ms 368.374 ms 370.728 ms
10 ae-52-52.csw2.London1.Level3.net (4.69.139.120) [AS3356] 383.136 ms 386.773 ms 387.971 ms
11 ae-57-222.ebr2.London1.Level3.net (4.69.153.133) [AS3356] 391.669 ms ae-59-224.ebr2.London1.Level3.net (4.69.153.141) [AS3356] 384.353 ms ae-58-223.ebr2.London1.Level3.net (4.69.153.137) [AS3356] 390.391 ms
12 ae-24-24.ebr2.Frankfurt1.Level3.net (4.69.148.198) [AS3356] 367.976 ms 368.210 ms ae-23-23.ebr2.Frankfurt1.Level3.net (4.69.148.194) [AS3356] 374.705 ms
13 ae-82-82.csw3.Frankfurt1.Level3.net (4.69.140.26) [AS3356] 368.563 ms ae-62-62.csw1.Frankfurt1.Level3.net (4.69.140.18) [AS3356] 371.392 ms ae-82-82.csw3.Frankfurt1.Level3.net (4.69.140.26) [AS3356] 380.242 ms
14 ae-71-71.ebr1.Frankfurt1.Level3.net (4.69.140.5) [AS3356] 366.064 ms ae-81-81.ebr1.Frankfurt1.Level3.net (4.69.140.9) [AS3356] 367.639 ms ae-61-61.ebr1.Frankfurt1.Level3.net (4.69.140.1) [AS3356] 375.044 ms
15 ae-1-19.bar1.Munich1.Level3.net (4.69.153.245) [AS3356] 388.552 ms 390.943 ms 393.417 ms
16 GIGA-HOSTIN.bar1.Munich1.Level3.net (62.140.24.126) [AS9057/AS3356] 222.769 ms 225.136 ms 227.559 ms
17 server7 (178.238.225.247) [AS51167] 231.750 ms 234.424 ms 236.893 ms

 

(clearly latency drop between 15th and 16th hop. Giving clue of different return path)

 

Server > Home trace:

root@server7:~# traceroute 117.200.61.87
traceroute to 117.200.61.87 (117.200.61.87), 30 hops max, 60 byte packets
1 gw.giga-dns.com (91.194.90.1) 0.725 ms 0.719 ms 0.705 ms
2 host-93-104-204-33.customer.m-online.net (93.104.204.33) 0.694 ms 0.678 ms 0.669 ms
3 xe-1-1-0.rt-decix-2.m-online.net (82.135.16.102) 7.859 ms 7.855 ms 7.853 ms
4 xe-1-1-0.rt-decix-2.m-online.net (82.135.16.102) 7.592 ms 7.597 ms 7.591 ms
5 213.198.72.237 (213.198.72.237) 8.048 ms 8.048 ms 8.271 ms
6 ae-5.r21.frnkge03.de.bb.gin.ntt.net (129.250.4.162) 7.776 ms ae-2.r20.frnkge04.de.bb.gin.ntt.net (129.250.5.217) 8.073 ms ae-5.r21.frnkge03.de.bb.gin.ntt.net (129.250.4.162) 7.820 ms
7 ae-0.r20.frnkge04.de.bb.gin.ntt.net (129.250.2.13) 8.067 ms ae-1.r21.asbnva02.us.bb.gin.ntt.net (129.250.3.20) 98.758 ms ae-0.r20.frnkge04.de.bb.gin.ntt.net (129.250.2.13) 8.042 ms
8 ae-0.r20.asbnva02.us.bb.gin.ntt.net (129.250.4.4) 93.975 ms ae-1.r21.asbnva02.us.bb.gin.ntt.net (129.250.3.20) 113.841 ms ae-0.r20.asbnva02.us.bb.gin.ntt.net (129.250.4.4) 93.847 ms
9 ae-0.r20.asbnva02.us.bb.gin.ntt.net (129.250.4.4) 99.587 ms 93.837 ms 93.829 ms
10 ae-2.r04.lsanca03.us.bb.gin.ntt.net (129.250.5.70) 168.760 ms 168.761 ms 167.983 ms
11 ae-2.r04.lsanca03.us.bb.gin.ntt.net (129.250.5.70) 167.236 ms xe-0-1-0-10.r04.lsanca03.us.ce.gin.ntt.net (198.172.90.222) 159.727 ms ae-2.r04.lsanca03.us.bb.gin.ntt.net (129.250.5.70) 160.751 ms
12 xe-0-1-0-10.r04.lsanca03.us.ce.gin.ntt.net (198.172.90.222) 163.235 ms 160.475 ms 165.715 ms
13 115.254.1.137 (115.254.1.137) 340.748 ms 344.682 ms 338.766 ms
14 115.254.1.137 (115.254.1.137) 336.735 ms 218.248.255.101 (218.248.255.101) 335.243 ms 115.254.1.137 (115.254.1.137) 338.721 ms
15 218.248.173.41 (218.248.173.41) 345.967 ms 218.248.255.101 (218.248.255.101) 335.266 ms 342.750 ms
16 218.248.173.41 (218.248.173.41) 342.260 ms 345.005 ms 343.014 ms
17 218.248.173.41 (218.248.173.41) 347.736 ms * 353.451 ms
18 * * *
19 * * *
20 * * *

 

 

This brings me back to question – how come when return path is via US then latency is 200ms when pinging from home? 

 

All of sudden I thought of multiple subnets on server! (YES YES YES!)

There are multiple subnets configured and they have IPs belonging to completely different range and infact different ASNs and BGP announcements (here we go!). Now let’s call them IP1 and IP2, where IP1 is default on server and gateway from IP1 range is default route for server while IP2 is secondary. So far I was pinging IP2. Let’s ping IP1 from home:

5 packets transmitted, 5 received, 0% packet loss, time 4000ms
rtt min/avg/max/mdev = 396.381/399.734/403.037/2.617 ms

 

Clearly expected latency!! 🙂

 

Explanation:

Here’s what was happening – Server has two IPs – IP1 and IP2.
IP1 is default and is covered in BGP announcement of M-Online (German ISP) while IP2 is secondary and is covered in BGP announcement of datacenter itself (they recently got a ASN and run own autonomous network). M-online is relatively big ISP and has transit from multiple providers including Level3. Telia, lot of peering etc. While datacenter’s network has almost no peering but transit from Level3 and Telia. Thus IP1 (primary) is from M-online and server’s default route also points to M-Online gateway.

But since IP is from different subnet covered under different BGP announcement – it follows gateway of datacenter itself. IP1 was prefering route via Level3 which is following route as: Level3 (EU) > Level3 (US) > Reliance-FLAG (US) > BSNL (India) – hence a trip via US with high latency. But for IP2 which is from dataceneter’s network – they seem to be prefering Telia (rather then Level3). Now Telia seems to be having better relation with Tata AS6453 which is also one of upstream transit providers for BSNL other then Reliance. Also, Tata has BSNL’s prefixes covered on their border routers within London and hence followed path for IP2 is: Datacenter > Telia (Europe) > Tata (Europe) > Tata (India) > Tata-VSNL (India) > BSNL. This is better path and gives relatively low latency.

 

TeliaSonera Looking Glass – traceroute inet 117.200.61.87 as-number-lookup

Router: Munich
Command: traceroute inet 117.200.61.87 as-number-lookup

traceroute to 117.200.61.87 (117.200.61.87), 30 hops max, 40 byte packets
1 ffm-bb1-link.telia.net (213.155.134.12) 7.760 ms 8.963 ms ffm-bb1-link.telia.net (80.91.248.30) 9.625 ms
2 ffm-b2-link.telia.net (80.91.252.168) 8.028 ms ffm-b2-link.telia.net (80.91.246.225) 8.049 ms ffm-b2-link.telia.net (80.91.252.168) 18.213 ms
3 teleglobe-122701-ffm-b2.telia.net (213.248.69.38) 8.105 ms 8.014 ms 8.120 ms
4 if-3-2.tcore1.PVU-Paris.as6453.net (80.231.153.53) [AS 6453] 135.317 ms if-5-2.tcore1.PVU-Paris.as6453.net (80.231.153.121) [AS 6453] 137.277 ms if-3-2.tcore1.PVU-Paris.as6453.net (80.231.153.53) [AS 6453] 153.583 ms
MPLS Label=332640 CoS=0 TTL=1 S=1
5 if-2-2.tcore1.PYE-Paris.as6453.net (80.231.154.18) [AS 6453] 145.658 ms if-12-2.tcore1.PYE-Paris.as6453.net (80.231.154.69) [AS 6453] 135.219 ms if-2-2.tcore1.PYE-Paris.as6453.net (80.231.154.18) [AS 6453] 131.225 ms
MPLS Label=522146 CoS=0 TTL=1 S=1
6 if-8-1600.tcore1.WYN-Marseille.as6453.net (80.231.217.5) [AS 6453] 135.192 ms 149.243 ms 135.519 ms
MPLS Label=646803 CoS=0 TTL=1 S=1
7 if-9-5.tcore1.MLV-Mumbai.as6453.net (80.231.217.18) [AS 6453] 152.803 ms 138.552 ms 171.287 ms
8 180.87.38.74 (180.87.38.74) [AS 6453] 133.772 ms 134.308 ms 134.257 ms
9 115.114.59.170.static-Mumbai.vsnl.net.in (115.114.59.170) [AS 4755] 179.789 ms 179.780 ms 179.786 ms
10 218.248.255.101 (218.248.255.101) [AS 9829] 186.435 ms 176.686 ms 176.350 ms
11 218.248.173.41 (218.248.173.41) [AS 9829] 184.978 ms 185.500 ms 194.190 ms
12 218.248.173.41 (218.248.173.41) [AS 9829] 185.008 ms 184.857 ms 187.702 ms
13 * * *
14 * * *

 

This is why home > server Vs server > home latency is different. Case closed! 🙂

 

Heart of issue:

Core cause of all these strange results is BSNL itself. At one end they purchasing transit in India from Indian ISPs like Tata. This works well since they purchase transit = they announce their routes to Tata-VSNL and get global table from Tata-VSNL. Next Tata-VSNL announces all these prefixes to Tata Comm main Tier 1 backbone AS6453 which announces these prefixes consistantly across their border routers in US, Europe and Asia. While at the same time BSNL does crazy purchases of IPLC (International Private Leased Circuit) and connect to ISP’s routers outside India over a layer 2 leased circuit. Here BSNL’s router in India makes a BGP session with upstream’s router outside India. This is OK BUT should be done everywhere. Since BSNL is using IPLC setup only with US – this tends to make a situation where there are few paths to enter BSNL’s network and most of them via New York or Los Angles routers of their upstreams. Either BSNL has to stop using IPLC and simply purchase bandwidth in India or has to setup BGP sessions in Europe and Asia as well. 

 

Comparison of setup

Let’s pickup transit provider from whom BSNL purchases within India – Tata. Here clear identification is that BSNL connects to Tata-VSNL AS4755 and AS4755 further connects to Tata AS6453. AS4755 is only used within India and AS6453 is used everywhere else except India. In India generally AS6453 connects only to AS4755. So AS4755 > AS9829 happens only in India. 🙂

So lookup of routing table from Tata’s router in Mumbai, London, New York and Singapore:

Router: gin-mlv-core1
Site: IN, Mumbai – MLV, VSNL
Command: show ip bgp 117.200.61.87

BGP routing table entry for 117.200.48.0/20
Bestpath Modifiers: deterministic-med
Paths: (3 available, best #3)
Multipath: eBGP
4755 9829
mlv-tcore1. (metric 1) from cxr-tcore1. (66.110.10.113)
Origin IGP, valid, internal
Community:
Originator: 66.110.10.202
4755 9829
mlv-tcore1. (metric 1) from mlv-tcore2. (66.110.10.215)
Origin IGP, valid, internal
Community:
Originator: 66.110.10.202
4755 9829
mlv-tcore1. (metric 1) from mlv-tcore1. (66.110.10.202)
Origin IGP, valid, internal, best
Community:

 

Router: gin-lhx-core1
Site: GB, London – LHX, TATA COMM. HARBOR EXCHANGE
Command: show ip bgp 117.200.61.87

BGP routing table entry for 117.200.48.0/20
Bestpath Modifiers: deterministic-med
Paths: (2 available, best #1)
Multipath: eBGP
4755 9829
mlv-tcore1. (metric 3636) from ldn-mcore3. (ldn-mcore3.)
Origin IGP, valid, internal, best
Community:
Originator: 66.110.10.202
4755 9829
mlv-tcore1. (metric 3636) from l78-tcore1. (66.110.10.237)
Origin IGP, valid, internal
Community:

 

Router: gin-nyt-core1
Site: US, New York – NYT, TELEHOUSE
Command: show ip bgp 117.200.61.87

BGP routing table entry for 117.200.48.0/20
Bestpath Modifiers: deterministic-med
Paths: (1 available, best #1)
4755 9829
tv2-tcore2. (metric 13282) from nyy-tcore1. (66.110.10.204)
Origin IGP, valid, internal, best
Community:
Originator: 66.110.10.202
 
 

Router: gin-s9r-core1
Site: SG, Singapore – S9R, GLOBAL SW F6-SC6
Command: show ip bgp 117.200.61.87

BGP routing table entry for 117.200.48.0/20
Bestpath Modifiers: deterministic-med
Paths: (8 available, best #6)
Multipath: eBGP
21 23 24 25 26
4755 9829
mlv-tcore1. (metric 3176) from tv2-core1. (tv2-core1.)
Origin IGP, valid, internal
Community:
Originator: 66.110.10.202
4755 9829
mlv-tcore1. (metric 3176) from hk2-core3. (hk2-core3.)
Origin IGP, valid, internal
Community:
Originator: 66.110.10.202
4755 9829
mlv-tcore1. (metric 3176) from klt-tcore1. (66.110.11.12)
Origin IGP, valid, internal
Community:
Originator: 66.110.10.202
4755 9829
mlv-tcore1. (metric 3176) from pye-core1. (pye-core1.)
Origin IGP, valid, internal
Community:
Originator: 66.110.10.202
4755 9829
mlv-tcore1. (metric 3176) from l78-mcore3. (Loopback5.mcore3.L78-London.)
Origin IGP, valid, internal
Community:
Originator: 66.110.10.202
4755 9829
mlv-tcore1. (metric 3176) from mlv-tcore1. (66.110.10.202)
Origin IGP, valid, internal, best
Community:
4755 9829
mlv-tcore1. (metric 3176) from rsd-core1. (rsd-core1.)
Origin IGP, valid, internal
Community:
Originator: 66.110.10.202
4755 9829
mlv-tcore1. (metric 3176) from jsd-core1. (jsd-core1.)
Origin IGP, valid, internal
Community:
Originator: 66.110.10.202

 

 

Clearly all paths are to Tata-VSNL 4755 in India directly (not via US) and then further to BSNL. This is not true incase of their IPLC bandwidth purchase where entry point for network will be router of BSNL’s upstream in New York and Los Angles only.

This is how BSNL screws up itself. Anyone hearing? 

 

 

Oh and btw I was “made to write” a 2 page answer to question in today’s sessional test at college. Question was what is TCP/IP? I wish teacher ignores the crap I wrote there and look at this blog post. THIS IS TCP/IP! 🙂