Router problem

Forum: LXer Meta ForumTotal Replies: 8
Author Content
djohnston

Jan 16, 2013
7:16 PM EDT
An internet router problem caused me to lose connection to the LXer site today. The outage lasted some 30 minutes or less. The inability to access the (very fast loading) LXer site occurs intermittently from time to time. I would guesstimate it happens a couple of times a month.

I am not complaining. I'm just making it known. Traceroute and ping information during the time I could not access the site is shown below. (I've had to add extra carriage returns to make the display somewhat legible.)

traceroute to lxer.com (108.166.170.174), 30 hops max, 60 byte packets

1 192.168.1.1 (192.168.1.1) 0.238 ms 0.287 ms 0.427 ms

2 10.6.0.1 (10.6.0.1) 10.985 ms 11.010 ms 11.033 ms

3 COX-68-12-10-113-static.coxinet.net (68.12.10.113) 11.077 ms 11.102 ms 11.124 ms

4 COX-68-12-10-2-static.coxinet.net (68.12.10.2) 12.248 ms 12.973 ms 15.624 ms

5 68.1.5.140 (68.1.5.140) 21.692 ms 20.986 ms 21.600 ms

6 gigabitethernet2-12.ar5.GRU1.gblx.net (64.209.94.93) 38.124 ms 16.262 ms 14.055 ms

7 te1-1-10G.asr1.DAL2.gblx.net (67.17.79.110) 17.617 ms 17.568 ms 17.512 ms

8 highwinds-network-group.tengigabitethernet2-2.asr1.dal2.gblx.net (208.48.236.130) 18.544 ms 18.495 ms 18.408 ms

9 * * * 10 * * * 11 * * * 12 * * * 13 * * * 14 * * * 15 * * * 16 * * * 17 * * * 18 * * * 19 * * * 20 * * * 21 * * * 22 * * * 23 * * * 24 * * * 25 * * * 26 * * * 27 * * * 28 * * * 29 * * * 30 * * *

PING lxer.com (108.166.170.174) 56(84) bytes of data.

--- lxer.com ping statistics --- 10 packets transmitted, 0 received, 100% packet loss, time 9006ms

Wed Jan 16 04:25:55 PM CST 2013



Here's the traceroute info after I was once again able to access the site.

traceroute to lxer.com (108.166.170.174), 30 hops max, 60 byte packets

1 192.168.1.1 (192.168.1.1) 0.201 ms 0.269 ms 0.331 ms

2 10.6.0.1 (10.6.0.1) 9.536 ms 9.804 ms 9.880 ms

3 COX-68-12-10-113-static.coxinet.net (68.12.10.113) 9.916 ms 9.938 ms 9.966 ms

4 COX-68-12-10-2-static.coxinet.net (68.12.10.2) 10.312 ms 10.303 ms 14.526 ms

5 68.1.5.140 (68.1.5.140) 20.231 ms 20.136 ms 20.052 ms

6 gigabitethernet2-12.ar5.GRU1.gblx.net (64.209.94.93) 20.159 ms 14.439 ms TenGigabitEthernet6-1.ar3.DAL2.gblx.net (208.51.117.137) 29.460 ms

7 te1-1-10G.asr1.DAL2.gblx.net (67.17.79.110) 29.536 ms 29.486 ms 29.706 ms

8 highwinds-network-group.tengigabitethernet2-2.asr1.dal2.gblx.net (208.48.236.130) 30.262 ms 30.156 ms 30.271 ms

9 cust-108-166-160-18.corexchange.com (108.166.160.18) 35.221 ms 35.848 ms 35.395 ms

10 dal4.wmkt.net (108.166.170.174) 35.273 ms 35.297 ms 35.321 ms

Wed Jan 16 05:02:08 PM CST 2013

bob

Jan 16, 2013
8:09 PM EDT
LXer.com experienced a service outage today from 2212 UTC to 2240 UTC for a total of 28 minutes. The outage was due to a failure of the backup power system at our data center in Dallas. Several thousand web sites were affected.

This is a highly unusual event in that professional data centers are usually very good at providing high availability continuous power to their users. I don't have the details yet from our data center as to why the backup power systems failed, but I will pass that along when I get the info.

Sorry for the incovenience to the LXer readers during this 28 minute outage.
djohnston

Jan 17, 2013
12:30 AM EDT
Thanks, bob. No apologies necessary. This is a fine site and a great group of people.
dinotrac

Jan 17, 2013
1:16 PM EDT
@bob --

Not good enough. Not even close.

The world could end in 28 minutes. That just wouldn't do:

The world ends, and LXer readers won't even know we're dead yet.

Work on it.
bob

Jan 17, 2013
2:55 PM EDT
I'm right on top of it dinotrac. :)
dinotrac

Jan 18, 2013
5:39 PM EDT
Good. I'd hate to wait for my own funeral that won't be given because nobody else is left to give it.
bob

Jan 19, 2013
12:37 AM EDT
Reason for outage:

Summary:

On January 16th, 2013 at approximately 4:22 PM CST, UPS #1 at The Connection (8600 Harry Hines) had a capacitor failure that resulted in a phase imbalance that caused an insulated-gate transistor (IGT) failure which resulted in the shorting of both input and bypass breakers. With both breakers tripped, downstream power distribution units (PDU) fed from UPS #1 were without power. Our on-site facility engineers responded immediately and we also dispatched our UPS vendor. On-site engineers determined it was safe to reset the bypass breaker and were able to restore service via the bypass breaker at approximately 4:32 PM. Our facility team turned on each downstream PDU one at a time to alleviate in-rush issues. The last PDU was turned on at approximately 4:38 PM. Our UPS vendor identified the parts required to repair the UPS and had them dispatched from their warehouse.

On January 17th, 2013 at approximately 1:00 PM CST, the replacement parts arrived and repairs were started to replace the failed components within our UPS system. During the repair work, exhaustive testing and calibrating was performed to ensure that our UPS operates properly. On January 18th at 12:40 AM CST, the load was transferred back onto UPS #1.

UPS #1 feeds the "B" PDUs on our data center floor. If you have a non-redundant power configuration with circuits provisioned on the "B" PDUs, then your power service was impacted for approximately 10 to 16 minutes.

Customers on cxswitch #4, 5, 8, 10, 12, 13, 15, 17, 19, 21, 23, 26, 54 would have seen network loss for approximately 14 to 20 minutes even if your cabinet itself was not powered from a B-side PDU.

Customers with a redundant power and redundant network configuration should not have been impacted. If you have a redundant configuration and were impacted, please contact our support team and we will be happy to confirm proper setup of your systems.

Future Mitigation:

We will evaluate our power infrastructure and to include further failover protection at the UPS level as well as network switch level. We are also taking steps to provide more information quicker for any service interruptions so customers can be aware of what is going on before a full RFO can be provided.
djohnston

Jan 19, 2013
2:16 AM EDT
What brand was UPS #1? Just kidding! Thanks for the very thorough update, bob.

cr

Jan 19, 2013
4:23 AM EDT
The fact that they could (and did) diagnose down to the component level is impressive.

Posting in this forum is limited to members of the group: [Editors, MEMBERS, SITEADMINS.]

Becoming a member of LXer is easy and free. Join Us!