|
Weather-related event and subsequent email system failure of May 1st and 2nd, 2009
At 1:05 a.m. on Friday, May 1, UpLync was impacted by a significant weather event that caused email services to be unavailable for a prolonged period of time. As of 9:03 a.m. Saturday, May 2, all email services have been restored. Please read below for additional details about the event.
Where did the weather-related event occur?
UpLync has its core network servers (including email servers) located at Springnet Underground, a division of City Utilities of Springfield, Missouri (http://www.springnet.net/). Springnet Underground boasts one of the nation's newest and most leading edge data centers. With multiple redundant Internet connections to Tier 1 Internet providers like AT&T and Sprint and electrical power subsystems including a 3 megawatt generator, Springnet is the region's premier facility for the hosting of mission-critical information technology equipment. The event occurred at this vendor's facility and adversely affected all of Springnet's customers including thousands of companies across the United States.
Who was affected?
All customers who have mail processed by UpLync Technologies were temporarily affected by this event. This includes GOIN Internet Services customers (addresses ending in @getgoin.net) as well as those customers of GOIN that have their email domains hosted with GOIN. All incoming and outgoing email was affected.
What happened?
At approximately 1:05 a.m. Friday, the facility at Springnet Underground suffered an electrical power outage. At this time, a severe electrical storm was in the area. While the data center at Springnet is designed to maintain electrical power indefinitely in these situations, the electrical subsystems failed to engage causing a power failure to all of Springnet Underground's facility. Power was restored shortly thereafter. See below for a report-in-progress from Springnet's facility administrators.
The power outage at the Springnet facility caused a temporary failure in the operating system of UpLync's core mail server. UpLync technicians worked around the clock to repair the damage and restore email services to its customers.
Why did this occur?
Springnet continues to invest significant time and resources in diagnosing the root cause of this event. At this time, Springnet reports that they have yet to ascertain the complete chain of events that caused the initial power failure.
Will this sort of thing occur in the future?
While UpLync and Springnet have taken every feasible precaution in the protection of their respective network subsystems, it remains an unfortunate truth that planning for every conceivable adverse event is impossible. However, it is believed that this event was isolated in nature and will not occur in the near future.
Did I lose any mail?
UpLync has designed its mail systems so that events such as this do not cause irreparable data loss. Incoming mail destined for UpLync customers is queued across multiple servers if its final destination is temporarily unavailable. In the event of a prolonged unavailability, an explanation of the delay is dispatched to the sender. Should the delay last for more than twenty-four hours, the original message is returned to the sender along with an explanation that the destination was unreachable. This event is called a 'bounce'.
All queued mail is delivered to its final destination once it becomes available. Therefore, this event would have not resulted in any loss of email. In the worst case, a sender may need to resend the original message if they received a bounce notification.
All of us at UpLync Technologies would like to extend our sincere apologies for the inconvenience that this event may have caused you. As always, our commitment to you is the timely resolution of issues that impact the quality of our services to you. While the event was unforseeable and unavoidable, we will continue to work closely with Springnet to take whatever mitigating action necessary to avoid similar events in the future.
Mike Bristol, CEO
UpLync Technologies, Inc.
Loss of Critical Power at SpringNet Underground - 4:42 a.m., Friday, May 1, 2009
Today, at approximately 1:05am SpringNet Underground lost critical power. Power to critical load was restored at approximately 1:50am. We currently have the UPS in maintenance bypass and under generator power. Eaton Powerware, the UPS manufacture, has been contacted and a system technician has been dispatched. As information is collected regarding the cause of this event we will provide updates.
Our apologies for the inconveniences this has caused.
Chad Marsh
SpringNet NOC Supervisor
==============
Update #1 -Loss of Critical Power at SpringNet Underground - 7:26 a.m., Friday, May 1, 2009
Currently EATON Powerware is onsite; they have downloaded the logs from the UPS and are performing a visual inspection. EATON is sending additional staff that is due to arrive at 9:00am. We will continue to run on generator until further notice.
Chad Marsh
SpringNet NOC Supervisor
==============
Update #2 -Loss of Critical Power at SpringNet Underground - 10:18 a.m., Friday, May 1, 2009
EATON Powerware has completed their initial inspection of the UPS. They will bring each module individually online without load for further inspection. We will remain on generator power until further notice.
Chad Marsh
SpringNet NOC Supervisor
==============
Update #3 -Loss of Critical Power at SpringNet Underground - 1:41 p.m., Friday, May 1, 2009
We have verified that all recommended grounding enhancements and TVSS devices installed over the past nine months functioned as designed.
EATON Powerware technicians have completed their onsite field test and information gathering process. Upon submitting their findings to their engineering team we will be scheduling a conference call with them.
At 3:00 pm today EATON Powerware technicians will return the UPS to online mode. This will be a closed transition and should not disrupt critical load.
While we are working with EATON Powerware to determine a cause and identify a solution we will operate in "Storm Contingency" mode. Storm Contingency mode means, when radar indicates a storm is within one hour of the facility we will transition UPS and battery load from utility power to generator power for the duration of the storm. This procedure will minimize our exposure. We will continue in Storm Contingency mode until we have a resolution to this issue.
We will continue to provide updates as we have them.
Chad Marsh
SpringNet NOC Supervisor
==============
Update #4 -Loss of Critical Power at SpringNet Underground - 3:37 p.m., Friday, May 1, 2009
This will be the last planned update for the weekend.
EATON Powerware has transferred critical load back to the UPS. We are still working with their local technicians to schedule a conference call with the engineers.
We have engaged specialists from City Utilities to verify data that resides within the utility relays and validate the corresponding data downloaded from the UPS.
As mentioned in the prior update we are in Storm Contingency mode and we are using generator to power the UPS.
We will provide more updates next week as information becomes available.
Chad Marsh
SpringNet NOC Supervisor
==============
Update #5 -Loss of Critical Power at SpringNet Underground - 9:23 p.m., Tuesday, May 5, 2009
We attended a conference call with EATON Powerware's engineering group Monday late afternoon. They have received the initial data that was gathered onsite and have begun reviewing the information. We have also been able to provide them with the information logged from our relay devices (onsite as well as from the substation relay.) SpringNet and EATON Powerware are comparing this recent event with the event in July in addition to power outages since then that resulted in normal UPS operations. Based on this analysis, EATON Powerware is developing the course of action; we will be providing that information when it is available.
EATON Powerware has requested SpringNet's next maintenance window to perform further testing on the UPMs. We have scheduled this testing for Sunday 5/10 from 12:01am-8:00am. We will transfer to generator power and switch the UPS to Maintenance By-Pass during this time.
We will continue to provide updates as they are available.
Chad Marsh
SpringNet NOC Supervisor
==============
Update #6 -Loss of Critical Power at SpringNet Underground - 12:27 p.m., Friday, May 8, 2009
SpringNet and EATON Powerware have developed a plan for testing during the maintenance window on Sunday.
This will include the following:
- Powerware will inspect and test each UPM.
- Powerware will review the internal modifications of UPM 3 and 4 made during the migration to the parallel capacity system.
- Powerware will review and test for any fluctuating battery currents on all UPMs.
- Powerware to inspect and review any differences between the configuration of the system now and prior to July 2008.
- City Utilities power quality will perform thermal imaging on the UPS.
- City Utilities substation technicians will inspect switchgear and relay devices.
EATON Powerware has assured us that this is a priority for them and that they will see this through until we find the root cause and have identified a solution.
City Utilities has provided power quality monitoring equipment for the main utility feed as well as the UPS output. City Utilities has been taking measurements from the monitoring equipment and will provide us with an analysis as the data is available.
We have received lightning data from Vaisala. The data indicates that, 2 seconds prior to City Utilities electric feeder going offline there was a lighting strike within 0.2 miles of our location.
In addition to the above mentioned testing, the following will be performed in the upcoming week.
- City Utilities substation technicians will perform a ground resistance test on the facility.
- City Utilities substation technicians will install an advanced power meter on the substation feeder that provides electrical power to the facility.
- City Utilities power quality will install an advanced power quality meter on the facilities primary meter.
As a reminder this Sunday from 12:01am to approximately 8:00am we will transition the facility to maintenance bypass and operate on generator power for the duration of the maintenance event.
Chad Marsh
SpringNet NOC Supervisor
==============
Update #7 -Loss of Critical Power at SpringNet Underground - 5:41 a.m., Sunday, May 10, 2009
Eaton Powerware has completed the scheduled inspection and testing process during this maintenance window. Powerware engineering has indicated they will have an analysis of this data by Tuesday afternoon.
Chad Marsh
SpringNet NOC Supervisor
==============
Update #8 -Loss of Critical Power at SpringNet Underground - 9:30 a.m., Tuesday, May 12, 2009
EATON Powerware has provided us with a preliminary report of findings and action items from this weekend's inspection and testing.
During the inspection window EATON Powerware changed the UPS neutral configuration from a "Solidly Connected" configuration to a "Separately Derived" configuration. Prior to this change the UPM 's were connected to two neutral sources, one through the SBM and one directly to the main distribution panels feeding the UPMs. This modification removes a possible alternate path of current flow during a lightning strike.
Upon inspection of UPM 3 and UPM 4 EATON Powerware removed and isolated connectors that were required when the system was in a parallel redundant configuration but are not required as part of the present parallel capacity system.
There was a large quantity of data collected during the inspection window. We will continue to provide updates as EATON Powerware goes through the data and provides the analysis to us.
Chad Marsh
SpringNet NOC Supervisor
==============
Update #9 -Loss of Critical Power at SpringNet Underground - 4:18 a.m., Tuesday, May 19, 2009
EATON Powerware met with us today to review the data collected and provide us an action plan. We will implement the following during the next maintenance window which is this Sunday May 24 at 12:01am-8:00am.
- EATON Powerware recommends the UPS be connected as a "separately derived" configuration regarding neutral and ground bonding. We implemented this change to the UPMs on Sunday May 10. During this maintenance window the SBM will be changed to match the UPM configuration. This will make the entire UPS a separately derived system. This is important because the SBM (brains of the UPS) monitors phase to neutral voltages for reference. Utilizing a separately derived system will allow us to provide a consistent neutral to the UPS.
- We will remove the neutral to ground bond in the switch gear. The neutral to ground bond inside the utility transformer will remain per code requirements.
- EATON Powerware will inspect all the communication and breaker wiring inside the SBM.
- EATON Powerware will inspect the maintenance bypass breaker and test the maintenance bypass procedure.
- EATON Powerware will replace the monitoring board in the SBM.
EATON Powerware will then test the equipment to make sure it is responding as expected with the new board.
We have created a website for these updates and all updates in the future; this can be found at support.springnet.net.
Chad Marsh
SpringNet NOC Supervisor
==============
|