clock menu more-arrow no yes

Filed under:

Network Outage

New, comments

As you might must have noticed, yesterday the Sports Blog Nation network which is the parent of your dearly beloved Bright Side of the Sun experienced an extended system outage.

This was the first ever outage of this kind and we are very sorry that it occured. At least it wasn't on a game day. Maybe the network gods decided that we along with the Suns needed a rest day...

Details of the outage after the jump.

As painful as these things can be, I am sure we have all gone through them before and know that the techies of the world work very hard to prevent and correct instances like this.

Here's the word from on-high and/or the bowels of server hell...

You probably noticed that around 1 am est the outage ended and the
blogs on the sbn2.0 platform were back online.

The summary cause of the outage was that we experienced a major
hardware failure. Through a brute force process of elimination, we
determined that the RAM (memory) we received in the recent upgrade was
"bad". That type of problem can cause unusual, unexplained errors in
software that makes great use of RAM and that is exactly what happened
to us. It sounds like an easy, quick problem to solve, but "bad" RAM
is so rarely the first explanation of an outage like this, that it
took us awhile to go through the process of confirming it and finally
making hardware changes to deal with it. Basically, this was a worst
case scenario event.

I want to apologize to everyone for such an extended outage. We tried
to handle this situation as quickly and cautiously as possible to be
certain you were back online again without any more interruptions. I
know that GameThreads were missed, opportunities to discuss trades and
firings were missed. For the past few months, we've been living on
your  blogs, engaging in the communities and reading the comments so
much that we understand how frustrating it must be to lose that line
to the people you want to talk to about this stuff. It was as much a
nightmare for us as it must've been for y'all. We take a lot of pride
in what we've built and making sure it runs properly 24/7.

Finally, we always try to learn from events like this: thinking about
more hardware redundancy, working with our hosting provider on a more
paranoid hardware server upgrade process, a more flexible application
stack that can adjust dynamically to failures, a different message on
our downtime/maintenance screen, better communication with y'all
during an emergency like this, putting in place known external
communication outlets about the network so your readers have a source
of updates during outages, etc, etc. You can expect that we'll be
having many internal conversations about this stuff, but on the last
few points I think that is worth a conversation on this list as well.