At 2:05 PM CT, Aleyant's data center provider in Chicago (Singlehop) experienced a significant outage in their internal network. This immediately brought all of Aleyant's services hosted in this data center offline, including Pressero, eDocBuilder, PrintJobManager, the Support Portal, and the Forum. Aleyant staff immediately contacted the data center to report the problem.
Singlehop worked on the problem, reporting initial resolution at 2:52 PM CT and confirmed resolution at 3:21 PM CT. All services were restored except for Pressero. Our team immediately investigated and found that the network outage caused a log file to become corrupt on the Pressero database (SQL) server. The team was able to resolve this problem, with no data loss. The team then discovered that a few of the Pressero caching servers in the cluster (redis) and state server were also brought offline from the outage. The team was able to resolve this and restore Pressero services by 4:13 PM CT.
We are extremely sorry for this outage and will be working with our data center provider to learn more about what caused this outage and what is being done to prevent it from happening again. Please feel free to open a support ticket with us if you would like any additional information.
March 27, 2018: We received this update from the datacenter, dated March 26:
ROOT CAUSE ANALYSIS: The exact cause for the software state, as well as the redundancy failure
is still under investigation internally and with Cisco software developers under a level 1 TAC case.
We’ve so far identified a hardware failure in a controller module (which has been taken out of
service), but the discovery as to why hot-standby controllers within the cluster failed to detect and
assume control during this state is still ongoing.
CORRECTIVE ACTIONS: We’re continuing to work diligently with the Cisco software development
team to identify all aspects of this failure and obtain appropriate corrective measures to prevent
this instance from occurring in the future. We expect to complete this effort within 2 - 3 weeks to
ensure a thorough diagnosis. An RMA has been issued for the failed controller and will be
replaced under a scheduled maintenance window within the next 96 hours.