October 15, 2018
Dear Pressero Customer,
Please accept our apologies for the significant problems we experienced with Pressero in our Chicago datacenter last week. We understand that the reliability of Pressero is critical to your business and agree that this situation was unacceptable. Below, we will explain what happened, what we have done so far, and what we hope to do to prevent this scenario from happening again.
On October 9 we released a new version of Pressero. Despite extensive testing beforehand, we noticed performance problems almost immediately. Our infrastructure team quickly reverted the new version to the previous version, but the performance problems continued.
It’s important to note that Pressero is not a single server. It’s a collection of servers, with groups of servers performing different roles such as load balancing, caching, application serving, file storage, and database processing. There is a high level of redundancy throughout the stack. The team began to investigate all the components of Pressero in addition to analyzing network traffic and overall system utilization.
To address the problem, a number of configuration changes were made, including providing more RAM or CPU capacity to selected components. Some changes would result in temporary performance gains, but the overall problem continued. On the evening of October 9, our team removed an application server that was reporting problems. While a temporary performance gain was seen, it quickly degraded as the overall system load increased.
Some may recall that Pressero experienced significant hosting issues almost exactly one year ago. One of the many improvements we made after this situation was an investment in an Application Performance Monitoring (APM) system. This is software that runs on the application server and analyzes the myriad of processes executed by the application. It’s a very helpful tool to help understand where application bottlenecks and problems are occurring. Our team was reviewing data and reports generated by the APM and came to the conclusion that it was not providing adequate help for the current problem. The team then installed a second APM tool. After some data collection and analysis, the team realized that the first APM tool was causing the problem on some web application servers.
What We Did
The team took three steps that ultimately resolved the problem:
- Rebuilt the web application servers
- Removed the first APM tool
- Added two more servers to Pressero to be responsible for serving static files (images, PDF’s, etc) to reduce the burden on the application servers.
What We Will Do
While we are grateful that system performance has been restored, we are mindful that we need to continue to enhance the services we are providing. There are a number of next steps:
- Continue using the new APM at the server level, but only when needed to diagnose a problem. In addition, we will add more monitoring of the end user experience. We want to make sure we don’t miss any early warning signs and want to measure and benchmark this more closely. We would like to monitor page load speeds for a cross-section of Pressero sites. If you have a Pressero site capturing 50 or more orders per month and would be willing to allow us to monitor this site in more depth, please let us know by opening a ticket at http://support.aleyant.com. The only data we will capture is related to performance.
- Over the last few months, our team has been working to migrate the hosting approach for all Aleyant applications to treat server infrastructure deployment and updates as programming code. This approach, called containers, allows for a high level of automation. (You can google “Docker” if you want to learn more) A significant benefit to using containers is that our team will be able to more quickly change the configuration of components or completely rebuild components should a problem happen again.
- While our current data center provider in Chicago cannot be blamed for this problem, the reality is that they were certainly responsible for a few other issues in 2018. Our contract with them is coming to a conclusion in April of 2019. We are currently evaluating other options and believe that a new hosting provider may provide more stability and options for resiliency.
- In early 2019, we will be adding the option of dedicated instances of Pressero & eDocBuilder. This solution is comprised of a single cloud-based server that is used by a single customer. While there will naturally be more cost of this, one of the compelling benefits is the ability to have dedicated resources that remain separated from the traffic and load pitfalls of a multi-tenant instance.
Again, we apologize for the problems experienced last week. If you have additional questions or would like to speak with a manager about this, please open a ticket at http://support.aleyant.com.