Slack 503



We all come across website errors at some point, and every website out there has inevitably experienced their share of 404 and 500 status codes, which are shown when a URL is not found either due to their typo or a broken link (404), and when there is a general server error (500).

This did it for me too - we had set up Skype for Business using WAP but found WAP was unstable and would stop frequently and moved to ARR 3.0. There were some old URL ACL's left behind for lync and lyncdiscover on port 443 that were getting in the way of IIS and causing 503 errors instead of IIS serving the pages. Principal parts of singer. machine 503 needle position selector stitch chart singer fashion disc bobbin winder stitch width selector hand wheel stitch length regulator throat plate positioning lever take-up lever thread control pressure regulating dial needle thread tension thread cutter throat plate slide plate. This page contains resources for students taking the SEC-503 course with The SANS Institute. You are welcome to make use of any of these resources in accord with any licenses that are found within them. The faculty maintain a public Slack channel for attendees in the class. We invite all students to join the channel and participate freely in discussions.

Slack

There are many other less well known status codes in the 400 and 500 range: Those HTTP status codes beginning with 4 are errors on the client side, and codes beginning with 5 are errors on the side of the website server.

Some of the common 500 error messages that websites may encounter are 502, 503, and 504, which are all related to the website server being unable to fill a request that should be valid. These codes provide both the user and the website owner with additional information as to why a page is not loading. Websites can utilize logs and alerts to keep track of how often these errors occur and be notified when errors are occurring so that they can be resolved if needed.

502 Bad Gateway

A 502 error means that a website server that is serving as a reverse proxy for the website origin server (for example, a CDN PoP) did not receive a valid response from the origin server. This may be because the origin server is experiencing issues, there is an invalid or incorrect DNS name, or because a firewall on the origin server has blocked the reverse proxy server request. Serum vst plugin downloadtreegood.

This may also occur when requests at the origin server are taking several minutes to complete and a caching tool such as Varnish Cache has been instructed to timeout after a set number of seconds. Varnish Cache has a default timeout of 60 seconds, which Section recommends keeping for security and alerting reasons.

503 Service Unavailable

The 503 service unavailable message means that the website origin server is not available and is usually a temporary state. This error could be triggered because something running on the website server side has crashed or your site is purposefully down for maintenance. This error is also commonly served when a site has more traffic than it can handle - a problem with poor website scalability. The famous Twitter “Fail Whale” which was often triggered in the early days of the social media site is one example of a 503 triggered due to excess traffic.

503

To avoid serving 503 errors due to an overload of visitors, we recommend using a Content Delivery Network and caching tool so that more of your requests can be served from the CDN and cache.

504 Gateway Timeout

Similar to the 502 error, the 504 Gateway Timeout error occurs if the server that is acting as a proxy for the website origin server did not receive a response from the website origin server within a set time period. This may indicate an issue with the DNS host or hosting company, or with the connection or configuration between the reverse proxy servers and the website origin server.

Troubleshooting 500 errors

502 and 504 errors are related to a bad gateway, meaning that while the reverse proxy server is operational, something it needs to collect from the origin server is not working, or the connection between the reverse proxy server and the origin server is broken. Lg pc suite download free. To troubleshoot this issue, websites should check that their origin server and all the servers it needs to access are running correctly, and then check the configuration between the origin server and reverse proxy server (such as the CDN PoP or locally installed Varnish Cache server). You should also check that your firewall is not accidentally blocking legitimate traffic, your DNS name is set up correctly, and your origin hosting provider is not down.

503 errors are often expected when your website is going through downtime in order to make updates or changes. They may also be triggered by a large influx of traffic to your website. If your website does not scale well, we suggest exploring utilizing a CDN such as Section’s and/or a caching reverse proxy such as Varnish Cache. To learn more please contact us or sign up for a trial account to explore our features free for 2 weeks.

On May 12, 2020, Slack had our first significant outage in a long time. This is a detailed look into the technical issues that caused it.

Slack 503 Key

This story describes the technical details of the problems that caused the Slack downtime on May 12th, 2020. To learn more about the process behind incident response for same outage, read Ryan Katkov’s post, “All Hands on Deck”.

On May 12, 2020, Slack had our first significant outage in a long time. We published a summary of the incident shortly after, but this story is an interesting one, and we’d like to go into more detail on the technical issues around it.

The user-visible outage began at 4:45pm Pacific time, but the story really begins around 8:30am that morning. Our Database Reliability Engineering team was alerted for a significant load increase in part of our database infrastructure at the same time as our Traffic team received alerts that we were failing some API requests. The increased load on the database was due to a rollout of a configuration change, which triggered a longstanding performance bug. The change was quickly pinpointed and rolled back — it was a feature flag which performed a percentage-based rollout, so this was a fast process. We had some customer impact, but it lasted only for three minutes and most users were still able to send messages successfully throughout this brief morning incident.

One of the incident’s effects was a significant scale-up of our main webapp tier. Our CEO Stewart Butterfield has written about some of the impact of the lockdown and stay-at-home orders on Slack usage. As a result of the pandemic, we’ve been running significantly higher numbers of instances in the webapp tier than we were in the long-ago days of February 2020. We autoscale quickly when workers become saturated, as happened here — but workers were waiting much longer for some database requests to complete, leading to higher utilization. We increased our instance count by 75% during the incident, ending with the highest number of webapp hosts that we’ve ever run to date.

Everything seemed fine for the next eight hours — until we were alerted that we were serving more HTTP 503 errors than normal. We spun up a new incident response channel and the on-call engineer for the webapp tier manually scaled up the webapp fleet as an initial mitigation. Unusually, this didn’t help at all. We very quickly noticed that a subset of the webapp fleet was under heavy load while the rest of the webapp instances were not. Multiple strands of investigations began, looking into both webapp performance and our loadbalancer tier. A few minutes later, we identified the problem.

We use a fleet of HAProxy instances behind a layer 4 load-balancer to distribute requests to the webapp tier. We use Consul for service discovery, and consul-template to render lists of healthy webapp backends that HAProxy should route requests to.

We don’t render our webapp host list directly into our HAProxy configuration file, however. The reason for this is that updating the host list via the configuration file requires reloading HAProxy. The process of reloading HAProxy involves creating a brand-new HAProxy process while keeping the old one around until it’s finished dealing with in-flight requests. Very frequent reloads could lead to too many running HAProxy processes and poor performance. This constraint is in tension with the goal of autoscaling the webapp tier, which is to get new instances into service as quickly as possible. Therefore, we use HAProxy’s Runtime API to manipulate the HAProxy server state without doing a reload each time a web tier backend comes into or goes out of service. It’s worth noting that HAProxy can integrate with Consul’s DNS interface, but this adds lag due to the DNS TTL, it limits the ability to use Consul tags, and managing very large DNS responses often seems to lead to hitting painful edge-cases and bugs.

We define HAProxy server templates in our HAProxy state that are effectively ‘slots’ which our webapp backends can occupy. When a new webapp instance is provisioned, or an old one becomes unhealthy, the Consul service catalog is updated. Consul-template renders a new version of the host list, and a separate program developed at Slack, haproxy-server-state-management, reads that host list and uses the HAProxy Runtime API to update the HAProxy state.

We run M parallel pools of HAProxy instances and webapp instances, each pool in its own AWS Availability Zone. HAProxy is configured with N ‘slots’ for webapp backends in each AZ, giving a total of N * M backends that can be routed to across all the AZs. A few months ago, this total was more than ample headroom — we’d never needed to run anything even approaching that number of instances of our webapp tier. However, after the morning’s database incident, we were running slightly more than N * M instances of the webapp. If you think of the HAProxy slots as a giant game of musical chairs, a few of these webapp instances were left without a seat. That wasn’t a problem — we had more than enough serving capacity.

However, over the course of the day, a problem developed. The program which synced the host list generated by consul template with the HAProxy server state had a bug. It always attempted to find a slot for new webapp instances before it freed slots taken up by old webapp instances that were no longer running. This program began to fail and exit early because it was unable to find any empty slots, meaning that the running HAProxy instances weren’t getting their state updated. As the day passed and the webapp autoscaling group scaled up and down, the list of backends in the HAProxy state became more and more stale.

By 4:45pm Pacific, most HAProxy instances were only able to send requests to the set of webapp backends that had been up since the morning, and this set of older webapp backends was now a minority of the fleet. We do regularly provision new HAProxy instances, so there would have been a few fresh HAProxy instances that had correct configuration, but most of them were more than eight hours old and therefore were stuck with full and stale backend state. The outage was eventually triggered at the end of the business day in the US because that’s when we begin to scale down the webapp tier as traffic drops. Autoscaling will preferentially terminate older instances, so this meant that there were no longer enough older webapp instances remaining in the HAProxy server state to serve demand.

Once we knew the cause of the failure, it was resolved quickly with a rolling restart of the HAProxy fleet. After the incident was mitigated, the first question we asked ourselves was why our monitoring didn’t catch this problem. We had alerting in place for this precise situation, but unfortunately, it wasn’t working as intended.The broken monitoring hadn’t been noticed partly because this system ‘just worked’ for a long time, and didn’t require any change. The wider HAProxy deployment that this is part of is also relatively static. With a low rate of change, fewer engineers were interacting with the monitoring and alerting infrastructure.

Slack 503 Vs

The reason that we haven’t been doing any significant work on this HAProxy stack is that we’re moving towards Envoy Proxy for all of our ingress load-balancing (we’ve recently moved our websockets traffic onto Envoy). While HAProxy has served us well and reliably for many years, it also has some operational sharp edges, of exactly the kind highlighted by this incident. The complex pipeline we use to manipulate HAProxy server state will be replaced by Envoy’s native integration with an xDS control plane for endpoint discovery. The most recent versions of HAProxy (since the 2.0 release) also solve many of these operational pain points. However, Envoy has been our proxy of choice for our internal service mesh project for some time, and this makes a move to Envoy for our ingress load-balancing attractive. Our initial testing of Envoy + xDS at scale is very exciting and this migration should improve both performance and availability going forward. Our new load-balancing and service discovery architecture is not susceptible to the problem that caused this outage.

503

Slack 503 Twitter

We strive to keep Slack available and reliable, and in this case, we failed. We know that Slack is a critical tool for our users, and that is why we aim to learn as much as we can from every incident, whether customer visible or not. We apologize for the inconvenience this outage caused and will continue to use the lessons learned to drive improvements in both our systems and our processes.