forked from cloudfoundry/docs-cf-admin
-
Notifications
You must be signed in to change notification settings - Fork 0
/
troubleshooting-router-error-responses.html.md.erb
84 lines (67 loc) · 6.33 KB
/
troubleshooting-router-error-responses.html.md.erb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
---
title: Troubleshooting Router Error Responses
owner: Routing
---
<strong><%= modified_date %></strong>
This topic helps operators to better understand if 502's are a result of the <%= vars.product_full %> Platform or an application.
## <a id="points-of-failure"></a> Points of Failure
There are different points of failure in which 502's can come from:
1. Infrastructure
- Load Balancer
- Network
2. Platform - <%= vars.product_name %>
- Gorouter
- Diego Cell(s)
3. Application
In the **Infrastructure**, 502's can occur in the following way:
- From the Load Balancer, 502's can surface when the Gorouters are not receiving traffic at all.
- This can be observed if the Load Balancer is logging 502's but the Gorouters are not.
In the **Platform**, 502's can occur in the following ways:
- If the Gorouter is unable to connect to the application container:
- TCP dial issues (can't make an initial connection to the [backend](../concepts/http-routing.html#back-end)). The Gorouter will retry TCP dial errors up to three times, if it still fails then a 502 will be returned to the client and logged to the `access.log`. This may be due to:
- An application that is unresponsive (which indicates an issue with the application)
- The Gorouter has a stale route (which indicates an issue with the platform)
- The application container is corrupted (which indicates a problem with the platform)
- These types of errors may look like this within the `gorouter.log`:
<pre class="terminal">
[2018-07-05 17:59:10+0000] {"log_level":3,"timestamp":1530813550.92134,"message":
"backend-endpoint-failed","source":"vcap.gorouter","data":{"route-endpoint":
{"ApplicationId":"","Addr":"10.0.32.15:60099","Tags":null,"RouteServiceUrl":""},
"error":"dial tcp 10.0.32.15:60099: getsockopt: connection refused"}}
</pre>
- If the Gorouter successfully dials the endpoint but an error occurs:
- `read: connection reset by peer` errors can occur when the application closes the connection abruptly with a TCP RST packet and not the expected FIN-ACK. This will cause the Gorouter to retry the next endpoint. Note, Gorouter does not currently retry on `write: connection reset by peer` failures.
- TLS Handshake errors. When these errors occur, the Gorouter will retry up to three times and if it's still failing then a 502 may be returned. These errors appear similar to the following in the `gorouter.log` (and a 502 will be logged in the `access.log`):
<pre class="terminal">
[2018-07-05 18:20:54+0000] {"log_level":3,"timestamp":1530814854.4359834,"message":"
backend-endpoint-failed","source":"vcap.gorouter","data":{"route-endpoint":
{"ApplicationId":"","Addr":"10.0.16.17:61002","Tags":null,"RouteServiceUrl":""},
"error":"x509:certificate is valid for 53079ca3-c4fe-4910-78b9-c1a6, not xxx"}}
</pre>
- If the Gorouter successfully connects to the endpoint, but an error occurs while the request is in transport (i.e. Gorouter has not received a response from the endpoint):
- Prior to <%= vars.routing_version %>, there was a bug that logged a 502 for requests canceled by clients before the server responded with headers. <%= vars.routing_version %> and beyond, if the same situation occurs, a 499 is returned.
In an **Application**, 502's can occur in the following ways (Note: the Gorouter will not retry any error response that is returned by the application):
- If 502's are only occuring from a particular application instance(s) and not all of the applications on the platform, then it is likely an application-related error (i.e. application is overloaded, unresponsive, can't connect to database, etc.).
- If all applications are experiencing 502's, then it could either be a platform issue (possible misconfiguration) or an application issue (i.e. all applications are unable to connect to an upstream database).
## <a id="debugging-steps"></a> General Debugging Steps
Here are general debugging steps for any issue resulting with 502 error codes:
- Gather the Gorouter logs & Diego Cell logs at the time of the incident. <%= vars.gather_routing_logs %>
- Review the logs and consider the following questions:
1. Which errors are the Gorouters returning?
2. Is the Gorouter's [routing table](https://github.com/cloudfoundry/gorouter#dynamic-routing-table) accurate (are the endpoints for the route as expected)?
3. Do the Diego Cell logs have anything interesting about unexpected app crashes and/or restarts?
4. Is the application healthy and handling requests successfully? (try using [request tracing headers](../concepts/http-routing.html#zipkin-headers) to verify)
- Was there a recent platform change/upgrade that caused an increase in 502's?
- Are there any suspicious [metrics](<%= vars.gorouter_metrics_link %>) spiking? How is CPU & Memory utilzation?
## <a id="gorouter-error-classification"></a>Gorouter Error Classification Table
Error Type | Status Code | Source of Issue | Evidence
------------ | ------------- | ------------- | -------------
Dial | 502 | Application or Platform | logs with error `dial tcp`
AttemptedTLSWith<br/>NonTLSBackend | 525 | Platform* | - logs with error `tls: first record does not look like a TLS handshake`<br/> - `backend_tls_handshake_failed` metric increments
HostnameMismatch | 503 | Platform | - logs with error `x509: certificate is valid for <x> not <y>`<br/> - `backend_invalid_id` metric increments
UntrustedCert | 526 | Platform | - logs with an error prefix `x509: certificate signed by unknown authority`<br/> - `backend_invalid_tls_cert` metric increments
RemoteFailedCertCheck | 496 | Platform | logs with error remote `error: tls: bad certificate`
ContextCancelled | 499 | Client/App | logs with error `context canceled` <br/><p class="note"><strong>Note</strong>: This status code is never returned to clients, only logged, as it occurs when the downstream client closes the connection before Gorouter responds.</p>
RemoteHandshakeFailure | 525 | Platform | - logs with error remote `error: tls: handshake failure`<br/> - `backend_tls_handshake_failed` metric increments
*Note: any platform issue could be the result of a misconfiguration
For each of the above errors, there will be a `backend-endpoint-failure` log line in `gorouter.log` and an error message in `gorouter.err.log`. Additionally, the [`access.log`](https://github.com/cloudfoundry/gorouter#logs) will record the request status codes.