-
Notifications
You must be signed in to change notification settings - Fork 108
/
troubleshooting-router-error-responses.html.md.erb
430 lines (344 loc) · 23.1 KB
/
troubleshooting-router-error-responses.html.md.erb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
---
title: Troubleshooting router error responses in Cloud Foundry
owner: CF for VMs Networking
---
<%# Reset page title based on platform type %>
<% if vars.platform_code != 'CF' %>
<% set_title("Troubleshooting router error responses in", vars.app_runtime_abbr) %>
<% end %>
Are your 502 errors the result of your infrastructure, <%= vars.app_runtime_abbr %>, or an app? This topic helps you understand and
debug these errors.
## <a id="overview"></a> Outer error responses overview
In your deployment, 502 errors can come from any of the following:
- **Infrastructure**: issues with your load balancer or network can cause 502 errors.
- **Platform**: issues with Gorouter or Diego Cells can cause 502 errors.
See the following sections:
- [Diagnose Gorouter Errors](#gorouter-errors)
- [Diagnose Stale Routes in Gorouter](#diagnose-stale-routes)
- [Gorouter Error Classification Table](#gorouter-error-classification)
- **App**: issues with an app's load or configuration can cause 502 errors.
See [Diagnose App Errors](#app-errors).
If you are unsure of the source of 502 errors, see [General Debugging Steps](#debugging-steps).
## <a id="debugging-steps"></a> General debugging steps
Some general debugging steps for any issue resulting in 502 errors are as follows:
1. Gather the Gorouter logs and Diego Cell logs at the time of the incident. <%= vars.gather_routing_logs %>
1. Review the logs and consider the following:
1. Which errors are the Gorouters returning?
1. Is Gorouter's routing table accurate? Are the endpoints for the route as expected? For more information, see [Dynamic Routing Table](https://github.com/cloudfoundry/gorouter#dynamic-routing-table) in the Gorouter documentation on GitHub.
1. Do the Diego Cell logs have anything interesting about unexpected app crashes or restarts?
1. Is the app healthy and handling requests successfully? You can use request tracing headers to verify. For more information, see [HTTP Headers for Zipkin Tracing](../concepts/http-routing.html#zipkin-headers) in _HTTP Routing_.
1. Consider the following:
- Does your load balancer log 502 errors, but Gorouter does not? This means that traffic is not reaching Gorouter.
- Was there a recent platform change or upgrade that caused an increase in 502 errors?
- Are there any suspicious [metrics](<%= vars.gorouter_metrics_link %>) spiking? How is the CPU and memory utilization?
## <a id="log-formatting"></a>Log formatting
### <a id="levels"></a> Levels
The following table describes the log levels supported by Gorouter.
<% if vars.platform_code == 'CF' %>
The log level is specified in the configuration YAML file for Gorouter.
<% else %>
The log level is set to `debug` and is not configurable.
<% end %>
<table class="table">
<thead>
<tr>
<th>Message</th>
<th>Description</th>
<th>Examples</th>
</tr>
</thead>
<tr>
<td><code>fatal</code></td>
<td>Gorouter is unable to handle any requests due to a fatal error.</td>
<td>Gorouter cannot bind to its TCP port, a <%= vars.app_runtime_abbr %> component has
published invalid data to Gorouter.</td>
</tr>
<tr>
<td><code>error</code></td>
<td>An unexpected error has occurred.</td>
<td>Gorouter failed to fetch token from UAA service.</td>
</tr>
<tr>
<td><code>info</code></td>
<td>An expected event has occurred.</td>
<td>Gorouter started or exited, Gorouter has begun to prune
routes for stale droplets.</td>
</tr>
<tr>
<td><code>debug</code></td>
<td>A lower-level event has occurred.</td>
<td>Route registration, route unregistration.</td>
</tr>
</table>
### <a id="message-contents"></a> Message contents
This section provides a sample Gorouter log entry and explanation of the contents.
`[2017-02-01 22:54:08+0000] {"log_level":0,"timestamp":1485989648.0895808,"message":"endpoint-registered","source":"vcap.Gorouter.registry","data":{"uri":"0-*.login.bosh-lite.com","backend":"10.123.0.134:8080","modification_tag":{"guid":"","index":0}}}
`
<table class="table">
<thead>
<tr>
<th>Property</th>
<th>Description</th>
</tr>
</thead>
<tr>
<td><code>log_level</code></td>
<td>Logging level of the message</td>
</tr>
<tr>
<td><code>timestamp</code></td>
<td>Epoch time of the log</td>
</tr>
<tr>
<td><code>message</code></td>
<td>Content of the log entry</td>
</tr>
<tr>
<td><code>source</code></td>
<td>Gorouter function that initiated the log entry</td>
</tr>
<tr>
<td><code>data</code></td>
<td>Additional information that varies based on the message</td>
</tr>
</table>
### <a id="access-logs"></a> Access logs
This section provides details about Gorouter access logs.
By default, Gorouter generates an access log in the following format when it receives a request:
```
<Request Host> - [<Start Date>] "<Request Method> <Request URL> <Request Protocol>" <Status Code> <Bytes Received> <Bytes Sent> "<Referrer>" "<User-Agent>" <Remote Address> <Backend Address> x_forwarded_for:"<X-Forwarded-For>" x_forwarded_proto:"<X-Forwarded-Proto>" vcap_request_id:<X-Vcap-Request-ID> response_time:<Response Time> gorouter_time:<Gorouter Time> app_id:<Application ID> app_index:<Application Index> instance_id:<Container Instance ID> x_cf_routererror:<X-Cf-RouterError> <Extra Headers>
```
When Gorouter is configured to log additional request attempt details, the format includes `failed_attempts`, `failed_attempts_time`, `dns_time`, `dial_time`, `tls_time`, and `backend_time`:
```
<Request Host> - [<Start Date>] "<Request Method> <Request URL> <Request Protocol>" <Status Code> <Bytes Received> <Bytes Sent> "<Referrer>" "<User-Agent>" <Remote Address> <Backend Address> x_forwarded_for:"<X-Forwarded-For>" x_forwarded_proto:"<X-Forwarded-Proto>" vcap_request_id:<X-Vcap-Request-ID> response_time:<Response Time> gorouter_time:<Gorouter Time> app_id:<Application ID> app_index:<Application Index> instance_id:<Instance ID> failed_attempts:<Failed Attempts> failed_attempts_time:<Failed Attempts Time> dns_time:<DNS Time> dial_time:<Dial Time> tls_time:<TLS Time> backend_time:<Backend Time> x_cf_routererror:<X-Cf-RouterError> <Extra Headers>
```
Gorouter access logs are also redirected to syslog.
See the following list for more information about the Gorouter access log fields:
* The following are optional fields: <code>Status Code</code>, <code>Response Time</code>, <code>Application ID</code>, <code>Application Index</code>, <code>Instance ID</code>, <code>X-Cf-RouterError</code>, and <code>Extra Headers</code>.
* If the access log lacks a <code>Status Code</code>, <code>Response Time</code>, <code>Application ID</code>, <code>Application Index</code>, or <code>X-Cf-RouterError</code>, the corresponding field shows `-`.
* `Response Time` is the total time it takes for the request to go through the Gorouter to the app
and for the response to travel back through the Gorouter.
This has the time that the request spends traversing the network to the app and back again to the Gorouter.
It also has the time the app spends forming a response.
* `Gorouter Time` is the total time it takes for the request to go through the Gorouter
initially plus the time it takes for the response to travel back through the Gorouter.
This does not have the time the request spends traversing the network to the app.
This also does not have the time the app spends forming a response.
* `X-Cf-RouterError` is populated if the Gorouter encounters an error. The returned values can
help distinguish whether a non 2xx response code is due to an error in the Gorouter
or the back end. For more information on the possible errors, see
the [Diagnose App Errors](#app-errors) section.
When Gorouter is configured to log additional request attempt details, the following fields are provided for diagnosing request behaviors:
* `Failed Attempts` is the number of attempts Gorouter tried to backends for the app before either getting a success, or giving up.
* `Failed Attempts Time` is the time spent on the attempted backend requests that failed.
* `DNS Time` is the time spent resolving DNS records.
* `Dial Time` is the time spent establishing a TCP connection with the backend.
* `TLS Time` is the time spent during TLS handshake with the backend.
* `Backend Time` is the time spent waiting on the backend to respond to the request.
## <a id="gorouter"></a>Diagnose Gorouter errors
This section describes the basic structure of Gorouter logs and how to diagnose Gorouter errors.
### <a id="cannot-connect"></a>Gorouter cannot connect to the app container
If Gorouter cannot connect to the app container, you might see this error in the `gorouter.log`:
<pre class="terminal">
[2018-07-05 17:59:10+0000] {"log_level":3,"timestamp":1530813550.92134,"message":
"backend-endpoint-failed","source":"vcap.gorouter","data":{"route-endpoint":
{"ApplicationId":"","Addr":"10.0.32.15:60099","Tags":null,"RouteServiceUrl":""},
"error":"dial tcp 10.0.32.15:60099: getsockopt: connection refused"}}
</pre>
If TCP cannot make an initial connection to the backend, Gorouter retries TCP dial errors up to [`max_attemps`](https://github.com/cloudfoundry/gorouter#dynamic-routing-table) which defaults to 3.
If it still fails, Gorouter returns a 502 to the client and writes to the `access.log`.
Any of the following can cause connection errors between Gorouter and the app container:
- An app that is unresponsive, indicates an issue with the app.
- A stale route in Gorouter, indicates an issue with the platform. For more information, see [Diagnose Stale Routes](#diagnose-stale-routes).
- A corrupted app container, indicates a problem with the platform.
### <a id="connect-error"></a>Gorouter errors after connecting
If Gorouter successfully dials the endpoint but an error occurs, you might see the following:
- `read: connection reset by peer` errors.
When these errors occur, the Gorouter returns error to the client, marks backend ineligible, and does not retry another back end.
- Only for idempotent requests and if [route-integrity](#tls-to-back-end) is enabled:
If a TCP connection can be established (with Envoy), but the app itself does not send a complete http response, Gorouter retries the request on different app instances if any exist.
A request is considered idempotent under the following circumstances (see also [Gorouter source code](https://github.com/golang/go/blob/5c489514bc5e61ad9b5b07bd7d8ec65d66a0512a/src/net/http/request.go#L1413-L1427)):
> Request body is empty AND (Request method is one of "GET", "HEAD", "OPTIONS", "TRACE" OR Request Header `X-Idempotency-Key` or `Idempotency-Key` is set)
The most likely scenario for this to occur is during creation of a new app instance, while the app is not yet completely started, but Envoy is already accepting connections.
- TLS handshake errors.
When these errors occur, the Gorouter retries up to [`max_attemps`](https://github.com/cloudfoundry/gorouter#dynamic-routing-table) which defaults to 3.
If it still fails, Gorouter can return a 502.
These errors appear similar to the following in the `gorouter.log`, and a 502 error is logged in the `access.log`:
<pre class="terminal">
[2018-07-05 18:20:54+0000] {"log_level":3,"timestamp":1530814854.4359834,"message":"
backend-endpoint-failed","source":"vcap.gorouter","data":{"route-endpoint":
{"ApplicationId":"","Addr":"10.0.16.17:61002","Tags":null,"RouteServiceUrl":""},
"error":"x509:certificate is valid for 53079ca3-c4fe-4910-78b9-c1a6, not xxx"}}
</pre>
- If a clients cancels a request before the server responds with headers, Gorouter returns a 499 error.
## <a id="diagnose-stale-routes"></a> Diagnose stale routes in Gorouter
A stale route occurs when Gorouter contains out-of-date route information for a
backend app.
In nearly all cases, stale routes are self correcting.
<% if vars.platform_code == 'CF' %>
By default, when Gorouter detects that it is sending traffic to the wrong app, it prunes that backend app from its route table and terminates the connection.
To disable this behavior, set `router.backends.enable_tls` to `false` as described in [TLS to apps and other back end services](../concepts/http-routing.html#tls-to-back-end) in _HTTP Routing_.
<% else %>
When Gorouter detects that it is sending traffic to the wrong app, it prunes that backend app from its route table and terminates the connection.
<% end %>
### <a id="stale-routes"></a> Causes of stale routes
When a route is unmapped or when an app container is deleted because the app is deleted or moved, a deregister message is sent to Gorouter.
This message tells Gorouter to delete the route mapping to that container.
If Gorouter does not receive this deregister message, the route is now considered stale.
Gorouter still attempts to send traffic to the app.
You are more likely to have stale routes when the following are true:
- SSL verification is not activated.
<% if vars.platform_code == 'CF' %>
- You do not have Diego Release v2.34.0 or later deployed, which contains a fix that sends unregistration messages multiple times.
<% else %>
- You are on one of the versions of <%= vars.app_runtime_abbr %> that does not send deregister messages multiple times.
For the list of versions, see [Enabling TLS from the Gorouter to application instances results in bad routes in TAS for VMs 2.3+](https://community.pivotal.io/s/article/enabling-tls-from-the-gorouter-to-application-instances-results-in-bad-routes?language=en_US) in the Knowledge Base.
<% end %>
- You unmapped a route to an app, but traffic to that route is still being sent to the app.
### <a id="stale-route-triage"></a> How to locate stale routes
The following procedure helps you identify stale routes:
1. **Verify the state of the deployment**.
Run `cf routes` for all spaces and ensure the route is only mapped to the intended apps.
Sometimes there can be multiple routes using the same hostname and domain, but with different paths.
If the domain is shared, check all orgs as well.
1. **Examine the Gorouter routes table**. It might be necessary to check multiple Gorouters, as it is possible that some received the proper deregister message and some did not.
1. SSH to the VM where Gorouter is running.
1. To print the entire Gorouter routes table, run:
`/var/vcap/jobs/gorouter/bin/retrieve-local-routes | jq .`
1. Find the entry for the suspected stale route. Note the values for `address` and `private_instance_id`.
1. **Cross-reference the Gorouter routes table entry with actual Long-Running Processes (LRPs)**:
1. SSH onto the Diego Cell where the IP address matches the IP address that you found on the routes table entry.
1. To get information about all of the actual LRPs, run:
`cfdot actual-lrps | jq .'`
1. Look through the actual LRPs to find the instance ID that you noted from the routes table.
If that instance ID exists and the port in the route table does not exist in the ports section, then there is likely a stale route.
<p class="note important">
The CAPI endpoint <code>GET /v3/processes/:guid/stats</code>
provides information about the host and ports the app is using, and is intended for use by app developers.
This endpoint does not provide complete information for operators. Operators should
use the <code>cfdot</code> CLI on the Diego Cell to view the LRPs directly and all at once.</p>
### <a id="stale-routes-fix"></a> How to fix stale routes
In most cases, you do not have to fix stale routes.
Stale routes are normal in the routing table.
Gorouter fixes stale routes on its own when SSL verification is activated.
When Gorouter attempts to send traffic to a stale endpoint, it recognizes that the endpoint is stale.
Gorouter prunes the stale route and then tries another endpoint.
Stale routes are only a problem if an end user receives errors.
The following procedure helps you fix stale routes:
1. Ensure that SSL verification is activated.
For more information, see [With TLS Enabled](../concepts/http-routing.html#with-tls) in _HTTP Routing_.
<p class="note important">
Using TLS to verify app identity depends on SSL verification.
If you deactivate SSL verification, there is no way to avoid misrouting.
</p>
1. If there is a stale route, then restarting Gorouter fixes the immediate issue.
If you restart all of the Gorouters, and see the same issue for the exact same route, then the issue is not a stale route. The issue is that a bad route is continuing to be added to the route table no matter how many time Gorouter prunes it.
1. If Gorouter is continually missing deregister messages, it might be because either the NATS message bus or the Gorouters are overwhelmed.
Look at the VM usage and consider scaling.
## <a id="gorouter-error-classification"></a> Gorouter error classification table
Use this table when you are debugging Gorouter errors.
The table lists error types, status codes, and indicates if the Gorouter retries the errored request.
For each error, a `backend-endpoint-failure` log entry appears in `gorouter.log`
and an error message in `gorouter.err.log`.
Additionally, the `access.log` records the request status codes.
For more information about debugging Gorouter errors, see the [Gorouter documentation](https://github.com/cloudfoundry/gorouter#logs) on GitHub.
If the request is one that can be retried, Gorouter makes up to [`max_attemps`](https://github.com/cloudfoundry/gorouter#dynamic-routing-table) which defaults to 3.
<table class="table">
<thead>
<th width="15%">Error Type</th>
<th width="10%">Status Code</th>
<th width="15%">Can be re‑tried?</th>
<th width="20%">Source of Issue</th>
<th width="40%">Evidence</th>
</thead>
<tr>
<td>Dial</td>
<td>502</td>
<td>Yes</td>
<td>App or Platform</td>
<td>Logs with error <code>dial tcp</code></td>
</tr>
<tr>
<td>AttemptedTLSWith<br/>NonTLSBackend</td>
<td>525</td>
<td>Yes</td>
<td>Platform</td>
<td>Logs with error <code>tls: first record does not look like a TLS handshake</code> or <code>backend_tls_handshake_failed</code> metric increments</td>
</tr>
<tr>
<td>HostnameMismatch</td>
<td>503</td>
<td>Yes</td>
<td>Platform</td>
<td>Logs with error <code>x509: certificate is valid for < x >, not < y > </code><br/> or <code>backend_invalid_id</code> metric increments</td>
</tr>
<tr>
<td>UntrustedCert</td>
<td>526</td>
<td>Yes</td>
<td>Platform</td>
<td>Logs with error prefix <code>x509: certificate signed by unknown authority</code> or <code>backend_invalid_tls_cert</code> metric increments</td>
</tr>
<tr>
<td>RemoteFailedCertCheck</td>
<td>496</td>
<td>Yes</td>
<td>Platform</td>
<td>Logs with error remote <code>error: tls: bad certificate</code></td>
</tr>
<tr>
<td>ContextCancelled</td>
<td> 499</td><td> No</td>
<td> Client/App</td>
<td> Logs with error <code>context canceled</code><br>
This status code appears in logs only.
It is never returned to clients as it occurs when the downstream client closes the connection
before Gorouter responds.</td>
</tr>
<tr>
<td>RemoteHandshakeFailure</td>
<td> 525</td>
<td> Yes</td>
<td> Platform</td>
<td> Logs with error remote <code>error: tls: handshake failure</code> and <code>backend_tls_handshake_failed</code> metric increments</td>
</tr>
<tr>
<td>ExpiredOrNotYetValidCertFailure</td>
<td> 502</td>
<td> Yes</td>
<td> Platform</td>
<td> Logs with error <code>x509: certificate has expired or is not yet valid</code>.
For example, this error can occur if the Diego Cell clock drifts.</td>
</tr>
<tr>
<td><em>unknown</em></td>
<td> 502</td>
<td> No</td>
<td> App or Platform</td>
<td> If the error is not one of the types mentioned earlier, then <code>502</code> is the default response.
Gorouter tracks <code>502</code> errors with the <code>gorouter.bad_gateways</code> metric.
See <a href="<%= vars.gorouter_metrics_link %>">Router Error: 502 Bad Gateway</a>.
</td>
</tr>
</table>
## <a id="app-errors"></a>Diagnose app errors
This section describes app-related 502 errors.
When 502 errors only occur in specific app instances and not all app instances on the platform, it is likely an app-related error.
The app might be overloaded, unresponsive, or unable to connect to the database.
When all apps are experiencing 502 errors, then it could either be a platform issue. For example, a misconfiguration, or an app issue likely all apps being unable to connect to an upstream database.
If the Gorouter has back end keep-alive connections activated, 502 errors can occur due to a race condition when the app instance keep alive idle timeout is less than 90 seconds.
For more information, see [Gorouter Back End Keep-Alive Connections](../adminguide/routing-keepalive.html).
For any other errors that appear in Gorouter access logs, you can analyze application logs if they need further investigation.
<p> Gorouter does not retry any error response returned by the app.</p>
### Gorouter specific response headers
In the case that Gorouter encounters an error connecting to an application backend, the `X-CF-RouterError` header is populated to help distinguish the origin of a non-2xx response code.
The value of the `X-Cf-Routererror` header can be one of the following:
| Value | Description |
|--------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| invalid_cf_app_instance_header | The provided value for the `X-Cf-App-Instance` header does not match the required format of `APP_GUID:INSTANCE_ID`. |
| empty_host | The value for the `Host` header is empty, or the `Host` header is equivalent to the remote address. Some LB's optimistically set the `Host` header value with their IP address when there is no value present. |
| unknown_route | The desired route does not exist in the gorouter's route table. |
| no_endpoints | The entry is in the route table for the desired route, but there are no healthy endpoints available. |
| Connection Limit Reached | The backends associated with the route have reached their max number of connections. The max connection number is set via the spec property `router.backends.max_conns`. |
| route_service_unsupported | Route services are not enabled or WebSockets requests are bound to route services. You can configure route services using the spec property `router.route_services_secret`. If the property is empty, route services are deactivated. |
| endpoint_failure | The registered endpoint for the desired route failed to handle the request.