Keep-alive in the runtime client
How go-openapi/runtime reuses TCP connections, what the kernel and the
HTTP transport actually do for you, and where it goes wrong when there
is a NAT gateway, proxy, or firewall between your client and the server.
Concrete: the reference for “I get context deadline exceeded after a
quiet period” ā issue #336 is the canonical example.
Scope: client-side
Runtime. Server-side keep-alive (http.Server’s own timers) is summarised briefly at the end, with pointers into the Go stdlib docs.
TL;DR
If your client lives behind a NAT gateway, a load balancer, or a firewall
with an idle conntrack timeout (AWS NAT: 350 seconds; many corporate
firewalls: a few minutes), and you see context deadline exceeded on
requests that follow a quiet period:
- Check what your
Runtime.Transportis. If you letclient.Newpick the default (http.DefaultTransport), you already getIdleConnTimeout = 90sandDialer.KeepAlive = 30s. Those defeat most NAT timeouts. - If you replaced the Transport (for TLS config, a proxy, etc.), you almost certainly lost those defaults. Reinstate them.
- On Go 1.23+, set an explicit
net.Dialer.KeepAliveConfigā the bareKeepAlivefield only sets the probe interval, not the idle delay before probing starts. On Linux the kernel default for the idle delay is often 7200 seconds (two hours), so probes never fire before a 350s NAT timeout drops your conntrack. - Do not reach for
Runtime.EnableConnectionReuse. The name is misleading ā it does not control TCP keepalive or NAT timeouts. See “the misnomer” below.
A recipe at the bottom of this document covers the cloud / NAT case.
Two distinct things named “keep-alive”
The word “keep-alive” is used for two unrelated mechanisms operating at different layers. The runtime, the stdlib, and the OS all speak about “keep-alive” without always disambiguating, which is the root of most confusion.
HTTP keep-alive (application layer)
Connection: keep-alive is an HTTP/1.1 default. It means the same TCP
connection serves multiple HTTP request/response pairs. The client
sends request 1, reads response 1, sends request 2 on the same socket,
reads response 2, and so on, until either side closes.
In Go, http.Transport keeps a pool of idle connections per host. After
a response body is fully read and closed, the connection goes back to
the pool. The next request to the same host may pick a connection from
the pool instead of dialling a new one. Skipping the dial saves a TCP
handshake plus, for HTTPS, a TLS handshake ā typically tens to hundreds
of milliseconds per request.
TCP keepalive (kernel / socket layer)
SO_KEEPALIVE is a socket option asking the kernel to send periodic
empty ACK packets on an otherwise-idle TCP connection. The peer
acknowledges them. Two consequences:
- Dead-peer detection. If the peer disappears (machine rebooted, network partitioned), the kernel sees the missing ACKs and tears the connection down. Without keepalive, a half-open connection can linger indefinitely.
- Conntrack / NAT keep-alive. A NAT gateway or stateful firewall maintains a connection tracking (conntrack) entry per TCP flow passing through it. The entry is dropped after some idle period ā AWS NAT uses 350 seconds, many enterprise firewalls use 60sā15min. Once the entry is dropped, packets arriving for that flow are either silently discarded or rejected with a RST that may not reach the original sender. Periodic TCP keepalive packets count as live traffic, so the NAT keeps the entry fresh.
The first three of the four mechanisms below are HTTP keep-alive concerns; the fourth is the kernel/TCP one.
| Knob | Layer | Default | What it controls |
|---|---|---|---|
http.Transport.DisableKeepAlives | HTTP | false (keep-alive on) | Whether a TCP conn serves more than one HTTP request |
http.Transport.MaxIdleConns / MaxIdleConnsPerHost | HTTP | 100 / 2 | Idle-pool sizing |
http.Transport.IdleConnTimeout | HTTP | 90s | How long an idle conn stays in the pool before close |
net.Dialer.KeepAlive (+ KeepAliveConfig on Go 1.23+) | TCP | 30s | Whether and how the kernel sends keepalive probes on dialled connections |
If you only remember one thing: HTTP-layer settings decide whether the runtime reuses a connection. TCP-layer settings decide whether a through-the-network proxy still believes the connection exists. A mismatch produces issue #336.
What Go does for you, by default
http.DefaultTransport is the transport
client.New
sets on every fresh Runtime. Its defaults, as of recent Go:
Read the way most cloud-deployed Go services need it:
IdleConnTimeout = 90sis less than the AWS NAT 350s timeout, so an idle pooled connection is closed by Go before NAT drops it.Dialer.KeepAlive = 30senables TCP keepalive probes every 30s, so active connections survive long NAT timeouts even when the application isn’t sending data.
For typical cases, these defaults are correct. You only need to think about this if you replaced the Transport, or if your environment has an unusual idle timeout.
How the runtime wires this
Runtime.Transport is the http.RoundTripper used for every outbound
request. Three things to know:
- The default is
http.DefaultTransport, with the values above. - Replacing
rt.Transport = ...with a custom transport completely overrides the defaults ā you inherit nothing unless you copy whathttp.DefaultTransportsets. Runtime.SetDebug(true)does not affect keep-alive at all ā it only logs requests/responses.
The misnomer ā Runtime.EnableConnectionReuse
Runtime.EnableConnectionReuse() is the method most users find when
searching for “keep-alive” or “connection reuse” in this codebase. The
name suggests it controls whether connections are pooled and reused. It
does not.
What it actually does: wraps Runtime.Transport in a RoundTripper
that, after every response, drains any unread bytes from the response
body before Close. The reason: Go’s http.Transport will only
return a connection to the idle pool if the response body was fully
read. If your handler stops reading early ā for example, you only need
the HTTP status and skip the body ā the connection is not reusable, and
the next request will pay the cost of a new dial + handshake.
So EnableConnectionReuse is a narrow fix for one specific pattern: code
that doesn’t fully read response bodies. It has no effect on:
- TCP keepalive packets;
- whether the connection survives a NAT idle timeout;
- the size of the idle pool;
- the idle timeout in the pool;
- any other connection-lifecycle concern.
If you ended up here following the issue #336 trail: this method will not help you. A future runtime release will either rename this method to something narrow and honest, or fold the body-draining behaviour into a default-on path so users no longer have to know about it.
The NAT idle-timeout failure mode
This is the scenario in issue #336. Walk through it once and the symptom becomes recognisable:
- The client makes a request. Go dials a fresh TCP connection through the NAT gateway. NAT creates a conntrack entry. Request completes; the connection goes into Go’s idle pool.
- The application is quiet for more than 350 seconds.
- Go’s idle pool has not yet evicted the connection (if you increased
IdleConnTimeoutpast 350s, or if you have a custom transport that doesn’t set it). Or the conn is “active” because something is waiting on it, just not sending data ā long polling, server-sent events, slow streaming response. - NAT drops the conntrack entry. No notification to either side.
- The application makes its next request. Go picks the still-pooled connection. The TCP stack believes it is fine; it sends.
- Packets disappear at the NAT. The server never sees the request, the client never sees a response. From the application’s view, the request hangs.
- Eventually the request’s context deadline fires:
context deadline exceeded.
The same shape applies to any stateful network appliance between you and the server: load balancers, corporate firewalls, IPSec tunnels.
Solutions
Rely on the defaults (preferred)
If you can: use http.DefaultTransport, do not replace
rt.Transport. IdleConnTimeout=90s and Dialer.KeepAlive=30s
together cover the common NAT and firewall idle timeouts. No further
configuration needed.
Custom Transport ā reinstate the defaults
When you build a custom transport (for TLSClientConfig, an HTTP proxy
URL, a MaxIdleConnsPerHost change, etc.), start from the
http.DefaultTransport values, then override only what you need:
The single most common bug is omitting the Dialer. A literal of the
form &http.Transport{TLSClientConfig: ...} with no DialContext
uses Go’s net default dialler, which has no keepalive at all.
Explicit KeepAliveConfig (Go 1.23+)
The bare net.Dialer.KeepAlive field sets the probe interval. On Linux,
the kernel does not start sending probes until a separate idle delay
elapses, and that idle delay defaults to 7200 seconds at the
tcp_keepalive_time sysctl. With AWS NAT’s 350s timeout, the probes
never start in time.
Go 1.23 introduced net.Dialer.KeepAliveConfig, which lets you set the
idle delay explicitly so the kernel does not depend on tcp_keepalive_time:
With these numbers, after 60 seconds of silence the kernel starts sending probes, well before the 350s NAT timeout ā the conntrack stays fresh, the application sees no surprises.
Other levers
http.Transport.IdleConnTimeoutset to less than the NAT timeout forces Go to close idle connections before NAT can drop them. The next request then dials fresh.http.Transport.DisableKeepAlives = trueopts out of HTTP keep-alive entirely ā every request gets a fresh TCP connection. Simple and correct, but trades a handshake cost on every request. Reasonable for low-volume clients; pathological for high-volume ones.
Diagnosing keep-alive problems
When you suspect a keep-alive issue:
- Confirm the symptom shape. “Context deadline exceeded” after a quiet period is the fingerprint of a dropped conntrack. If the failures happen under load, it’s almost certainly something else.
- Check the Transport. Print or log
rt.Transportearly in your application; if it is*http.Transport, inspectIdleConnTimeoutand the dialler’sKeepAlive/KeepAliveConfig. Many subtle bugs vanish at this step. - Use
httptrace. The stdlib’snet/http/httptracepackage surfaces the connection lifecycle āGotConn,PutIdleConn,ConnectStart,TLSHandshakeStart, etc. When you seeGotConnwithReused: trueimmediately followed by a hang, you have caught a stale pooled connection. (Future runtime versions may surface this via a built-in helper; see the roadmap.) - On Linux, inspect kernel state.
ss -t -oshows the keepalive timer for each active socket;cat /proc/sys/net/ipv4/tcp_keepalive_*shows the kernel defaults;conntrack -L(where available) shows the NAT side. tcpdumpon the client. Look for outbound packets with no inbound response after the symptom appears. Confirms the NAT-drop hypothesis.
Server-side, briefly
A server’s keep-alive behaviour is governed by
http.Server, not by anything in
the runtime middleware:
Server.IdleTimeoutā how long a kept-alive connection waits for the next request before the server closes it.Server.ReadHeaderTimeout,ReadTimeout,WriteTimeoutā bound the time spent on individual phases; expiry closes the connection.
The runtime’s server middleware does not override these. If your server
sits behind a NAT or load balancer with an idle timeout, set
Server.IdleTimeout to a value below that timeout so the server
proactively closes idle connections ā clients on Go will simply dial
again on their next request without surfacing an error.
Recipe ā Runtime for cloud / NAT environments
The construction below is the conservative starting point for a client deployed in AWS, GCP, or behind any stateful network appliance with an idle timeout. Adjust the timing constants if you have measurements; do not adjust them on intuition alone.
If you cannot move to Go 1.23, fall back to:
and rely on the IdleConnTimeout to evict pooled connections before
NAT does. The kernel keepalive probes may or may not fire in time
depending on tcp_keepalive_time, but at least your idle pool is
self-policing.
Reference
net/http.Transportnet.Dialer,net.KeepAliveConfig(Go 1.23+)net/http/httptrace.ClientTrace- Client transport:
client/runtime.go(Runtime.Transport,Runtime.New) - The misnomer:
client/keepalive.go(KeepAliveTransport,Runtime.EnableConnectionReuse) - Issue #336: https://github.com/go-openapi/runtime/issues/336