šŸ“– 10 min read (~ 2200 words).

Keep-alive in the runtime client

How go-openapi/runtime reuses TCP connections, what the kernel and the HTTP transport actually do for you, and where it goes wrong when there is a NAT gateway, proxy, or firewall between your client and the server.

Concrete: the reference for “I get context deadline exceeded after a quiet period” — issue #336 is the canonical example.

Scope: client-side Runtime. Server-side keep-alive (http.Server’s own timers) is summarised briefly at the end, with pointers into the Go stdlib docs.

TL;DR

If your client lives behind a NAT gateway, a load balancer, or a firewall with an idle conntrack timeout (AWS NAT: 350 seconds; many corporate firewalls: a few minutes), and you see context deadline exceeded on requests that follow a quiet period:

  1. Check what your Runtime.Transport is. If you let client.New pick the default (http.DefaultTransport), you already get IdleConnTimeout = 90s and Dialer.KeepAlive = 30s. Those defeat most NAT timeouts.
  2. If you replaced the Transport (for TLS config, a proxy, etc.), you almost certainly lost those defaults. Reinstate them.
  3. On Go 1.23+, set an explicit net.Dialer.KeepAliveConfig — the bare KeepAlive field only sets the probe interval, not the idle delay before probing starts. On Linux the kernel default for the idle delay is often 7200 seconds (two hours), so probes never fire before a 350s NAT timeout drops your conntrack.
  4. Do not reach for Runtime.EnableConnectionReuse. The name is misleading — it does not control TCP keepalive or NAT timeouts. See “the misnomer” below.

A recipe at the bottom of this document covers the cloud / NAT case.

Two distinct things named “keep-alive”

The word “keep-alive” is used for two unrelated mechanisms operating at different layers. The runtime, the stdlib, and the OS all speak about “keep-alive” without always disambiguating, which is the root of most confusion.

HTTP keep-alive (application layer)

Connection: keep-alive is an HTTP/1.1 default. It means the same TCP connection serves multiple HTTP request/response pairs. The client sends request 1, reads response 1, sends request 2 on the same socket, reads response 2, and so on, until either side closes.

In Go, http.Transport keeps a pool of idle connections per host. After a response body is fully read and closed, the connection goes back to the pool. The next request to the same host may pick a connection from the pool instead of dialling a new one. Skipping the dial saves a TCP handshake plus, for HTTPS, a TLS handshake — typically tens to hundreds of milliseconds per request.

TCP keepalive (kernel / socket layer)

SO_KEEPALIVE is a socket option asking the kernel to send periodic empty ACK packets on an otherwise-idle TCP connection. The peer acknowledges them. Two consequences:

  1. Dead-peer detection. If the peer disappears (machine rebooted, network partitioned), the kernel sees the missing ACKs and tears the connection down. Without keepalive, a half-open connection can linger indefinitely.
  2. Conntrack / NAT keep-alive. A NAT gateway or stateful firewall maintains a connection tracking (conntrack) entry per TCP flow passing through it. The entry is dropped after some idle period — AWS NAT uses 350 seconds, many enterprise firewalls use 60s–15min. Once the entry is dropped, packets arriving for that flow are either silently discarded or rejected with a RST that may not reach the original sender. Periodic TCP keepalive packets count as live traffic, so the NAT keeps the entry fresh.

The first three of the four mechanisms below are HTTP keep-alive concerns; the fourth is the kernel/TCP one.

KnobLayerDefaultWhat it controls
http.Transport.DisableKeepAlivesHTTPfalse (keep-alive on)Whether a TCP conn serves more than one HTTP request
http.Transport.MaxIdleConns / MaxIdleConnsPerHostHTTP100 / 2Idle-pool sizing
http.Transport.IdleConnTimeoutHTTP90sHow long an idle conn stays in the pool before close
net.Dialer.KeepAlive (+ KeepAliveConfig on Go 1.23+)TCP30sWhether and how the kernel sends keepalive probes on dialled connections

If you only remember one thing: HTTP-layer settings decide whether the runtime reuses a connection. TCP-layer settings decide whether a through-the-network proxy still believes the connection exists. A mismatch produces issue #336.

What Go does for you, by default

http.DefaultTransport is the transport client.New sets on every fresh Runtime. Its defaults, as of recent Go:

// from net/http
DefaultTransport = &http.Transport{
    Proxy:                 http.ProxyFromEnvironment,
    DialContext: (&net.Dialer{
        Timeout:   30 * time.Second,
        KeepAlive: 30 * time.Second, // TCP keepalive interval
    }).DialContext,
    ForceAttemptHTTP2:     true,
    MaxIdleConns:          100,
    IdleConnTimeout:       90 * time.Second,
    TLSHandshakeTimeout:   10 * time.Second,
    ExpectContinueTimeout: 1 * time.Second,
}

Read the way most cloud-deployed Go services need it:

  • IdleConnTimeout = 90s is less than the AWS NAT 350s timeout, so an idle pooled connection is closed by Go before NAT drops it.
  • Dialer.KeepAlive = 30s enables TCP keepalive probes every 30s, so active connections survive long NAT timeouts even when the application isn’t sending data.

For typical cases, these defaults are correct. You only need to think about this if you replaced the Transport, or if your environment has an unusual idle timeout.

How the runtime wires this

Runtime.Transport is the http.RoundTripper used for every outbound request. Three things to know:

  1. The default is http.DefaultTransport, with the values above.
  2. Replacing rt.Transport = ... with a custom transport completely overrides the defaults — you inherit nothing unless you copy what http.DefaultTransport sets.
  3. Runtime.SetDebug(true) does not affect keep-alive at all — it only logs requests/responses.

The misnomer — Runtime.EnableConnectionReuse

Runtime.EnableConnectionReuse() is the method most users find when searching for “keep-alive” or “connection reuse” in this codebase. The name suggests it controls whether connections are pooled and reused. It does not.

What it actually does: wraps Runtime.Transport in a RoundTripper that, after every response, drains any unread bytes from the response body before Close. The reason: Go’s http.Transport will only return a connection to the idle pool if the response body was fully read. If your handler stops reading early — for example, you only need the HTTP status and skip the body — the connection is not reusable, and the next request will pay the cost of a new dial + handshake.

So EnableConnectionReuse is a narrow fix for one specific pattern: code that doesn’t fully read response bodies. It has no effect on:

  • TCP keepalive packets;
  • whether the connection survives a NAT idle timeout;
  • the size of the idle pool;
  • the idle timeout in the pool;
  • any other connection-lifecycle concern.

If you ended up here following the issue #336 trail: this method will not help you. A future runtime release will either rename this method to something narrow and honest, or fold the body-draining behaviour into a default-on path so users no longer have to know about it.

The NAT idle-timeout failure mode

This is the scenario in issue #336. Walk through it once and the symptom becomes recognisable:

  1. The client makes a request. Go dials a fresh TCP connection through the NAT gateway. NAT creates a conntrack entry. Request completes; the connection goes into Go’s idle pool.
  2. The application is quiet for more than 350 seconds.
  3. Go’s idle pool has not yet evicted the connection (if you increased IdleConnTimeout past 350s, or if you have a custom transport that doesn’t set it). Or the conn is “active” because something is waiting on it, just not sending data — long polling, server-sent events, slow streaming response.
  4. NAT drops the conntrack entry. No notification to either side.
  5. The application makes its next request. Go picks the still-pooled connection. The TCP stack believes it is fine; it sends.
  6. Packets disappear at the NAT. The server never sees the request, the client never sees a response. From the application’s view, the request hangs.
  7. Eventually the request’s context deadline fires: context deadline exceeded.

The same shape applies to any stateful network appliance between you and the server: load balancers, corporate firewalls, IPSec tunnels.

Solutions

Rely on the defaults (preferred)

If you can: use http.DefaultTransport, do not replace rt.Transport. IdleConnTimeout=90s and Dialer.KeepAlive=30s together cover the common NAT and firewall idle timeouts. No further configuration needed.

Custom Transport — reinstate the defaults

When you build a custom transport (for TLSClientConfig, an HTTP proxy URL, a MaxIdleConnsPerHost change, etc.), start from the http.DefaultTransport values, then override only what you need:

import (
    "net"
    "net/http"
    "time"
)

rt.Transport = &http.Transport{
    Proxy: http.ProxyFromEnvironment,
    DialContext: (&net.Dialer{
        Timeout:   30 * time.Second,
        KeepAlive: 30 * time.Second,
    }).DialContext,
    ForceAttemptHTTP2:     true,
    MaxIdleConns:          100,
    IdleConnTimeout:       90 * time.Second,
    TLSHandshakeTimeout:   10 * time.Second,
    ExpectContinueTimeout: 1 * time.Second,

    TLSClientConfig: yourTLSConfig, // <- your actual override
}

The single most common bug is omitting the Dialer. A literal of the form &http.Transport{TLSClientConfig: ...} with no DialContext uses Go’s net default dialler, which has no keepalive at all.

Explicit KeepAliveConfig (Go 1.23+)

The bare net.Dialer.KeepAlive field sets the probe interval. On Linux, the kernel does not start sending probes until a separate idle delay elapses, and that idle delay defaults to 7200 seconds at the tcp_keepalive_time sysctl. With AWS NAT’s 350s timeout, the probes never start in time.

Go 1.23 introduced net.Dialer.KeepAliveConfig, which lets you set the idle delay explicitly so the kernel does not depend on tcp_keepalive_time:

DialContext: (&net.Dialer{
    Timeout: 30 * time.Second,
    KeepAliveConfig: net.KeepAliveConfig{
        Enable:   true,
        Idle:     60 * time.Second,  // wait 60s of idleness, then start probing
        Interval: 30 * time.Second,  // send a probe every 30s
        Count:    4,                 // drop the conn after 4 missed probes
    },
}).DialContext,

With these numbers, after 60 seconds of silence the kernel starts sending probes, well before the 350s NAT timeout — the conntrack stays fresh, the application sees no surprises.

Other levers

  • http.Transport.IdleConnTimeout set to less than the NAT timeout forces Go to close idle connections before NAT can drop them. The next request then dials fresh.
  • http.Transport.DisableKeepAlives = true opts out of HTTP keep-alive entirely — every request gets a fresh TCP connection. Simple and correct, but trades a handshake cost on every request. Reasonable for low-volume clients; pathological for high-volume ones.

Diagnosing keep-alive problems

When you suspect a keep-alive issue:

  • Confirm the symptom shape. “Context deadline exceeded” after a quiet period is the fingerprint of a dropped conntrack. If the failures happen under load, it’s almost certainly something else.
  • Check the Transport. Print or log rt.Transport early in your application; if it is *http.Transport, inspect IdleConnTimeout and the dialler’s KeepAlive / KeepAliveConfig. Many subtle bugs vanish at this step.
  • Use httptrace. The stdlib’s net/http/httptrace package surfaces the connection lifecycle — GotConn, PutIdleConn, ConnectStart, TLSHandshakeStart, etc. When you see GotConn with Reused: true immediately followed by a hang, you have caught a stale pooled connection. (Future runtime versions may surface this via a built-in helper; see the roadmap.)
  • On Linux, inspect kernel state. ss -t -o shows the keepalive timer for each active socket; cat /proc/sys/net/ipv4/tcp_keepalive_* shows the kernel defaults; conntrack -L (where available) shows the NAT side.
  • tcpdump on the client. Look for outbound packets with no inbound response after the symptom appears. Confirms the NAT-drop hypothesis.

Server-side, briefly

A server’s keep-alive behaviour is governed by http.Server, not by anything in the runtime middleware:

  • Server.IdleTimeout — how long a kept-alive connection waits for the next request before the server closes it.
  • Server.ReadHeaderTimeout, ReadTimeout, WriteTimeout — bound the time spent on individual phases; expiry closes the connection.

The runtime’s server middleware does not override these. If your server sits behind a NAT or load balancer with an idle timeout, set Server.IdleTimeout to a value below that timeout so the server proactively closes idle connections — clients on Go will simply dial again on their next request without surfacing an error.

Recipe — Runtime for cloud / NAT environments

The construction below is the conservative starting point for a client deployed in AWS, GCP, or behind any stateful network appliance with an idle timeout. Adjust the timing constants if you have measurements; do not adjust them on intuition alone.

package main

import (
    "net"
    "net/http"
    "time"

    "github.com/go-openapi/runtime/client"
)

func newClient(host, basePath string) *client.Runtime {
    rt := client.New(host, basePath, []string{"https"})

    rt.Transport = &http.Transport{
        Proxy: http.ProxyFromEnvironment,
        DialContext: (&net.Dialer{
            Timeout: 30 * time.Second,
            // Go 1.23+: explicit idle delay; bare KeepAlive=30s is
            // not enough on Linux because the kernel idle default
            // (tcp_keepalive_time) is often 7200s.
            KeepAliveConfig: net.KeepAliveConfig{
                Enable:   true,
                Idle:     60 * time.Second,
                Interval: 30 * time.Second,
                Count:    4,
            },
        }).DialContext,
        ForceAttemptHTTP2:     true,
        MaxIdleConns:          100,
        IdleConnTimeout:       60 * time.Second, // < AWS NAT's 350s
        TLSHandshakeTimeout:   10 * time.Second,
        ExpectContinueTimeout: 1 * time.Second,
    }

    return rt
}

If you cannot move to Go 1.23, fall back to:

DialContext: (&net.Dialer{
    Timeout:   30 * time.Second,
    KeepAlive: 30 * time.Second,
}).DialContext,
IdleConnTimeout: 60 * time.Second,

and rely on the IdleConnTimeout to evict pooled connections before NAT does. The kernel keepalive probes may or may not fire in time depending on tcp_keepalive_time, but at least your idle pool is self-policing.

Reference