mirror of
https://code.videolan.org/rist/librist.git
synced 2026-07-04 15:06:53 +00:00
fix(eap): recover and reintegrate a bonded EAP-SRP sender leg after a flap
A bonded caller-sender leg that lost and regained connectivity on the same source tuple -- an interface flap, or an upstream outage that resumes without a reconnect -- stayed wedged and never rejoined the bond without an operator restart. While the leg is silent the far-end authenticator times out and purges its session. The leg keeps its miface-bound UDP socket (a flap does not close it) and stays in EAP SUCCESS, so it keeps streaming data the authenticator now drops while it waits for an EAPOL START the caller never sends, because EAP is authenticator-driven and the caller believes it is still authenticated. - Extend try_caller_socket_rebind to sender-mode callers: a sender leg silent past session_timeout resets its EAP context (eap_reset_authenticatee) and re-drives the SRP handshake on the existing socket. It does not rebind -- the miface-bound socket is still valid and rebinding would move the source tuple the far end expects. - Fold the leg back into the weighted bond once it re-authenticates. Recovery de-authenticates the leg (so it leaves the sender balancing rotation while down) and rewinds eap_authentication_state, so the "EAP Authentication succeeded" transition fires again on re-auth and restores the connection- level authenticated flag via rist_peer_authenticate. Without this the leg re-authenticates but is left out of balancing and carries only NACK retransmits instead of its share. - Only SRP sender legs need this; plaintext/PSK senders have no such deadlock and recover via normal reconnect, so they are left untouched. Reproduced and verified with a bonded advanced-profile ristsender over two netns/veth legs to an SRP listener: one leg is silenced with 100% packet loss both ways (tc netem) while its interface stays up, so the socket persists on the same source tuple. Before the fix the returning leg never re-authenticates and the listener floods "handshake is still pending"; with re-auth alone it authenticates but the sender balances over only the surviving leg; after the full fix the returning leg re-authenticates and resumes carrying its full weighted share (verified on a restored zero-loss link, matching a plaintext bond). Added as test/rist/test_bonded_leg_flap_netns.sh (meson "netns" suite), which asserts both re-authentication and reintegration; it needs Linux root + netns/tc and cleanly skips (exit 77) otherwise. An in-process loopback test cannot reproduce the wedge because a loopback leg has no miface binding and self-heals with a new-port handshake.
This commit is contained in:
@@ -41,6 +41,18 @@ Fable 5 security audit:
|
||||
address buffer.
|
||||
|
||||
Bug Fixes:
|
||||
- EAP-SRP: a bonded caller-sender leg that goes silent and then resumes on
|
||||
the same source tuple (an interface flap or upstream outage that comes
|
||||
back without a reconnect) no longer stays wedged (issue #224). While the
|
||||
leg was silent the far-end authenticator timed out and purged its session,
|
||||
but the caller kept its still-valid miface-bound socket and stayed in EAP
|
||||
SUCCESS, so it streamed data the authenticator now dropped while waiting
|
||||
for an EAPOL START the caller never sent; the leg never rejoined the bond
|
||||
without an operator restart. A caller-sender leg silent past
|
||||
session_timeout now resets EAP and re-runs the SRP handshake on its
|
||||
existing socket; once it re-authenticates it is folded back into the
|
||||
weighted balancing rotation, so it resumes carrying its full share of the
|
||||
bond on its own instead of lingering as an authenticated-but-idle leg.
|
||||
- Receiver: the "Too many old packets, resetting buffer" error no longer
|
||||
repeats on every output cycle on degraded or bonded links. The flow reset
|
||||
that emits it now clears its internal trigger, so a sustained run of late
|
||||
|
||||
+50
-10
@@ -2891,21 +2891,18 @@ static bool try_listener_reassociate_by_cname(struct rist_peer *new_peer, uint64
|
||||
return true;
|
||||
}
|
||||
|
||||
/* Receiver-caller socket rebind on a NAT source-port rebind / sender
|
||||
* silence. Plaintext, shared-PSK and EAP-SRP callers. SRP callers
|
||||
* additionally reset their EAP state and re-initiate the handshake on
|
||||
* the fresh socket (see below) so recovery does not depend on the
|
||||
* listener still holding the old authenticated session -- which never
|
||||
* works for a NAT'd caller after the listener restarts, because the
|
||||
* hub cannot reach back through the NAT to drive the listener-side
|
||||
* reassociation path. Linear backoff capped at REBIND_BACKOFF_CAP. */
|
||||
/* Caller-side recovery when a peer goes silent past session_timeout.
|
||||
* Receiver-mode callers rebind the local socket (NAT rebind / listener
|
||||
* restart); SRP callers also reset EAP and re-handshake on the fresh socket.
|
||||
* Sender-mode callers only reach the SRP path: their miface-bound socket
|
||||
* survives a flap, so they reset EAP without rebinding (see the branches).
|
||||
* Linear backoff capped at REBIND_BACKOFF_CAP. */
|
||||
#define REBIND_BACKOFF_CAP 10
|
||||
static bool try_caller_socket_rebind(struct rist_peer *peer, uint64_t now)
|
||||
{
|
||||
struct rist_common_ctx *cctx = get_cctx(peer);
|
||||
if (!peer || peer->parent || peer->listening ||
|
||||
!peer->receiver_mode || peer->multicast_sender ||
|
||||
peer->multicast_receiver)
|
||||
peer->multicast_sender || peer->multicast_receiver)
|
||||
return false;
|
||||
if (cctx->profile <= RIST_PROFILE_SIMPLE)
|
||||
return false;
|
||||
@@ -2932,6 +2929,42 @@ static bool try_caller_socket_rebind(struct rist_peer *peer, uint64_t now)
|
||||
(now - peer->last_rebind_time) < min_gap)
|
||||
return false;
|
||||
|
||||
if (!peer->receiver_mode) {
|
||||
#if HAVE_SRP_SUPPORT
|
||||
/* Sender-mode leg: the miface-bound socket survives a flap, so
|
||||
* don't rebind it -- just reset EAP and re-drive the handshake on
|
||||
* the existing socket. Only SRP deadlocks like this; plaintext/PSK
|
||||
* recover via normal reconnect.
|
||||
*
|
||||
* De-authenticate the leg so it drops out of the weighted sender
|
||||
* balancing while it is silent (the balancer keeps a leg in rotation
|
||||
* only while authenticated) and re-drives the connection handshake.
|
||||
* eap_authentication_state is rewound to 1 so the "EAP Authentication
|
||||
* succeeded" transition fires again when re-auth completes; that path
|
||||
* restores authenticated and folds the leg back into the bond at full
|
||||
* weight (without this, the leg re-authenticates but never rejoins
|
||||
* balancing, streaming only NACK retransmits). */
|
||||
if (peer->eap_ctx == NULL)
|
||||
return false;
|
||||
peer->authenticated = false;
|
||||
peer->eap_authentication_state = 1;
|
||||
peer->dead = 0;
|
||||
peer->timed_out = 0;
|
||||
peer->last_pkt_received = now;
|
||||
eap_reset_authenticatee(peer->eap_ctx);
|
||||
peer->rebind_attempts++;
|
||||
peer->last_rebind_time = now;
|
||||
rist_log_priv(cctx, RIST_LOG_WARN,
|
||||
"Sender caller peer %"PRIu32" silent past session_timeout "
|
||||
"(attempt %"PRIu32"); reset EAP and re-initiated the SRP "
|
||||
"handshake to recover the leg without operator intervention.\n",
|
||||
peer->adv_peer_id, peer->rebind_attempts);
|
||||
return true;
|
||||
#else
|
||||
return false;
|
||||
#endif
|
||||
}
|
||||
|
||||
struct evsocket_ctx *evctx = cctx->evctx;
|
||||
int old_sd = peer->sd;
|
||||
uint16_t old_local_port = peer->local_port;
|
||||
@@ -3931,6 +3964,13 @@ protocol_bypass:
|
||||
rist_log_priv(get_cctx(peer), RIST_LOG_INFO,
|
||||
"Peer %d EAP Authentication succeeded\n", peer->adv_peer_id);
|
||||
p->eap_authentication_state = 2;
|
||||
/* A caller-sender leg that re-authenticated after going
|
||||
* silent (see try_caller_socket_rebind) cleared its
|
||||
* connection-level authenticated flag to leave the bond
|
||||
* while down. Restore it now so the weighted balancer
|
||||
* folds the leg back in at full weight. */
|
||||
if (!p->receiver_mode && !p->listening && !p->authenticated)
|
||||
rist_peer_authenticate(p);
|
||||
//First authentication, so send keepalive
|
||||
_librist_proto_gre_send_keepalive(p, p->rist_gre_version);
|
||||
_librist_proto_gre_send_keepalive(p, p->rist_gre_version);
|
||||
|
||||
@@ -170,6 +170,22 @@ if have_srp
|
||||
# idle forever.
|
||||
test('Receiver caller socket rebind on silence (srp)', test_caller_socket_rebind,
|
||||
args: ['srp'], suite: ['regression', 'nat', 'srp'], timeout: 120)
|
||||
# Regression for the bonded EAP-SRP sender-leg flap bug (issue #224).
|
||||
# A bonded caller's leg is silenced (100% loss both ways) with its
|
||||
# interface left UP, so its miface-bound socket persists on the same
|
||||
# source tuple; the listener purges the timed-out leg while the other
|
||||
# leg keeps the bond alive. When the loss clears the caller must reset
|
||||
# EAP and re-authenticate the leg on its own -- before the fix it
|
||||
# stayed wedged in stale EAP SUCCESS forever. This needs real
|
||||
# interfaces + tc packet loss (a loopback leg has no miface binding and
|
||||
# self-heals via a new-port handshake, hiding the bug), so it runs only
|
||||
# on Linux as root with netns/tc and cleanly SKIPs (exit 77) otherwise.
|
||||
if get_option('built_tools')
|
||||
test('Bonded SRP leg re-authenticates after a same-tuple flap (srp, netns)',
|
||||
find_program('test_bonded_leg_flap_netns.sh'),
|
||||
args: [ristsender_exe, ristreceiver_exe],
|
||||
suite: ['netns', 'nat', 'multipath', 'srp'], timeout: 120)
|
||||
endif
|
||||
endif
|
||||
|
||||
###?profile= URL override tests (rist_*_create profile != URL profile)
|
||||
|
||||
Executable
+182
@@ -0,0 +1,182 @@
|
||||
#!/bin/bash
|
||||
# librist. Copyright © 2026 SipRadius LLC.
|
||||
# SPDX-License-Identifier: BSD-2-Clause
|
||||
#
|
||||
# Regression for the bonded EAP-SRP sender-leg re-authentication bug
|
||||
# (upstream issue #224).
|
||||
#
|
||||
# A bonded caller (sender) with EAP-SRP has two legs to one listening
|
||||
# receiver. One leg is silenced (100% packet loss BOTH ways) long enough
|
||||
# for the listener to time out and purge that leg's peer, while the other
|
||||
# leg keeps the bond -- and the caller process -- alive. Crucially the
|
||||
# silenced leg's interface stays UP, so its miface-bound UDP socket is
|
||||
# never torn down: when the loss clears the leg resumes from the EXACT
|
||||
# same source tuple, with the caller still in EAP SUCCESS. The listener
|
||||
# purged the session and now silently drops the caller's data while it
|
||||
# waits for an EAPOL START the caller never sends (EAP is authenticator-
|
||||
# driven and the caller believes it is still authenticated). The leg is
|
||||
# wedged forever.
|
||||
#
|
||||
# This is why the reproduction needs real interfaces + tc packet loss and
|
||||
# cannot be an in-process loopback test: a loopback leg has no miface
|
||||
# binding, so it is recreated on timeout with a fresh ephemeral port and
|
||||
# self-heals with a cold handshake -- hiding the bug. Likewise an
|
||||
# "ip link set ... down" makes the sends fail and forces the same
|
||||
# new-port reconnect. Only same-tuple silence reproduces the wedge.
|
||||
#
|
||||
# The fix drives the caller to reset EAP and re-run the SRP handshake on
|
||||
# that leg once it has been silent past session_timeout, so the leg
|
||||
# re-authenticates AND rejoins the weighted bond on its own.
|
||||
#
|
||||
# Two discriminators (the receiver is a stable listener the whole time, so
|
||||
# the flow never resets):
|
||||
# 1. Successful EAP authentications. Two legs authenticate at startup; a
|
||||
# third is only possible if the flapped leg re-authenticated on its own.
|
||||
# Before any fix it stays wedged at two and the receiver floods
|
||||
# "handshake is still pending".
|
||||
# 2. Legs the sender is balancing over after recovery, counted as the
|
||||
# distinct mifaces in the final sender-stats line. A leg only appears
|
||||
# there while it is authenticated and in the balancing rotation, so this
|
||||
# catches the reintegration defect where a leg re-authenticates but is
|
||||
# left out of balancing and streams only NACK retransmits (returns 1
|
||||
# instead of 2).
|
||||
#
|
||||
# Usage: test_bonded_leg_flap_netns.sh <ristsender> <ristreceiver>
|
||||
# Exit: 0 = leg re-authenticated and rejoined the bond (fixed)
|
||||
# 1 = leg wedged or did not rejoin balancing (bug)
|
||||
# 77 = SKIP (needs Linux root + netns + tc) 99 = setup error
|
||||
set -u
|
||||
|
||||
TX="${1:-}"
|
||||
RX="${2:-}"
|
||||
USER=flapuser; PASS=flappass; CN=testcn
|
||||
RXIP=10.0.1.51
|
||||
NS=rist224_snd
|
||||
RXLOG="$(mktemp)"; TXLOG="$(mktemp)"
|
||||
|
||||
skip() { echo "SKIP: $*"; cleanup 2>/dev/null; exit 77; }
|
||||
setup_err() { echo "SETUP ERROR: $*"; cleanup 2>/dev/null; exit 99; }
|
||||
|
||||
cleanup() {
|
||||
ip netns del "$NS" 2>/dev/null
|
||||
ip link del veth-a 2>/dev/null
|
||||
ip link del veth-b 2>/dev/null
|
||||
ip link del rbr0 2>/dev/null
|
||||
[ -n "${TXPID:-}" ] && kill "$TXPID" 2>/dev/null
|
||||
[ -n "${RXPID:-}" ] && kill "$RXPID" 2>/dev/null
|
||||
[ -n "${FEEDPID:-}" ] && kill "$FEEDPID" 2>/dev/null
|
||||
rm -f "$RXLOG" "$TXLOG" 2>/dev/null
|
||||
}
|
||||
trap cleanup EXIT
|
||||
|
||||
# ---- environment gate: skip cleanly where we cannot reproduce ----------
|
||||
[ "$(uname -s)" = "Linux" ] || skip "not Linux"
|
||||
[ "$(id -u)" = "0" ] || skip "needs root (CAP_NET_ADMIN) for netns/veth/tc"
|
||||
[ -n "$TX" ] && [ -x "$TX" ] || setup_err "ristsender not found ($TX)"
|
||||
[ -n "$RX" ] && [ -x "$RX" ] || setup_err "ristreceiver not found ($RX)"
|
||||
command -v ip >/dev/null 2>&1 || skip "iproute2 'ip' missing"
|
||||
command -v tc >/dev/null 2>&1 || skip "iproute2 'tc' missing"
|
||||
ip netns add "$NS" 2>/dev/null || skip "cannot create netns (no CAP_NET_ADMIN?)"
|
||||
|
||||
# ---- topology: bridge in root ns, two veth legs into a sender netns -----
|
||||
ip link add rbr0 type bridge || setup_err "bridge add"
|
||||
ip addr add ${RXIP}/24 dev rbr0 || setup_err "bridge addr"
|
||||
ip link set rbr0 up || setup_err "bridge up"
|
||||
|
||||
ip link add veth-a type veth peer name sa || setup_err "veth-a"
|
||||
ip link add veth-b type veth peer name sb || setup_err "veth-b"
|
||||
ip link set veth-a master rbr0; ip link set veth-a up
|
||||
ip link set veth-b master rbr0; ip link set veth-b up
|
||||
ip link set sa netns "$NS"; ip link set sb netns "$NS"
|
||||
ip netns exec "$NS" ip addr add 10.0.1.20/24 dev sa
|
||||
ip netns exec "$NS" ip addr add 10.0.1.122/24 dev sb
|
||||
ip netns exec "$NS" ip link set sa up
|
||||
ip netns exec "$NS" ip link set sb up
|
||||
ip netns exec "$NS" ip link set lo up
|
||||
# same-subnet multi-homing: disable rp_filter so miface egress is kept
|
||||
ip netns exec "$NS" sysctl -qw net.ipv4.conf.all.rp_filter=0
|
||||
ip netns exec "$NS" sysctl -qw net.ipv4.conf.sa.rp_filter=0
|
||||
ip netns exec "$NS" sysctl -qw net.ipv4.conf.sb.rp_filter=0
|
||||
sysctl -qw net.ipv4.conf.all.rp_filter=0 >/dev/null 2>&1
|
||||
|
||||
ip netns exec "$NS" ping -c1 -W2 -I sa ${RXIP} >/dev/null 2>&1 || skip "leg A no connectivity"
|
||||
ip netns exec "$NS" ping -c1 -W2 -I sb ${RXIP} >/dev/null 2>&1 || skip "leg B no connectivity"
|
||||
|
||||
# ---- receiver: listener, advanced profile, SRP authenticator -----------
|
||||
$RX -p 2 -v 6 \
|
||||
-i "rist://@${RXIP}:2030?username=${USER}&password=${PASS}" \
|
||||
-o "udp://127.0.0.1:12345" >"$RXLOG" 2>&1 &
|
||||
RXPID=$!
|
||||
sleep 1
|
||||
|
||||
# udp feeder inside the sender netns -> ristsender input
|
||||
ip netns exec "$NS" bash -c 'while :; do printf "RISTTESTPACKET%08d" $RANDOM > /dev/udp/127.0.0.1/5556; sleep 0.02; done' &
|
||||
FEEDPID=$!
|
||||
sleep 0.5
|
||||
|
||||
# ---- sender: bonded caller, two SRP legs (same cname + creds) ----------
|
||||
ip netns exec "$NS" $TX -p 2 -v 6 \
|
||||
-i "udp://@127.0.0.1:5556" \
|
||||
-o "rist://${RXIP}:2030?miface=sa&weight=10&username=${USER}&password=${PASS}&cname=${CN},rist://${RXIP}:2030?miface=sb&weight=5&username=${USER}&password=${PASS}&cname=${CN}" \
|
||||
>"$TXLOG" 2>&1 &
|
||||
TXPID=$!
|
||||
|
||||
echo "== warmup 12s (both legs authenticate + stream) =="
|
||||
sleep 12
|
||||
auths_setup=$(grep -c "Successfully authenticated" "$RXLOG")
|
||||
echo "after warmup: successful auths=${auths_setup}"
|
||||
if [ "$auths_setup" -lt 2 ]; then
|
||||
echo "INDETERMINATE: both legs did not authenticate before the flap."
|
||||
exit 99
|
||||
fi
|
||||
|
||||
echo "== silence leg B (sb): 100% loss both ways, interface stays UP, 10s =="
|
||||
ip netns exec "$NS" tc qdisc add dev sb root netem loss 100% || setup_err "tc add sb"
|
||||
tc qdisc add dev veth-b root netem loss 100% || setup_err "tc add veth-b"
|
||||
sleep 10
|
||||
echo "== leg B restored (same socket / same source tuple) =="
|
||||
ip netns exec "$NS" tc qdisc del dev sb root 2>/dev/null
|
||||
tc qdisc del dev veth-b root 2>/dev/null
|
||||
echo "== observe 28s (leg B must re-authenticate AND rejoin balancing) =="
|
||||
sleep 28
|
||||
|
||||
auths_final=$(grep -c "Successfully authenticated" "$RXLOG")
|
||||
flood=$(grep -c "handshake is still pending" "$RXLOG")
|
||||
resets=$(grep -c "reset EAP and re-initiated the SRP" "$TXLOG")
|
||||
|
||||
# Reintegration check: the sender only lists a leg in its per-peer stats
|
||||
# while that leg is authenticated and in the weighted balancing rotation, so
|
||||
# the number of distinct mifaces in the final sender-stats line is the number
|
||||
# of legs actually carrying balanced data. Re-authenticating alone is not
|
||||
# enough -- a leg that re-auths but is left out of balancing streams only NACK
|
||||
# retransmits and never returns to its share (that was the original defect).
|
||||
last_sstats=$(grep '"sender-stats"' "$TXLOG" | tail -1)
|
||||
legs_balancing=$(printf '%s' "$last_sstats" | grep -o '"miface":"[^"]*"' | sort -u | wc -l)
|
||||
legs_balancing=$(printf '%s' "$legs_balancing" | tr -d ' ')
|
||||
echo "== after flap: successful auths ${auths_setup} -> ${auths_final}, "\
|
||||
"handshake-pending flood=${flood}, sender EAP resets=${resets}, "\
|
||||
"legs balancing after recovery=${legs_balancing} =="
|
||||
|
||||
# A fresh EAP authentication after the flap proves the wedged leg
|
||||
# re-handshook on its own. Without the fix the count never moves and the
|
||||
# receiver floods "handshake is still pending".
|
||||
if [ "$auths_final" -le "$auths_setup" ]; then
|
||||
echo "FAIL (issue #224): the flapped SRP leg never re-authenticated; it"\
|
||||
"stayed wedged in EAP SUCCESS while the listener waited for an"\
|
||||
"EAPOL START it never sent (successful auths stuck at ${auths_final},"\
|
||||
"handshake-pending flood=${flood})."
|
||||
exit 1
|
||||
fi
|
||||
|
||||
if [ "${legs_balancing:-0}" -lt 2 ]; then
|
||||
echo "FAIL (issue #224 reintegration): the flapped leg re-authenticated"\
|
||||
"(successful auths ${auths_setup} -> ${auths_final}) but did not rejoin"\
|
||||
"the weighted bond; the sender balances over only ${legs_balancing} leg"\
|
||||
"so the returning leg carries no share."
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo "PASS: the flapped SRP leg reset EAP, re-authenticated (successful auths"\
|
||||
"${auths_setup} -> ${auths_final}) and rejoined the weighted bond"\
|
||||
"(${legs_balancing} legs balancing) on its own."
|
||||
exit 0
|
||||
+2
-2
@@ -45,7 +45,7 @@ if compile_prometheus
|
||||
tools_dependencies += prometheus_dep
|
||||
endif
|
||||
|
||||
executable('ristsender',
|
||||
ristsender_exe = executable('ristsender',
|
||||
['ristsender.c', 'yamlparse.c', 'oob_shared.c', srp_shared, tools_deps, rev_target],
|
||||
dependencies: [
|
||||
librist_dep,
|
||||
@@ -57,7 +57,7 @@ executable('ristsender',
|
||||
include_directories: inc,
|
||||
install: should_install)
|
||||
|
||||
executable('ristreceiver',
|
||||
ristreceiver_exe = executable('ristreceiver',
|
||||
['ristreceiver.c', 'yamlparse.c', 'oob_shared.c', srp_shared, tools_deps, rev_target],
|
||||
dependencies: [
|
||||
librist_dep,
|
||||
|
||||
Reference in New Issue
Block a user