The issue #188 NAT/socket-rebind and multipath cname integration
tests included <pthread.h> and initialised their tracker mutex with
PTHREAD_MUTEX_INITIALIZER. MSVC has no system pthread.h, and the
Windows pthread-shim maps pthread_mutex_t to a CRITICAL_SECTION,
which cannot be initialised statically, so the whole librist VS
solution failed to build (only these tests; the library and tools
were fine).
Follow the convention already used by test_send_receive.c and
test_reflector.c: include "pthread-shim.h" (directly, or via
rist-private.h), initialise the tracker mutex at runtime with
pthread_mutex_init(), and declare the feeder threads with the
PTHREAD_START_FUNC() macro so they match the shim's __stdcall/LPVOID
signature. POSIX builds are unchanged; the suite now also compiles
under MSVC.
Drop the pre-release marker from the 0.2.18 heading and add the
canonical ABI 15:0:11 / API 4.12.0 summary lines, matching the
format of prior release sections.
Bump API to 4.12.0, ABI to 15:0:11 (soversion 4 unchanged,
binary-compatible with 0.2.15/0.2.16/0.2.17 and rc2). New interfaces
since rc2: rist_recovery_depth_set() with the
RIST_RECOVERY_DEPTH_MIN/DEFAULT/MAX macros, and the additive stats
fields (RIST_STATS_VERSION 3: profile / seq_bits / advanced_active).
Changes since rc2:
- Advanced-profile correctness fixes (>64k sequence gaps, ring-index
masking in receiver_mark_missing, expected-next-seq wrap mask,
payload-only byte accounting, sequence-based duplicate detection,
32-bit-safe merge-mode pairing).
- Default profile is now Advanced with Main interoperability (TR-06-3
Section 9); risttunnel follows RIST_DEFAULT_PROFILE.
- Tunable Advanced recovery-ring depth via ?recovery-depth=.
- Advanced-profile profile/framing visibility in the stats API and the
Prometheus exporter.
Add 0.2.18 pre-release entries for the changes that landed since rc2:
- Advanced-profile correctness fixes: >64k sequence gaps, ring-index
masking in receiver_mark_missing, expected-next-seq wrap mask,
payload-only byte accounting, sequence-based duplicate detection,
and 32-bit-safe merge-mode pairing
- Advanced-profile profile/framing visibility in the stats API
(RIST_STATS_VERSION 2 -> 3) and the Prometheus exporter, plus the
receiver counters that were previously JSON-only
- risttunnel follows RIST_DEFAULT_PROFILE
Also normalize two em dashes in the section to ASCII ' - '.
receiver_mark_missing() indexed receiver_queue[] with the raw last_seq_found
and current_seq, whereas every other queue access masks with
(receiver_queue_max - 1). For 16-bit flows the sequence range equals
receiver_queue_max, so a raw sequence was always a valid index. A 32-bit
(Advanced) sequence can exceed receiver_queue_max once the stream passes the
ring size: the access then reads off the end of the array and dereferences a
stale or NULL slot's packet_time, crashing the receiver. Mask both indices
like the rest of the queue accesses.
Several receiver counters were present in the stats JSON but never surfaced
as Prometheus metrics. Export retries, dropped_late, dropped_full,
duplicates, and the nack-depth buckets (recovered after two, three, four, and
more nacks).
The stats output gave no way to tell which RIST profile a flow used or
whether Advanced framing was active on the wire. Add `profile` and
`advanced_active` to the sender peer stats, and `profile`, `seq_bits`, and
`advanced_active` to the receiver flow stats, with matching fields in the
JSON payloads. Bump RIST_STATS_VERSION to 3 and the JSON schema version to 4.
The Prometheus exporter emits new `rist_client_flow_info` and
`rist_sender_peer_info` series that carry these as labels, so a scrape can
identify an Advanced flow (and 16-bit vs 32-bit framing) without changing the
label set of the existing series. On the sender, advanced_active reflects
whether Advanced framing is in use toward the peer; on the receiver flow it
is true only when the context is Advanced and the flow is on 32-bit framing,
so it reads false when an Advanced context stays on Main framing because the
peer has not advertised Advanced support (TR-06-3 Section 9) and false for
Main and Simple contexts.
The merge-mode pairing test computed the next sequence number with
`& UINT16_MAX`, truncating it to 16 bits, which mis-pairs packets near a
16-bit boundary on a 32-bit Advanced flow. Use rist_seq_next(), which wraps
correctly for both short_seq (16-bit) and 32-bit flows.
receiver_enqueue compared source_time to detect a duplicate in an occupied
slot. The Advanced path stamps source_time with the arrival time, so that
comparison never matched a genuine duplicate and the duplicate counter
under-reported. Compare the sequence number instead, which is what the slot
index is derived from.
This does not change how Advanced buffers packets: it still buffers on
arrival time, so avg_buffer_time measures buffer residence rather than
source-to-output latency. Buffering on the on-wire timestamp is left as a
separate change.
The Advanced receive path passed the full datagram size as the ingest size
while the Main path passes the payload only, so received_bytes and the
derived bitrate were not comparable across profiles. Pass adv_data_len (the
delivered payload) instead. ts_null_bytes stays 0 on the Advanced path,
which performs no ts-null reinsertion on receive.
receiver_enqueue computed the expected next sequence number with
`& (UINT16_MAX - 1)` (0xFFFE), which clears bit 0 and forces the value
even, so every odd successor was mis-computed. The mask was meant to wrap
a 16-bit counter, i.e. `& UINT16_MAX`.
The error is gated by the `packet_time < last_packet_ts` guard, so it never
fires on normal in-order delivery. It bites in ARRIVAL timing mode, where a
per-path arrival timeline can legitimately regress for an in-order packet on
a bonded flow, sending roughly half of odd-successor packets into the
late-drop path.
Extract rist_seq_next() (symmetric with rist_seq_gap) so the successor wraps
correctly for both short_seq (16-bit) and 32-bit flows, and add a link-free
unit test covering the odd-successor regression and both wrap points.
receiver_mark_missing() and receiver_enqueue() unconditionally masked the
forward sequence gap to 16 bits (& UINT16_MAX / & (UINT16_MAX-1)). That is
correct for Simple/Main (16-bit) flows but wrong for Advanced 32-bit flows:
a genuine >64k contiguous gap was truncated, distorting the NACK pacing math
and mis-classifying in-order packets in the arrival-timing reorder guard.
Reachable now that the Advanced recovery ring is resizable up to the full
32-bit space (?recovery-depth=).
Add a short_seq-aware rist_seq_gap() helper and use it at both sites. The
short_seq branch reproduces the original masks byte-for-byte, so 16-bit flows
are unchanged; only 32-bit flows now see the true gap. The wrap false-positive
cap is unified with the recovery-walk hole cap already in the tree
(short_seq ? UINT16_SIZE/2 : receiver_queue_max/2), finishing a conversion
that was only half-applied. Adds a pure, link-free unit test.
MR note: the 16-bit branch intentionally keeps the long-standing
`& (UINT16_MAX - 1)` (0xFFFE) mask in expected_seq, which forces the value
even and looks like it should be `& UINT16_MAX`. Preserving it keeps this
patch a zero-behavior-change for 16-bit flows; whether that mask is itself a
bug is a separate, older question and should be its own change.
Completes the default-profile sweep: risttunnel was still hardcoded to
Main while ristsender/ristreceiver/YAML now follow RIST_DEFAULT_PROFILE
(Advanced). risttunnel uses one profile for both its sender and receiver
side, so it now defaults to Advanced too and interoperates with a
Main-only peer via the TR-06-3 Section 9 fallback. Help text updated to
list advanced and the new default.
rist2rist is intentionally left at Simple (its documented purpose is to
receive Simple-profile input and re-emit Main); udp2udp does not select
a profile.
Changes the default RIST profile from Main to Advanced. This is now safe
because an Advanced endpoint interoperates with Main: it negotiates
Advanced framing with an Advanced peer and falls back to Main framing
with a Main-only peer (TR-06-3 Section 9, implemented in the preceding
commit), so existing Main deployments keep working without a flag.
- RIST_DEFAULT_PROFILE is now RIST_PROFILE_ADVANCED.
- ristsender, ristreceiver and the YAML config now take their default
profile from RIST_DEFAULT_PROFILE instead of a hardcoded
RIST_PROFILE_MAIN, so there is a single source of truth. Pass
-p 1 (or profile: 1 in YAML) to force Main as before.
- yamlparse.c gains an explicit <librist/peer.h> include for the macro.
The default only affects the profile a context starts in; the
?profile= URL override and the explicit profile argument to
rist_*_create() are unchanged. rist_peer_config still records the
default with profile_set = 0, so it is applied only when ?profile= is
present.
An Advanced-profile sender previously emitted Advanced (Type-8) framing
unconditionally, so a Main-only receiver could not decode the stream;
an Advanced receiver likewise assumed 32-bit framing for the whole flow.
This implements the spec's optional Main/Advanced interop method so a
single flow works across mismatched profiles.
Sender (udp.c): an Advanced device now starts in Main-Profile framing
and only switches to Advanced framing for a peer once that peer
advertises Advanced capability (I=1 in its Main keep-alives, tracked as
remote_supports_advanced). Until then it emits Main-conformant media a
Main-only peer can decode. Control and OOB are unchanged.
Receiver (rist-common.c): rist_receiver_recv_data() learns the actual
wire framing of each data packet (pkt_short_seq) and refines the flow's
short_seq from it. The two framings carry different sequence widths and
timestamp encodings, so a mid-stream framing change (the Main->Advanced
upgrade once a peer advertises I=1) is treated like a flow-id change:
the timing baseline is dropped so the next enqueue re-derives time_offset
and the seq->index mapping from the new framing instead of blending the
two. Without this the stale baseline corrupts delivery after the switch.
flow.c: Advanced flows default to 32-bit framing at create time;
recv_data refines it per wire. The recovery ring size stays fixed per
context.
Matched Advanced and matched Main pairs are unaffected. Adds a permanent
cross-profile interop suite (both directions, with and without 10% loss)
to test/rist/meson.build.
The Advanced-profile recovery/retransmission ring was a fixed
compile-time size, and Simple/Main flows over-allocated to that same
size. This makes the Advanced ring sizable and stops the 16-bit
profiles from paying the Advanced memory footprint.
- Recovery rings are now heap-allocated and profile-sized. Simple/Main
allocate the 16-bit cap (65536 entries); Advanced allocates the
configured depth. Sender and receiver rings are freed on teardown.
- New ?recovery-depth=<n> URL parameter and rist_recovery_depth_set(ctx,
depth) API size the Advanced ring. depth is a base-2 exponent: the
ring holds 65536 << depth packets. Default 3 (8x, the prior fixed
size); range 0..16 (16 == full 32-bit sequence space). Must be set
before rist_start().
- rist_peer_config grows a trailing recovery_depth field at the existing
RIST_PEER_CONFIG_VERSION 5; rist_peer_config_defaults_set_versioned()
sets it to RIST_RECOVERY_DEPTH_DEFAULT for version-5 callers.
Zero-initialised and older configs keep the default ring.
- A one-shot WARN fires at peer-config time when recovery-maxbitrate and
the max buffer would queue more packets than the recovery window can
address, because packets beyond the window cannot be retransmitted.
The Advanced hint points at ?recovery-depth=; the Simple/Main hint
points at the Advanced profile.
Tests: test_recovery_depth (depth->packet mapping, ring resize, receiver
capacity, Main behaviour, range clamping, NULL ctx) and
test_url_recovery_depth_parse (numeric parse, range and garbage
rejection).
The RTT Echo Response handler converted the round-trip NTP interval with
a scale that inflated the measured RTT by a constant factor and left it in
microseconds rather than the NTP ticks the rest of the code expects.
peer->last_rtt drives the sender's retransmit re-queue suppression window,
so on a path with non-trivial RTT the inflated value made the sender
refuse the re-NACKs a twice-dropped packet needs, leaving packets
unrecovered under sustained loss; recovery on low-latency links was
unaffected, which is why it went unnoticed.
Derive the round-trip time with calculate_rtt_delay(), the same helper the
Simple and Main RTCP echo path uses: it returns NTP ticks and subtracts
the responder's processing delay in the correct units.
Listener-side cname re-association (added for issue #188 NAT source-port
rebind recovery) treated a matching cname as proof of identity. Under a
shared PSK the cname is not a per-peer secret, so any peer holding the
group key could replay a victim's cname from a new source tuple and take
over its peer record. Gate the migration on an authenticated per-peer SRP
session; plaintext and shared-PSK callers instead recover through the
existing caller-side socket rebind.
Coverage:
- test_psk_cname_hijack: a shared-PSK caller forging a victim's cname is
not absorbed into the victim's peer record
- test_nat_port_rebind: per-mode distinct-caller expectations
(plaintext/psk stay separate, srp migrates)
- test_multipath_cname: same-cname bonded paths stay distinct under PSK
and under SRP with shared credentials
- test_multipath_leg_rebind: one SRP leg rebinds mid-stream without
merging into the surviving same-cname leg or disturbing it
The new tests use rist_receiver_data_callback_set2 and release each
delivered block via rist_receiver_data_block_free2, so they compile under
-Werror=deprecated-declarations on win64 and run clean under the ubuntu
LeakSanitizer build.
Update NEWS for the SRP-only reassociation and the caller-side rebind path.
Bump API to 4.11.0, ABI to 14:0:10 (soversion 4 unchanged,
binary-compatible with 0.2.15/0.2.16/0.2.17). New interfaces since
rc1: rist_peer_config_defaults_set_versioned() and the recovery_priority
peer-config field (RIST_PEER_CONFIG_VERSION 5).
Changes since rc1:
- Per-peer retransmission routing via ?recovery-priority=.
- Versioned peer-config initialisation closes an ABI-related
out-of-bounds write for callers built against older struct revisions.
- Fixed a global-logger use-after-free when an application frees its
logging settings while the global logger still references them.
- Receiver no longer aborts when an interpolated packet arrival time
falls outside its neighbouring queue slots.
- NAT source-port rebind recovery (issue #188).
A calling receiver behind a dynamic-IP NAT can have its external
source port change (DHCP renewal, NAT mapping rebind, etc.). The
listening sender's peer match keys on the full caller tuple, so
the new tuple is treated as a brand-new caller, the existing
child peer record sits orphaned for session_timeout, and the
receiver-side state is not inherited. The issue reports a manual
restart of ristreceiver as the workaround.
The recovery path is split by whether encryption is in play, so
cname (forgeable on the wire in plaintext) is only trusted to
identify a peer when an authenticated envelope is present.
Listener-side cname re-association (encrypted callers):
Once a new listener-child peer authenticates and learns its
cname from an RTCP SDES, look for an existing sibling with the
same cname / is_rtcp / is_data / adv_flow_id and migrate the
new source tuple into it. Gated on PSK or SRP being
configured, on exactly one matching candidate sibling that is
either dead or has been silent for >= 2 * keepalive_interval,
and on no other sibling sharing the cname being currently
alive (protects multipath and same-NAT duplicate-cname cases).
For SRP, kick a fresh EAPOL START on the surviving sibling so
the periodic re-auth timer fires now instead of after up to
EAP_REAUTH_PERIOD; the full EAP-SRP handshake still takes its
normal multi-RTT exchange. _librist_peer_match_peer_addr now
skips dead children so an absorbed peer cannot shadow the
surviving record that now owns the migrated source tuple.
Receiver-caller socket rebind (plaintext callers):
From rist_timeout_check, immediately before the kill_peer
that would normally fire on session_timeout, close and reopen
the local UDP socket on a fresh ephemeral port and restart
the handshake. Gated on caller-mode receiver + MAIN+ profile
+ no encryption + peer->config.local_port == 0 + min_silence =
max(session_timeout, 4 * keepalive_interval) so a misconfigured
short session_timeout against a slower keepalive cadence does
not flap-loop. Linear backoff capped at REBIND_BACKOFF_CAP *
session_timeout. If rist_create_socket fails after the close,
fall through to the normal kill_peer path.
Reproducers:
* test/rist/test_nat_port_rebind.c -- two modes (plaintext /
psk) covering both branches of the encryption gate.
* test/rist/test_caller_socket_rebind.c -- destroys sender,
asserts the receiver auto-rebinds to a new ephemeral local
port and the next sender to come up sees the reconnected
caller without operator intervention.
The retry/gap interpolation path in receiver_enqueue() asserted that the
CBR-interpolated packet_time fell below next->packet_time. Under
arrival-based timing a flow can be fed by peers on unsynchronised clocks,
so neighbouring queue slots may carry packet_times from different clock
domains and the interpolated estimate can land outside (previous, next) --
aborting the receiver in debug builds and silently corrupting queue
ordering under NDEBUG.
Replace the assert with explicit guards and clamping:
- only interpolate when next->packet_time > previous->packet_time;
- multiply before dividing so integer rounding cannot push the estimate
up to or past next->packet_time;
- handle gap/reorder/sequence-wrap (seq not strictly between the
neighbours) by placing the packet just before next;
- clamp the final result into [previous->packet_time, next->packet_time].
When a receiver flow is carried by more than one RTCP-capable peer,
send_nack_group historically sent every NACK to the eligible peer
with the lowest RTT. That is wrong when the lowest-RTT peer cannot
actually retransmit: e.g. a duplicate/relay feed that delivers the
same packets but keeps no retransmit buffer. NACKs go to it, the
peer that holds the buffer never hears them, and recovery stalls.
Add a per-peer recovery_priority (new ?recovery-priority=<n> URL
parameter, rist_peer_config.recovery_priority). NACK groups now go
to the eligible peer with the highest recovery_priority, ties broken
by lowest RTT. The default of 0 on every peer reduces exactly to the
prior lowest-RTT selection, so existing deployments are unchanged.
The selection predicate is factored into src/rist-nack-select.h
(rist_nack_peer_preferred) so it can be unit-tested without building
peers or starting threads.
Also name the weight=0 duplicate sentinel: new public macro
RIST_PEER_WEIGHT_DUPLICATE makes the load-balancer's "duplicate to
all peers" branch read as intent rather than a magic 0.
ABI: RIST_PEER_CONFIG_VERSION 4 -> 5 (one new trailing field,
defaulted to 0 by rist_peer_config_defaults_set; zero-initialised
configs keep legacy behaviour).
Tests:
- test/rist/unit/test_url_recovery_priority_parse.c: ?recovery-priority=
parsing through the public rist_parse_address2 API.
- test/rist/unit/test_nack_peer_select.c: priority/RTT selection,
tie-breaks, legacy-equivalence and unmeasured-RTT edge cases.
- meson test --suite unit: 8/8 OK.
- meson test (full): 51 OK, 11 Expected Fail, 0 unexpected.
Add a unit test asserting rist_peer_config_defaults_set_versioned() writes
only the fields defined at or below the requested version (split_mode and
merge_mode from version 1, profile and profile_set from version 4) and that
the rist_peer_config_defaults_set() inline records the caller's compiled
RIST_PEER_CONFIG_VERSION.
Document in NEWS the new versioned initialiser, the version guards in
parse_url_options() / store_peer_settings(), and the retained exported
rist_peer_config_defaults_set() symbol, which assumes version 0 for binaries
linked against an earlier library and therefore leaves version-1+ fields
untouched.
Introduces version checks and a versioned struct initialization mechanism for
`struct rist_peer_config` to protect against memory corruption and out-of-bounds
access when older client binaries run against a newer dynamic library.
Changes:
- Declared `rist_peer_config_defaults_set_versioned()` to initialize defaults
conditionally based on the version of the calling client.
- Redefined `rist_peer_config_defaults_set()` as a static inline function in
`peer.h` for external clients to capture their compiled version, while keeping
the non-inline legacy wrapper in `rist.c` (gated by `LIBRIST_INTERNAL` and
defaulted to version 0) for ABI linkage compatibility.
- Added version checks to `parse_url_options()` inside `src/rist-common.c` to prevent
writing out-of-bounds (`multicast_ttl`, `multicast_source`, `reflector`,
`local_port` require version >= 2; `srp_compat_legacy` requires version >= 3).
- Added version checks to `store_peer_settings()` inside `src/rist-common.c` to prevent
reading out-of-bounds when copying configuration settings to internal structures
(`reflector` requires version >= 2; `srp_compat_legacy` requires version >= 3).
Record the fix that clears the global logging settings when an
application frees its rist_logging_settings (directly or by destroying a
context that embeds them), preventing later logs from calling a freed
callback.
Fixes an access violation/use-after-free in client applications (such
as FFmpeg) when a stream is quickly closed and opened.
If a client allocates or embeds logging settings, they are copied into
the static global logging settings. When the stream is destroyed, the
memory hosting the logging callback and its user argument is freed.
If they are not unset from the global logger, subsequent library logs
routed through rist_log_priv3 (e.g., in background evsocket threads
or new context initialization) will attempt to use the dangling
callback argument, triggering a crash inside the logging callback.
This fixes the issue by unsetting the global logging configuration
if the settings structure being cleaned up matches the currently
configured global logging settings.
- Add rist_logging_unset_global_if_matches() to check and unset settings.
- Call it in rist_logging_settings_free2() when freeing settings directly.
- Call it in rist_receiver_destroy_local() and rist_sender_destroy_local()
to handle clients like FFmpeg that embed the settings structure.
Mirrors the existing peer_url label on rist_sender_peer_*. peer_id
is reissued (higher) on reconnect, so it isn't a stable per-peer
identifier; peer_url is.
Sender peers carry the URL through rist_prometheus_sender_add_peer.
Receiver peers are auto-discovered from the stats callback, so the
URL is added to the receiver-flow stats JSON ("url" field on each
peer object) and parsed at first registration in the exporter.
For listener-mode peers the URL lives on the parent (the inbound
child is created with url=NULL); the stats populator walks the
parent chain.
The unix-socket scraper exposed by ristreceiver / ristsender
(activated by --metrics-unixsock) returned empty or stub-only
bodies on a substantial fraction of scrapes in production.
Six separate defects in tools/prometheus-exporter.c contributed:
1. listen(fd, 1) -- a backlog of one is too small for any
real deployment where Prometheus + an external collector
may scrape within the same RTT. Replaced with SOMAXCONN.
2. accept() returning EINTR was treated as fatal and broke the
thread out of the loop, silently disabling the exporter for
the remainder of the process lifetime. EINTR is now retried.
3. The hot path issued a single unchecked write() with the
result discarded via (void)!. Partial writes and transient
EINTR therefore truncated the response, and an early peer
close raised SIGPIPE on the exporter thread. Replaced with
a write_all helper that loops on partial writes, retries
EINTR, and uses MSG_NOSIGNAL where available.
4. shutdown(SHUT_RDWR) immediately before close() can behave
as an abortive teardown on AF_UNIX SOCK_STREAM under some
kernels, discarding bytes that write() had just queued.
Switched to shutdown(SHUT_WR), which is the documented
"no more writes -- let the receive queue drain" sequence.
5. close() with unread bytes in the receive queue is the
biggest source of empty-body scrapes in the field. The
exporter is a dump-on-connect protocol that never reads
the client's request, but tools like nc / curl / the
Prometheus scraper itself all send "GET /metrics HTTP/1.0
\r\n\r\n". On Linux, close() on an AF_UNIX SOCK_STREAM
with unread input is treated as an abortive teardown: the
kernel sends RST and discards everything in the transmit
queue -- including the 10 KB body we just wrote. The
client sees ECONNRESET / EOF with zero bytes. We now
shutdown(SHUT_WR) and drain the receive queue with a
SO_RCVTIMEO-bounded recv loop before close, so the kernel
does a clean FIN-FIN handshake and the body actually
reaches the client. Confirmed by a Python probe that
went from 5/5 ECONNRESET to 5/5 success in the same
environment.
6. In single-stat-point mode (the default for ristreceiver
and ristsender) rist_prometheus_stats_format() was zeroing
container_count after every scrape. Any subsequent scrape
that arrived before the next 1 Hz stats callback then
emitted only HELP/TYPE/UNIT preambles with no data points,
which manifested in the field as metrics flickering and
downstream consumers parsing NaN / blank bandwidth. The
ring is now only reset in --metrics-multipoint mode;
single-point mode keeps the latest sample visible until
the next callback overwrites it. cleanup_stale_locked()
still ages stale flows / peers out after 15 s.
In single-stat-point mode the writer set container_offset = 1 after
each callback, but the format() macro reads only container[0..count-1]
and container_count stays at 1. container[0] was therefore frozen at
the value from the first callback while every subsequent callback
silently overwrote container[1], which the reader never touches.
Three sites had the same pattern:
- rist_prometheus_handle_client_stats (receiver flow stats)
- rist_prometheus_handle_client_stats (per-receiver-peer stats)
- rist_prometheus_handle_sender_peer_stats
NEWS entry added under 0.2.18 Bug Fixes.