From: nolade <nola.aunger@inkbridge.io>
Date: Thu, 3 Apr 2025 21:22:01 +0000 (-0400)
Subject: Added 'frag errors' info to introduction/trouble-shooting/connectivity section
X-Git-Url: http://git.ipfire.org/cgi-bin/gitweb.cgi?a=commitdiff_plain;h=9149e9d74cf096eb5ada8d14d8f99cb47ee69942;p=thirdparty%2Ffreeradius-server.git

Added 'frag errors' info to introduction/trouble-shooting/connectivity section

update nav bar
---

diff --git a/doc/antora/modules/ROOT/nav.adoc b/doc/antora/modules/ROOT/nav.adoc
index 0897531590d..73087eed8fa 100644
--- a/doc/antora/modules/ROOT/nav.adoc
+++ b/doc/antora/modules/ROOT/nav.adoc
@@ -10,7 +10,7 @@
 *** xref:trouble-shooting/user.adoc[User Management]
 *** xref:trouble-shooting/server.adoc[Server Configuration]
 *** xref:trouble-shooting/client.adoc[Client Configuration]
-*** xref:trouble-shooting/connect_nas.adoc[Connectivity-NAS]
+*** xref:trouble-shooting/connect_nas.adoc[Connectivity]
 *** xref:trouble-shooting/datastore.adoc[Datastores]
 
 
diff --git a/doc/antora/modules/ROOT/pages/trouble-shooting/connect_nas.adoc b/doc/antora/modules/ROOT/pages/trouble-shooting/connect_nas.adoc
index 4a8fab60613..155c4b74fa2 100644
--- a/doc/antora/modules/ROOT/pages/trouble-shooting/connect_nas.adoc
+++ b/doc/antora/modules/ROOT/pages/trouble-shooting/connect_nas.adoc
@@ -1,7 +1,10 @@
-= Connectivity-NAS
+= Connectivity
 
+Maintaining productivity, ensuring customer satisfaction, improving network performance, and reducing downtime are all key benefits of connectivity troubleshooting. Network outages can halt business operations, preventing employees from accessing critical applications and collaborating effectively. This leads to lost productivity and revenue.
 
-== How do I tell the user what to use for an IP netmask?
+Additionally, these problems can disrupt online services, negatively impacting customer satisfaction and damaging the businessâs reputation. Troubleshooting helps identify and resolve bottlenecks, congestion, and other issues that affect network performance, ensuring efficient and reliable operations. By quickly identifying and resolving issues, businesses can minimize downtime and reduce the financial impact of connectivity problems.Â 
+
+== Allocating IP netmasks
 
 The whole netmask business is a complicated one. An IP interface has an IP address and usually a netmask associated with it. Netmasks on point-to-point interfaces like a PPP link are generally not used.
 
@@ -11,10 +14,8 @@ The result of this on most NAS is that they start to route a subnet (the subnet
 
 Many NAS interpret a left-out Framed-IP-Netmask as if it were set to 255.255.255.255, but to be certain you should set the Framed-IP-Netmask to 255.255.255.255.
 
-The following entries do almost the same on most NAS:
-.Entries Example
-[%collapsible]
-====
+The following entries do almost the same on most NASs:
+
 	user Cleartext-Password := "blegh"
 		Service-Type = Framed-User,
 		Framed-Protocol = PPP,
@@ -26,90 +27,315 @@ The following entries do almost the same on most NAS:
 		Framed-Protocol = PPP,
 		Framed-IP-Address = 192.168.5.78,
 		Framed-Route = "192.168.5.64/28 0.0.0.0 1"
-====
+
 
 The result is that the end user gets IP address 192.168.5.78 and that the whole network with IP addresses 192.168.5.64 - 195.64.5.79 is	routed over the PPP link to the user (see the RADIUS RFCs for the exact syntax of the Framed-Route attribute).
 
+== Fragmentation issues
+
+802.1X authentication methods like EAP-TLS transmit large UDP packets that need IP fragmentation to reach their destination. If the network used for 802.1X mishandles IP fragments or has an issue with Path MTU Discovery (PMTUD), this issue shows up as unreliable or non-functional 802.1X authentication.
+
+Before attempting to troubleshoot possible IP or EAP fragmentation issues, itâs important to have a comprehensive understanding of the normal behavior of IP networks regarding fragmentation and reassembly, forwarding of fragments, and critical network services like PMTUD.
+
+Debugging network problems without understanding IP networking usually leads to making mistakes. Often, random changes are made until something appears to work. The final result may have issues or cause network instability.
+
+This section outlines how to identify, investigate, and resolve fragmentation issues, including common scenarios with broken or misconfigured network devices.
+
+=== Identifying broken Path MTU Discovery
+
+The `tracepath` tool provides a useful indication of any path MTU restrictions
+to a destination.
+
+In the example below, the `tracepath` tool sends maximum size UDP packets marked "Don't Fragment" to the destination with increasing TTLs. At each stage it records information about the hop, based on ICMP responses, reducing the payload size if necessary, as indicated by the "Next-Hop MTU" field of an ICMP "Fragmentation Needed" or ICMP "Too Big" responses.
+
+```
+$ tracepath -m 20 110.60.100.30
+ 1?: [LOCALHOST]                         pmtu 1500
+ 1:  _gateway                              0.589ms
+ 2:  200.100.50.20                         8.486ms pmtu 1492
+ 3:  30.50.70.90                           9.267ms
+ 4:  no reply                             10.117ms
+ 8:  ae23.example.com                     10.806ms
+ 9:  ae28.example.com                     11.419ms
+10:  ae31.example.com                     13.986ms
+11:  ae29.example.com                     15.739ms
+12:  ae20.example.com                     15.486ms
+13:  ae25.example.com                     17.442ms asymm 11
+14:  140.90.30.1                          17.718ms asymm 13
+15:  no reply  => UDP is filtered as target is known to be at this hop
+16:  no reply
+17:  no reply
+18:  no reply
+19:  no reply
+20:  no reply
+```
+
+The MTU was first reduced to the default Ethernet MTU (1500 bytes) to allow packet transmission via the sourceâs uplink interface. The source reaches the local gateway successfully using the 1500-byte MTU on hop 1. However, to reach the host on hop 2, another reduction in the path MTU to 1492 bytes was required. This reduction was learned from an ICMP "Fragmentation Needed" response. In this scenario, hop 2 is probably a host located on the remote side of a PPP connection. Notably, the PPPoE header consumes 8 bytes of overhead, which caused the MTU reduction.
 
-== 3Com/USR HiPerArc doesn't work
+[NOTE]
+====
+Some network devices are configured to not respond with an ICMP âTTL exceeded in transitâ message when dropping a packet because the TTL reaches 0, as in hop 4. If all hosts outside a specific LAN exhibit â`no reply`,â itâs likely that an critical ICMP response is being filtered by a firewall. Erroneously dropping ICMP âTTL exceeded in transitâ messages is a *fundamental IP network issue that must be fixed.*
+====
+
+Also note, that if connections to the destination host are filtered to prevent the return of an ICMP âdestination unreachableâ response for closed UDP ports, such as by a host-based firewall, network firewall, or router ACL, the trace will stop receiving replies either at or immediately prior to the hop where the filters are applied. The trace continues to report â`no reply`â until the hop count is exhausted. This does not necessarily indicate a fundamental network issue (beyond silent filtering of high-numbered ports).
 
-I'm using a 3Com/USR HiPerArc and I keep getting this message on radius.log:
+==== Symptoms of broken PMTUD
 
-	`Mon Jul 26 15:18:54 1999: Error: Accounting: logout: entry for NAS tc-if5 port 1 has wrong ID`
+Run the `tracepath` command from a host thatâs on the same LAN as the NAS, and target the RADIUS Server. Then, do the same thing in reverse.
 
-If you're using HiPer ARC 4.1.11, you need to contact the vendor? Version 4.1.11 has a problem reporting NAS-port numbers to Radius. Upgrade the firmware from http://totalservice.usr.com to at least 4.1.59. If you are in Europe you can telephone to 3Com Global Response Center (phone number: 800 879489), and tell them that you have bought it in the last 90 days. They will help you, step by step, to do the upgrade.
+*Example: MTU size*
+
+If the trace stops reporting path information at an intermediate hop en route to the destination that doesnât perform packet filtering, itâs likely that thereâs a path MTU restriction. However, a device along the already-probed segment of the path is preventing ICMP "Fragmentation Needed" or ICMP "Too Big" responses generated by the restrictive network device from reaching the source.
+
+In this type of scenario, a broken PMTUD shows a trace like the following:
+
+```
+$ tracepath -m 20 110.60.100.30
+ 1?: [LOCALHOST]                         pmtu 1500
+ 1:  _gateway                              0.589ms
+ 2:  no reply
+ 3:  no reply
+ ...
+19:  no reply
+20:  no reply
+```
+
+Execute the trace with a lower MTU e.g. 1400 bytes, allowing the packets to reach the destination. If the packets get closer, but still don't reach the target, reduce the MTU again and try transmitting.
+
+*Example: Missing ICMP Messages*
+
+This behaviour is a strong indication that PMTUD is broken between those hosts.
+The source is not receiving the indications that it needs to reduce the packet
+size to the destination, therefore it will likely continue to send RADIUS
+packets that are too big to reach their destination, rather than perform IP
+fragmentation with a viable fragment size. **Broken PMTUD is a fundamental
+network issue that should be fixed.**
+
+```
+$ tracepath -m 20 -l 1400 110.60.100.30
+ 1?: [LOCALHOST]
+ 1:  _gateway                              0.589ms
+ 2:  200.100.50.20                         8.486ms
+ 3:  30.50.70.90                           9.267ms
+ ...
+```
+
+Take captures of all network devices along the path to determine where
+IP packets are being dropped due to an MTU restriction. Determine if ICMP
+"Fragmentation Needed" or ICMP "Too Big" responses are being generated (as is
+required), and --- if so --- where these ICMP responses are being dropped prior
+to reaching the source.
+
+=== Identifying broken fragment handling
+
+The following capture taken with the `tcpdump` utility on a NAS shows a supplicant performing EAP-TLS and is in the process of sending the client certificate chain:
+
+```
+... IP (id 53297, offset 0, flags [+], proto UDP (17), length 1500)
+  10.0.0.50.46521 > 10.0.0.51.1812: RADIUS, length: 1472
+    Access-Request (1), id: 0x09, Authenticator: e0422b49...
+      User-Name Attribute (1), length: 11, Value: anonymous
+      ...
+      EAP-Message Attribute (79), length: 255, Value: [REDACTED]
+      EAP-Message Attribute (79), length: 255, Value: [REDACTED]
+      EAP-Message Attribute (79), length: 255, Value: [REDACTED]
+      EAP-Message Attribute (79), length: 255, Value: [REDACTED]
+      EAP-Message Attribute (79), length: 255, Value: [REDACTED]
+      EAP-Message Attribute (79) (bogus, goes past end of packet)
+
+... IP (id 53297, offset 1480, flags [], proto UDP (17), length 98)
+  10.145.0.50 > 10.145.0.51: ip-proto-17
+```
+
+It shows the transmission of a single RADIUS packet, containing a large EAP
+message fragment, to the RADIUS Server as a set of IP fragments. A single IPv4
+packet would have a length that would exceed the 1500 byte MTU of the path to
+the destination, so the NAS performs IP fragmentation.
+
+Note the first fragment (with offset 0), has an overall frame of length 1500
+bytes to fill the path MTU to the destination, and has a more fragments
+indication ("`flags [+]`"). The second fragment has offset 1480 and has ID
+53297 which matches the initial fragment.
+
+
+==== Symptoms of bad fragment handling
+
+Take simulataneous captures at both the NAS and the RADIUS server and look for
+an instances of fragments being generated at source for an IP packet.
+
+If the network is functioning correctly, the capture taken at the destination
+will show the arrival of all fragments. It is okay for these fragments to
+arrive out of order.
+
+In the rare case that an in-path device is performing IP fragment reassembly
+(and the local MTU exceeds that which was discovered by the sender) then it is
+also possible to observe a single, **complete** reassembled packet.
+
+In even rarer cases, for IPv4 packets you might even observe a different
+arrangement of fragments representing the original packet, either because an
+in-path host has performed further fragmentation of the fragments, or because
+fragment reassembly has occurred and then the IP packet has been subsequently
+refragmented using a different IP fragment size.
+
+Each of these scenarios is fine provided that the destination host is provided
+with a complete set of fragments representing the original IP packet containing
+the RADIUS request.
+
+In the case of RADIUS requests being sent to the RADIUS Server, debugging the
+RADIUS Server (`radiusd -X`) will show it processing the RADIUS request from
+the reassembled IP packet. If `tcpdump` shows some IP fragments arriving but
+FreeRADIUS does not receive the RADIUS request, then something has gone wrong
+in the network resulting in the operating system failing to reassemble the
+original IP packet --- due to either missing or incorrectly formatted IP
+fragments.
+
+Missing or broken IP fragments always infers the existance of one or more network
+devices that exhibit impaired IP behaviour. **Impaired IP fragment handling is
+a fundamental network issue that should be fixed.**
+
+Captures should be taken at network devices along the path to determine where
+IP fragments are being dropped, or incorrectly routed.
+
+[NOTE]
+====
+The FreeRADIUS `radsniff` tool is not a substitute for `tcpdump` tool when diagnosing IP fragmentation issues. The `radsniff` tool processes raw data read from a network interface and does not perform userland IP fragment reassembly. Therefore its output can be misleading:
+
+```
+...
+(3) Access-Request Id 6 eth0:1.1.1.1:53320 -> 2.2.2.2:1812
+(4) Access-Challenge Id 6 eth0:1.1.1.1:53320 <- 2.2.2.2:1812
+(5) Packet too small by 82 bytes, ... should be 1562 bytes
+(6) **noreq** Access-Challenge Id 7 eth0:1.1.1.1:53320 <- 2.2.2.2:1812
+...
+(11) Access-Request Id 10 eth0:10.145.0.50:53320 -> 10.145.0.51:1812
+(12) Access-Accept Id 10 eth0:10.145.0.50:53320 <- 10.145.0.51:1812
+```
+
+Packet (5) was an Access-Request that was received as a set of IP fragments, and only the first fragment was processed and declared incomplete i.e. `Packet too small..`. Therefore, the Access-Challenge response in packet (6) didn't  match to any request.
+
+This example output is normal when RADIUS requests are delivered as a set of IP fragments, and not a fault. It can be seen that the conversation eventually completes with an Access-Accept.
+====
 
-== 3Com/USR HiPerArc Simultaneous-Use doesn't work
+=== Identifying impaired network devices
 
-by Robert Dalton support at accesswest dot com
+Network RADIUS encounters various scenarios where a AAA service is
+degraded or broken due to faulty or incorrectly configured network devices.
 
-Verify if you are using HiPerArc software version V4.2.32 release date 09/09/99
+An issue is likely to be due to one of these common cases for which potential
+solutions are provided.
 
-In order for simultaneous logins to be prevented reported port density must be set to 256 using the command :
+[NOTE]
+====
+Correct IP networking functionality may vary between a device's firmware
+versions. Because of this, EAP-based authentication methods should always be
+carefully tested prior to production network upgrades being undertaken.
+====
 
-	set pbus reported_port_density 256
+==== Access networks that do not support a standard Ethernet MTU
 
-Otherwise it changes the calculations of the SNMP object ID's.
+Supplicants and authenticators anticipate that the MTU of the network over
+which EAPoL is performed is a standard size for the link type. Some supplicants
+will generate EAPoL frames that are the full 1500 bytes of a standard Ethernet
+MTU and cannot be configured to do otherwise. Even when a supplicant can be
+configured to use a smaller EAP fragment size, it might not be practical to do
+so, for example in BYOD environments.
 
-There is a bug in effected versions of checkrad namely the line under the subroutine "sub_usrhiper". The line that should be commented out is:
+**Solution:** Increase the access network's MTU so that it meets the standard
+for the link type technology. If resizing the MTU isn't possible, configure all supplicants and authenticators to use a smaller fragment size for EAP messages. Also, configure the NAS to advertise the smaller MTU of the EAPoL network in the
+`Framed-MTU` attribute of RADIUS requests sent to the RADIUS Server.
 
-	($login) = /^.*\"([^"]+)".*$/;
+==== NAS doesn't perform IP fragmentation correctly
 
-== Cisco Simultaneous-Use doesn't work
+Some wireless lan controllers (WLCs) and switches (that do not support asymmetric fragmentation/reassembly) are unable to encapsulate a large EAP message generated by a supplicant into a RADIUS Access-Request that would need to span multiple IP fragments to satisfy the path MTU to the RADIUS Server.
 
-Q: I am getting the following in radius.log file:
+**Solution:** Upgrade or replace the NAS with a device that performs proper IP
+fragmentation.
 
-	Thu Oct 21 10:59:01 1999: Error: Check-TS: timeout waiting for checkrad
+==== NAS doesn't perform IP fragment reassembly correctly
 
-What's wrong?
+Some WLCs are unable to de-encapsulate an EAP message from a RADIUS
+Access-Challenge that is received as a set of IP fragments, even though the EAP
+message would fit within the link MTU for the EAPoL interface.
 
-A: Verify if you have SNMP enabled on your CISCO router, check the existence of the following line:
+**Solution:** Upgrade or replace the NAS with a device that performs proper IP
+fragment reassembly.
 
-	snmp-server community public RO 97
+==== Devices drop IP fragments
 
-where 97 is the access-list that specifies who gets access to the SNMP info. You should also have a line like this:
-	access-list 97 permit A.B.C.D
+Some firewalls, routers and network load balancers simply drop all IP
+fragments on egress or ingress as a matter of policy, for reasons other than a
+link MTU restriction.
 
-where A.B.C.D is the ip address of the host running the radius server.
+**Solution:** Reconfigure the malfunctioning network device to permit IP
+fragments to and from the RADIUS servers.
 
-== Ascend MAX 4048 Simultaneous-Use doesn't work
+==== Devices that sometimes drop IP fragments
 
-Q: I am getting the following in radius.log file:
+Some firewalls drop IP fragments for an extended period of time in reaction
+to some global network condition, such as during a fragment-based network
+attack. Services that depend on IP fragmentation may therefore work at some
+times but not others.
 
-Thu Oct 21 10:59:01 1999: Error: Check-TS: timeout waiting for checkrad
+**Solution:** Override such protections for traffic to and from the RADIUS
+servers, and disable virtual reassembly if necessary to protect the resources
+of the firewall. Ensure that the RADIUS Server's operating system is up to date
+and that the host has sufficient resources to mitigate fragment-based network
+attacks by itself.
 
-What's wrong?
+==== Devices that attempt "virtual reassembly" on an incomplete packet stream
 
-A: Verify that you have the MAX 4048 setup in your naslist as max40xx and that you have Finger turned on.
+Firewalls and routers may be configured to perform "virtual reassembly" of
+complete IP packets using all IP fragments for policy inspection purposes. If
+traffic takes multiple paths such that a single device does not see all IP
+fragments then reassembly will fail, fragments will be dropped, and excessive
+resources consumed.
 
-	Ethernet->Mod Config->Finger=Yes
+**Solution:** Disable virtual reassembly for packets involving the RADIUS
+servers or amend the routing policy to ensure that all fragments to a
+destination are forwarded via the same path.
 
+==== Devices that steer IP fragments of the same packet to different backends
 
-== The server is complaining about invalid user route-bps-asc1-1, along with lots of others
+Stateless routers and load balancers, as well as load balancers with broken
+flow cache lookup for IP fragments, may steer subsequent IP fragments to a
+different backend than the initial fragment.
 
-Ascend decided to have the 4000 series NAS boxes retrieve much of their configuration from the RADIUS server. To disable this "feature", set:
+**Solution:** Configure the device to steer packets based on the Layer 2
+addresses only, and not the Layer 4 information which is only present in the
+initial fragment. Note: This configuration is normally required for EAP since
+the source port is not guaranteed to remain the same throughout the
+authentication exchange.
 
-	Ethernet->Mod Config->Auth->Allow Auth Config Rqsts = No
+==== Devices perform Network Address Translation with broken flow cache lookup
 
+NAT devices with broken flow cache lookup may either drop or incorrectly
+rewrite IP fragments and ICMP responses.
 
-== Why do Acct-Input-Octets and Acct-Output-Octets wrap at 4 GB?
+**Solution:** Upgrade or replace the broken device.
 
-There are two possible causes of this problem:
+==== Load balancers having pathological IP fragment handling when a backend is degraded
 
-* Gigawords not enabled on NAS
+Some load balancers route fragments to the correct backend except when a backend is offline, in which case they route fragments incorrectly. A single backend becoming unavailable results in degradation of the entire service.
 
-Some NAS do not send "Gigawords" attributes by default. Read your NAS documentation and configure it to send the attributes Acct-Input-Gigawords and Acct-Output-Gigawords.
+**Solution:** Upgrade or replace the broken load balancer.
 
-* Cisco IOS needs to set the flag for gigawords. Enter the following command on the NAS:
+==== Devices that filter ICMP "Fragmentation Needed" and "Too Big" messages
 
-	`aaa accounting gigawords`
+Some routers and firewalls may filter critical ICMP responses, breaking
+PMTUD, and resulting in authenticators and/or RADIUS servers continuously
+sending oversized IP packets. These packets are too large for the path and do not reach their destination.
 
-[NOTE]
-====
-This command requires a reload of the device on certain IOS version.
-====
+**Solution:** Configure devices so as not to filter ICMP messages that are
+essential for basic network services.
 
+==== Devices steering ICMP to a different backend than the corresponding application data
 
-=== How do I enable logging of 64 bit counters, a.k.a. `Acct-{Input|Output}-Gigawords?`
+Some devices performing ECMP routing and other forms of network load
+balancing with broken flow caches will route an ICMP message to a different
+backend than to where the application data that originated the ICMP response is
+sent. This breaks PMTUD and results in RADIUS servers continuing to send
+oversized IP packets instead of performing IP fragmentation.
 
-Refer to <<Why do Acct-Input-Octets and Acct-Output-Octets wrap at 4 GB?>>
+**Solution:** Either use a device that performs flow tracking to match ICMP
+messages with their associated data flows and steer them to the same backend,
+or broadcast ICMP messages required for PMTUD to all backends.