Architectural and Design Issues with IPsec VPNs

Architectural and Design Issues with IPsec VPNs

Aside from issues related to configuring ISAKMP and IPsec policy, there is a variety of network elements that can interfere with optimal operation of an IPsec VPN if it is not managed correctly. In this section, we will discuss several of the most common troubleshooting challenges present IPsec VPN implementations today face. We will also discuss several effective techniques for diagnosing the problems that can result from improper design, and the appropriate solutions to remediate those problems.

Troubleshooting IPsec VPNs in Firewalled Environments

As we've discussed in our overview of the IPsec protocol in Chapter 2, many items need to be passed between two VPN endpoints in order to dynamically create tunnels securely using IKE and to successfully pass IPsec packets once a Phase 2 SA has been established. Most firewalls, by default, employ a "closed" model of security (by default, nothing is allowed) in which the firewall must be explicitly instructed to allow the required protocols through by an administrator. When deploying IPsec in firewalled environments, care must be taken to allow the required elements to securely pass, or problems could arise with VPN operation and performance. We will discuss two common issues in firewalled VPN environmentsfirewall fragmentation handling, filtering of required IPsec protocols, and filtering of Internet Control Message Protocol (ICMP) unreachables.

Allowing the Required IPsec Protocols to Pass

It is very common to see IPsec VPN sessions that traverse firewalls. One such example that we've discussed in Chapter 2 is in a DMZ design. Another popular application for such a design is in secure extranet designs. Most firewalls available in today's marketplace employ a closed policy by default, allowing no traffic to pass from low-security interfaces to interfaces assigned higher security levels. This includes protocols necessary for IPsec and IKE to operate effectively.

Unless manual IPsec session keys are used, firewalls between IPsec peers must allow ISAKMP traffic (UDP port 500) to pass between the IPsec VPN endpoints. Additionally, IPsec traffic must be allowed through the firewall, or encrypted traffic will get blocked at the firewall outside interface. These protocols include Encapsulating Security Payload (ESP) (IP Protocol 50) and Authentication Header (AH) (IP Protocol 51), depending on which of the two are included in the IPsec transforms used in that SA.


Although IPsec transforms can include the ESP protocol, AH protocol, or both ESP and AH protocols together, it may not be necessary to open up firewall configurations to allow them both. Figure assumes that all Remote Nets use ESP and AH. Administrators should verify the protocol selected in their IPsec transforms, as it may not be necessary to allow both ESP and AH through the firewall.

IPsec VPN Traffic Through Firewalls

Figure illustrates a firewalled IPsec VPN tunnel deployment in which tunnels are built from a central, firewalled aggregation site out to smaller remote locations.

Figure provides the firewall ACL configuration that is employed to enable IPsec tunnels to be built between the Campus_Net and Remote_Nets A, B, and C in Figure.

Figure. Firewall Configuration for Crypto Traffic in Figure

DMZ-PIX-MAIN(config)#show access-list 101
access-list 101; 9 elements
access-list 101 line 1 permit udp host host eq isakmp
access-list 101 line 2 permit udp host eq isakmp host
access-list 101 line 3 permit esp host host
access-list 101 line 3 permit ah host host
access-list 101 line 4 permit udp host host eq isakmp
access-list 101 line 5 permit udp host eq isakmp host
access-list 101 line 6 permit esp host host
access-list 101 line 3 permit ah host host
access-list 101 line 7 permit udp host host eq isakmp
access-list 101 line 8 permit udp host eq isakmp host
access-list 101 line 9 permit esp host host
access-list 101 line 3 permit ah host host
DMZ-PIX-MAIN(config)#show access-group
access-group 101 in interface outside

Firewall's Handling of Fragmented IPsec Packets

In addition to ensuring that the appropriate protocols are allowed to communicate through the firewall, it is critical that network designers should also account for the design considerations presented by the way that firewalls handle fragmented packets. When an IPsec packet is fragmented, the information relevant to the firewall's filtering decision, such as data found in the Layer 3 and 4 headers, is obscured in noninitial fragments.


All fragments of a fragmented IPsec packet must be decrypted before they can be reassembled. This behavior can bypass the crypto hardware switching path, leading to performance degradation in IPsec networks. It is therefore critical to account for fragmentation issues in IPsec designs. We will discuss IPsec MTU and fragmentation issues and available solutions for fragment handling in IPsec networks (virtual fragmentation reassembly, IPsec prefragmentation, and path MTU discovery) later in this chapter.

As such, the firewall will potentially allow fragments to pass without inspection, as shown in Figure.

Firewall Fragment Handling in IPsec Networks

Cisco PIX firewalls by default are configured to detect when a fragmented packet has been received and to make filtering decisions on the initial fragment and all noninitial fragments without actually reassembling the packet. This feature is called Virtual Fragment Reassembly. Virtual Fragmentation Reassembly provides the firewall with the ability to make filtering decisions on fragments without having to decrypt each packet in the fragmented chain. Virtual Fragmentation Reassembly does indeed consume computational resources on the firewall, but does provide an ideal solution when filtering decisions must be made on noninitial IPsec packet fragments.

Figure demonstrates the improved handling of fragmented packets in IPsec networks. In the case illustrated in Figure above, packets are being fragmented at an intermediary point between the two IPsec VPN gateways, ISP_GW_A and ENT_DMZ_IN. The DMZ firewall, ENT_FW, receives the fragments, and is configured for virtual fragmentation reassembly. The firewall therefore does not do any reassembly. Instead, it will initially allow only the first fragment in the fragment chain through without being inspected. Virtual Fragment Reassembly enables the firewall to inspect the remaining fragments of the original packet without reassembling the packet. Virtual Fragmentation Reassembly therefore plays a vital role in this example, as the firewall would have to decrypt each fragment in the chain in order to reassemble the packet, which it is not configured to do.

Filtering of ICMP Unreachables

ICMP unreachables are commonly used by hackers to find and exploit network vulnerabilities, and are a fundamental component of scanning techniques used to find openings in firewalled environments. As such, it is very common that firewalls not reply to filtered messages with ICMP unreachables. This behavior breaks a fundamental tool that IPsec can use to avoid the performance problems that can arise from fragmenting packets after encryptionPath MTU Detection (PMTUD).

As discussed in greater detail later in this chapter, PMTUD sends ICMP messages along the path, relying on information in ICMP unreachable messages to throttle down the MTU that the PMTUD device must fragment to before encrypting with IPsec. The insertion of a firewall along the PMTUD path effectively breaks this model, because it will suppress the response of the ICMP unreachable that is to carry MTU sizing information back to the fragmenting PMTUD device.

NAT Issues in IPsec VPN Designs

NAT was introduced to solve problems with the depletion of publicly available address space. It does this by translating the private source or destination IP addresses into public ones. As we had discussed previously in Chapter 2, "IPsec Fundamentals," IPsec was designed, in part, to prevent the manipulation of data while in transit between the two endpoints of the IPsec VPN tunnel. Considering that IPsec in tunnel mode protects the IP header from manipulation, incompatibilities arise when a device attempts to perform NAT on an IPsec packet protected in tunnel mode. This section will cover some intrinsic incompatibilities between the two technologies and explore some solutions for deploying them in tandem.

Intrinsic IPsec/NAT Incompatibilities

Deployment of IPsec VPNs in NAT environments should be approached with care, as there are many known incompatibilities between NAT and IPsec. The nature of NAT is to modify, or translate, a portion of the IP packet, specifically the source and destination addresses or ports, when the packet is en route from a given source to a given destination. The nature of IPsec is to detect and prevent the malicious manipulation of packets between a given source and destination. Therein lies the origin of IPsec/NAT incompatibilitiesthe nature of NAT is to manipulate a packet, while the nature of IPsec is to preserve the packet's integrity.

Recall from our discussion in Chapter 2 that IPsec defines a suite of protocols, such as AH and ESP, that can operate in different modes, such as tunnel or transport, and include varying degrees of authentication, or strengths in optional HMACs within a given transform. Because each protocol protects different portions of the IP packet in different ways, the effect of NAT can vary on a per-protocol basis. Some of the protocol-specific examples of inherent IPsec/NAT incompatibility include:

  • IPsec AH Keyed Message Integrity Check (MIC) Failures in NAT Environments

  • Inbound IPsec SA Selector Inconsistencies in NAT Environments

  • IKE Rekeying Failures in PAT Environments

  • Overlapping IPsec Security Policy Database Entries

  • IPsec Security Parameter Index Conflicts on NAT Devices

  • Embedded IP Address Translation Limitations

  • Unidirectional NAT Support

  • TCP and User Datagram Protocol (UDP) Checksum Failures

IPsec AH Keyed MIC Failures in NAT Environments

Authentication Header protocol includes source and destination addresses in the keyed MIC in order to provide a greater scope of authentication and integrity than the ESP protocol. Manipulating the source/destination address of the packet between VPN endpoints using AH will cause a MIC failure at the receiving VPN endpoint. ESP does not have this specific incompatibility, as source and destination information is not included in the integrity check.

Inbound IPsec SA Selector Inconsistencies in NAT Environments

If IKE authenticates Phase 2 selectors, and the initiator's source address is translated en route to the responder, then RFC 2401 requires that the responder drop the decapsulated packet, as the translated IP address does not match the SA selector value.

IKE Rekeying Failures in PAT Environments

An IKE responder must respond to IKE requests on the correct port. In nonPAT environments, this is UDP 500 by default. However, in situations in which IKE initiators have their ports translated to something other than 500, the IKE responder must be able to respond to the IKE request on the translated port, and must be able to do so predictably and reliably for IKE rekey messages to reach their correct destinations (correct IKE initiators).

Overlapping IPsec Security Policy Database Entries

When two or more IPsec initiators use their source address as its Phase 2 identifiers, an IPsec responder could view the two sources as identical. The responder could, therefore, potentially install overlapping security policy database entries for multiple sources. As a result, the responder is at risk of forwarding traffic over the incorrect SAs to its sources. The creation of overlapping security policy database entries in an IPsec responder resulting from duplicate NAT inside local addresses used as Phase 2 SA identifiers is illustrated in Figure.

SPD Confusion in NAT Environments

IPsec Security Parameter Index Conflicts on NAT Devices

When two initiators attempt to negotiate a Phase 2 SA with the same destination, and Security Parameter Index (SPI)-based NAT is occurring between source and destination, SPI conflicts can sometimes occur on that NAT device leading to forwarding confusion from responder to initiator. Consider the scenario in Figure. The responder will install two different SPI entries (i.e., 2001 and 2002). However, because inbound and outbound SPI creation occurs independently of one another, the two initiators could indeed install similar SPI entries (i.e., both would claim to have installed SPI 2000 for the same destination). Because traffic from both Router_A and Router_B use the same UDP source port information, the NAT devices use overlapping SPIs for forwarding decisions. As a result, the traffic from the responder could be forwarded to the incorrect initiator due to the SPI conflict in the NAT device (i.e., the NAT device does not know which initiator to forward SPI 2000 traffic to).

SPI Overlap in NAT Device

Embedded IP Address Translation Limitations

Some applications have addressing information embedded in to the payload of the IP packet. In both ESP and AH protocols, the payload of the packet is integrity protected. Therefore, changes to that payload, such as those NAT would attempt to execute, are not possible within the ESP and AH encapsulated payloads.

Unidirectional NAT Support

In some cases, a NAT device will install an NAT/PAT entry only once a packet is received from a given interface (i.e., inside to outside). Once that entry is installed on the NAT device, traffic can be forwarded in both directions. However, until that entry is installed, traffic received in other directions (i.e., outside to inside) will not get forwarded, as a NAT/PAT entry will not be dynamically created for traffic received on that interface.

TCP and UDP Checksum Failures

TCP and UDP checksums include IP source and destination addresses as part of the calculation. Therefore, translating the source or destination address with NAT can cause these checksum calculations after NAT processing. This problem arises in IPsec and NAT environments where TCP and UDP checksums are calculated and verified. This specific incompatibility does not affect IPsec in tunnel mode or IPsec+GRE, as neither of these methods requires validation of UDP/TCP checksums that use a translated source and destination IP address in their calculations.

IPsec NAT Transparency (NAT-T)

IPsec NAT-T enables an IPsec VPN endpoint to dynamically detect the support for NAT-T on its remote endpoint and to detect the presence of NAT devices between the two endpoints. If NAT is detected through the use of NAT-T, then the two endpoints will dynamically agree on the appropriate handling of IPsec NAT-T packets (such as UDP encapsulation of ESP packets, and so on). NAT-T, therefore, enables the two VPN endpoints to seamlessly establish an IPsec VPN endpoint across one or more NAT points that may exist between the two endpoints.

Consider the example described in Figure in which two routers, Router_A and Router_B, communicate with one another through a firewall, Ent_FW.

The Operation of IPsec NAT-T

Routers A and B are both capable of NAT-T, and dynamically agree on the handling of IPsec packets across the NAT'd path through the following sequence of exchanges:

Router_A sends its vendor ID to Router_B during IKE Phase 1 negotiation. This phase of NAT-T is commonly referred to as a "NAT Support" exchange.

Router_B sends its vendor ID string payload to Router_A during IKE Main Mode (MM1 and MM2, Phase 1) negotiation, letting Router_A know that it does indeed support NAT-T. This is also known as a "NAT Support" exchange in NAT-T context.

After Routers A and B have agreed on NAT-T Support, they must determine if NAT exists between the two of them. This phase is commonly referred to as "NAT Detect." In this phase of NAT-T, multiple NAT-D payloads are exchanged between source and destination. NAT-D payload consists of an address and a hash. Each peer typically sends two NAT-D payloads to the other in main mode (MM3 and MM4)one for the destination address followed by another for the source address. When NAT-D payloads are sent between each peer, the hashes are verified at the remote end. If the hash values match, then it can safely be determined that NAT does not exist. If the hash values do not match, then it can safely be determined that NAT does exist.

After NAT-D payloads are exchanged to detect NAT information, IPsec Quick Mode messages are exchanged to decide which peer (none, either, or both) will use NAT-T. This negotiation is performed during IKE Phase 2 in Quick Mode (QM1 and QM2).

In Figure, inside source address translation is being performed by the PIX. IPsec endpoints have determined this behavior using NAT-T steps 1-4 described above. In this case, Routers A and B will both encapsulate ESP packets in UDP, hereby remedying three incompatibilities with IPsec and NAT:

- Incompatibility between ESP and AH with PAT

- Incompatibility between IKE fixed ports and PAT

- Incompatibility between UDP checksums and NAT (intermediate NAT-T UDP header checksum set to 0)


As noted above, Cisco IOS will attempt NAT-T during IKE Phase 1 negotiation. Cisco IOS configures NAT-T automatically, and there is no manual configuration required. To disable the encapsulation of ESP packets in UDP using NAT-T, execute the following command from the IOS CLI:

Router_A(config)#no crypto IPsec nat-transparency udp-encapsulation


IPsec SPIs provide another element that a NAT device can use to forward data between two endpoints. When IPsec traffic is passed through a NAT device, various crypto-protected elements are often translated by NAT, including source IP addresses, destination IP addresses, source ports, and destination ports. Another field that can be use to populate the translation table in a NAT device is the IPsec SPI. As we had discussed previously, though, IPsec and NAT incompatibilities arise when overlapping IPsec tunnels with overlapping SPIs are passed through the NAT device.

Cisco IOS releases 12.2T and later employ a predictive SPI selection algorithm on IPsec crypto endpoints that enable them to select unique SPIs during IKE. This effectively enables a NAT device in the crypto path to use IPsec SPIs to build its translation table without encountering the translation and forwarding issues caused by overlapping SPIs discussed earlier in this chapter.

Consider again the scenario in Figure, but with predictively selected SPIs and SPI matching enabled on the NAT device. The NAT device is now capable of differentiating between multiple initiators (sources) in its forwarding table without the use of PAT. Instead, SPI matching is used to differentiate between the two IPsec VPN tunnel initiators, Routers A and B.


Additional configuration information on IPsec and SPI-based NAT can be obtained at the following link with a valid CCO account:

The Influence of IPsec on Traffic Flows Requiring QoS

As networking technologies begin to mature and become more widespread, time- and delay-sensitive applications are increasingly migrating toward a converged solution. Many of these applications are considered critical to the needs of businesses in different vertical markets. As such, a need for guaranteed timely delivery and ordering of these applications emerges. In today's networked environment, a variety of QoS mechanisms exist to guarantee the timely delivery of delay-sensitive business critical data communications in IP networks. QoS in and of itself is so broad and deep in scope that we cannot cover it in its entirety. We will, however, discuss several common inconsistencies between IPsec and QoS that present design challenges:

  • Traffic Flow Hash Ubiquification: Flow-based QoS techniques, such as Weighted-Fair queuing (WFQ) rely on the original source and destination IP addresses of the packet to hash traffic flows in to "conversations." Certain IPsec protocols and modes effectively ubiquify the information needed to perform this hashing decisionthe source and destination IP address. Consider the example of IPsec ESP in Tunnel mode. In this example, the inner IP header will be encapsulated within the ESP boundary and encrypted. Therefore, if the router wants to use WFQ to hash this traffic flow into a conversation, it will not be able to, as it will be unable to read the encrypted original source and destination IP address. Because most VPN endpoints supporting QoS rely on flow-based QoS techniques such as WFQ and Low-Latency Queueing/class-based weighted fair queueing (LLQ)/(CBWFQ), it is critical that the IPsec VPN endpoint have the capacity to classify traffic flows before IPsec or generic routing encapsulation (GRE) encapsulation or both. Cisco IOS offers this functionality with the IPsec Preclassify feature.

  • Packet Reordering and the IPsec Antireplay Window: If packets are received outside of the antireplay window in an IPsec VPN, they will be dropped. The nature of QoS is to reorder packets, which can sometimes result in delay of queued traffic. It is critical to ensure that delays are not so long as to result in the packet being received outside of the antireplay window on the receiving VPN endpoint. Cisco IOS offers the capacity to extend the antireplay window on VPN endpoints to alleviate antireplay window errors if they should arise.

  • Packet Marking Obfuscation and (LLQ/CBWFQ): LLQ/CBWFQ is a QoS technique for reordering traffic flows locally on a network device. LLQ/CBWFQ requires that packets within a traffic flow be identified in some way. This is typically achieved by marking the packet by setting the DiffServ bits in the IP header. Network devices are therefore able to differentiate that traffic from other traffic flows and treat it (queue it) with the appropriate level of urgency. The scope of effect of LLQ/CBWFQ decisions is contained to the local device only.

  • Resource Reservation Protocol (RSVP): RSVP uses an exchange of RSVP signaling messages between two endpoints to reserve resources for delay-sensitive traffic between the two endpoints. Unlike LLQ/CBWFQ, where LLQ/CBWFQ classification and queuing decisions are local in scope, the scope of RSVP queueing decisions is end-to-end. Although RSVP can be configured to work in tandem with DiffServ-based QoS, it can also be configured to place traffic in the RSVP-reserved queue regardless of what type of DiffServ policy is configured on that specific network node.


For more detailed information on LLQ/CBWFQ and DiffServ, visit CCO at the following URLs:

DiffServThe Scalable End-to-End QoS Model

Implementing DiffServ for End-to-End Quality of Service

As mentioned previously, QoS requires the use of end-to-end messaging techniques and the interpretation of certain bit values within the IP header. In some cases, if care is not taken during an IPsec/QoS deployment, IPsec can obfuscate the necessary messaging and IP header bits needed to deliver QoS. We will discuss QoS within this context, and explore some available techniques for delivering QoS within an IPsec VPN deployment.

IPsec's Influence on DiffServ and LLQ/CBWFQ

In this section, we will explore a Voice over IP (VoIP) deployment in a branch networking scenario. VoIP is delay-sensitivethat is to say that packets must be received in order with consistent delay (low jitter). As shown in Figure, two routers want to communicate with each other over a series of wide-area links of varying bandwidths. On the lower-speed links, packets can sometimes be dropped due to oversubscription of the available bandwidth. Therefore QoS is required to ensure that the voice (RTP) packets are not dropped when this occurs (other packets are dropped instead). For these reasons, QoS must be used for IP traffic over the Frame-Relay links.

IPsec and DiffServ in a VoIP implementation

DiffServ is implemented in conjunction with LLQ/CBWFQ to deliver QoS for voice traffic to and from the branches. Because the company's security policy mandates confidentiality for voice traffic, IPsec VPNs have been configured between the enterprise headend router and all branch routers, posing several design considerations with the IPsec/DiffServ requirements:

  • If AH is used, changes to the IP header are not permitted (the AH MIC invalidates them on the receiving VPN endpoint). This prevents remarking on network devices between the phones. Therefore, RTP traffic must be marked accordingly prior to IPsec encapsulation (either on the routers or phones) if AH is used.

  • In both AH and ESP, if packets are received outside the antireplay window, they are dropped. Therefore, if traffic is delayed in queue due to QoS decisions, it could get dropped if it is received outside of the antireplay window at the opposite end of the IPsec VPN tunnel.

  • With ESP, the original IP header and QoS information is encapsulated in ESP and encrypted with the appropriate transform. This effectively renders the DiffServ bits needed for QoS unreadable by intermediate network nodes between the two IPsec VPN endpoints. Unless these bits are successfully copied to the outer IP header ESP encapsulation, network nodes between the two IPsec VPN endpoints may not appropriately classify the IPsec-processed RTP packet.

IPsec's Effect on IntServ and RSVP

In addition to issues outlined with the DiffServ and LLQ/CBWFQ, RSVP implementations with IPsec VPNs provide further design issues to address. As we had mentioned previously, RSVP provides a signaling method to proactively provision resources between a given source and destination. RSVP does so by exchanging a series of RSVP PATH and RSVP RESV messages between the source and destination. If intermediate network nodes between the RSVP source and destination are unable to decipher the RSVP RESV messages, as would be the case if they were encrypted in an IPsec VPN, intermediate network nodes cannot use the RSVP-RESV messages to dynamically reserve resources between source and destination (illustrated in Figure).

IPsec and RSVP Signaling Incompatibility

Therefore, to dynamically provision resources on intermediate nodes between a source and destination that require timely, ordered delivery of IP-based application traffic, RSVP signaling messages must be forwarded outside of the crypto path.

Solving Fragmentation Issues in IPsec VPNs

In IPsec VPN environments, it is critical to address MTU and fragmentation issues. Otherwise, the entire VPN is at risk of performance and operation issues. We will discuss the effect of fragmentation reassembly and MTU issues in this section, and provide solutions for proper IPsec design in environments in which MTU is likely to be exceeded, resulting in fragmentation.

The effect of fragment handling between encryption devices is largely focused on the encryption device that is performing the reassembly of the fragmented packet. Although most network devices and VPN endpoints available today can fragment encrypted packets in the crypto-switched fast path, the decrypting IPsec endpoint must decrypt all fragments in the chain before the packet can be reassembled. Figure illustrates an IPsec VPN deployment in which packets are reassembled prior to decryption on the destination IPsec VPN gateway.

Fragmentation Handling Between Encryption Devices

This reassembly behavior is done at the process level and greatly affects the performance of the VPN. In IPsec environments, every precaution should be taken to fragment packets before they are encrypted with IPsec so that administrators can be assured that both fragmentation and reassembly is being done on devices with the appropriate computational resources available.

Path MTU Discovery

IP PMTUD is a technology that is used to dynamically discover the maximum MTU size between two endpoints such that the originating device fragments packets to the lowest MTU of the path. As such, PMTUD prevents intermediate network devices from fragmenting packets and causing excessive CPU overhead on the receiving IPsec endpoint doing the reassembly. Consider the scenario described in Figure, in which Host_A wishes to open a TCP session to Server_B across a routed IP network using Routers A, B, and C.

PMTUD and IPsec

Administrators have enabled IP PMTUD on their workstations and servers such that fragmentation reassembly issues can be avoided on Router_B. Host_A executes PMTUD using the following process:

Host_A creates an IP packet sized to the appropriate MTU of its locally attached segment and sets the DF bit before transmitting it to Server_B.

Router_A receives the packet, notes that the DF bit is set. Router_A's serial link has an MTU of 1414. Because the DF bit is set and the packet from Host_A to Server_B exceeds the MTU of the serial interface, it is dropped.

Router_A sends an ICMP Unreachable message back to Host_A, carrying the MTU (1414) of the next hop (the serial interface between Router_A and C).

Host_A sends another ICMP message of 1414 bytes in length to Server_B with the DF bit set.

Router_A receives the packet and forwards to Router_C. Router_C receives the packet, and notes that the DF bit is set. Because the DF bit is set and the packet is greater than the MTU of Router_C's link to Router_B (512), the packet is dropped.

Router_C sends an ICMP Unreachable message back to Host_A, carrying the MTU (512) of the next hop (serial interface between Router_C and A).

Host_A sends another ICMP message of 512 bytes in length to Server_B with the DF bit set.

The 512 byte ICMP message is lower than the MTU of any individual link in the path. It is therefore successfully forwarded to Server_B. Server_B sends an ICMP Echo Response back to Host_A, indicating to Host_A that 512 is the MTU of the path.


The routers in the above scenario are used to illustrate the general operation of PMTUD. IPsec and IPsec+GRE tunnels use a slightly different configuration of PMTUD than the previous, known as "Tunnel Path MTU Discovery." The specific operation of fragment handling using PMTUD in IPsec and IPsec+GRE environments is discussed in greater detail later in this chapter.

IPsec in Cisco IOS can be configured to copy the DF bit value in to the outer IP header in ESP-processed packets. As such, the ICMP traffic that PMTUD relies on to operate correctly does not have to be explicitly excluded from the crypto switching path.

There are several issues that must be addressed if PMTUD is to be part of one's design strategy to mitigate fragmentation reassembly issues in IPsec VPNs. In this section, we will briefly highlight some of the most common ones:

  • Permitting ICMP Unreachable Messages

  • Rate-Limiting ICMP Messages

  • PMTUD Not Supported on End-Hosts

  • Adjusting TCP Maximum Segment Size

  • Clearing the DF-Bit

Permitting ICMP Unreachable Messages

As we've discussed previously in our overview of the PMTUD protocol, PMTUD relies heavily on ICMP unreachable messages to communicate the MTU of segments back to the fragmenting host. It is very common for security devices, such as firewalls, to deny ICMP unreachable messages, as they are commonly used in malicious scanning techniques by hackers. As such, care must be taken to ensure that all paths between PMTUD-enabled endpoint be checked to ensure that ICMP unreachables are indeed allowed to pass if PMTUD is to be the preferred message for fragmentation avoidance along the path.

Rate-Limiting ICMP Messages

Because PMTUD relies on the receipt of ICMP Unreachable replies within a given retransmission window on the originating host, care should be given to rate-limiting techniques applied to ICMP messages, as they could cause premature retransmission of ICMP messages in PMTUD environments. If received out of order, ICMP unreachable messages in a PMTUD environment could cause confusion on MTU settings for the originating PMTUD host. For example, ICMP Unreachable messages delayed in rate-limiting queues could signal an erroneous MTU setting on the originating PMTUD host if that Unreachable message is received after a valid ICMP Unreachable with the correct IP MTU.

PMTUD Not Supported on End-Hosts

If PMTUD is disabled on the hosts within a network, it is recommended that the network administrator take steps to ensure that packets are fragmented in the network at some other location before crypto processing occurs. This can be achieved through enabling IPsec Lookahead Fragmentation in Cisco IOS and setting the DF bit in IPsec-processed packets. With IPsec Lookahead Fragmentation, the IOS IPsec VPN endpoint will attempt to determine the encapsulated packet size before it is encrypted. If the encapsulated packet size is predetermined to be larger than the path MTU, it is fragmented before encryption. When the DF bit is set, the encrypting router will look for information in any ICMP unreachable message received for updates it needs to install to the Path MTU entry in its SADB. Alternately, if the VPN endpoint does not support functionality similar to IPsec Lookahead Fragmentation or explicit setting of the DF bit in outer IP headers, the MTU of the IPsec VPN tunnel can be manually defined to avoid fragmentation reassembly issues.

Adjusting TCP Maximum Segment Size

Hosts sending IP packets greater than the TCP Maximum Segment Size (MSS) are at risk of fragmentation. Strictly speaking, the TCP MSS is the maximum amount of data that a host is willing to accept in an IP datagram. Hosts compare TCP MSS buffers sizes with MTU to determine the MTU for their transmissions. The result will be the lower of the TCP MSS or the MTU less 40 bits (an allocation for IP header and TCP header, both 20 bits in length). Once determined, each host communicates the selected values to the opposite host via the following exchange described in Figure.

TCP MSS, IP MTU, and Fragmentation

The following order of events describes the sequence illustrated in Figure above:

Host_A has an MSS buffer of 20k and an MTU of 1500. It compares the MSS buffer with the MTU of the link, less a 40-bit allocation for IP and TCP header addition (1500 40 = 1460) and selects the lower value of 1460 (1460 < 20000) to send to Host_B.

Host_B has a 16k MSS buffer and a 2048 interface MTU size. It does a similar comparison to Host_A's in step 1 and selects 2008 (2048 40). It then compares the received value from Host_A and selects the lower value of 1460 as its MSS value (1460 < 2008).

Host_B signals its MSS of 2008 to Host_A.

Host_A compares the received value of 2008 with its TCP MSS value derived in Step 1 and selects the lower of the two values, 1460, as its TCP MSS.

TCP MSS values use MTU values to help avoid fragmentation. In the example above, MTU values are selected, as they are smaller than TCP MSS values. However, if the MSS value were to be smaller than the MTU, then Hosts A and B will select the MSS + 40 bytes as the maximum packet length for TCP traffic. Note that, in this case, the larger MTU value would still be used for UDP traffic.

Clearing the DF-Bit

If the DF bit is cleared somewhere along the PMTUD path between source and destination, the network nodes along the path will fragment the ICMP PMTUD message rather than dropping it and replying with an ICMP unreachable. This will obviously break the operation of PMTUD. Most IP-enabled devices available today are capable of clearing the DF bit in an IP header.

Fragmentation Behavior on Cisco IOS VPN Endpoints

The overhead associated with IPsec and IPsec+GRE encapsulated IP packets can often lead to fragmentation, which is why PMTUD is, by default, enabled on IPsec VPN routers. However, the specifics of fragment handling and PMTUD differ slightly from nonVPN environments. In this section, we will discuss the handling of fragments in IPsec and IPsec+GRE tunnels and some additional solutions available for avoiding fragmentation in IPsec VPN environments.

IPsec VPNs use Tunnel Path MTU Discovery to interpret MTU information of ICMP Unreachable messages and update the Path MTU of the corresponding IPsec SA. The typical PMTUD operation and fragment handling of an IPsec VPN is illustrated in Figure.

Fragment Handling and PMTUD Operation with IPsec Tunnels

The following describes the operation illustrated in Figure:

Host_A sends a 1500-byte (size of the local interface MTU) packet to Server_B.

Router_A receives the packet sent in 1 above, and observes that the ESP encapsulated packet size exceeds the MTU of the serial link to B. Because Host_A set the DF bit of the packet, Router_A drops the packet and sends an ICMP unreachable message containing the MTU size of 1442 (1500bytes58bytes max ESP overhead) back to Host_A.

Host_A receives the ICMP Unreachable message with the MTU information, and forwards another ICMP packet of 1442 bytes in length to Router_A. Router_A encapsulates the packet with ESP and forwards it across the VPN with the DF bit set in the outer header.

Router_C receives the ICMP message from Router_A in Step 3 and notes that the packet exceeds the MTU of its serial interface to Router_B. Because the DF bit is set, Router_C drops the packet and forwards an ICMP Unreachable to Router_A with the MTU size of 1440 embedded.

Router_A receives the ICMP Unreachable message from Router_C in Step 4. Router_A notes the MTU size of 1440 in the PMTU field of the SA that is established with Router_B. Router_A does not send a new ICMP message of 1440 in length, but instead this is handled by Host_A in step 6.

Host_A retransmits an ICMP message of 1442 in length, as it never received an acknowledgement from the original ICMP message sent in Step 2.

Router_A compares the ESP-encapsulated packet size (1442+58) of the packet received in step 6 above with its path MTU (1440) and drops the packet. Router_A responds with an ICMP unreachable with the MTU of 1342 (1400 PMTU less ESP overhead of 58 bytes) embedded.

Host_A sets its MTU to 1342 and forwards a new 1342-byte message to Server_B. The message and associated ESP overhead is now lower than the end-to-end path MTU, resulting in a successful transmission from Host_A to Server_B.

As we've discussed in previous sections of this chapter, and in others, it is sometimes necessary to encapsulate certain traffic types in GRE prior to processing them with IPsec. Processing of multicast traffic, for example, is one instance in which one would seek to encapsulate the plain text traffic in GRE prior to encapsulating it in ESP. This is commonly referred to as IPsec+GRE. This process includes an additional 24 bytes of overhead, as the GRE header is applied in addition to the ESP or AH headers. More importantly, it adds additional steps to the Tunnel PMTUD operation while trying to avoid fragmentation. Figure illustrates the fragment handling process using PMTUD in an IPsec+GRE scenario.

Fragment Handling and PMTUD Operation with IPsec+GRE Tunnels

The operation of PMTUD over an IPsec+GRE tunnel illustrated in Figure is described by the following order of events:

Host_A sends a 1500byte packet with the DF bit set to Server_B.

Router_A receives the packet and observes that the DF bit is set. GRE encapsulation occurs prior to ESP encapsulation in this scenario, so the GRE process on the router drops the packet as the 1500byte packet + 24bytes of GRE overhead exceeds the GRE tunnel MTU of 1500. Router_A sends an ICMP Unreachable back to Host_A with an embedded MTU value of 1476 (1500GRE header length of 24).

Host_A sends a 1476 byte packet with the DF bit set to Server_B.

Router_A receives the packet, noting that the DF bit has been set. The router encapsulates the packet in GRE and then attempts to encapsulate it in ESP. The added ESP encapsulation pushes the MTU over the serial interface MTU of 1414, so Router_A drops the packet. ESP sends an ICMP error message to GRE indicating an MTU of 1376 bytes (1414 less max ESP header length of 38 bytes). GRE records this value as the new tunnel IP MTU.

Host_A retransmits the 1476-byte packet in step 3, as no acknowledgement was received. Router_A drops this packet as it exceeds the tunnel IP MTU derived in step 4. Router_A responds with an ICMP Unreachable message with the tunnel IP MTU of 1414.

Host_A sends a new ICMP message of 1414-bytes in length to Server_B. Router_A encapsulates in GRE, and then encapsulates in ESP. The DF bit is copied to the outer IP header in the ESP packet before transmitting across the IPsec VPN.

Router_C receives the packet, and notes that the DF bit is set. The size of the packet is now 1414, as GRE and ESP headers have been added to the original ICMP message sent in step 6. Router_C drops the packet, as it exceeds the MTU of the link to Router_B and has the DF bit set. Router_C sends an ICMP unreachable message to Router_A with the MTU of 1400.

Router_A receives the ICMP unreachable message from Router_C in step 7 above and updates the PMTU field of its IPsec SA to Router_B with the 1400-byte value.

Host_A retransmits the 1414-byte ICMP message in step 6 to Server_B, as no acknowledgement was received.

Router_A receives the packet, and encapsulates it in GRE. Once ESP encapsulation is applied, the length of the packet exceeds the 1400-byte IPsec SA PMTU obtained from Router_C in step 8. ESP sends an ICMP message to GRE with an MTU of 1342 (140058 bytes max ESP header length). GRE updates its tunnel IP MTU with this value.

Host_A retransmits the 1414-byte ICMP message in step 6 again, as no acknowledgement was received from the retransmission in step 10.

Router_A receives the packet and drops it as it exceeds the new GRE tunnel IP MTU of 1342 and has the DF-bit set. Router_A forwards an ICMP Unreachable to Host_A with an MTU value of 1318 bytes (1342 GRE MTU less 24 bytes GRE overhead).

Host_A receives the ICMP Unreachable message sent from Router_A in step 12, and sends a new 1318-byte ICMP message to Server_B with the DF bit set.

Router_A receives the packet, encapsulates it in GRE, encapsulates it in ESP, sets the DF bit in the outer IP header, and forwards to Router_C. This time, Router_C forwards the ICMP message originated from Host_A to Router_B.

Router_B decapsulates the ESP packet, then decapsulates the GRE packet, and finally forwards the original ICMP PMTUD message to Server_B.

Server_B acknowledges the receipt of the message, confirming that Host_A is to use an MTU size of 1338 bytes for this path.


Although the DF bit in PMTUD ICMP messages is always set so as to properly detect areas of fragmentation, ICMP Unreachable responses to these messages are sent with the DF bit set to 0. As such, it is important to note that ICMP PMTUD messages sent from source to destination will never be fragmented, but the responses to those messages could quite possibly be fragmented along the return path.

Solutions for Preventing Fragmentation

In previous sections, we've discussed the most common method for preventing fragmentationPath MTU Discovery. However, as we have explored, the use of PMTUD is somewhat laborious for network devices to execute. Additionally, PMTUD may not be an option in networks that require the filtering of ICMP messages at various points within the network. As such, it is important to understand other ways in which fragmentation can be avoided when designing an IPsec VPN. We will discuss several techniques for mitigating IP fragmentation other than PMTUD in this section.

IPsec Prefragmentation

IPsec Prefragmentation is a Cisco IOS feature that enables an encrypting IPsec VPN endpoint to attempt fragmentation before encryption if a size of the encrypted packet and additional header information exceeds the MTU of the path in between endpoints. PMTUD can be used to determine the path of the SA and the MTU of that path. Doing so in conjunction with IPsec Prefragmentation provides a very scalable and manageable method of increasing the overall performance of an IPsec VPN where fragmentation after encryption is a possibility. Figure illustrates how to configure IPsec crypto DF-bit overwrite with IPsec Lookahead Fragmentation such that the path MTU of the SADB will be dynamically determined using tunnel PMTUD, and large packets will be fragmented (those exceeding the Path MTU for that SA in the SADB) before encryption.

Enabling IPsec Prefragmentation with PMTUD and Crypto DF-bit Rewrite

Router_A#configure terminal
Enter configuration commands, one per line. End with CNTL/Z.
Router_A(config)#crypto IPsec df-bit set
Router_A(config)#crypto IPsec fragmentation before-encryption


Although in the case of Figure, Router_A will attempt to fragment large packets before encrypting them, there are many configuration instances in which IPsec Lookahead Fragmentation and DF-bit overwrite are configured incorrectly. It is critical to understand the interdependencies of the DF-bit setting and the Lookahead Fragmentation setting addressing IPsec fragmentation design considerations. For a full listing of DF-bit interoperability with IPsec Lookahead Fragmentation settings, please refer to the following URL on CCO:

Figure illustrates a client server exchange that does not support PMTUD. Note that, even though there is no exchange of ICMP messages, the Path MTU is still discovered and updated in Router_A's IPsec SADB.

IPsec Fragment Handling Without PMTUD-Enabled Endstations

There are three key operations that enable this feature:

  • Lookahead Fragmentation: Before forwarding an IPsec packet, Router_A predetermines the encapsulated packet size, and compares it with the MTU in the SADB. If it is predetermined to exceed that MTU size, the packet is fragmented before it is encrypted.

  • Crypto DF-bit Rewrite: When PMTUD is not supported, it is important that Router_A be able to set the DF bit in the outer IP header of IPsec-encapsulated packets. This prevents fragmentation, and triggers ICMP unreachables needed to adjust the Path MTU in Router_A's SADB.

  • Processing of MTU Information in ICMP Unreachables: Router_A is capable of deciphering MTU information of ICMP unreachables (received when IPsec packets with DF=1 are dropped). It uses this information to dynamically update the path MTU in its SADB.

The exchange between Router_A and Router_B in Figure illustrates how all of these three features work in concert to minimize the effect of postencryption fragmentation in IPsec VPN deployments where PMTUD is note-enabled on the endstations:

Host_A sends a 1500-byte data packet, destined for Server_B.

Router_A receives the packet, and estimates the ESP encapsulated packet size before encrypting or forwarding the packet. Router_A compares the estimated encapsulated packet size with the Path MTU, and determines that the size is greater than the Path MTU and fragments the packet.

Router_A applies the appropriate encapsulation to the fragments in the fragment chain. While doing so, it sets the DF bit of each encapsulated packet equal to 1.

Router_C receives the packets from Router_A, compares them with the MTU of the Router_C to Router_B link, notes that DF=1, and drops larger packets accordingly. Router_C sends ICMP Unreachables for dropped packets to Router_A.

Router_A receives the ICMP Unreachables from Router_C, and updates the MTU of its SADB accordingly.

Host_A does not receive a reply to its original packet within the appropriate timeout window and therefore retransmits.

Router_A performs Lookahead Fragmentation on the retransmitted packet, sizing the fragments to the new MTU in its SADB. It then sets the DF bit in each encrypted packet.

The encrypted packets are now sized lower than any individual link MTU in the path (<1400 bytes), and are therefore received on Router_B. Router_B is now able to decrypt each fragment in the chain before they are reassembled, a process that is done in the fast switching path.

Manual MTU Adjustment

We've discussed the many tools available within Cisco IOS to avoid fragmentation in IPsec VPNs without having to manually tune the MTU sizes within the network. However, the option still exists to increase MTU size between IPsec VPN endpoints such that the risk of receiving a packet smaller than that MTU size is small. If one must tune MTU sizes to accommodate IPsec traffic between endpoints in a network, one should take the following disadvantages to this approach into consideration:

  • Scalability and Management: Remember that MTU sizes vary on a segment-by-segment basis. As such, it can become laborious for network administrators to consistently ensure that every segment's MTU is properly tuned. Network designers can anticipate the difficulty of manual MTU tuning to increase as the number of IPsec VPN connections and hosts scales upwards.

  • Serialization Delay: The MTU attribute exists to decrease serialization delay on networks. On segments that have artificially high MTU sizes, network administrators can expect increased delay as larger packets are serialized in queue. This adversely affects time- and delay-sensitive applications such as Voice and Video over IP.


For more information on serialization delay and other common troubleshooting and design issues with VoIP, please refer to the following link on CCO:

Troubleshooting QoS Choppy Voice Issues:

The Effect of Recursive Routing on IPsec VPNs

Recursive routing commonly occurs when a router attempts to install a route over the same route that it is using to learn that route through a given RP. Because multicast updates inherent to most RP operations are unable to be crypto-switched, the effect of recursive routing is most commonly seen in the IPsec+GRE solution in which GRE is used to encapsulate RP traffic prior to IPsec (ESP or AH) encapsulation. We will explore a recursive routing situation in an IPsec+GRE scenario outlined in Figure.

Recursive Routing and IPsec+GRE Deployments

The following order of events describes the influence of the recursive routing scenario in the IPsec+GRE deployment illustrated in Figure:

Router_A and Router_B look up the GRE tunnel destination interfaces in their routing tables in order to build the GRE tunnel. They both find routes installed in their routing table for the appropriate tunnel destination interfaces and build the GRE tunnel accordingly using the statically configured default route (Figure, line 30).

Router_A and Router_B are configured to encrypt all GRE traffic between each other using ESP. Therefore, all exchanges across the GRE tunnel will be kept confidential.

Router_A and Router_B inject all directly connected routes in to their routing protocol, Enhanced Interior Gateway Routing Protocol (EIGRP) AS1 (Figure, line 24). This includes the loopback192 interfaces and the GRE tunnel interfaces.

Router_A and Router_B build an EIGRP adjacency across the GRE tunnel and begin to exchange routing updates with one another.

Router_A and Router_B learn the static routes installed in step 1 recursively across the GRE tunnel via EIGRP 1, and attempt to install the routes in the routing table. EIGRP learns a more specific route to the GRE tunnel destination than the default used in step 1 above.

Once Router_A and Router_B are instructed to learn each other's tunnel destinations of the GRE tunnel interface (rather than via the static default routes in step 1), they determine that the behavior is recursive, and clear the route from the routing table. This causes both the GRE tunnel and the EIGRP adjacency to go down.

When a recursive routing situation occurs in an IPsec+GRE scenario, IOS CLI will write messages to the console, indicating the failure. Other symptoms indicative of recursive routing failure include RP adjacency loss across the GRE tunnel, which are also written to the IOS CLI. Figure shows recursive routing configuration errors described in Figure and steps 1-6 above. Symptoms of recursive routing in IPsec+GRE scenarios are shown in Figure.

Figure. Configuration Errors Leading to Recursive Routing (Topology in Figure)

1  interface Loopback192
2  ip address
3  !
4  interface Tunnel 192
5  ip address
6  tunnel source Loopback 192
7  tunnel destination
8  !
9  !
10 interface Serial 1/0
11  ip address
12  encapsulation frame-relay
13  serial restart-delay 0
14  frame-relay interface-dlci 102
15  frame-relay lmi-type ansi
16 !
17 !
18 router eigrp 1
19  redistribute connected
20  network
21  default-metric 1500 100 128 128 4700
22  no auto-summary
23 ip classless
24 ip route

Figure provides the diagnostic output on Router_A that confirms that the GRE tunnel and EIGRP adjacency failures are due to the recursive routing behavior described above. Lines 10 and 18 show the presence of the static route that was used to build the GRE tunnel prior to occurrence of recursive routing. However, line 18 of Figure shows that there is a recursively learned EIGRP route to the tunnel destination ( over the tunnel interface itself (next hop of Because this route is more specific than the default route, EIGRP installs it in the routing table, causing the tunnel to fail due to recursive routing (as confirmed in Figure, lines 20 and 21). Once the tunnel fails, EIGRP tears down the adjacency, as illustrated in Figure, line 22.

Recursive Routing Symptoms and Diagnostic Output in Cisco IOS

1  Router_A#show ip route
2  Codes: C - connected, S - static, R - RIP, M - mobile, B - BGP
3         D - EIGRP, EX - EIGRP external, O - OSPF, IA - OSPF inter area
4         N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2
5         E1 - OSPF external type 1, E2 - OSPF external type 2
6         i - IS-IS, su - IS-IS summary, L1 - IS-IS level-1, L2 - IS-IS level-2
7         ia - IS-IS inter area, * - candidate default, U - per-user static route
8         o - ODR, P - periodic downloaded static route
10 Gateway of last resort is to network
12 is subnetted, 1 subnets
13 C is directly connected, Ethernet0/0
14 is subnetted, 1 subnets
15 C is directly connected, Tunnel192
16 is subnetted, 2 subnets
17 C is directly connected, Loopback192
18 D [90/297372416] via, 00:00:04, Tunnel192
18 S* [1/0] via
19 Router_A#
20 *May 8 15:42:10.439: %TUN-5-RECURDOWN: Tunnel192 temporarily disabled due to recursive
21 *May 8 15:42:11.439: %LINEPROTO-5-UPDOWN: Line protocol on Interface Tunnel192,
     changed state to down
22 *May 8 15:42:11.447: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 100: Neighbor
(Tunnel192) is down: interface down

In order to prevent this behavior, the administrators of Router_A ensure that the route to the GRE tunnel destination is not learned over the GRE tunnel itself. This can be accomplished by excluding the routes used to establish the GRE tunnel from being redistributed in to EIGRP using either a route-map or distribute list. Figure provides an example of how to accomplish this using an EIGRP route-map.

Controlling EIGRP Route Redistribution with Route-Maps and ACLs

access-list 1 deny
route-map no-recursive permit 10
 match ip address 1

Figure provides an alternative to excluding routes from the EIGRP redistribution using a distribute list. In the context of this particular example, the configuration in Figure will accomplish the same task as the route-map configuration in Figure.

Controlling EIGRP Route Redistribution with Distribute Lists

router eigrp 1
 redistribute connected
 default-metric 1500 100 128 128 4700
 distribute-list 1 in
 no auto-summary
access-list 1 deny
access-list 1 permit any

Figure shows an example of EIGRP and GRE tunnel reestablishment once the appropriate route-map and ACL combination have been applied to the EIGRP routing protocol process using the redistribute connected route-map [route-map-name] command.

GRE Tunnel and EIGRP Adjacency Reestablishment when Excluding Redistribution of Recursively-Learned Routes Using Route-Maps and ACLs

router eigrp 1
 redistribute connected
 default-metric 1500 100 128 128 4700
 no auto-summary
crvpn-3600-a#conf t
Enter configuration commands, one per line. End with CNTL/Z.
crvpn-3600-a(config)#router eigrp 1
crvpn-3600-a(config-router)#redistribute connected route-map no-recursive
*May 8 16:27:44.191: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 1: Neighbor (Tunnel192)
is up: new adjacency

 Python   SQL   Java   php   Perl 
 game development   web development   internet   *nix   graphics   hardware 
 telecommunications   C++ 
 Flash   Active Directory   Windows