Troubleshooting HA clusters

To verify the cluster configuration – CLI

1. Log into each cluster unit CLI.

You can use the console connection if you need to avoid the problem of units having the same IP address.

2. Enter the command get system status.

Look for the following information in the command output.

Current HA mode: a-a, master

The cluster units are operating as a cluster and you have connected to the primary unit.

 

 

Current backup

 

HA

 

mode:

 

a-a,

The cluster units are operating as a cluster a subordinate unit.
 

Current dalone

 

HA

 

mode:

 

stan-

The cluster unit is not operating in HA mode

 

nd you have connected to a

3. Verify that the get system ha status command displays all of the cluster units.

4. Enter the get system ha command to verify that the HA configuration is correct and the same for each cluster unit.

 

To troubleshoot the cluster configuration – CLI

1. Try using the following command to re-enter the cluster password on each cluster unit in case you made an error typing the password when configuring one of the cluster units.

config system ha

set password <password>

end

2. Check that the correct interfaces of each cluster unit are connected.

Check the cables and interface LEDs.

Use get hardware nic <interface_name> command to confirm that each interface is connected. If the interface is connected the command output should contain a Link: up entry similar to the following:

get hardware nic port1

 

Link: up

 

If Link is down, re-verify the physical connection. Try replacing network cables or switches as required.

 

More troubleshooting information

Much of the information in this HA guide can be useful for troubleshooting HA clusters. Here are some links to sections with more information.

  • If sessions are lost after a failover you may need to change route-ttl to keep synchronized routes active longer. See Controlling how the FGCP synchronizes kernel routing table updates on page 1524.
  • To control which cluster unit becomes the primary unit, you can change the device priority and enable override. See An introduction to the FGCP on page 1310.
  • Changes made to a cluster can be lost if override is enabled. See An introduction to the FGCP on page 1310.
  • In some cases, age differences among cluster units result in the wrong cluster unit becoming the primary unit. For example, if a cluster unit set to a high priority reboots, that unit will have a lower age than other cluster units. You can resolve this problem by resetting the age of one or more cluster units. See An introduction to the FGCP on page 1310. You can also adjust how sensitive the cluster is to age differences. This can be useful if large age differences cause problems. SeeAn introduction to the FGCP on page 1310 and An introduction to the FGCP on page 1310.
  • If one of the cluster units needs to be serviced or removed from the cluster for other reasons, you can do so without affecting the operation of the cluster. See Disconnecting a cluster unit from a cluster on page 1494.
  • The web-based manager and CLI will not allow you to configure HA if you have enabled FGSP HA. See FortiGate Session Life Support Protocol (FGSP) on page 1579.
  • The web-based manager and CLI will not allow you to configure HA if one or more FortiGate unit interfaces is configured as a PPTP or L2TP client.
  • The FGCP is compatible with DHCP and PPPoE but care should be taken when configuring a cluster that includes a FortiGate interface configured to get its IP address with DHCP or PPPoE. Fortinet recommends that you turn on DHCP or PPPoE addressing for an interface after the cluster has been configured. See An introduction to the FGCP on page 1310.
  • Some third-party network equipment may prevent HA heartbeat communication, resulting in a failure of the cluster or the creation of a split brain scenario. For example, some switches use packets with the same Ethertype as HA heartbeat packets use for internal functions and when used for HA heartbeat communication the switch generates CRC errors and the packets are not forwarded. See Heartbeat packet Ethertypes on page 1504.
  • Very busy clusters may not be able to send HA heartbeat packets quickly enough, also resulting in a split brain scenario. You may be able to resolve this problem by modifying HA heartbeat timing. See Modifying heartbeat timing on page 1505.
  • Very busy clusters may suffer performance reductions if session pickup is enabled. If possible you can disable this feature to improve performance. If you require session pickup for your cluster, several options are available for improving session pickup performance. See Improving session synchronization performance on page 1539.
  • If it takes longer than expected for a cluster to failover you can try changing how the primary unit sends gratuitous ARP packets. See Changing how the primary unit sends gratuitous ARP packets after a failover on page 1508.
  • You can also improve failover times by configuring the cluster for subsecond failover. See Subsecond failover on page 1534 and Failover performance on page 1550.
  • When you first put a FortiGate unit in HA mode you may loose connectivity to the unit. This occurs because HA changes the MAC addresses of all FortiGate unit interfaces, including the one that you are connecting to. The cluster MAC addresses also change if you change the some HA settings such as the cluster group ID. The connection will be restored in a short time as your network and PC updates to the new MAC address. To reconnect sooner, you can update the ARP table of your management PC by deleting the ARP table entry for the FortiGate unit (or just deleting all arp table entries). You may be able to delete the arp table of your management PC from a command prompt using a command similar to arp -d.
  • Since HA changes all cluster unit MAC addresses, if your network uses MAC address filtering you may have to make configuration changes to account for the HA MAC addresses.
  • A network may experience packet loss when two FortiGate HA clusters have been deployed in the same broadcast domain. Deploying two HA clusters in the same broadcast domain can result in packet loss because of MAC address conflicts. The packet loss can be diagnosed by pinging from one cluster to the other or by pinging both of the clusters from a device within the broadcast domain. You can resolve the MAC address conflict by changing the HA Group ID configuration of the two clusters. The HA Group ID is sometimes also called the Cluster ID. See Diagnosing packet loss with two FortiGate HA clusters in the same broadcast domain on page 1512.
  • A network may experience packet loss when two FortiGate HA clusters have been deployed in the same broadcast domain. Deploying two HA clusters in the same broadcast domain can result in packet loss because of MAC address conflicts. The packet loss can be diagnosed by pinging from one cluster to the other or by pinging both of the clusters from a device within the broadcast domain. You can resolve the MAC address conflict by changing the HA Group ID configuration of the two clusters. The HA Group ID is sometimes also called the Cluster ID. See Diagnosing packet loss with two FortiGate HA clusters in the same broadcast domain on page 1512.
  • The cluster CLI displays slave is not in sync messages if there is a synchronization problem between the primary unit and one or more subordinate units. See How to diagnose HA out of sync messages on page 1521.
  • If you have configured dynamic routing and the new primary unit takes too long to update its routing table after a failover you can configure graceful restart and also optimize how routing updates are synchronized. See Configuring graceful restart for dynamic routing failover on page 1523 and Controlling how the FGCP synchronizes kernel routing table updates on page 1524.
  • Some switches may not be able to detect that the primary unit has become a subordinate unit and will keep sending packets to the former primary unit. This can occur after a link failover if the switch does not detect the failure and does not clear its MAC forwarding table. See Updating MAC forwarding tables when a link failover occurs on page 1531.
  • If a link not directly connected to a cluster unit (for example, between a switch connected to a cluster interface and the network) fails you can enable remote link failover to maintain communication. See Remote link failover on page 1534.
  • If you find that some cluster units are not running the same firmware build you can reinstall the correct firmware build on the cluster to upgrade all cluster units to the same firmware build. See Synchronizing the firmware build running on a new cluster unit on page 1484.
This entry was posted in FortiOS 5.4 Handbook and tagged , , on by .

About Mike

Michael Pruett, CISSP has a wide range of cyber-security and network engineering expertise. The plethora of vendors that resell hardware but have zero engineering knowledge resulting in the wrong hardware or configuration being deployed is a major pet peeve of Michael's. This site was started in an effort to spread information while providing the option of quality consulting services at a much lower price than Fortinet Professional Services. Owns PacketLlama.Com (Fortinet Hardware Sales) and Office Of The CISO, LLC (Cybersecurity consulting firm).

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.