Skip to content

PowerHA SystemMirror

PowerHA Known Fix Info

General

PowerHA command PATH

export PATH=$PATH:/usr/es/sbin/cluster:/usr/es/sbin/cluster/utilities:/usr/es/sbin/cluster/sbin:/usr/es/sbin/cluster/cspoc

Basic two-node cluster configuration

On any network interface that's part of the PowerHA cluster, configure the poll_uplink attribute to yes.

chdev -l ent0 -a poll_uplink=yes -P

2. Configure /etc/hosts

(Done on both nodes) Configure the /etc/hosts file with all cluster member IP addresses, and any service IP addresses.

hostent -a <ip_address> -h "host1"
hostent -a <ip_address> -h "host2"
hostent -a <ip_address> -h "service-vip"

3. Configure /usr/es/sbin/cluster/netmon.cf

(Done on both nodes) The IP address used below in /usr/es/sbin/cluster/netmon.cf is the persistent IP address of the cluster node.

# cat /usr/es/sbin/cluster/netmon.cf
!REQD <ip_address> <netmask>

4. Configure /etc/cluster/rhosts

(Done on both nodes)

# cat /etc/cluster/rhosts
host1
host2

Refresh the subsystem clcomd

stopsrc -s clcomd; sleep 5; startsrc -s clcomd

5. Create two-node cluster

(Done on one node)

# clmgr add cluster kristian_cluster NODES="host1 host2"

6. Define a primary and secondary repository disk

(Done on one node) The cluster repository disks can be 1GB in size.

clmgr modify cluster kristian_cluster REPOSITORY=hdisk1,hdisk2
clmgr query repository

7. Configure SNMP

(Done on both nodes) In /etc/snmpdv3.conf

Uncomment the below line.

COMMUNITY public    public     noAuthNoPriv 0.0.0.0     0.0.0.0         -

Add the following lines.

# PowerHA SystemMirror
VACM_VIEW defaultView        1.3.6.1.4.1.2.3.1.2.1.5    - included -

8. Configure DR policy

(Done on one node)

clmgr modify cluster CAA_AUTO_START_DR=disabled

9. Configure LPM policy

(Done on one node) During a Live Partition Mobility operation, the below setting will place the resource groups into an unmanaged state.

clmgr modify cluster LPM_POLICY=unmanage

10. Synchronize the cluster

(Done on one node)

clmgr sync cluster FIX=yes

11. Define tie-breaker disk

(Done on one node) The cluster tie-breaker disk can be 1GB in size. Once defined, synchronize the cluster again.

clmgr modify cluster kristian_cluster SPLIT_POLICY=tiebreaker MERGE_POLICY=tiebreaker TIEBREAKER=hdisk7
clmgr sync cluster FIX=yes
clmgr view cluster SPLIT-MERGE

You can ignore the message about having to create a stretched or linked cluster.

12. Cluster start-up

When starting cluster services on the node(s), the options provided are "now", "restart", or "both". Always select "now", as we don't want the PowerHA cluster services coming online during a server restart. This allows us to better troubleshoot problems if/should the server restart. It also allows us to further control the start-up sequence of the AIX host. If you've accidently selected "restart" or "both", this will add an entry into /etc/inittab so PowerHA cluster services are started at boot. You can remove this entry using the rmitab command.

# lsitab hacmp6000
hacmp6000:2:wait:/usr/es/sbin/cluster/etc/rc.cluster -boot -i  -A   # Bring up Cluster
# rmitab hacmp6000

13. Service IP Distribution Preference

If the service IP address will reside on the same interface as the hosts boot/persistent IP address, the distribution preference needs to be set to "Disable Firstalias". This will ensure the boot/persistent IP address appears before the service IP in the route table, and allows authentication back to NIM for operations like mksysb backups.

# clmgr show network
net_ether_01
# clmgr modify network net_ether_01 RESOURCE_DIST_PREF="NOALI"

14. File collections

PowerHA has the ability to keep files synchronized between cluster node members. It's good practise to configure any application monitor/restart/cleanup scripts into a file collection, so any changes to the scripts are propagated to all cluster nodes. PowerHA will automatically add any configured application monitor/restart/cleanup scripts to the HACMP_Files file collection. Ensure that SYNC_WITH_CLUSTER and SYNC_WHEN_CHANGED are "true".

clmgr modify file_collection HACMP_Files SYNC_WITH_CLUSTER=true SYNC_WHEN_CHANGED=true

15. Restart cluster nodes

shutdown –Fr

Cluster commands

Start/stop cluster

clmgr online cluster WHEN=now MANAGE=auto BROADCAST=true CLINFO=true FIX=yes
clmgr offline cluster WHEN=now MANAGE=offline BROADCAST=true

Cluster state

clstat
cldump

Check current cluster state and NodeID values

lssrc -ls clstrmgrES

Level of PowerHA running

halevel -s

Level of PowerHA running on all reachable cluster nodes

/usr/es/sbin/cluster/cspoc/cli_on_cluster -S halevel -s

Run command on remote cluster node

/usr/es/sbin/cluster/utilities/cl_rsh -n <node> <command>

Sync cluster configuration

clmgr sync cluster FIX=yes

Detailed Topology and Application Server configuration

cldisp

Show cluster tunables

clctrl -tune -L

Show heartbeat status

# lscluster -m
        Points of contact for node: 2
        -----------------------------------------------------------------------
        Interface     State  Protocol    Status     SRC_IP->DST_IP
        -----------------------------------------------------------------------
        sfwcom        UP     none         none   none
        tcpsock->02   UP     IPv4         none   1.2.3.4->1.2.3.5

# /usr/lib/cluster/clras sancomm_status
+-------------------------------------------------------------------+
|       NAME       |                 UUID                 | STATUS  |
+-------------------------------------------------------------------+
| host1            | ce5681e6-be5a-11e9-8006-1ad85f872615 |   UP    |
+-------------------------------------------------------------------+

How to use the (RSCT) tb_break command to test tie-breaker disk reservation

With nodes online, use the AIX devrsrv command to check if either node is holding the disk reservation.

Show current state of tie-breaker disk: devrsrv -c query -l <tiebreaker_disk>

Example:

# devrsrv -c query -l hdisk11
Device Reservation State Information
==================================================
Device Name                     :  hdisk11
Device Open On Current Host?    :  NO
ODM Reservation Policy          :  PR EXCLUSIVE
ODM PR Key Value                :  1582870249224044282
Device Reservation State        :  NO RESERVE
Registered PR Keys              :  No Keys Registered
PR Capabilities Byte[2]         :  0x11 CRH   PTPL_C
PR Capabilities Byte[3]         :  0x81 PTPL_A
PR Types Supported              :  PR_WE_AR  PR_EA_RO  PR_WE_RO  PR_EA  PR_WE  PR_EA_AR

Then check using the following tb_break commands to test if the disk can accept a reservation.

With both nodes online do:

  1. Node A, to reserve the TB. It should succeed

    /usr/sbin/rsct/bin/tb_break -v -l -t DISK "DEVICE=/dev/hdisk11"

    node-A # /usr/sbin/rsct/bin/tb_break -v -l -t DISK "DEVICE=/dev/hdisk11"
    tb_break build version is 3.2.6.3 ((1.33))
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), Attempting to dlopen()
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), loading symbol(tbInitModule)
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), loading symbol(tblm_SCSIPR)
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), invoking the init module
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), registering the function
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), registering the function, rc=3 (good if rc>=0)
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), returns RC=3 (good if RC>=0)
    Initializing DISK tie-breaker (DEVICE=/dev/hdisk11)
    aixDisk_init: DEVICE=/dev/hdisk11
    token key:DEVICE val:/dev/hdisk11
    device=/dev/hdisk11
    aixDisk_close /dev/hdisk11 3
    aixDisk_init: /dev/hdisk11 is sucessful
    Reserving tie-breaker (DEVICE=/dev/hdisk11)
    aixDisk_is_readable(fd=3): attempt to read
    aixDisk_is_readable(fd=3): is successful
    aixDisk_reserve(/dev/hdisk11, ext=0x2, fd=3) is granted
    tb_reserve status GRANTED(0) (errno=0)
    tb_break returns successfully
    
  2. Node B, try to reserve the TB. It should fail.

    /usr/sbin/rsct/bin/tb_break -v -l -t DISK "DEVICE=/dev/hdisk11"

    node-B # /usr/sbin/rsct/bin/tb_break -v -l -t DISK "DEVICE=/dev/hdisk11"
    tb_break build version is 3.2.6.3 ((1.33))
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), Attempting to dlopen()
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), loading symbol(tbInitModule)
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), loading symbol(tblm_SCSIPR)
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), invoking the init module
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), registering the function
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), registering the function, rc=3 (good if rc>=0)
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), returns RC=3 (good if RC>=0)
    Initializing DISK tie-breaker (DEVICE=/dev/hdisk11)
    aixDisk_init: DEVICE=/dev/hdisk11
    token key:DEVICE val:/dev/hdisk11
    device=/dev/hdisk11
    aixDisk_init:  device /dev/hdisk11 owned by another
    aixDisk_init: /dev/hdisk11 is sucessful
    Reserving tie-breaker (DEVICE=/dev/hdisk11)
    * aixDisk_reserve(/dev/hdisk11, ext=0x2) denied because openx() failed. errno=16
    tb_reserve status DENIED(1) (errno=16)
    tb_break returns with exit-code=1
    
  3. Node A, try to release the TB. It should succeed

    /usr/sbin/rsct/bin/tb_break -v -u -t DISK "DEVICE=/dev/hdisk11"

    node-A # /usr/sbin/rsct/bin/tb_break -v -u -t DISK "DEVICE=/dev/hdisk11"
    tb_break build version is 3.2.6.3 ((1.33))
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), Attempting to dlopen()
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), loading symbol(tbInitModule)
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), loading symbol(tblm_SCSIPR)
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), invoking the init module
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), registering the function
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), registering the function, rc=3 (good if rc>=0)
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), returns RC=3 (good if RC>=0)
    Initializing DISK tie-breaker (DEVICE=/dev/hdisk11)
    aixDisk_init: DEVICE=/dev/hdisk11
    token key:DEVICE val:/dev/hdisk11
    device=/dev/hdisk11
    aixDisk_close /dev/hdisk11 3
    aixDisk_init: /dev/hdisk11 is sucessful
    Releasing tie-breaker (DEVICE=/dev/hdisk11)
    aixDisk_release(/dev/hdisk11, openx=0x0, fd=3) is successful
    aixDisk_close /dev/hdisk11 3
    tb_release status 0 (errno=0)
    tb_break returns successfully
    
  4. Node B, try to reserve the TB. It should succeed.

    /usr/sbin/rsct/bin/tb_break -v -l -t DISK "DEVICE=/dev/hdisk11"

    node-B # /usr/sbin/rsct/bin/tb_break -v -l -t DISK "DEVICE=/dev/hdisk11"
    tb_break build version is 3.2.6.3 ((1.33))
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), Attempting to dlopen()
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), loading symbol(tbInitModule)
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), loading symbol(tblm_SCSIPR)
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), invoking the init module
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), registering the function
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), registering the function, rc=3 (good if rc>=0)
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), returns RC=3 (good if RC>=0)
    Initializing DISK tie-breaker (DEVICE=/dev/hdisk11)
    aixDisk_init: DEVICE=/dev/hdisk11
    token key:DEVICE val:/dev/hdisk11
    device=/dev/hdisk11
    aixDisk_close /dev/hdisk11 3
    aixDisk_init: /dev/hdisk11 is sucessful
    Reserving tie-breaker (DEVICE=/dev/hdisk11)
    aixDisk_is_readable(fd=3): attempt to read
    aixDisk_is_readable(fd=3): is successful
    aixDisk_reserve(/dev/hdisk11, ext=0x2, fd=3) is granted
    tb_reserve status GRANTED(0) (errno=0)
    tb_break returns successfully
    
  5. Node A, try to reserve the TB. It should fail.

    /usr/sbin/rsct/bin/tb_break -v -l -t DISK "DEVICE=/dev/hdisk11"

    node-A # /usr/sbin/rsct/bin/tb_break -v -l -t DISK "DEVICE=/dev/hdisk11"
    tb_break build version is 3.2.6.3 ((1.33))
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), Attempting to dlopen()
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), loading symbol(tbInitModule)
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), loading symbol(tblm_SCSIPR)
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), invoking the init module
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), registering the function
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), registering the function, rc=3 (good if rc>=0)
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), returns RC=3 (good if RC>=0)
    Initializing DISK tie-breaker (DEVICE=/dev/hdisk11)
    aixDisk_init: DEVICE=/dev/hdisk11
    token key:DEVICE val:/dev/hdisk11
    device=/dev/hdisk11
    aixDisk_init:  device /dev/hdisk11 owned by another
    aixDisk_init: /dev/hdisk11 is sucessful
    Reserving tie-breaker (DEVICE=/dev/hdisk11)
    * aixDisk_reserve(/dev/hdisk11, ext=0x2) denied because openx() failed. errno=16
    tb_reserve status DENIED(1) (errno=16)
    tb_break returns with exit-code=1
    
  6. Node B, try to release the TB. It should succeed

    /usr/sbin/rsct/bin/tb_break -v -u -t DISK "DEVICE=/dev/hdisk11"

    # /usr/sbin/rsct/bin/tb_break -v -u -t DISK "DEVICE=/dev/hdisk11"
    tb_break build version is 3.2.6.3 ((1.33))
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), Attempting to dlopen()
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), loading symbol(tbInitModule)
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), loading symbol(tblm_SCSIPR)
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), invoking the init module
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), registering the function
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), registering the function, rc=3 (good if rc>=0)
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), returns RC=3 (good if RC>=0)
    Initializing DISK tie-breaker (DEVICE=/dev/hdisk11)
    aixDisk_init: DEVICE=/dev/hdisk11
    token key:DEVICE val:/dev/hdisk11
    device=/dev/hdisk11
    aixDisk_close /dev/hdisk11 3
    aixDisk_init: /dev/hdisk11 is sucessful
    Releasing tie-breaker (DEVICE=/dev/hdisk11)
    aixDisk_release(/dev/hdisk11, openx=0x0, fd=3) is successful
    aixDisk_close /dev/hdisk11 3
    tb_release status 0 (errno=0)
    tb_break returns successfully
    

Logs

List cluster logs

cllistlogs

Resource Groups commands

Resource group state and application status

clRGinfo
clRGinfo -m

Move a resource group to another node

clmgr move resource_group <rg_name> NODE=<node>

List all resource groups in a cluster

clmgr query resource_group

Show details of all resource groups

clshowres

Application commands

List the Application monitoring on a cluster

cllsappmon
cllsappmon <app_mon_name>

Troubleshooting

If caavg_private is not active

# lspv
hdisk0          00c8c7a77acb99fd                    rootvg          active
hdisk1          00c8c7a76ca0dfd1                    None
hdisk2          00cbf7e7aa7e2a90                    clustervg0001   concurrent
hdisk3          00cbf7e7aa7f4dd5                    caavg_private
# redefinevg -d hdisk3 caavg_private
0516-574 redefinevg:Volume Group name 00cbf7e700004c0000000158b374067b is already used.
# synclvodm caavg_private
0516-622 synclvodm: Warning, cannot write lv control block data.
# lspv
hdisk0          00c8c7a77acb99fd                    rootvg          active
hdisk1          00c8c7a76ca0dfd1                    None
hdisk2          00cbf7e7aa7e2a90                    clustervg0001   concurrent
hdisk3          00cbf7e7aa7f4dd5                    caavg_private   active

Examine PowerHA log files and produce report

Recent errors

/usr/es/sbin/cluster/clanalyze/clanalyze -a -o recent

Daemons and subsystems

/usr/es/sbin/cluster/clanalyze/clanalyze -v

All supported errors

/usr/es/sbin/cluster/clanalyze/clanalyze -a -o all

Clean repository disk if you want to reuse it

/usr/lib/cluster/clras clean hdisk1

Recreate CAA repository disk

Check that the repos disk still has a valid header on both nodes.

/usr/lib/cluster/clras dumprepos -r hdiskX

Try syncing the cluster, while cluster services are down on both nodes. If that doesn't work, try this, against the repos disk...

clusterconf -vr hdiskX

If that doesn't work try this...

Potential cluster crash

This will cause the node to crash if cluster service has started on this node before.

On Primary Node:

export CAA_FORCE_ENABLED=1
rmcluster -f -r hdiskX
rmdev -dl cluster0
shutdown -Fr

On Secondary Node:

export CAA_FORCE_ENABLED=1
rmcluster -f –r hdiskX
rmdev -dl cluster0
shutdown -Fr

On Primary Node:

clmgr sync cluster

AIX 7.3 to AIX 7.2 rollback

After a rollback from AIX 7.3 to AIX 7.2, the PowerHA cluster would not start, showing the following errors in syslog.log

# grep 'xcluster_create_common' /var/adm/syslog/syslog.log
Nov 14 15:55:23 caa:err|error unix: kcluster_syscalls.c        xcluster_create_common  1783    Invalid cluster level given.
Nov 14 22:41:56 caa:err|error unix: kcluster_syscalls.c        xcluster_create_common  1783    Invalid cluster level given.
Nov 17 14:08:40 caa:err|error unix: kcluster_syscalls.c        xcluster_create_common  1783    Invalid cluster level given.

This is because during the migration from AIX 7.2 to AIX 7.3, CAA is updated and updated the repository disk. Recreating the repository disk following the above procedure resolved the problem.

Incomplete cluster migration has been detected

Attempting to sync the cluster produces the following error.

clmgr: ERROR: an incomplete cluster migration has been detected. Cluster
       verification/synchronization cannot be performed until the migration is
       completed and all nodes are running the same version of the product.

Each cluster node is showing a different version. They should both be reporting '20'.

# odmget HACMPnode | grep -E 'name|version'
        name = "node1"
        version = 20
        name = "node1"
        version = 20
        name = "node1"
        version = 20
        name = "node1"
        version = 20
        name = "node1"
        version = 20
        name = "node1"
        version = 20
        name = "node1"
        version = 20
        name = "node2"
        version = 19
        name = "node2"
        version = 19
        name = "node2"
        version = 19
        name = "node2"
        version = 19
        name = "node2"
        version = 19
        name = "node2"
        version = 19
        name = "node2"
        version = 19

node1# odmget HACMPcluster | grep version
        cluster_version = 19

node2# odmget HACMPcluster | grep version
        cluster_version = 19

If you have exhausted all other options, you can attempt to resolve this by modifying the ODM entries, and then try to sync the cluster again.

  1. Backup the ODM /etc/objrepos on both nodes.

    # tar -cvf /tmp/etc.objrepos.tar /etc/objrepos
    
  2. On node both nodes.

    # cp /etc/objrepos/HACMPcluster /etc/objrepos/HACMPcluster.backup
    # print -- "HACMPcluster:\tcluster_version = 20" | ODMDIR=/etc/objrepos odmchange -o HACMPcluster
    # odmget HACMPcluster | grep -i version
    
  3. On the node that was incorrectly reporting as 19 (in this example, node2).

    # print -- "HACMPnode:\tversion = 20" | ODMDIR=/etc/objrepos odmchange -o HACMPnode
    
  4. On the node that was incorrectly reporting as 19 (in this example, node2).

    # clmgr sync cluster FIX=yes
    
  5. Run the sync again on the other node (node1) to ensure it's still completing successfully.

    # clmgr sync cluster FIX=yes