Skip to content

PowerHA SystemMirror

PowerHA Known Fix Info

General

PowerHA command PATH

export PATH=$PATH:/usr/es/sbin/cluster:/usr/es/sbin/cluster/utilities:/usr/es/sbin/cluster/sbin:/usr/es/sbin/cluster/cspoc

Basic two-node cluster configuration

On any network interface that's part of the PowerHA cluster, configure the poll_uplink attribute to yes.

chdev -l ent0 -a poll_uplink=yes -P

2. Configure /etc/hosts

(Done on both nodes) Configure the /etc/hosts file with all cluster member IP addresses, and any service IP addresses.

hostent -a <ip_address> -h "host1"
hostent -a <ip_address> -h "host2"
hostent -a <ip_address> -h "service-vip"

3. Configure /usr/es/sbin/cluster/netmon.cf

(Done on both nodes) The IP address used below in /usr/es/sbin/cluster/netmon.cf is the persistent IP address of the cluster node.

# cat /usr/es/sbin/cluster/netmon.cf
!REQD <ip_address> <netmask>

4. Configure /etc/cluster/rhosts

(Done on both nodes)

# cat /etc/cluster/rhosts
host1
host2

Refresh the subsystem clcomd

stopsrc -s clcomd; sleep 5; startsrc -s clcomd

5. Create two-node cluster

(Done on one node)

# clmgr add cluster kristian_cluster NODES="host1 host2"

6. Define a primary and secondary repository disk

(Done on one node) The cluster repository disks can be 1GB in size.

clmgr modify cluster kristian_cluster REPOSITORY=hdisk1,hdisk2
clmgr query repository

7. Configure SNMP

(Done on both nodes) In /etc/snmpdv3.conf

Uncomment the below line.

COMMUNITY public    public     noAuthNoPriv 0.0.0.0     0.0.0.0         -

Add the following lines.

# PowerHA SystemMirror
VACM_VIEW defaultView        1.3.6.1.4.1.2.3.1.2.1.5    - included -

8. Configure DR policy

(Done on one node)

clmgr modify cluster CAA_AUTO_START_DR=disabled

9. Configure LPM policy

(Done on one node) During a Live Partition Mobility operation, the below setting will place the resource groups into an unmanaged state.

clmgr modify cluster LPM_POLICY=unmanage

10. Synchronize the cluster

(Done on one node)

clmgr sync cluster FIX=yes

11. Define tie-breaker disk

(Done on one node) The cluster tie-breaker disk can be 1GB in size. Once defined, synchronize the cluster again.

clmgr modify cluster kristian_cluster SPLIT_POLICY=tiebreaker MERGE_POLICY=tiebreaker TIEBREAKER=hdisk7
clmgr sync cluster FIX=yes
clmgr view cluster SPLIT-MERGE

You can ignore the message about having to create a stretched or linked cluster.

12. Cluster start-up

When starting cluster services on the node(s), the options provided are "now", "restart", or "both". Always select "now", as we don't want the PowerHA cluster services coming online during a server restart. This allows us to better troubleshoot problems if/should the server restart. It also allows us to further control the start-up sequence of the AIX host. If you've accidently selected "restart" or "both", this will add an entry into /etc/inittab so PowerHA cluster services are started at boot. You can remove this entry using the rmitab command.

# lsitab hacmp6000
hacmp6000:2:wait:/usr/es/sbin/cluster/etc/rc.cluster -boot -i  -A   # Bring up Cluster
# rmitab hacmp6000

13. Service IP Distribution Preference

If the service IP address will reside on the same interface as the hosts boot/persistent IP address, the distribution preference needs to be set to "Disable Firstalias". This will ensure the boot/persistent IP address appears before the service IP in the route table, and allows authentication back to NIM for operations like mksysb backups.

# clmgr show network
net_ether_01
# clmgr modify network net_ether_01 RESOURCE_DIST_PREF="NOALI"

14. File collections

PowerHA has the ability to keep files synchronized between cluster node members. It's good practise to configure any application monitor/restart/cleanup scripts into a file collection, so any changes to the scripts are propagated to all cluster nodes. PowerHA will automatically add any configured application monitor/restart/cleanup scripts to the HACMP_Files file collection. Ensure that SYNC_WITH_CLUSTER and SYNC_WHEN_CHANGED are "true".

clmgr modify file_collection HACMP_Files SYNC_WITH_CLUSTER=true SYNC_WHEN_CHANGED=true

15. Restart cluster nodes

shutdown –Fr

Cluster commands

Start/stop cluster

clmgr online cluster WHEN=now MANAGE=auto BROADCAST=true CLINFO=true FIX=yes
clmgr offline cluster WHEN=now MANAGE=offline BROADCAST=true

Cluster state

clstat
cldump

Check current cluster state and NodeID values

lssrc -ls clstrmgrES

Level of PowerHA running

halevel -s

Level of PowerHA running on all reachable cluster nodes

/usr/es/sbin/cluster/cspoc/cli_on_cluster -S halevel -s

Run command on remote cluster node

/usr/es/sbin/cluster/utilities/cl_rsh -n <node> <command>

Sync cluster configuration

clmgr sync cluster FIX=yes

Detailed Topology and Application Server configuration

cldisp

Show cluster tunables

clctrl -tune -L

Show heartbeat status

# lscluster -m
        Points of contact for node: 2
        -----------------------------------------------------------------------
        Interface     State  Protocol    Status     SRC_IP->DST_IP
        -----------------------------------------------------------------------
        sfwcom        UP     none         none   none
        tcpsock->02   UP     IPv4         none   1.2.3.4->1.2.3.5

# /usr/lib/cluster/clras sancomm_status
+-------------------------------------------------------------------+
|       NAME       |                 UUID                 | STATUS  |
+-------------------------------------------------------------------+
| host1            | ce5681e6-be5a-11e9-8006-1ad85f872615 |   UP    |
+-------------------------------------------------------------------+

How to use the (RSCT) tb_break command to test tie-breaker disk reservation

With nodes online, use the AIX devrsrv command to check if either node is holding the disk reservation.

Show current state of tie-breaker disk: devrsrv -c query -l <tiebreaker_disk>

Example:

# devrsrv -c query -l hdisk11
Device Reservation State Information
==================================================
Device Name                     :  hdisk11
Device Open On Current Host?    :  NO
ODM Reservation Policy          :  PR EXCLUSIVE
ODM PR Key Value                :  1582870249224044282
Device Reservation State        :  NO RESERVE
Registered PR Keys              :  No Keys Registered
PR Capabilities Byte[2]         :  0x11 CRH   PTPL_C
PR Capabilities Byte[3]         :  0x81 PTPL_A
PR Types Supported              :  PR_WE_AR  PR_EA_RO  PR_WE_RO  PR_EA  PR_WE  PR_EA_AR

Then check using the following tb_break commands to test if the disk can accept a reservation.

With both nodes online do:

  1. Node A, to reserve the TB. It should succeed

    /usr/sbin/rsct/bin/tb_break -v -l -t DISK "DEVICE=/dev/hdisk11"

    node-A # /usr/sbin/rsct/bin/tb_break -v -l -t DISK "DEVICE=/dev/hdisk11"
    tb_break build version is 3.2.6.3 ((1.33))
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), Attempting to dlopen()
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), loading symbol(tbInitModule)
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), loading symbol(tblm_SCSIPR)
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), invoking the init module
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), registering the function
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), registering the function, rc=3 (good if rc>=0)
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), returns RC=3 (good if RC>=0)
    Initializing DISK tie-breaker (DEVICE=/dev/hdisk11)
    aixDisk_init: DEVICE=/dev/hdisk11
    token key:DEVICE val:/dev/hdisk11
    device=/dev/hdisk11
    aixDisk_close /dev/hdisk11 3
    aixDisk_init: /dev/hdisk11 is sucessful
    Reserving tie-breaker (DEVICE=/dev/hdisk11)
    aixDisk_is_readable(fd=3): attempt to read
    aixDisk_is_readable(fd=3): is successful
    aixDisk_reserve(/dev/hdisk11, ext=0x2, fd=3) is granted
    tb_reserve status GRANTED(0) (errno=0)
    tb_break returns successfully
    
  2. Node B, try to reserve the TB. It should fail.

    /usr/sbin/rsct/bin/tb_break -v -l -t DISK "DEVICE=/dev/hdisk11"

    node-B # /usr/sbin/rsct/bin/tb_break -v -l -t DISK "DEVICE=/dev/hdisk11"
    tb_break build version is 3.2.6.3 ((1.33))
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), Attempting to dlopen()
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), loading symbol(tbInitModule)
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), loading symbol(tblm_SCSIPR)
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), invoking the init module
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), registering the function
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), registering the function, rc=3 (good if rc>=0)
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), returns RC=3 (good if RC>=0)
    Initializing DISK tie-breaker (DEVICE=/dev/hdisk11)
    aixDisk_init: DEVICE=/dev/hdisk11
    token key:DEVICE val:/dev/hdisk11
    device=/dev/hdisk11
    aixDisk_init:  device /dev/hdisk11 owned by another
    aixDisk_init: /dev/hdisk11 is sucessful
    Reserving tie-breaker (DEVICE=/dev/hdisk11)
    * aixDisk_reserve(/dev/hdisk11, ext=0x2) denied because openx() failed. errno=16
    tb_reserve status DENIED(1) (errno=16)
    tb_break returns with exit-code=1
    
  3. Node A, try to release the TB. It should succeed

    /usr/sbin/rsct/bin/tb_break -v -u -t DISK "DEVICE=/dev/hdisk11"

    node-A # /usr/sbin/rsct/bin/tb_break -v -u -t DISK "DEVICE=/dev/hdisk11"
    tb_break build version is 3.2.6.3 ((1.33))
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), Attempting to dlopen()
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), loading symbol(tbInitModule)
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), loading symbol(tblm_SCSIPR)
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), invoking the init module
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), registering the function
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), registering the function, rc=3 (good if rc>=0)
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), returns RC=3 (good if RC>=0)
    Initializing DISK tie-breaker (DEVICE=/dev/hdisk11)
    aixDisk_init: DEVICE=/dev/hdisk11
    token key:DEVICE val:/dev/hdisk11
    device=/dev/hdisk11
    aixDisk_close /dev/hdisk11 3
    aixDisk_init: /dev/hdisk11 is sucessful
    Releasing tie-breaker (DEVICE=/dev/hdisk11)
    aixDisk_release(/dev/hdisk11, openx=0x0, fd=3) is successful
    aixDisk_close /dev/hdisk11 3
    tb_release status 0 (errno=0)
    tb_break returns successfully
    
  4. Node B, try to reserve the TB. It should succeed.

    /usr/sbin/rsct/bin/tb_break -v -l -t DISK "DEVICE=/dev/hdisk11"

    node-B # /usr/sbin/rsct/bin/tb_break -v -l -t DISK "DEVICE=/dev/hdisk11"
    tb_break build version is 3.2.6.3 ((1.33))
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), Attempting to dlopen()
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), loading symbol(tbInitModule)
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), loading symbol(tblm_SCSIPR)
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), invoking the init module
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), registering the function
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), registering the function, rc=3 (good if rc>=0)
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), returns RC=3 (good if RC>=0)
    Initializing DISK tie-breaker (DEVICE=/dev/hdisk11)
    aixDisk_init: DEVICE=/dev/hdisk11
    token key:DEVICE val:/dev/hdisk11
    device=/dev/hdisk11
    aixDisk_close /dev/hdisk11 3
    aixDisk_init: /dev/hdisk11 is sucessful
    Reserving tie-breaker (DEVICE=/dev/hdisk11)
    aixDisk_is_readable(fd=3): attempt to read
    aixDisk_is_readable(fd=3): is successful
    aixDisk_reserve(/dev/hdisk11, ext=0x2, fd=3) is granted
    tb_reserve status GRANTED(0) (errno=0)
    tb_break returns successfully
    
  5. Node A, try to reserve the TB. It should fail.

    /usr/sbin/rsct/bin/tb_break -v -l -t DISK "DEVICE=/dev/hdisk11"

    node-A # /usr/sbin/rsct/bin/tb_break -v -l -t DISK "DEVICE=/dev/hdisk11"
    tb_break build version is 3.2.6.3 ((1.33))
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), Attempting to dlopen()
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), loading symbol(tbInitModule)
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), loading symbol(tblm_SCSIPR)
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), invoking the init module
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), registering the function
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), registering the function, rc=3 (good if rc>=0)
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), returns RC=3 (good if RC>=0)
    Initializing DISK tie-breaker (DEVICE=/dev/hdisk11)
    aixDisk_init: DEVICE=/dev/hdisk11
    token key:DEVICE val:/dev/hdisk11
    device=/dev/hdisk11
    aixDisk_init:  device /dev/hdisk11 owned by another
    aixDisk_init: /dev/hdisk11 is sucessful
    Reserving tie-breaker (DEVICE=/dev/hdisk11)
    * aixDisk_reserve(/dev/hdisk11, ext=0x2) denied because openx() failed. errno=16
    tb_reserve status DENIED(1) (errno=16)
    tb_break returns with exit-code=1
    
  6. Node B, try to release the TB. It should succeed

    /usr/sbin/rsct/bin/tb_break -v -u -t DISK "DEVICE=/dev/hdisk11"

    # /usr/sbin/rsct/bin/tb_break -v -u -t DISK "DEVICE=/dev/hdisk11"
    tb_break build version is 3.2.6.3 ((1.33))
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), Attempting to dlopen()
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), loading symbol(tbInitModule)
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), loading symbol(tblm_SCSIPR)
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), invoking the init module
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), registering the function
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), registering the function, rc=3 (good if rc>=0)
    loadTieBreakerModule (/opt/rsct/modules/tblm_SCSIPR.a), returns RC=3 (good if RC>=0)
    Initializing DISK tie-breaker (DEVICE=/dev/hdisk11)
    aixDisk_init: DEVICE=/dev/hdisk11
    token key:DEVICE val:/dev/hdisk11
    device=/dev/hdisk11
    aixDisk_close /dev/hdisk11 3
    aixDisk_init: /dev/hdisk11 is sucessful
    Releasing tie-breaker (DEVICE=/dev/hdisk11)
    aixDisk_release(/dev/hdisk11, openx=0x0, fd=3) is successful
    aixDisk_close /dev/hdisk11 3
    tb_release status 0 (errno=0)
    tb_break returns successfully
    

Logs

List cluster logs

cllistlogs

Resource Groups commands

Resource group state and application status

clRGinfo
clRGinfo -m

Move a resource group to another node

clmgr move resource_group <rg_name> NODE=<node>

List all resource groups in a cluster

clmgr query resource_group

Show details of all resource groups

clshowres

Application commands

List the Application monitoring on a cluster

cllsappmon
cllsappmon <app_mon_name>

Troubleshooting

If caavg_private is not active

# lspv
hdisk0          00c8c7a77acb99fd                    rootvg          active
hdisk1          00c8c7a76ca0dfd1                    None
hdisk2          00cbf7e7aa7e2a90                    clustervg0001   concurrent
hdisk3          00cbf7e7aa7f4dd5                    caavg_private
# redefinevg -d hdisk3 caavg_private
0516-574 redefinevg: Volume Group name 00cbf7e700004c0000000158b374067b is already used.
# synclvodm caavg_private
0516-622 synclvodm: Warning, cannot write lv control block data.
# lspv
hdisk0          00c8c7a77acb99fd                    rootvg          active
hdisk1          00c8c7a76ca0dfd1                    None
hdisk2          00cbf7e7aa7e2a90                    clustervg0001   concurrent
hdisk3          00cbf7e7aa7f4dd5                    caavg_private   active

Examine PowerHA log files and produce report

Recent errors

/usr/es/sbin/cluster/clanalyze/clanalyze -a -o recent

Daemons and subsystems

/usr/es/sbin/cluster/clanalyze/clanalyze -v

All supported errors

/usr/es/sbin/cluster/clanalyze/clanalyze -a -o all

Clean repository disk if you want to reuse it

/usr/lib/cluster/clras clean hdisk1

Recreate CAA repository disk

Check that the repos disk still has a valid header on both nodes.

/usr/lib/cluster/clras dumprepos -r hdiskX

Try syncing the cluster, while cluster services are down on both nodes. If that doesn't work, try this, against the repos disk...

clusterconf -vr hdiskX

If that doesn't work try this...

Potential cluster crash

This will cause the node to crash if cluster service has started on this node before.

On Primary Node:

export CAA_FORCE_ENABLED=1
rmcluster -f -r hdiskX
rmdev -dl cluster0
shutdown -Fr

On Secondary Node:

export CAA_FORCE_ENABLED=1
rmcluster -f –r hdiskX
rmdev -dl cluster0
shutdown -Fr

On Primary Node:

/usr/es/sbin/cluster/utilities/cldare -rtv

Incomplete cluster migration has been detected

Attempting to sync the cluster produces the following error.

clmgr: ERROR: an incomplete cluster migration has been detected. Cluster
       verification/synchronization cannot be performed until the migration is
       completed and all nodes are running the same version of the product.

Each cluster node is showing a different version. They should both be reporting '20'.

# odmget HACMPnode | grep -E 'name|version'
        name = "node1"
        version = 20
        name = "node1"
        version = 20
        name = "node1"
        version = 20
        name = "node1"
        version = 20
        name = "node1"
        version = 20
        name = "node1"
        version = 20
        name = "node1"
        version = 20
        name = "node2"
        version = 19
        name = "node2"
        version = 19
        name = "node2"
        version = 19
        name = "node2"
        version = 19
        name = "node2"
        version = 19
        name = "node2"
        version = 19
        name = "node2"
        version = 19

node1# odmget HACMPcluster | grep version
        cluster_version = 19

node2# odmget HACMPcluster | grep version
        cluster_version = 19

If you have exhausted all other options, you can attempt to resolve this by modifying the ODM entries, and then try to sync the cluster again.

  1. Backup the ODM /etc/objrepos on both nodes.

    # tar -cvf /tmp/etc.objrepos.tar /etc/objrepos
    
  2. On node both nodes.

    # cp /etc/objrepos/HACMPcluster /etc/objrepos/HACMPcluster.backup
    # print -- "HACMPcluster:\tcluster_version = 20" | ODMDIR=/etc/objrepos odmchange -o HACMPcluster
    # odmget HACMPcluster | grep -i version
    
  3. On the node that was incorrectly reporting as 19 (in this example, node2).

    # print -- "HACMPnode:\tversion = 20" | ODMDIR=/etc/objrepos odmchange -o HACMPnode
    
  4. On the node that was incorrectly reporting as 19 (in this example, node2).

    # clmgr sync cluster FIX=yes
    
  5. Run the sync again on the other node (node1) to ensure it's still completing successfully.

    # clmgr sync cluster FIX=yes