What is a core group?
A core group encapsulates processes in a Network Deployment cell to create high availability domains.
A core group is a grouping of WebSphere Application Server cell processes. A core group can contain standalone servers, cluster members, node agents, and the deployment manager. A core group must contain at least one node agent or the deployment manager.
DefaultCoreGroup is a core group that is created by default at installation time and can be used out-of-the-box; that is, all processes will know about each other.
1. A core group cannot extend beyond a cell
2. All JVMs in a core group must able to communicate (they use heartbeat messages to know each other)
Core group coordinator
Once the core group stabilizes at runtime, one of the member will be elected to act as an coordinator. That member called as Coregroup coordinator is responsible for managing the high availability with in that core group.
1. It maintains all group information like group name, members and policy of the group
2. It maintains a record state of the group members as they start, stop or fail
3. Assigning singleton services to group members and handling failover based on policy specified.
We can change the default core group coordinator by going to:
servers –>coregroups->coregroup settings->Default Coregroup ->preferred coordinator servers.
When a member becomes active coordinator, you can see the following messages in the SystemOut:
[3/3/10 18:00:37:758 CET] 00000013 CoordinatorIm I HMGR0206I: The Coordinator is an Active Coordinator for core group DefaultCoreGroup.
If a member was failed/stopped in the core group:
[3/3/10 18:00:37:758 CET] 00000026 RoleMember W DCSV8104W: DCS Stack DefaultCoreGroup.TestRepln at Member Test-Cell\node01\server01: Removing member [Test-Cell\node02\server02] because the member was requested to be removed by member Test-Cell\node02\server01. Internal details VL suspects others: CC-Situation Normal
[3/3/10 18:00:38:176 CET] 00000023 VSyncAlgo1 I DCSV2004I: DCS Stack DefaultCoreGroup at Member Test-Cell\node01\server01: View synchronization completed successfully. The View Identifier is (22898:0.Test-Cell\node02\server01). The internal details are None.
[3/3/10 18:00:38:207 CET] 00000023 VSyncAlgo1 I DCSV2004I: DCS Stack DefaultCoreGroup.TestRepln at Member Test-Cell\node01\server01: View synchronization completed successfully. The View Identifier is (331:0.Test-Cell\node02\server01). The internal details are None.
[3/3/10 18:00:38:537 CET] 00000024 CoordinatorIm I HMGR0218I: A new core group view has been installed. The core group is DefaultCoreGroup.
[3/3/10 18:00:39:228 CET] 00000026 DataStackMemb I DCSV8050I: DCS Stack DefaultCoreGroup.TestRepln at Member Test-Cell\node01\server01: New view installed, identifier (332:0.Test-Cell\node02\server01), view size is 11 (AV=11, CD=12, CN=12, DF=12)
[3/3/10 18:00:39:343 CET] 00000021 DRSBuddyManag A CWWDR0006I: Replication instance terminated : Test-Cell\node02\server02
If a new member joins the core group, you can see the following message
[3/3/10 18:17:13:245 CET] 00000026 RoleMember I DCSV8051I: DCS Stack DefaultCoreGroup.TestRepln at Member Test-Cell\node01\server01: Core group membership set changed. Added: [Test-Cell\node02\server02].
[3/3/10 18:17:13:315 CET] 00000023 MbuRmmAdapter I DCSV1032I: DCS Stack DefaultCoreGroup.TestRepln at Member Test-Cell\node01\server01: Connected a defined member Test-Cell\node02\server02.
[3/3/10 18:17:30:337 CET] 00000023 VSyncAlgo1 I DCSV2004I: DCS Stack DefaultCoreGroup.TestRepln at Member Test-Cell\node01\server01: View synchronization completed successfully. The View Identifier is (333:0.Test-Cell\node02\server01). The internal details are None.
[3/3/10 18:17:30:353 CET] 00000026 DataStackMemb I DCSV8050I: DCS Stack DefaultCoreGroup.TestRepln at Member Test-Cell\node01\server01: New view installed, identifier (334:0.Test-Cell\node02\server01), view size is 12 (AV=12, CD=12, CN=12, DF=12)
What happens when coordinator went down?
Whenthe active coordinator is not available (stopped/crashed), the HA manager will elect the first inactive server in the preferred coordinator servers list. If preferred list is not specified, it will select lexically lowest named server.
The newly selected coordinator initiates a state rebuild by sending a message to all core group members to report their states.
Core group settings
Specifies the number of coordinators for this core group. The default value is one coordinator, although multiple coordinators are advisable for large core groups. All of the group data must fit in the memory of the allocated coordinators. One coordinator can run out of memory in a system with a large core group, which can cause the system to work improperly.
Channel framework is the default transport type. It uses the channel framework service to incorporate port reusability and shared port technology into the communication system.
Unicast is a targeted network model that focuses on a direct recipient for communication. This type of communication is most suitable when the intended message is sent to a specific set of recipients.
Multicast consists of a broadcast network model. This model broadcasts communication across the defined network, depending upon the values that are provided for the multicast settings. Multicast settings are suitable when there are many recipients for the intended message; otherwise broadcast communication tends to overload the network with traffic, and can impact performance goals.
Specifies the name of the channel chain if you select channel framework for the transport type.
If you select Multicast transport type
The port setting tells the coordinator where to scan for transmissions. When setting this value, verify that you are specifying a port that is not used by another network communication device. Setting a port value that has conflicts causes problems with your high availability manager infrastructure.
Multicast group IP start
Specify the starting Internet Protocol (IP) address of the intended communication area.
Multicast group IP end
Specify the ending IP address of the intended communication area. Plan the network to accommodate scalability.
4. Additional Properties
Core group servers
Specifies the server processes that belong to the core group. Server processes include the deployment manager, node agents, application servers, and cluster members. You can use the panel that displays to move server processes to a different core group.
Use to define the policies that determine which members of a high availability group are made active.
Specifies which core group servers are preferred coordinator servers.
Core Group policies:
Servers > Core groups > Core group settings > New or existing core group > Policies.
|All active||The All active policy indicates that the high availability manager keeps all of the application components that are running on all of the servers in the high availability group active at all times|
|M of N||The M of N policy is similar to the One of N policy. However, it enables you to specify the number (M) of high availability group members that you want to keep active if it is possible to do so. The number of active members must be greater than one and less than or equal to the number of servers in the high availability group. If the number of active servers is set to one, this policy is a match for the One of N policy|
|No Operation||The No operation policy indicates that no high availability group members are made active|
|One of N||The One of N policy keeps one member of the high availability group active at all times. This is used by groups that desire singleton failover. If a failure occurs, the high availability manager starts the singleton on another server|
|Static||The Static policy allows you to statically define or configure the active members of the high availability group|
Specifies one or more name-value pairs that are used to associate this policy with a high availability group. These pairs must match attributes that are contained in the name of a high availability group before this policy is associated with that group.
Core Group Policy settings
Is alive timer
In seconds, the interval of time at which the high availability manager will check the health of the active group members that are governed by this policy. If a group member has failed, the server on which the group member resides is restarted.
Specifies whether quorum checking is enabled for a group governed by this policy. Quorum is a mechanism that can be used to protect resources that are shared across members of the group in the event of a failure. The quorum mechanism is designed to work in conjunction with a hardware control facility that allows application servers to be shut down if a failure causes the group to be partitioned.
note: The Quorum setting in the policy will only have an effect if the following items are true:
* The group members are also cluster members.
* GroupName.WAS_CLUSTER=clustername must be specified as a property in the group name of any high availability group matching this policy.
Specifies whether work items assigned to the failing server are moved to the server that is designated as the most preferred server for the group if a failure occurs. This field only applies for M of N and One of N policies.
Preferred servers only
Specifies whether group members are only activated on servers that are on the list of preferred servers for this group. This field only applies for M of N and One of N policies.
Core group servers:
Use this to move servers into a different core group. All members of a cluster must be in the same core group. If you select one or more members of a cluster, all of the members of that cluster must be moved.
Preferred coordinator servers:
Use Add and Remove to move servers into and out of the list of preferred servers. Use Move up and Move down to adjust the order within the list of preferred servers. Make sure that the most preferred server is at the top of the list and the least preferred server is at the bottom.
Core group member Failure detection
HA manager monitors all the core group members. It uses 2 settings to detect the failure
1. Active failure detection
If the heartbeat from a JVM is failing for specified interval of time, then it will be marked as failed. When using default settings, heartbeats are sent every 10sec and 20times (200sec) should be failed before marking the JVM as failed. When a JVM is marked as failed, a new view is installed and you can see that in the SystemOut log.
2. TCP Keep Alive
If one member is not able to contact other member, and if gets closed socket error, it will signal the other members to treat that member as failed. Say, if one jvm is panics or network issue etc, as soon as the TCP settings allow, the failure will be detected.
Note: TCP Keep alive setting is of the operating system.