2 Replies Latest reply: Jul 19, 2017 8:59 PM by Wasfi Bounni RSS

What happens when a primary SFED fails over to the standby one?

Wasfi Bounni

Hi

 

I mean what will happen from the perspective of the iscsi initiator, would a hick up happen where a reboot of that host "initiator" is needed or is it seemless without any service interruption. If so, how can that be possible. I mean the failover must be near instantanous.

 

Kindly

Wasfi

  • Re: What happens when a primary SFED fails over to the standby one?
    Tom Tasker

    Hi Wasfi,

     

    For a SFED HA failover it is identical to what is seen in a DC storage array.  During the initial iSCSI setup between the initiator (ESXi host) and the target (SFED), the SFED advertises out all of the available IPs for connection.  This includes the active and standby SFED IPs from both SFED's. 

     

    During normal running the ESXi host connects to the active IP's but has awareness of the standby IP's like any server in the DC would have, and it's not until the active IP's become unavailable (failure on the active SFED) that the ESXi host (initiator) would then start polling all the known target IP's to see who is alive.  This is when the SteelFusion service on the standby SFED would fire up and advertise its self as the active SFED. 

     

    During this SFED failover the ESXi host will pause IO operations and hold them in the iSCSI buffers on the ESXi host, assuming that the link loss will come back up shortly, which it should! 

     

    Once the standby has become active all normal storage operations will resume with the "new" iSCSI target IPs showing as the previous standby (now active) SFED. 

     

    The SFED that failed, assuming it comes back cleanly, will resume in the role of standby and advertise its self back out as available to the ESXi host.  The whole process can then be repeated again. 

     

    There is no automatic failback on the SFED's, like there is on the SteelFusion Core, due to a number of technology and process factors.  The Blockstore (storage cache) needs to synchronized between the SFED's so that the SFED that was rebooting can catch up with the data changes that have happened during its outage.  Failing back automatically could mean that new data wouldn't make it across in time and result in data loss.  This is the "sync status" between the two SFED's after a failure. 

     

    Process wise, if there is a problem with the failing SFED, no matter how good HA and services are, having a SFED that could potentially "yo-yo" in availability for whatever reason, would introduce timeout periods that could be difficult to recover from.  The decision is that if an SFED fails for whatever reason, that reason should be diagnosed and understood before services are manually failed back by the administrator if needed. 

     

    Tom.