Hyper-V cluster Backup causes virtual machine reboots for common Cluster Shared Volumes members.


i having problem vms rebooting while other vms share same csv being backed up. have provided information have gather point below. if have missed anything, please let me know.

my hyperv cluster configuration:
5 node cluster running 2008r2 core datacenter w/sp1. updates released wsus install on core installation
each node has 8 nics configured follows:
 nic1 - management/campus access (26.x vlan)
 nic2 - iscsi dedicated (22.x vlan)
 nic3 - live migration (28.x vlan)
 nic4 - heartbeat (20.x vlan)
 nic5 - vswitch (26.x vlan)
 nic6 - vswitch (18.x vlan)
 nic7 - vswitch (27.x vlan)
 nic8 - vswitch (22.x vlan)
following hotfixes additional installed ms guidance (either while build or when troubleshooting stability issue in jan 2013)
 kb2531907 - installed during original building of cluster
 kb2705759 - installed during troubleshooting in jan2013
 kb2684681 - installed during troubleshooting in jan2013
 kb2685891 - installed during troubleshooting in jan2013
 kb2639032 - installed during troubleshooting in jan2013
original cluster build 2 hosts quorum drive. initial 2 hosts hst1 , hst5
next host added hst3, hst6 , hst2.
note: hst4 hardware used in different project , hst6 become hst4
validation of cluster comes warning following things:
 updates inconsistent across hosts
  i have tried manually install "missing" updates , not applicable
  most cause different build times each machine in cluster
   hst1 , hst5 both same level because built @ same time
   hst3 not rebuilt scratch due time constraints , goes pre-sp1 , has larger list of updates others lacking , hence inconsistency
   hst6 built scratch has more updates missing 1 or 5 (10 missing instead of 7)
   hst2 built , has missing updates (15)
 storage - list potential cluster disks
  it says there persistent reservations on 14 of csv volumes , thinks cluster.
  they removed validation set reason. these iscsi volumes/disks created new
  this cluster , have never been part of other cluster.
 when run cluster validation wizard, slew of event id 5120 failoverclustering. wording of error:
  cluster shared volume 'volume12' ('cluster disk 13') no longer available on node because of
  'status_media_write_protected(c00000a2)'. i/o temporarily queued until path
  volume reestablished.
 under storage , cluster shared volumes in failover cluster manager, disks show online , there no negative effect of errors.
cluster shared volumes
 we have 14 csvs iscsi attached 5 hosts. housed on hp p4500g2 (lefthand) san.
 i have limited number of vms no more 7 per csv per best practices documentation hp/lefthand
 vms in each csv spread out amonst 5 hosts (as expect)
backup software use backupchain backupchain.com.

problem having:
 when backup kicks off vm, vms on same csv reboot without warning. happens within seconds of backup starting

what have done troubleshoot this:
 we have tried rebalancing our backups
  originally, had backup jobs scheduled kick off on friday or saturday evening after 9pm
  2 or 3 hosts backing vms (serially; 1 vm per host @ time) each night.
  i changed backup scheduled of 90 vms, 1 per csv backing @ same time
   i mapped out hosts , csvs , scheduled backups run on week nights each night, there
   is 1 vm backed per csv. vms can backed on 5 nights (there vms don't
   get backed up). staggered start times each host 1 host starting
   in same timeframe. there overlap hosts had backups ran longer 1 hour.
  testing new schedule did not fix problem. made more clear. each backup timeframe
  started, whichever csv first vm start on have of vms reboot , come up.
 i thought maybe overloading network still decided disable of scheduled backup
 and run manually. kicking off backup on single vm, in cases, cause reboot of common
 csv members.
 ok, maybe there wrong backup software.
  downloaded demo of veeam , installed onto cluster.
  did test backup of 1 vm , had not problems.
  did test backup of second vm , had same problem. vms on same csv rebooted
 ok, not backup software. apparently vss. have looked through various websites. best troubleshooting
 site have found vss in 1 place on backupchain.com (http://backupchain.com/hyper-v-backup/troubleshooting.html)
 i have tested every process on there list , lay out results below:
  1. have rebooted hst6 , problems still persist
  2. when run vssadmin delete shadows /all, have no shadows delete on of 5 nodes
   when run vssadmin list writers, have no error messages on writers on node...
  3. when check listed registry key, have build in ms vss writer listed (i using software vss)
  4. when run vssadmin resize shadowstorge command, there no shadow storage on node
  5. have completed registration , service cycling on hst6 laid out here , of stuff "errors"
   only few of dll's register.
  6. hyperv integration services reconciled when worked ms in january , have no indication of
   further issue here.
  7. did not complete step delete subscriptions because, again, have no error messages when list writers
  8. removed veeam software had installed test (it hadn't added vss writer anyway though)
  9. can't realistically uninstall hyperv , test vss
  10. have latest sps , updates
  11. part of step 5 did this. seems rehash of various other stratgies
 i have used vss troubleshooter part of backupchain (ctrl-t) , following error:
  error: selected writer 'microsoft hyper-v vss writer' in failed state!
  - status: 8 (vss_ws_failed_at_prepare_snapshot)
  - writer failure code: 0x800423f0 (<unknown error code>)
  - writer id: {66841cd4-6ded-4f4b-8f17-fd23f8ddc3de}
  - instance id: {d55b6934-1c8d-46ab-a43f-4f997f18dc71}
  vss snapshot creation failed result: 8000ffff

vss errors in event viewer. below representative errors have received various nodes of cluster:
have various of below spread out on hosts except hst6
source: volsnap, event id 10, shadow copy of volume took long install
source: volsnap, event id 16, shadow copies of volume x aborted because volume y, contains shadow copy storage shadow copy, wa force dismounted.
source: volsnap, event id 27, shadow copies of volume x aborted during detection because critical control file not opened.
have 1 instance of each of these , both of below hst3
source: vss, event id 12293, volume shadow copy service error: error calling routine on shadow copy provider {b5946137-7b9f-4925-af80-51abd60b20d5}. routine details reverttosnashot [hr = 0x80042302, volume shadow copy service component encountered unexpected error.
source: vss, event id 8193, volume shadow copy service error: unexpected error calling routine getoverlappedresult.  hr = 0x80070057, parameter incorrect.

so, basically, have tried has resulted in no success towards solving problem.

i appreciate assistance can provided.

thanks,

charles j. palmer

wright flood

long read. <grin>  going miss things in reply.

first, don't see have specified network csv, recommended configuration microsoft.  result, maybe csv traffic ending on network not have bandwidth.  fortuanately, looks dedicated network heartbeat.  don't have dedicate network that, cluster communication traffic pretty low.  so, if on network, add csv traffic it.  here couple of powershell commands configure cluster metrics for (obviously replace network names).

$clstr = read-host "enter name of cluster" get-clusternetwork -cluster $clstr | ft name, role, metric (get-clusternetwork "csv" -cluster $clstr).metric = 800 (get-clusternetwork "livemigration" -cluster $clstr).metric = 900 get-clusternetwork -cluster $clstr | ft name, role, metric 

the other thing ensure cluster communications running on @ least 1 other network. people have cluster communications running on management network , csv network. provides fault tolerance cluster communication network.

next, see single nic iscsi.  better practice have @ least 2 nics storage access , set them mpio.  causing problem?  don't know, if csv network happens iscsi network, , running production , backup off single network, knows?

to address concerns cluster validation warnings ... not uncommon 'patch level mismatch' in cluster has been around awhile different nodes being built @ different times.  example, if have node patched before sp1 applied, , compare node had sp1 applied before patched, can guarantee vary.  looks have done should in case.  in regards storage errors, happen when try run storage tests on operational cluster owns storage tested.  of course there persistent reservations - cluster has them.  storage tests can successful on storage not claimed.

finally, backup problem.  have discussed backupchain?  there backup software certified use csv volumes?  no backup software works csv.  besides, if certified use csv, best place start on getting these answers.


.:|:.:|:. tim



Windows Server  >  Hyper-V



Comments

Popular posts from this blog

Error: 0x80073701 when trying to add Print Services Role in Windows 2012 Standard

Disconnecting from a Windows Server 2012 R2 file sharing session on a Windows 7,8,10 machine

Windows 2016 RDS event 1306 Connection Broker Client failed to redirect the user... Error: NULL