Hyper-V cluster Backup causes virtual machine reboots for common Cluster Shared Volumes members.
i having problem vms rebooting while other vms share same csv being backed up. have provided information have gather point below. if have missed anything, please let me know.
my hyperv cluster configuration:
5 node cluster running 2008r2 core datacenter w/sp1. updates released wsus install on core installation
each node has 8 nics configured follows:
nic1 - management/campus access (26.x vlan)
nic2 - iscsi dedicated (22.x vlan)
nic3 - live migration (28.x vlan)
nic4 - heartbeat (20.x vlan)
nic5 - vswitch (26.x vlan)
nic6 - vswitch (18.x vlan)
nic7 - vswitch (27.x vlan)
nic8 - vswitch (22.x vlan)
following hotfixes additional installed ms guidance (either while build or when troubleshooting stability issue in jan 2013)
kb2531907 - installed during original building of cluster
kb2705759 - installed during troubleshooting in jan2013
kb2684681 - installed during troubleshooting in jan2013
kb2685891 - installed during troubleshooting in jan2013
kb2639032 - installed during troubleshooting in jan2013
original cluster build 2 hosts quorum drive. initial 2 hosts hst1 , hst5
next host added hst3, hst6 , hst2.
note: hst4 hardware used in different project , hst6 become hst4
validation of cluster comes warning following things:
updates inconsistent across hosts
i have tried manually install "missing" updates , not applicable
most cause different build times each machine in cluster
hst1 , hst5 both same level because built @ same time
hst3 not rebuilt scratch due time constraints , goes pre-sp1 , has larger list of updates others lacking , hence inconsistency
hst6 built scratch has more updates missing 1 or 5 (10 missing instead of 7)
hst2 built , has missing updates (15)
storage - list potential cluster disks
it says there persistent reservations on 14 of csv volumes , thinks cluster.
they removed validation set reason. these iscsi volumes/disks created new
this cluster , have never been part of other cluster.
when run cluster validation wizard, slew of event id 5120 failoverclustering. wording of error:
cluster shared volume 'volume12' ('cluster disk 13') no longer available on node because of
'status_media_write_protected(c00000a2)'. i/o temporarily queued until path
volume reestablished.
under storage , cluster shared volumes in failover cluster manager, disks show online , there no negative effect of errors.
cluster shared volumes
we have 14 csvs iscsi attached 5 hosts. housed on hp p4500g2 (lefthand) san.
i have limited number of vms no more 7 per csv per best practices documentation hp/lefthand
vms in each csv spread out amonst 5 hosts (as expect)
backup software use backupchain backupchain.com.
problem having:
when backup kicks off vm, vms on same csv reboot without warning. happens within seconds of backup starting
what have done troubleshoot this:
we have tried rebalancing our backups
originally, had backup jobs scheduled kick off on friday or saturday evening after 9pm
2 or 3 hosts backing vms (serially; 1 vm per host @ time) each night.
i changed backup scheduled of 90 vms, 1 per csv backing @ same time
i mapped out hosts , csvs , scheduled backups run on week nights each night, there
is 1 vm backed per csv. vms can backed on 5 nights (there vms don't
get backed up). staggered start times each host 1 host starting
in same timeframe. there overlap hosts had backups ran longer 1 hour.
testing new schedule did not fix problem. made more clear. each backup timeframe
started, whichever csv first vm start on have of vms reboot , come up.
i thought maybe overloading network still decided disable of scheduled backup
and run manually. kicking off backup on single vm, in cases, cause reboot of common
csv members.
ok, maybe there wrong backup software.
downloaded demo of veeam , installed onto cluster.
did test backup of 1 vm , had not problems.
did test backup of second vm , had same problem. vms on same csv rebooted
ok, not backup software. apparently vss. have looked through various websites. best troubleshooting
site have found vss in 1 place on backupchain.com (http://backupchain.com/hyper-v-backup/troubleshooting.html)
i have tested every process on there list , lay out results below:
1. have rebooted hst6 , problems still persist
2. when run vssadmin delete shadows /all, have no shadows delete on of 5 nodes
when run vssadmin list writers, have no error messages on writers on node...
3. when check listed registry key, have build in ms vss writer listed (i using software vss)
4. when run vssadmin resize shadowstorge command, there no shadow storage on node
5. have completed registration , service cycling on hst6 laid out here , of stuff "errors"
only few of dll's register.
6. hyperv integration services reconciled when worked ms in january , have no indication of
further issue here.
7. did not complete step delete subscriptions because, again, have no error messages when list writers
8. removed veeam software had installed test (it hadn't added vss writer anyway though)
9. can't realistically uninstall hyperv , test vss
10. have latest sps , updates
11. part of step 5 did this. seems rehash of various other stratgies
i have used vss troubleshooter part of backupchain (ctrl-t) , following error:
error: selected writer 'microsoft hyper-v vss writer' in failed state!
- status: 8 (vss_ws_failed_at_prepare_snapshot)
- writer failure code: 0x800423f0 (<unknown error code>)
- writer id: {66841cd4-6ded-4f4b-8f17-fd23f8ddc3de}
- instance id: {d55b6934-1c8d-46ab-a43f-4f997f18dc71}
vss snapshot creation failed result: 8000ffff
vss errors in event viewer. below representative errors have received various nodes of cluster:
have various of below spread out on hosts except hst6
source: volsnap, event id 10, shadow copy of volume took long install
source: volsnap, event id 16, shadow copies of volume x aborted because volume y, contains shadow copy storage shadow copy, wa force dismounted.
source: volsnap, event id 27, shadow copies of volume x aborted during detection because critical control file not opened.
have 1 instance of each of these , both of below hst3
source: vss, event id 12293, volume shadow copy service error: error calling routine on shadow copy provider {b5946137-7b9f-4925-af80-51abd60b20d5}. routine details reverttosnashot [hr = 0x80042302, volume shadow copy service component encountered unexpected error.
source: vss, event id 8193, volume shadow copy service error: unexpected error calling routine getoverlappedresult. hr = 0x80070057, parameter incorrect.
so, basically, have tried has resulted in no success towards solving problem.
i appreciate assistance can provided.
thanks,
charles j. palmer
wright flood
long read. <grin> going miss things in reply.
first, don't see have specified network csv, recommended configuration microsoft. result, maybe csv traffic ending on network not have bandwidth. fortuanately, looks dedicated network heartbeat. don't have dedicate network that, cluster communication traffic pretty low. so, if on network, add csv traffic it. here couple of powershell commands configure cluster metrics for (obviously replace network names).
$clstr = read-host "enter name of cluster" get-clusternetwork -cluster $clstr | ft name, role, metric (get-clusternetwork "csv" -cluster $clstr).metric = 800 (get-clusternetwork "livemigration" -cluster $clstr).metric = 900 get-clusternetwork -cluster $clstr | ft name, role, metric
the other thing ensure cluster communications running on @ least 1 other network. people have cluster communications running on management network , csv network. provides fault tolerance cluster communication network.
next, see single nic iscsi. better practice have @ least 2 nics storage access , set them mpio. causing problem? don't know, if csv network happens iscsi network, , running production , backup off single network, knows?
to address concerns cluster validation warnings ... not uncommon 'patch level mismatch' in cluster has been around awhile different nodes being built @ different times. example, if have node patched before sp1 applied, , compare node had sp1 applied before patched, can guarantee vary. looks have done should in case. in regards storage errors, happen when try run storage tests on operational cluster owns storage tested. of course there persistent reservations - cluster has them. storage tests can successful on storage not claimed.
finally, backup problem. have discussed backupchain? there backup software certified use csv volumes? no backup software works csv. besides, if certified use csv, best place start on getting these answers.
.:|:.:|:. tim
Windows Server > Hyper-V
Comments
Post a Comment