Ever have one of those “days” or more like weeks where you are supposed to handle everything but in reality, secretly in the dark shadows of your office you are banging your head and crying because you can’t solve a stupid little problem? Yes, that was me this past week trying to solve something that should have been simple but ended up causing me many late nights and long days trying to troubleshoot. If you want to skip my ramble below at the bottom I put in a Final Notes with a quick summary.
So we are building out a number of Storage Space Direct (S2D) clusters for a client. Everything is pretty new technology, running Windows Server 2016, Hyper-Converged, blah blah blah….
We had built out our POC cluster to do some load balancing, fault tolerance testing, etc. Things where going very well until I killed one of the servers and lost about 20 test VM’s. That wasn’t supposed to happen??? Nope, it wasn’t. I had already planned to rebuild the POC cluster and use Server Core instead of the Desktop Experience GUI so went and rebuilt the cluster.
Once again, S2D was working great. Good performance for the load testing using VMFleet on the hardware configuration we had built out. (Separate Blog Series on that if you are interested.) Then I tested Live Migration and no go. I tested Quick Migration and everything worked like a charm. Went back and started to research why Live Migration wouldn’t work.
Within the Fail over Cluster Console I was getting the following error as well as in the Application Event log:
Error 7/9/2017 8:11:02 PM Hyper-V-High-Availability 21502 None
Live migration of ‘Virtual Machine LMTest01’ failed.
Virtual machine migration operation for ‘LMTest01’ failed at migration source ‘<SERVER NAME>’. (Virtual machine ID 66D6F9DA-BBF4-4D22-B6F0-4D3E78A3E626)
The Virtual Machine Management Service failed to establish a connection for a Virtual Machine migration with host ‘<SERVER NAME>’: The specified target is unknown or unreachable (0x80090303).
The Virtual Machine Management Service failed to authenticate the connection for a Virtual Machine migration at the source host: The specified target is unknown or unreachable (0x80090303)
Now when researching this error, at least the time of this post, there was little to no information about this for Windows Server 2016. Plenty of blogs and Technet forum post for 2012 and older. However, pretty much all of them focus on authentication issues and how you have Live Migration configured to either use Kerberos or CredSSP. I did happen to double check my Live Migration settings and everything on the Hyper-V side was configured correctly.
I checked my Active Directory configuration for constrained delegation for my Hyper-V host. On the Delegation Tab on my Hyper-V host properties I had Trust this computer for delegation to any service (Kerberos Only) set. I did read on a Microsoft blog some place that with 2016 it is best practice to use Trust this computer for delegation to specified services only. Then add the services you need that can present delegated credentials. I did this and quickly saw that all 4 host had “cifs” service type allowed for all my Hyper-V servers in this POC cluster. (Hmmm.. not going to give it away here… but this is where I should have noticed something…. )
So, at this point I started to dig into the Hyper-V event logs. Not only was I getting the error I mentioned above, but within the Hyper-V VMMS log I found repeated over and over the following errors:
- Failed to register the service principal name ‘Microsoft Virtual System Migration Service’.
- Failed to register the service principal name ‘Microsoft Virtual Console Service’.
- Failed to register the service principal name ‘Hyper-V Replica Service’.
Getting warmer??? Maybe, or maybe not? So I went on a rampage to trouble shoot these errors. Just like the first error, nothing was referring back to Windows 2016 or Hyper-V 2016. However, I followed a few blogs to help get me an idea of what might be wrong.
I even manually created the SPN’s for all the three services on all the nodes in my POC cluster. I restarted the VMMS service and was thinking great, now done. yet in still, error after error that it failed to register the service principal.
setspn -S “Microsoft Virtual System Migration/cldptcjblhpv001” cldptcjblhpv001
setspn -S “Microsoft Virtual System Migration/cldptcjblhpv001.DOMAIN.com” cldptcjblhpv001
setspn -S “Microsoft Virtual System Migration/cldptcjblhpv002” cldptcjblhpv002
setspn -S “Microsoft Virtual System Migration/cldptcjblhpv002.DOMAIN.com” cldptcjblhpv002
setspn -S “Microsoft Virtual System Migration/cldptcjblhpv003” cldptcjblhpv003
setspn -S “Microsoft Virtual System Migration/cldptcjblhpv003.Domain.com” cldptcjblhpv003
setspn -S “Microsoft Virtual System Migration/cldptcjblhpv004” cldptcjblhpv004
setspn -S “Microsoft Virtual System Migration/cldptcjblhpv004.Domain.com” cldptcjblhpv004
Did the same with the other 2 SPN’s as well. (Anyone catch something from above?)
However, as I mentioned above when I checked my properties for the constrained delegation for each Hyper-V host I only had added the “cifs” service type. Why? Because the other three service types never showed as an option to added until after I registered the SPN for the service. Yes, this was my fix??? Or was it???
I quickly went back to the properties on my host and now I did have the three missing service types. I kept the cifs sevice type and added the “Microsoft Virtual Machine Management” service type. Restarted the VMMS service and nothing. Still tons of errors that the SPN couldn’t register. Still Live Migration was broke.
So I am at a loss at this point. I read some place that these issues where fixed after rebuilding their cluster. So I did the same thing. Blew away everything, blocked GPO’s from the OU’s the machines where in. Still I ran into the same exact issues. Live Migration wasn’t working. I was still getting the first error and still getting the SPN registration errors.
So, one last reach out to the community and the interwebs I found the following blog: Griffon’s IT Library. They where having the same issues with Hyper-V not registering the SPN’s. At this point my focus was move on the SPN’s then the first error above. I figured if the VMM service wasn’t working properly then the Live Migration wouldn’t work. I knew I was on the right track! This had to be it! I read through his entire blog and it was it. This had to be the answer…….. well, if I would have did more deep dive into our environment here this could have or would have been the answer? This is where pride comes in and the reason to more hours of troubleshooting if I only would have done some simple troubleshooting at this point. I didn’t look at what port that our domain controllers was using. Who these days restricts the NTDS port to a range that is not common anyway….. (hint, this is a big hint to what fixed my issue days later.)
Once again, I blew away my cluster. Rebuilt it again. Clean and fresh. Yet still getting those SPN errors. This time I didn’t actually create the cluster. However, as soon as the Hyper-V role was installed the same SPN errors. This time, after a few cups of good coffee something hit me. Something I over looked. No, this couldn’t be the issue? Could it have been? Above I left a hint after showing you my setspn -S commands I ran… this was one of my issues….
Look closer, do you see what I finally noticed?
setspn -S “Microsoft Virtual System Migration/cldptcjblhpv001” cldptcjblhpv001
setspn -S “Microsoft Virtual System Migration/cldptcjblhpv001.Domain.com” cldptcjblhpv001
Yes, I left out the Service in the “Microsoft Virtual System Migration Service” Yes, this was it. Everything is working once I registered the SPN’s manually and added them to the delegation of services for my cluster nodes? Sorry, no try again! I was pretty much done at this point. I had a deadline coming up to get the actually production clusters built and ready. Yet, Live Migration wasn’t working. This was a project stopper. No Live Migration, no moving forward into production. Which means I would be the flat tire on the project timeline. My name would be dragged through the mud….. it would be my fault that we didn’t meet our project milestones… No!!!!!!!!!!!!!!!!!!
So one more try. One more last effort before I go to the team and say “I can’t fix this problem, we need to call Microsoft.” Which isn’t always bad but it doesn’t look good.
When researching anything SPN related to Hyper-V you always get tons of blogs and technical articles about the dynamic port range. Like I mentioned above my pride got in the way of an overall easy fix. I should have given it away when I mentioned it in the section where I talked about Griffon’s IT Library blog.
Now to give me credit I did have two issues here. One caused by my fat finger and another cause by some configurations to our domain controllers in this environment that I am building these clusters. There are many good blogs out there that talked about this one issue that I ignored plan as day. I had to search a little to found out how our domain controllers had been configured. What I found had me banging my head on my desk for a good hour plus. Our domain controllers had the NTDS port restricted to a very small port range. After finally logging on to a DC and doing a NETSTAT and seeing the traffic all on ports not normally used I wanted to scream. I double checked the registry and yes, they had been restricted.
I went back to Griffon’s IT Library blog and used his PowerShell script he borrowed from another site. Restarted the VMMS service and within seconds my SPN errors went away and was getting event logs confirming the registration of those services successful.
#———- Beginning of the script ——————————#
"Allow outbound traffic from service to TCP 5100 to 5400"
Direction = 2
Protocol = 6
)Get-NetFirewallRule -PolicyStore ConfigurableServiceStore
#———- End of the script ——————————#
The ports above are not my ports. I just changed them to match the few ports that they restricted it to.
With the SPN registered with correct service names and VMMS service only trying to communicate on the restricted ports my issues should be fixed. I will have to say that I started out chasing a rabbit. That my original error message, being a valid error was not the underlying issue that was causing my grief. Some days (Week’s) are just like this.
I created my cluster again, enabled S2D, created a test VM. And like it should have done the first time, Live Migration worked without an issue. I still don’t know why we restricted the NTDS ports like that. Everyone I have asked has said the same thing, it was always set that way and we never changed it. Anyway, at least I know that I will need to make another configuration change in all my Hyper-V host moving forward.
Started out with looking up Live Migration errors that lead me to configuration of the constrain delegation not being truly configured correctly. Why, because the SPN’s for three of the Hyper-V services where not registered correctly because the NTDS ports on my domain controllers had been restricted to ports not in the dynamic ranges that Hyper-V uses to register those SPN’s. At the end of the day if I would have focused on the SPN issues and just looked at my domain controllers I would have saved myself a lot of time.
- Check your NTDS ports on your domain controllers if you are having an SPN registration issue.
- Check that the SPN’s are registered correctly with the correct names (If you did step number 1 this should take care of itself.)
- Verify that your constrained delegation for Live Migration is configured correctly.
A few shout outs to a few other blogs that helped along the way: