Solution Deployment Failed On One Or More Servers In Farm

Issue

When deploying a solution in a multi-server SharePoint farm, it did not deploy to all servers in the farm. During deployment, it missed one or more servers and the status of the solution was "Not Deployed".

Note
In our environment, we have four SharePoint servers with custom MinRole, two for app servers and two for web front end purposes.

  1. If you browse the Manage Farm Solution from Central admin, the Status of the Solution is Not Deployed,


  2. If you click on the Solution and Solution Property page, you will see that it deployed to 3 servers out of 4 and kfca1 is missing.


  3. Even we try to retract an existing solution, it retracts from 3 servers out of 4, again missing kfca1.

Trouble Shooting

We checked the following things,

  1. Checked SharePoint Timer Service is running on All servers ( from services console)
  2. Checked SharePoint Admin Service is running on All servers ( from services console)
  3. Checked from Central admin if there was any Timer job stuck or paused ( from central admin > Monitoring > timer job status)
  4. Cleared the Config cache on all servers in the farm ( please check the wiki for Clear Config Cache) 

    Note
    Clearing Config cache in Production required extra precautions, otherwise it will cause an outage.

  5. Re-deploy the Solution either PowerShell or from Central Admin
  6. Reboot the faulty server
  7. Enable the Verbose Logging and try to check ULS logs for any clue.
  8. Check the Event Log for any clue

Root Cause

We opened a support ticket with MSFT after all our troubleshooting. During that, we found the Internal SharePoint Foundation Timer job was disabled on the server.

Note
Farm-level SharePoint Foundation Timer job was only visible from PowerShell.

  1. We ran the following script to get the status of Internal SharePoint Foundation Timer Job.
    1. $farm = get - spfarm  
    2. $ss = $farm.Servers | ? {  
    3.     $_.Role - notlike "Invalid"  
    4. }  
    5. foreach($s in $ss) {  
    6.     $s.name  
    7.     Write - host "........................."  
    8.     $is = $s.ServiceInstances  
    9.     foreach($i in $is) {  
    10.         if ($i.TypeName - eq "Microsoft SharePoint Foundation Administration") {  
    11.             $i.Typename  
    12.             $i.status  
    13.         }  
    14.         if ($i.TypeName - eq "Microsoft SharePoint Foundation Timer") {  
    15.             $i.Typename  
    16.             $i.status  
    17.         }  
    18.     }  
    19. }  
  2. From the out below, clearly we are seeing that SharePoint foundation Timer Job is disabled on kfca1, which is the problem.

Resolution

Now, we know the Internal SharePoint Foundation Timer job instance is disabled on the kfca1, we have to bring that service instance back online.

  1. We can change the status via PowerShell only. Please run the below PowerShell to bring all the Service Instances Online.
    1. $farm = Get - SPFarm  
    2. $disabledTimers = $farm.TimerService.Instances | where {  
    3.     $_.Status - ne "Online"  
    4. }  
    5. if ($disabledTimers - ne $null) {  
    6.     foreach($timer in $disabledTimers) {  
    7.         Write - Host "Timer service instance on server "  
    8.         $timer.Server.Name " is not Online. Current status:"  
    9.         $timer.Status  
    10.         Write - Host "Attempting to set the status of the service instance to online"  
    11.         $timer.Status = [Microsoft.SharePoint.Administration.SPObjectStatus]::Online  
    12.         $timer.Update()  
    13.     }  
    14. else {  
    15.     Write - Host "All Timer Service Instances in the farm are online! No problems found"  
    16. }  
  2. Here is output which tells us that service instance is online now.


  3. If you run the Script to check the status of the timer job, you will see all the servers return with Online Status:
  4. Now, we have to the Clear the Config Cache on the all servers in the farm. Please follow this wiki for clearing config cache.
  5. Finally, redeploy the Solution either using PowerShell or Via Central admin.
  6. Now, it is successfully deployed to all servers,

    .
  7. If you check the solution properties then it will show it has deployed to all servers as expected.

Conclusion

So, it ends up that the SharePoint timer service instance in SharePoint, which is available using the PowerShell, stopped on one server which makes it impossible to deploy the solution. When you deploy it using the local parameter then it deploys on the local server but is not deployed using the global one. A big clue from the troubleshooting is, the timer job for solution deployment was never created for the faulty server. Another day in a SharePoint admin’s life.

This applies to all SharePoint Server on-prem versions; i.e. SharePoint 2010, 2013, 2016 and 2019.