Troubleshooting and Fixing Distributed Cache Service in SharePoint 2013

The Distributed Cache (DC) is a new component that has been added to SharePoint 2013. Social networking tools, such as My Sites, and social content technologies, such as microblogs, activity feeds, news feeds, authentication tokens etc., are examples of social computing features. Thus, it's one of the most critical parts for SharePoint 2013 in terms of social computing.
 
The Distributed Cache service uses Windows AppFabric caching technology behind the scene.
 
The cache could consume a ton of memory for the application and web servers. While implementing DC service, there are two modes that could be used:
  1. Collocated mode – in this mode, the Distributed Cache service runs together with other services on the application server.
  2. Dedicated mode – in this mode, all services other than the Distributed Cache service are stopped on the application server that runs the Distributed Cache service.
Microsoft recommends using dedicated mode in the SharePoint Farm.
 
Capacity planning is an important factor which you will implement in the SharePoint farm.
 
These are Microsoft recommended Distributed Cache capacities:

Deployment sizeSmall farmMedium farmLarge farm
Total number of users< 10,000< 100,000< 500,000
Recommended cache size for the Distributed Cache service1 GB2.5 GB12 GB
Total memory allocation for the Distributed Cache service (double the recommended cache size above, plus reserve 2 GB for the OS)2 GB5 GB34 GB
Recommended architectural configurationDedicated server or co-located on a front-end serveDedicated serverDedicated server
Minimum cache hosts per farm112

Note: In the Distributed Cache service, cache size should not exceed 16 GB. So, Microsoft recommends that you use two servers while working in a large farm environment.

While implementing the DC, it is better to have dedicated farm even for a small farm.
 
What I found in TechNet, troubleshooting for DC is not very documented, especially when you run into issues. Fortunately, there are blogs that help in troubleshooting the DC.

My SharePoint Server 2013 farm is, as follows:
OS: Windows Server 2012
SharePoint Version:
SharePoint Server 2013 Standard, Build number: 15.0.4420.1017 (RTM)
SQL Server:
SQL Server 2012
A) App Server, 8 GB RAM
B) Web Front End 01, 3 GB RAM
C) WEB Front End 02, 3 GB RAM
First things first. I will list down all the pre-requisites for Distributed Cache to function properly, so that you do not pull out your hair and become frustrated like me! :)
  1. Warning while setting DC service.

    Do not restart the AppFabric Caching in the services console. Microsoft strongly recommends this and if you do this, you might need to rebuild your farm.



  2. Always use PowerShell the Distributed cache commandlets.
  3. Firewall Ports
    1. Distributed Cache requires following high ports. (22233, 22234, 22235, 22236)
      Note: If the firewall has been opened of above ports, use PowerShell using Distributed Cache Commandlets, the DC ports will be opened automatically.
    2. ICMPv4 and ICMPv6 have to be opened for DC to function properly.
      Besides this following ports have to be opened as well: 8, 138, 139, 445 Ports required
  4. Firewalls in the organization
    If the Network topology has 2 – 3 firewalls for SharePoint farm, all Firewalls have to be opened as well.

    Search and User Profile requirements

  5. Search: Continuous crawl has to be enabled.
  6. User Profile: The service account of the application pool of the web application for My Site should have Full Control.
  7. Use Stop-SPDistributedCacheServiceInstance –Graceful to stop any of the Distributed cache instances for any SharePoint server.
  8. Assign the Distributed Cache memory when you set up the Distributed cache instance for all SharePoint servers. DC eats memory like crazy and users will complain later on.
  9. Remote Services to be enabled.
I will cover both collocated and dedicated modes for DC configuration. 
  • In collocated configuration, each server in the farm will have DC instance with the STARTED status.
  • Whereas in the dedicated configuration, you can choose either one server to be dedicated Distributed Cache servers and other web serversMUST have STOPPED status. The Distributed Cache instance MUST be available on all SharePoint servers.
Issue #1 Error: cacheHostInfo is null or removing existing DC instance Remove-SPDistributedCacheServiceInstance

Fix:

Forcefully delete the Distributed Cache Instance as follows:

$instanceName ="SPDistributedCacheService Name=AppFabricCachingService"
$serviceInstance = Get-SPServiceInstance | ? {($_.service.tostring()) -eq $instanceName –and ($_.server.name) -eq "SP2013App"}
$serviceInstance.Delete()
Add-SPDistributedCacheServiceInstance

Issue #2 Error Starting the Distributed instance Cache
 
 

While you provision DC instance, you may receive the above error.

Fix:

Remove and add the DC instance.

#Removing the service from SharePoint on local host.
Stop-SPDistributedCacheServiceInstance –Graceful Remove-SPDistributedCacheServiceInstance$instanceName ="SPDistributedCacheService Name=AppFabricCachingService"
$serviceInstance = Get-SPServiceInstance | ? {($_.service.tostring()) -eq $instanceName -and ($_.server.name) -eq $env:computername}$serviceInstance.delete()

#Add DC Instance

$SPFarm = Get-SPFarm
$cacheClusterName = "SPDistributedCacheCluster_" + $SPFarm.Id.ToString()
$cacheClusterManager = [Microsoft.SharePoint.DistributedCaching.Utilities.SPDistributedCacheClusterInfoManager]::Local
$cacheClusterInfo = $cacheClusterManager.GetSPDistributedCacheClusterInfo($cacheClusterName);
$instanceName ="SPDistributedCacheService Name=AppFabricCachingService"
$serviceInstance = Get-SPServiceInstance | ? {($_.Service.Tostring()) -eq $instanceName -and ($_.Server.Name) -eq $env:computername}
$serviceInstance.Delete()
Add-SPDistributedCacheServiceInstance

Issue #3 ErrorCode<ERRPS002>:SubStatus<ES0001>:Invalid provider and connection string read. Please provide the values manually.
 
 

Fix:
 
Somehow, the connection string has been missing and we need to manually add the database entry for AppFabric as follows:
  1. Run (Windows + R) and enter Regedit
  2. HKEY_LOCAL_MACHINE >> SOFTWARE >> MICROSOFT >> AppFabric >> V1.0 >> CONFIGURATION
  3. Enter Connection String and Provider as follows:


Connection String:
Data Source=spsql;Initial Catalog=SPFarm_SharePoint_Config;Integrated Security=True;Enlist=False

Provider:
SPDistributedCacheClusterProvider

Then use PowerShell to verify the Distributed Cache

Use-CacheCluster
Get-CacheHost

Issue #4 Page load take 6 seconds.

Unexpected Exception in SPDistributedCachePointerWrapper::InitializeDataCacheFactory for usage 'DistributedViewStateCache' - Exception 'Microsoft.ApplicationServer.Caching.DataCacheException: ErrorCode<ERRCA0017>:SubStatus<ES0006>:There is a temporary failure. Please retry later. (One ormore specified cache servers are unavailable, which could be caused by busy network or servers. For on-premises cache clusters, also verify the following conditions. Ensure that security permission has been granted for this client account, and check that the AppFabric Caching Service is allowed through the firewall on all cache hosts. Also the MaxBufferSize on the server must be greater than or equal to the serialized object size sent from the client.) ---> System.ServiceModel.ProtocolException

Page load took more than 6 seconds in Developer Dashboard, as shown:



and you can see there is exactly 6 seconds in the developer dashboard.
 
 

In my SharePoint environment, I was getting the following errors as all in collocated mode for DC.
 
Fix:
It took more than 4 weeks to find the actual issue for me. To troubleshoot the Distributed cache, we need to know what incorrect settings were in my environment:

As mentioned, I have 3 SharePoint Servers,1 Application, and 2 web front-ends.

a) On App Server

Use-CacheCluster
Get-CacheHost
 
 


Only APP server status is UP.Apps02: UP
Wfe01: Unknown
Wfe02: Unknown

And other WFE server were showing below errors:

Error: SubStatus(ES0001): Cache host SP13WFE01.contoso.com is not reachable. Error: SubStatus(ES0001): Cache host SP13WFE02.contoso.com is not reachable.

b) first Frond End Server



Apps02: Unknown
Wfe01: Down
Wfe02: Unknown

c) Second Frond End Server



Apps02: Unknown
Wfe01: Unknown
Wfe02: Down
 
App02
Wfe01
Wfe02
Apps02: UP
Wfe01: Unknown
Wfe02: Unknown
Apps02: Unknown
Wfe01: Down
Wfe02: Unknown
 Apps02: Unknown
Wfe01: Unknown
Wfe02: Down
 
Clearly, each cache host is not able to connect to each other in above errors. So on each SharePoint server, the current server (Apps02) shows UP services status, whereas other WFEs shows UNKNOWN status. Same applies to WFE01 and WFE02. During my troubleshooting, I found if any server has UNKNOWN status, it means some configuration has to be fixed.

Collated mode

Step 1: Inbound rule for Distributed Cache ports (22233 - 2223) for each server in Firewall.
 
 

Perform this for each server.

Now, in my SharePoint farm WFE02 shows these settings



we have to open Firewall for WFE01 as well.

 

Step 2: Start the Remote services on each server as shown:



Step 3: Turn on Ping for all SharePoint servers.



Now, each SharePoint server has server status as UP.

Use-CacheCluster
Get-CacheHost

App Server:



WFE01:
 
 

WFE02:



This works perfectly in the collated mode for Distributed Cache.

Verify the page load and in my environment page load took 288.69 milliseconds with Distributed Cache started.

To simulate Dedicated Distributed Cache server, I stopped the DC instance for both, the WFEs and only Application server, to manage the Distributed Cache instance.

APP02

 
 

WFE01

 
 

WFE02





I hope this article helps someone.