Following an upgrade to Sitecore 9.0 update 2, from Sitecore 8.2 update 6, spotted that index rebuilds of the default indexes Core, Web & Master were taking much longer than they were before.
Talking to rebuild these 3 index in parallel under 50 mins in Sitecore 8.2, now taking over 6 hours in Sitecore 9 (sometime 20 hours+), for ~1/4 million items in each of the web and master databases.
This was using the same SolrCloud infrastructure which had been upgraded ahead of the Sitecore 9 upgrade, same size VMs for sitecore indexing server, same index batch sizes & threads.
<setting name="ContentSearch.ParallelIndexing.MaxThreadLimit value="15" />
<setting name="ContentSearch.ParallelIndexing.BatchSize" value="1500" />
Looking at the logs could see they were flooded with messages.
XXXX XX:XX:XX WARN More than one template field matches. Index Name : sitecore_master_index Field Name : XXXXXXXXX
Initial discussions with Sitecore Support were to apply some patches to filter out the messages being written to the log files. bug #195567
However this felt more like treating the symptoms rather than the cause.
With performance still only being slightly improved, using reflection and overrides, tried to patch the behaviour in SolrFieldNameTranslator to not need to write theses warnings to the log files in the first place. Unfortunately the code had lots of private non virtual methods, and implemented an internal interface, which proved quite tricky to override, without requiring IL modification, so really was something for Sitecore to fix.
But even after all this, still around 4+ hours to rebuild the index on a good day.
I observed an individual rebuild of the Core index was quite fast on it’s own, ~5 mins.
But Sitecore Support confirmed that the algorithm used, would use resource stealing, to make the jobs finish about the same time each other (Slow job would steal resource from faster job).
And confirmed in Sitecore 8.2 update 6 all indexes were taking a similar time when run in parallel.
Work Stealing in Task Scheduler
Blog on Work Stealing
Resources on the servers, and DTU usage on the database were minimal. So didn’t appear to be maxing out.
So what was the issue, some locking, or job scheduling changed in Sitecore 9?
Well to find the answer some performance traces were required, from a test environment where could replicate this issue.
After enough performance traces were performed, Sitecore support observed that there were lots of idle threads doing nothing.
Which was odd on a server with 16 cores, and 15 threads allocated for indexing.
Sitecore support were then able to find the bug, The bug is specific to the strategy OnPublishEndAsynchronousSingleInstanceStrategy which was being used by the web index.
This strategy overrides Run() method and initialises LimitedConcurrencyLevelTaskSchedulerForIndexing singleton with the incorrect MaxThreadLimit value.
This code appears to be the same in previous versions, likely we were using onPublishEndAsync rather than onPublishEndAsyncSingleInstance before the upgrade.
Ask for bug fix #285903 from Sitecore support if you are affected by this, so your config settings don’t get overwritten.