Improving the Siteore Solr Content Search Provider

Tue, Nov 26, 2019 in Development , Sitecore using tags Solr , Sitecore

Improving the Siteore Solr Content Search Provider

Back in February this year I gave a presentation at the London Sitecore User Group “Improving the Solr Content Search Provider”.

My first talk at a Sitecore User group.

Although I regularly attend the Sitecore discussion club, and have presented there on several occasions.

Known bugs

At the presentation I included a link to a github repository with a list of known issues and their public reference numbers.

The list of issues will vary between Sitecore versions, I’ve included the list of issues I’ve come across between Sitecore 8.2 up to Sitecore 9.0 update 2.

Including:

  • Partially rebuilt index going live - 96016 - IIS recycle - switches Alias although indexing job cancelled/incomplete - fixed in Sitecore 9 initial release
  • OptimizeOnRebuildEnabled setting not used - was in bug fix 96016 - introduced in Sitecore 9 initial release
  • Patch for IsSolrAliveAgent to update SolrState and process reinitialisation correctly - 163850.171950
  • Incorrect data being indexed, if have ContentSearch.Indexing.DisableDatabaseCaches setting set to true - 96740.127177.155383
  • If IndexAllFields=false the IncludedFields are indexed as string values - New bug in Sitecore 9 - 252532
  • Index rebuild slow down after Sitecore 9 upgrade.

Common Patterns for distributed computing

The github repostiory also included an example implementation of patterns to address some of the fallacies of distributed computing.

  • Circuit Breaker Pattern
    • Don’t make every request wait for a timeout exception, use a circuit breaker to fail fast, so can handle load under a degraded scenario
  • Not swallowing errors
    • We want to know the difference between no results, and an error.
    • The circuit breaker can’t work if we don’t know what’s an error.
    • If we don’t know the difference then we might cache there are no results, when there are results, but just couldn’t retrieve the results.
  • Shorter timeouts for queries
    • Again, we want queries to fail fast, and for the circuit breaker to kick in. But we can allow longer for indexing operations to take place, especially as there is no retry mechanism on the crawling/indexing side.

Comparison of the other options out of the box

  • Sitecore Query
  • Sitecore Fast Query
  • Links Database

Related blog posts:

On reasons not to use each of them, at least on Content Delivery, or when querying over lots of items.

A list of other alternatives

Examples of Customisations

Changing MyItems & Unlock all to use Solr, rather than a Sitecore Query.

I told you Sitecore.Query was slow, and Memory Intensive.

Since giving this presentation

Infact since giving this talk, one of the issues that got me to customise the Sitecore Solr Search provider in the first place reared its head again. Sitecore not being able to startup when Solr is down.

I thought we’d fixed that already, why can’t Sitecore startup when Solr is down anymore?

Well the number of Sitecore indexes had grown over time.

On startup each index was doing it’s own check to see if Solr was available, and waiting for a timeout/exception. This was happening in series. Depending on the type of error you get, if it’s a timeout, this can result in Sitecore taking longer to startup than IIS allows. Causing IIS to recycle the application and try again, in an infinite loop. Oh dear!

So I’ve added to the github repository a new issue 314454 - Sitecore doesn’t startup when Solr Down.

Simulating Solr being down/latency

You can simulate on your local PC latency to Solr using Fiddler. If you change your solr connection string to

http://localhost.:8983/solr;solrCloud=true

Notice the “.”, will ensure the traffic goes through Fiddler

Change your Sitecore IIS Application Pool to run as your local user, as Fiddler automatically picks up traffic running as your local user. (Or you can change the proxy settings)

Then run Fiddler.

Under “AutoResponder” tab, you can add a Rule for requests matching

http://localhost.8983/solr/.....

The URL you set could be specific to a particular index, or request, depending on what you want to test.

Then on the response for that URL you can set

*delay:10000

with how many milliseconds you want to delay the request with.

Using this technique you can simulate a network timeout, rather than a quicker port not listening/service not running error.

And from this can replicate the issue of Sitecore note being able to start up, if you have enough indexes which all timeout on startup.

What’s in 314454 - Sitecore doesn’t startup when Solr Down

Sitecore doesn’t issue patches anymore, so instead you get a hotfix, which include all of the issue fixes for that version of Sitecore you are on.

For Sitecore 9.0 update 2, this was included in hotfix 323662-1.

  1. If you have enough indexes, the timeout on initialisation from each index run in sequence can result in Sitecore not being starting up in the allowed time.
  2. Retry logic for SolrCloud aliases.
  3. Retry logic for initialising Indexes
  4. Exception handling in IsOnline index check.
  5. Initialisation of indexes to not be interupted if Solr is unavailable, to initialise what can and retry later.

And included changes to dlls

  • Sitecore.ContentSearch.Client.dll
  • Sitecore.ContentSearch.Data.dll
  • Sitecore.ContentSearch.dll
  • Sitecore.ContentSearch.Linq.dll
  • Sitecore.ContentSearch.Linq.Lucene.dll
  • Sitecore.ContentSearch.LuceneProvider.dll
  • Sitecore.ContentSearch.SolrNetExtension.dll
  • Sitecore.ContentSearch.SolrProvider.dll

There were so many changes it felt like a mini Sitecore Upgrade. And the contract changed, had to rewrite our customisations for them to continue to work.

This hotfix fixes the slow startup issue, but only checking once if Solr is available, rather than per index. So make Sitecore much faster to startup when Solr is down.

There was also some work to make the initialising of the indexes to work in parallel, to further speed up the time for initialisation.

Customisations and Upgrades

Each time we upgrade, we have to see if our customisations are still possible. Often the extension points/hooks we have used have gone, and we have to find new ones.

Even on this update to the hotfix which address 314454, we had to do a lot of rework.

Hopefully in a future version of the product the extensions points we require will be included in the product, and we won’t have to get out dotpeek to find a place to override, and use reflection/so many custom classes and overrides to change the behaviour to what we need.

Update the repository examples

As the hotfix isn’t publicly available in the nuget feeds, and the contract has changed.

I can’t update the example code to reference the hotfix, and for it to still compile.

If anyone is interested to see the examples updated to use the hotfix, reach out to me on twitter, and I can see about creating a feature branch.

Otherwise, I’m going to wait for a version of Sitecore which includes these fixes/contract changes, which is available in the nuget feed so I can update the examples and still have it compiling.