Excluding configured paths from Sitecore Index

We recently came across a requirement in a project, where the site node had multiple microsites within itself, BUT the client wanted the search feature of the main site to only include pages from the main site, and not from any of the micro sites. 

Sitecore search index configuration easily allows us to include / exclude templates & configure what fields we want to include in the index in the <documentOptions> section. However, there is currently no out of the box way to exclude items from being indexed based on their path in the Sitecore tree. 

To enable this, we wrote our custom crawler, which determined whether an item is to be included / excluded – based on the configured paths to ignore. 

Please note – this solution still results in all the items being crawled, but conditionally included in the index. 

We updated the custom crawler configuration – to make the excluded paths configurable. (to the microsite nodes in our case). 

Note: Snippets here are from a Sitecore 10 instance. 

<?xml version="1.0" encoding="utf-8"?> 

<configuration xmlns:patch="http://www.sitecore.net/xmlconfig/" xmlns:role="http://www.sitecore.net/xmlconfig/role/" xmlns:search="http://www.sitecore.net/xmlconfig/search/"> 
  <sitecore role:require="Standalone or ContentManagement" search:require="solr"> 
    <contentSearch> 
      <configuration type="Sitecore.ContentSearch.ContentSearchConfiguration, Sitecore.ContentSearch"> 
        <indexes hint="list:AddIndex"> 
          <index id="site_master_index" type="Sitecore.ContentSearch.SolrProvider.SolrSearchIndex, Sitecore.ContentSearch.SolrProvider"> 
            <param desc="name">$(id)</param> 
            <param desc="core">site_master_index</param> 
            <param desc="propertyStore" ref="contentSearch/indexConfigurations/databasePropertyStore" param1="$(id)" /> 
            <strategies hint="list:AddStrategy"> 
              <strategy ref="contentSearch/indexConfigurations/indexUpdateStrategies/manual" role:require="ContentManagement and !Indexing" /> 
              <strategy ref="contentSearch/indexConfigurations/indexUpdateStrategies/intervalAsyncMaster" role:require="Standalone or (ContentManagement and Indexing)" /> 
            </strategies> 
            <locations hint="list:AddCrawler"> 
              <crawler type="Site.Website.Infrastructure.Search.Crawler.ExcludePathsItemCrawler, Site.Website.Infrastructure"> 
                <Database>master</Database> 
                <Root>/sitecore/content/home</Root> 
                <ExcludeItemsList hint="list"> 
                  <ChicagoMetro>/home/chicago-metro</ChicagoMetro> 
                  <MemphisEast>/home/memphis-east</MemphisEast> 
                </ExcludeItemsList> 
              </crawler> 
            </locations> 
          </index> 
        </indexes> 
      </configuration> 
    </contentSearch> 
  </sitecore> 
</configuration> 

The section of importance here in this index configuration – is the Crawler / ExcludeItemsList. 

Here is the code which reads this section and uses the paths to conditionally include / exclude items in the index. We override the default method used to check if an item is excluded here: 

using Sitecore.ContentSearch; 
using Sitecore.Diagnostics; 
using System.Collections.Generic; 
using System.Linq; 

namespace Site.Website.Infrastructure.Search.Crawler 
{ 
    public class ExcludePathsItemCrawler : SitecoreItemCrawler 
    { 
        public List<string> ExcludeItemsList { get; } = new List<string>(); 

        protected override bool IsExcludedFromIndex(SitecoreIndexableItem indexable, bool checkLocation = false) 
        { 
            Assert.ArgumentNotNull(indexable, "item"); 
            return ExcludeItemsList.Any(path => indexable.AbsolutePath.StartsWith(path))  
                   || base.IsExcludedFromIndex(indexable, checkLocation); 
        } 
    } 
} 

That’s it! The public property here automatically maps the configuration and can be used in the code as is. 

You could also use this method to customize your crawler as per your requirements – other than adding path constraints, making things as configurable as you’d like! 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s