Reader Level:
ARTICLE

Content Enrichment in SharePoint 2013

Posted by Veena Sarda Articles | SharePoint 2013 October 29, 2012
This article gives details into content enrichment - data cleansing, entity extraction, classification and tagging in SharePoint 2013.
  • 0
  • 0
  • 9029

Searching for a particular word or phrase often yields hundreds and thousands of search results for users. The user can further filter the results with standard refiners like Location, People, Document Type, Year and so on. Adding more context-specific refinements will help the user further reduce the result set. SharePoint 2013 has enabled developers to add more context-specific refiners such that developers can modify the managed properties of crawled items before they are indexed by calling out to an external content enrichment web service.

The content enrichment web service is a SOAP-based service that a developer can create to receive a callout from the web service client inside the content processing component as shown in the figure below search architecture.

SharePoint-2013 Search-architecture.jpg

Figure 1: SharePoint 2013 Search Architecture

Content Enrichment Web Service Callout

A Web Service Callout allows us to receive information about managed properties that are being set up for the content that is currently being crawled. When you receive those managed properties into your web service callout you can modify information about the crawled properties and send it back into the Content Pipeline. This is useful when you need to perform some specific actions on these managed properties to modify them before they can get into the index.

The ability to modify managed properties for items during content processing is helpful when performing tasks such as:

  • Data Cleansing
  • Entity Extraction
  • Classification and tagging

Data Cleansing

An example of when data cleansing is useful is as follows. In my company some people use the abbreviation TCS and others use Tata Consultancy Services in documents (Word, Excel and PowerPoint). I get two refiners in the search UI as these are considered separate companies. Let's say I want to show a single refiner Tata Consultancy Services in the search UI which should be able to filter document types referring to both terms. This is possible in SharePoint 2013. I can receive the Company Name managed property into the web service callout, modify the information where the company name is TCS to Tata Consultancy Services and send it back to the content pipeline so that the modified value gets into the index for the property Company Name.

Entity Extraction

In Fast Search there was a concept of entity extraction through extending the content process pipeline. You would create a custom property extractor that enables you to automatically extract entities or concepts from the visible text content of an item and map that to a managed property.

For example if I am dealing with data from a Finance sub site, I would like to add custom refiners such as Deal Price Range, Deal Status (Win, Loss, In-Progress) and so on but unfortunately that metadata is not available in SharePoint at this point. What we can do in SharePoint 2013 is create a new managed property called Deal Status and build a web service to populate the managed property values based on our custom business logic. The Deal Status managed property can then be used as a refiner on the search results page.

Classification and Tagging

Let's say you are getting Indian classical music to show in your SharePoint search page. In addition to the tags that are normally associated with music like Artist Name, Collection, Date, Category and so on you want to classify and add additional tags for suitable listening time such as Early Morning, Morning, Afternoon, Late Afternoon, Evening, Late Evening and Night. You can do so in SharePoint 2013 by extending the content pipeline. The following are the detailes of how to build and configure such a web service.

The steps to build and configure the Content Enrichment Web Service Callout

The following 5 steps are required to build and configure the web service callout:

  1. Create the managed property in Central Administration
  2. Create a WCF Web Service

    • Refer Microsoft.Office.Server.Search.ContentProcessingEnrichment.dll

    • Implement IContentProcessingEnrichmentService
     
  3. Configure the trigger conditions
  4. Configure the callout web service endpoint address, input and output managed properties.
  5. Execute a full crawl.

Create the managed property in Central Administration

Navigate to Central Administration -> Search Service Application -> Search Schema -> New Managed Property.

  • Property name: ListenTime
  • Type: Text
  • Searchable: checked
  • Queryable: checked
  • Retrievable: checked
  • Refinable: Yes -" active
  • Token Normalization: checked

Create the web service

Create a WCF Service Application and add a reference to the Microsoft.Office.Server.Search.ContentProcessingEnrichment assembly.

The ProcessItem method processes the received Item and returns the result as a ProcessedItem.

In the ProcessItem method write code to read the Property AlbumTitle and compare the property value as shown in the Dictionary object below and populate the new property ListenTime.

private Dictionary<string, string> raagListenTime = new Dictionary<string, string>() {

{" Ahir Bhairav ", "Early Morning"},

{" Gujari Todi ", "Morning"},

{" Todi ", "Morning"},

{"Maand", "Afternoon"},

{"Bhimpalasi ", " Afternoon "},

{" Shadja ", "Twilight"}

{" Thumri ", "Night"}

{" Dhun ", "Evening"} };

This dictionary can keep expanding. This is an example to give you the idea of processing the input property to corresponding output property. You can write your own custom logic. Details on How to: Use the Content Enrichment web service callout for SharePoint Server is available in MSDN.

Configure Trigger Conditions and other properties

To minimize the performance impact of the web service callout, we only want it to be called under certain conditions; this condition is defined in the Trigger property. It is done using PowerShell commands. The expected input and output managed properties are configured via the InputProperties and OutputProperties.

Execute Full Crawl

Launch the service createad earlier and execute the full crawl. Once the crawl is complete, the ListenTime managed property should be populated and searchable. You can also modify the refinement panel to add this new property.

Error handling is configurable either to warn or fail the web service in case of error.

Point to note is that the web service client works with managed properties that you can configure as input properties or as output properties. Input properties are managed properties that are sent to the web service; output properties are managed properties returned by the web service.

COMMENT USING

Trending up