Azure Search Service is one of the powerful offerings in Azure that enables you to build powerful and comprehensive search engine features for enterprise applications which are normally harder to build on your own in traditional approach for complex, scalable and performance requirements. After implementing couple of search solutions using Azure Search, I wanted to write this post for sharing my view about how the implementation of search would look like in a production scenario and some key architectural and design considerations you need to aware of. The official documentation is good for getting started on and you may want to read it first, https://azure.microsoft.com/en-us/services/search to get better understanding of the service before proceeding further.
You might have realized after reading the documentation that there are lot of good information and examples that enables you to understand, setup a demo and make it work quickly but still the sample apps are hello world apps and need more care for the actual design which is not covered or not simple enough unfortunately. Unless you have prior design experience of similar solutions, it could easily turn out to be a complex design which will be difficult to maintain and extend.
Search service consist of three components 1) Data Source, refers to the underlying data storage, contains connection information, could use one of the Azure data storage options (Azure SQL, Cosmos DB, Table Storage, Blob Storage) 2) Indexer, refers to the scheduling part of how often you want to refresh the index from the data source 3) Index, refers to the index schema and collection of documents which will be used to perform the search against. You could have more than one of those components created within the same search service.
Note that the process of creating index and querying index can be done in one of following three ways.
- Azure portal – You could use it for initial setup, experiment or troubleshooting purpose but mostly you are going to rely on one of other options.
- .NET SDK – This will be your primary option for building the layer that constructs and executes the query expression with options for filters, highlights, scoring profile, facets, search fields, etc.
- REST API – You could do the same thing for what you could do with SDK. Read Index Builder section below for more details about why it is a best option for managing index creation.
It is a good idea to consider separating whole solution into three set of components as the nature of those processes and developer skills needed to build them are slightly different and they need different level of attention.
1. Index Builder
Though you try to finalize the index schema sooner, you will end up recreating index multiple times during the development time for various reasons, so you would need a tool that allows you to recreate index as many times as possible. You will find it very useful and productive tool. It is also likely that your organization or department would want to build more than one search solutions, typically search capabilities are needed for products, customers, service requests, etc. Believe me that you would appreciate yourself later for creating a generic index builder console or windows tool that takes index schema from configuration file and creates index components in target search service so that would naturally let you pick REST API approach to submit index schema configuration to target REST endpoint. Keeping this process separate also eliminates the risk of someone messing up index schema while working on query index process.
2. Data Import
Data import process is going to be unique and different anyway, still incremental data update process may get coupled with others as it needs to use SDK to access and update documents one at time. If there is a need to share common code such as models and wrapper around SDK methods, you may move them to a NuGet package and consume it. If your search is going to have heavily complex logic and high volume of traffic, it is better to manage incremental update through separate API as per CQRS principle.
Incremental update refers to the process to update index and underlying data source when a record is created, updated or deleted at source. You might think why should we update index manually when indexer updates it as scheduled? Well, the shortest internal you can set for indexer to update is 15 minutes so you need additional process to reflect the update in index immediately.
3. Query Index
You would need to stand up an API in front of Index which is responsible for exposing fine grinded interface, REST API, secured service, to accept scoped parameters for search, filters, scoring profile, facets, highlights, etc. and executing well-formed expressions against target index using SDK. You could find one online or build a reusable library through NuGet to generalize and form the expressions as highly likely this logic getting more complex very soon. Same can be reused in other search solutions as well. Consider design and building this API as a core search service for the specific domain so potentially it can serve across the organization.
At the time of writing this post, following are open challenges and important to plan for mitigation steps. Some of them are most requested features and still being reviewed, hopefully they get implemented soon.
Deployment – Currently 1) there is no support for slot or swap features, 2) you cannot update existing index schema directly, 3) you cannot scale up or down between different pricing tiers for same index. When one of above scenario arises, you will have to create new index components, switch them by updating production configuration files and reload data all over again which means there will be an outage and it could take for many hours or days depends on the size of overall data. So, you should plan for starting deployment ahead of time for rebuilding index, keep data loaded and ready for switch.
Recovery – For the same fact that recreating index is time very consuming, it is important to make recovery process is well defined, limit production access, don’t have index “drop” logic in any code, manage production configuration in DevOps tool and avoid storing in source control to decreases the chances for someone accidently deleting them.
Complex data types – Though Azure data storage technologies support complex data types, index column doesn’t support them. So it is important to store flattened data structure as much as possible or use fieldmapping to transform source to target column that supports delimiter separated values and use custom tokenize/analyzer to handle complex scenarios.
Best Practices/ Tips
- Try to flatten and keep the underlying data as simple and straight forward as possible to limit the transforming and querying logic getting complex.
- Generalize the logic that forms the search inputs to a valid expression to reduce it getting complex soon.
- Null values in a column will not appear as an option in facets return in search results. If you expect to use it in a filter, you would have to use a constant instead during data load.
- Use Search Explorer of an index in the Azure portal and Storage explorer for no-sql databases for troubleshooting to eliminate if an issue with underlying data or the logic you have built.
- Filters are case sensitive so you would have to store and filter using same case. If that affects display, you may have to have additional columns to store transformed values.
- Use replicas for high availability which stores multiple copies of same index and acts as load balancer
Repeated from previous sections:
- As reindexing (recreate) takes considerable amount of time, add process for deploying new index ahead of time
- Remember updating index or scaling up/down mean, you are going to create new index and update configuration files
- Segregate responsibilities – Develop independent processes for building index, querying index and loading data
Thanks for reading the article and hope it was useful. If you have any questions or feedback , please feel free to leave your comments below.