/    /  Solr – Tokenizers

Solr – Tokenizers:

In this tutorial, we will learn about the Tokenizers which are another important concept in Solr. Tokenizers in Solr used to break the data in the fields into number tokens. Tokenizers simply read the data as a continuous character stream and break them where ever specified like whitespaces, delimiters etc.

Example:

There are different Tokenizers like Standard Tokenizer, Keyword Tokenizer, classic tokenizer, letter Tokenizer etc.

<fieldType name="text" class="solr.TextField">
 <analyzer>
 <tokenizer class="solr.StandardTokenizerFactory"/>
 </analyzer>
</fieldType>

The above example is based on the standard tokenizer which breaks the text based on whitespace and punctuation as the delimiters.

Stream in text: "Please, send email to abcd@abcd.com by 10-10-2017."
After Break Out: "Please", "send", "email", "to","abcd", "abcd.com", "by", "10", "10", "2017"

Below is the list of Tokenizers:

• Standard Tokenizer
• Classic Tokenizer
• Keyword Tokenizer
• Letter Tokenizer
• Lower Case Tokenizer
• N-Gram Tokenizer
• Edge N-Gram Tokenizer
• ICU Tokenizer
• Path Hierarchy Tokenizer
• Regular Expression Pattern Tokenizer
• Simplified Regular Expression Pattern Tokenizer
• Simplified Regular Expression Pattern Splitting Tokenizer
• UAX29 URL Email Tokenizer
• White Space Tokenizer