Solr – Indexing:
In this tutorial, we will learn about Indexing in Solr.
Indexing in Solr is nothing but adding the content to the Solr. So, the same content that will be searchable through Solr index again. A Solr index can get this data through various ways like XML, CSV files, directly from tables in the database and data from rich document format like Microsoft word and PDF.
There are four ways of indexing the data into Solr:
- Indexing the Rich Documents like Microsoft Word, PDF kind of binary files can be done using the Solr cell which was built on Solr Tika.
- General XML, CSV files can be indexed just by sending the HTTP requests to Solr server or through the Solr admin user interface we can select the type of documents and execute the program to index the data within the file.
- We can also use the Java Client API with which we can build the JAVA Application to ingest the data.
- Indexing the table data from a database can do by configuring the schema and Solrconfig.xml files.
Post Tool is a command line utility from Solr which can be used to post different types of content. This tool can be used on both Linux and windows environment for posting the content.
In windows systems, we can use the post tool by using .jar file as below
java -jar /example/exampledocs/post.jar -h
In UNIX systems, the usage can be done as below in the terminal:
bin/post -c filminfo example/films/films.json
In the above command collection/core is mandatory. You can also check the correct usage using the below command
bin/post -help or bin/post -h
Use the below command to Index only XML files which have .xml extension into particular collection or core with name filminfo through port 8983.
bin/post -c filminfo -p 8983 *.xml
If you want to index only CSV files, use the below command.
bin/post -c filminfo -p 8983 *.csv
In same way if you want to index only CSV files, use the below command.
bin/post -c filminfo -p 8983 *.json
Use the below example to Index Rich documents like Microsoft word, PDF, HTML.
bin/post -c filminfo filming.pdf
It can even able to index all the documents with in a directory as mentioned below.
bin/post -c filminfo filmfolder/
If you want to index only ppt files from that mentioned directory, use the below command.
bin/post -c filminfo -filetypes ppt filmfolder/