Many web apps provides some sort of full text search for the content in the application. Many of them also allow users to upload and share files. However, most of them just index the metadata of the content and don’t care about the actual content in the uploaded files. This can be frustrating for a user who uses your search feature and searches for a word they know exists in a file but they get no result.
In this article I’m going to show you how you can implement this kind of feature in Rails using Elasticsearch.
What is Elasticsearch?
Elasticsearch is a distributed open source search server. It allows for real-time searching and the ability to scale easily using replicas. It’s easy to get started because elasticsearch is schema less. You only have to pass it a typed JSON document and it will automatically be indexed for you. Types are automatically determined by the server. It also allows you to define your own mappings to set boost levels, analyzers, and types.
It’s easy to install Elasticsearch on a mac using Homebrew:
Notice the instructions that Homebrew shows you on how to start / stop Elasticsearch.
If you’re using another operating system or if you don’t use Homebrew you can follow the instructions here to install Elasticsearch.
The attachment type plugin
In order to be able to index the actual file content in Elasticsearch we need to install The attachment type plugin. This plugin allows us to index different type of files, for example, Microsoft Office formats, PDFs, open document formats, ePub, HTML, and so on. The full list of supported file formats can be found here.
This is how you install the plugin:
The Tire gem exposes an easy to use domain specific language to communicate with Elasticsearch. It integrates easily with your ActiveModel/ActiveRecord classes for convenient usage in your Rails app.
The Carrierwave gem is used to upload files from your Rails app.
Create the ActiveRecord model
This will give us one Document model and an uploader called DocumentAttachment.
The uploader should look something like this:
We can leave the DocumentAttachment class as is. This version will store files directly on disk but Carrierwave also support Amazon S3 and other cloud services if you want to use that instead.
Now is the time to setup the uploader and integrate our Document model with Elasticsearch:
The first thing we do in the model is to mount the uploader with the document_attachment attribute. After that we include Tire to make the integration with ElasticSearch work.
After that we setup a mapping block which describes what and how we should index and store our data in ElasticSearch. Note that we exclude the attachment in the beginning of the block. We do this because we don’t want to store the actual content of the file in the index because that will make the index grow very fast. However, this doesn’t mean the content won’t be indexed. The other lines in the block just specifices what we want to index and what type they should be. The attachment type is not available in ElasticSearch by default but we got it by installing the attachment type plugin.
Next we have defined a attachment method. ElastichSearch requires that we send the file content as a base64 encoded string so the method fixes that. After that we have specified a method that returns the actual json that will be posted to ElasticSearch and we have specified that the result of the attachment method should be included.
Now we can play with our Document model in the Rails console:
The word consulting is present in my cv.pdf file and as you can see, the document is now returned in the search result!
Things to be aware of
Both file uploads and indexing of content is time consuming so you should really consider doing as much work as possible in the background. There’s a gem for Carrierwave that solves this called carrierwave_backgrounder and the documentation for the Tire gem has a section about background processing.