Using Elastic Search with Python

How can we make a full-text search with our relational database or Django application with python? Well, basically we should use Elastic Search for that.

The first thing before you think about adding a new dependency, specially a big one like ES, is to ask yourself if you really need such powerful and heavy tool for your use case. Maybe you can develop that project without ES and then maybe add it in the future. Because, in most cases that is the most logical approach. In this post I will begin by detailing how simple is to make simple text search in a RDS with django.

RDBs Text Search

Text search capabilities on relational database are limited. If you are an avid user of ES you know how much powerful a search-focus product is versus a RDBs. For example there is no easy way in a RDB to look for "Miracle" with "The MiraKle". With RBDs, most of the time our only options it to search for a sections with a 100% match.

We can call .filter with __contains and __icontains to search for text. The i refers to "case insensitive". Lets suppose we have a model called Product and we want to search it by name:

Product.objects.filter(name__icontains="some product name")


But that is very simple and limited query. With that we can only check if name contains certain string, and that's it.

If you are only using PostgreSQL and you really plan to stick with it, you can use postgresql for more flexible search capabilities.

Django pluggable search backends

Haystack allows to easily use multiple search backends with a django application, without having to worry to much about specific implementations or really know what you are doing. This, with the combination of Docker Compose, allow us to quickly try backends before we invest time in one specific search engine.

As we know, the simpler the solution the less customization it usually has. And that is also true with the extra overhead of the Haystacks simplification layer. You loose a LOT of the flexibility that the underlining search engines originally provide.

Another side-effect of this simple approach is that you will be forced to use old versions of your search engine. For example, at the time of writing, the most resent version of ES you can use is the 5.x. And the end of life of that version comes in 5 months. Therefore I wouldn't recommend using Haystack for new developments.

The only production-ready and flexible way to go with ES is to use the official elasticsearch-py library.

How to use elasticsearch-py

elasticsearch-py is a low-level library, nevertheless it's easier to use than haystack. All the official ES documentation and examples can be easily translated and implemented without any mental overhead. If you really start thinking about how ES works, it is basically sending and receiving JSON, and this library just facilitates that process. Also, because it's officially supported you have always access to the latest features.

The first step before we can query any data, it's the creation of an index:

es = Elasticsearch(ES_CONNECTION['URL'])
def create_index(index_name: str) -> bool:
if(es.indices.exists(index_name)):
return False
return es.indices.create(index_name, {
"settings": {
"index": {
"number_of_shards": 3,
"number_of_replicas": 1
},
}
})['acknowledged']


As you can see, we need to talk with this library mostly by passing dicts. If we continue with the Product example, this is how we add one product to an index:

def upsert(product: Product):
body = {
"doc_as_upsert": True,
"doc": {
"text": " ".join([
product.name,
product.description,
product.meta_description
]),
"recurring_percentage": float(product.recurring_percentage),
"commission": float(product.commission),
"score": int(product.score)
}
}
return es().update(index=PRODUCT_INDEX_NAME, id=product.pk, body=body)


That function creates or updates a Product by using doc_as_upsert. doc contains the "main document" or the object of interest we want to store in the index. I noticed that most people use the text variable to store the main search term. The text variable is a mix of name, description, and meta-data; by doing this we will be able to make a real full-text search using any relevant term for the product. Aside from the text field, we can also use any other field we want to search and/or order results.

Now is time to search some products. In the following example I will be using some custom objects like ProductSearchQuery and ProductSearchResult, but you shouldn't worry about them, they only hold values. Also while you read the code, look for the following bits of information:

• filter_path it's used to select which attributes ES should return.
• size and from are used to paginate the results.
• text contains the main text search query. And that is provably what you will me more interested on examining.
def search(search_query: ProductSearchQuery) -> ProductSearchResult:
filter_path = ["hits.total.value", "hits.hits._id",
"hits.hits._source.id"]
body = {
"query": {
"match": {
"text": {
"query": search_query.text,
"minimum_should_match": "50%",
"fuzziness": 10
}
}
},
"sort": search_query.es_sort_attrs(),
"from": search_query.page * search_query.page_size,
"size": search_query.page_size
}
try:
es_res = es().search(index=PRODUCT_INDEX_NAME,
filter_path=filter_path,
body=body)
total = es_res["hits"]["total"]["value"]
ids = [int(hit["_id"]) for hit in es_res["hits"]["hits"]]
return ProductSearchResult(
search_query=search_query,
product_ids=ids,
total=total
)
except KeyError:
return ProductSearchResult(search_query=search_query)


Another no minor concern is the consistency between the relational and ES databases right? We want to use ES for very advanced full-text search capabilities but we still use a RDB for transactional operations.

We can sync ES with our database using Django Signals:

import django.db.models.signals as orm_signals

def orm_upsert_product_callback(sender, instance=None, **kwargs):
upsert_product(instance)

def orm_delete_product_callback(sender, instance=None, **kwargs):
delete_product(product)


Testing Elastic Search with unittest

We always need to test our code and the interaction with every new dependency. So, how do we test Elastic Search?

It turns out that Elastic Search is in many cases slow to reflect in its response the changes we do. That forces us to wait for some seconds until ES updates itself and fully reflects the changes we make. How do we wait without falling prey of sleep? By using the backoff python library.

Using that library we can retry our failing tests until they succeed:

from elasticsearch.exceptions import RequestError
import backoff

def retry(func):
@backoff.on_exception(backoff.constant,
(AssertionError, RequestError),
max_time=5)
def inner():
func()
return inner


retry is a decorator we can use to re-run our tests multiple times until they don't throw an AssertionError or RequestError. In action, that looks something like this:

def test_simple_search(self):
product1 = Product.objects.create(
name='product',
description='incredible'
)
product2 = Product.objects.create(
name='product 2',
description='magnificent'
)
product3 = Product.objects.create(
name='cat',
description='super'
)

@retry
def test():
res = ProductSearch.search(ProductSearchQuery('product'))
self.assertEqual(res.total, 2)
self.assertCountEqual(res.products_ids, [product1.id, product2.id])
self.assertFalse(res.pages)
self.assertTrue(res.products)
self.assertTrue(res.serialized_products)
test()

@retry
def test():
res = ProductSearch.search(ProductSearchQuery('super'))
self.assertEqual(res.total, 1)
self.assertCountEqual(res.products_ids, [product3.id])
self.assertFalse(res.pages)
self.assertEqual(res.products[0], product3)
self.assertTrue(res.serialized_products)
test()


Those test should be run inside the django.test.TestCase class to get all the goodies of django. Those goodies include that our tests run inside a transaction that is reversed every time we finish each unit test method. It makes sense for us to also do something similar while working with ES.

Because ES doesn't and will not support transactions we should rely on deleting everything. To make this simple, we could just delete and then create again the search indexes for every test. More specifically:

• We should avoid modifying any data inside setUpClass or tearDownClass class methods. This is because there are two levels of database transactions created by django: one for the whole class, and then another one for each method.
• We setup and delete whatever we want only inside the setUp and tearDown methods. In this way we simulate a rollback without changing the TestCase rollback semantics we are used to experience.
def setUp(self):
crete_all_indexes()

def tearDown(self):
delete_all_indexes()


That is the real reason you see the data initialization inside the test_simple_search test method.