Blog configuration for Algolia search, pagination to lift text length restrictions

This article is synchronized and updated to xLog by Mix Space
For the best browsing experience, it is recommended to visit the original link
https://www.do1e.cn/posts/code/algolia-search

Algolia Search Configuration Method#

The mx-space documentation contains a detailed configuration tutorial, which may be similar for other blog frameworks.

Index Size Limit#

Unfortunately, after configuring according to the documentation, an error was reported in the log:

16:40:40  ERROR   [AlgoliaSearch]  Algolia push error
16:40:40  ERROR   [Event]  Record at the position 10 objectID=xxxxxxxx is too big size=12097/10000 bytes. Please have a look at
  https://www.algolia.com/doc/guides/sending-and-managing-data/prepare-your-data/in-depth/index-and-records-size-and-usage-limitations/#record-size-limits

The reason for the error is clear: one of the blog posts is too long, and the free Algolia allows only 10KB per record. For someone like me who wants to take advantage of free services, this is unacceptable, so I immediately thought of a solution.

Solution#

Idea#

For mx-space, after configuring the API Token, you can obtain a JSON file to manually submit to the Algolia index from /api/v2/search/algolia/import-json.
This contains a list of posts, pages, and notes, with example data as follows:

{
  "title": "Nanjing University IPv4 Address Range",
  "text": "# Motivation\n\n<details>\n<summary>The motivation comes from the web page built. Due to both internal and public networks....",
  "slug": "nju-ipv4",
  "categoryId": "abcdefg",
  "category": {
  "_id": "abcdefg",
  "name": "Others",
  "slug": "others",
  "id": "abcdefg"
  },
  "id": "1234567",
  "objectID": "1234567",
  "type": "post"
},

The objectID is crucial; it must be unique when submitted to Algolia.

The idea I came up with is to paginate the articles with long text, while modifying the objectID, right?! (Clearly, I didn't realize the seriousness of the problem at that time)
Additionally, some of my pages contain <style> and <script>, which can also be directly removed using regex.
Thus, I created the following Python code to edit the downloaded JSON from the above interface and submit it to Algolia.

from algoliasearch.search.client import SearchClientSync
import requests
import json
import math
import os
from copy import deepcopy
import re

MAXSIZE = 9990
APPID = "..."
APPKey = "..."
MXSPACETOKEN = "..."
url = "https://www.do1e.cn/api/v2/search/algolia/import-json"
headers = {
  "Authorization": MXSPACETOKEN,
  "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36 Edg/131.0.0.0",
}

ret = requests.get(url, headers=headers)
ret = ret.json()
with open("data.json", "w", encoding="utf-8") as f:
  json.dump(ret, f, ensure_ascii=False, indent=2)
to_push = []

def json_length(item):
  content = json.dumps(item, ensure_ascii=False).encode("utf-8")
  return len(content)

def right_text(text):
  try:
    text.decode("utf-8")
    return True
  except:
    return False

def cut_json(item):
  length = json_length(item)
  text_length = len(item["text"].encode("utf-8"))
  # Calculate the number of splits
  n = math.ceil(text_length / (MAXSIZE - length + text_length))
  start = 0
  text_content = item["text"].encode("utf-8")
  for i in range(n):
    new_item = deepcopy(item)
    new_item["objectID"] = f"{item['objectID']}_{i}"
    end = start + text_length // n
    # Ensure correct decoding during splitting (Chinese characters occupy 2 bytes)
    while not right_text(text_content[start:end]):
      end -= 1
    new_item["text"] = text_content[start:end].decode("utf-8")
    start = end
    to_push.append(new_item)

for item in ret:
  # Remove style and script tags
  item["text"] = re.sub(r"<style.*?>.*?</style>", "", item["text"], flags=re.DOTALL)
  item["text"] = re.sub(r"<script.*?>.*?</script>", "", item["text"], flags=re.DOTALL)
  if json_length(item) > MAXSIZE: # Exceeds limit, split
    print(f"{item['title']} is too large, cut it")
    cut_json(item)
  else: # Not exceeding limit, also modify objectID for consistency
    item["objectID"] = f"{item['objectID']}_0"
    to_push.append(item)

with open("topush.json", "w", encoding="utf-8") as f:
  json.dump(to_push, f, ensure_ascii=False, indent=2)

client = SearchClientSync(APPID, APPKey)
resp = client.replace_all_objects("mx-space", to_push)
print(resp)

If you are using another blog framework, this should be enough to provide you with some ideas.

Great, after modifying the search index with Python and re-submitting it to Algolia, enable the search function in the mx-space backend and try searching for the out-of-limit JPEG Encoding Details.
Why are there no results? Why is there an error in the backend again?

17:03:46  ERROR   [Catch]  Cast to ObjectId failed for value "1234567_0" (type string) at path "_id" for model "posts"
  at SchemaObjectId.cast (entrypoints.js:1073:883)
  at SchemaType.applySetters (entrypoints.js:1187:226)
  at SchemaType.castForQuery (entrypoints.js:1199:338)
  at cast (entrypoints.js:159:5360)
  at Query.cast (entrypoints.js:799:583)
  at Query._castConditions (entrypoints.js:765:9879)
  at Hr.Query._findOne (entrypoints.js:768:4304)
  at Hr.Query.exec (entrypoints.js:784:5145)
  at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
  at async Promise.all (index 0)

Let's Edit the mx-space Code#

From the above log, it is easy to see that mx-space uses ObjectId as the index instead of id. Locate this part in the code:

https://github.com/mx-space/core/blob/20a1eef/apps/core/src/modules/search/search.service.ts#L164-L165

Modify it as follows.

https://github.com/Do1e/mx-space-core/blob/1d50851/apps/core/src/modules/search/search.service.ts#L167-L168

Further Improvements#

However, this is still not very elegant; it requires running the Python script regularly to push data to Algolia. Since I have already started modifying the mx-space code, why not integrate pagination directly? Fortunately, various AIs can help me quickly get started with programming languages I wasn't very familiar with.

Add the following code after buildAlgoliaIndexData() in /apps/core/src/modules/search/search.service.ts, with logic similar to the above Python:

https://github.com/Do1e/mx-space-core/blob/1d50851/apps/core/src/modules/search/search.service.ts#L297-L333

Rebuild the Docker image, and then switch back to the official image, and it will be fine!

However, the original version also defined three types of events (add, delete, modify) to trigger the push of a single element. I am too lazy to change that, so I just moved the decorator (what is it called in ts? I only know it as a Python term) to pushAllToAlgoliaSearch.

A Little Interlude#

While editing the code, I found that the code already defined truncation for exceeding the limit. However, it was set to 100KB, indicating that the developer is a paid user. I personally think it would be better to set this as an environment variable rather than hardcoding it in the code.

https://github.com/mx-space/core/blob/20a1eef/apps/core/src/modules/search/search.service.ts#L370

2024/12/21: The author updated the configurable truncation, but I still prefer pagination submission, as it allows for full-text search.

https://github.com/mx-space/core/commit/6da1c13799174e746708844d0b149b4607e8f276