What I built (2) — leveraging different pretrained language model and create an NLP search

Intention and background

For a query like “can I get an egg sushi”, can the machine possible relate it to “tamago sushi” (in which tamago is egg in Japanese)

What I tried

  • Load the language models, have them process some dish name, see their clustering
  • Load all dish into a Faiss index, try to query with dish name or a sentence that contain dish name(s)

How it work

What a language model does is to take the word sequence (sentence is considered a long word sequence) and process it into a vector.

Some NLP basic — How word sequence becomes one vectors (please feel free to skip the whole section if you already know)

The “simplest” pipeline is considered:

Tokenize

This is simple, language like English have an advantage as words are simply separated with space, like “mango chocolate milk tea” could become “mango | chocolate | mile | tea”, while in other language like Cantonese (The language Hong Kong people speak) can be more challenging, like “芒果朱古力奶茶” becomes “芒果 | 朱古力 | 奶 | 茶”, obviously need some knowledge to help the “cutting” (yes, tokenizing is cutting sequence into tokens).

Embed each token into vector space

As NLP is run in computer, and computer process word different than human, which human see a token and can associate with meanings in our amazing brain, for computer, we try to have our computer “project” the symbol (the token) into some numbers (a vector).

0.2 | 0.1 | 0.6 => Mango0.1 | 0.1 | 0.8 => Orange0.1 | 0.1 | 0.82 => Tangerine0.5 | 0.7 | 0.1 => Milk0.0 | 0.9 | 0.0 => Water

Run through a model to per token to create a final vector of the sequence

After we have vectors of each token, we run the sequence through a language model to capture the meaning of the sequence. People might ask, can we average the the vectors of words like following?

Mango Milk = ( (0.2 | 0.1 | 0.6) + (0.5 | 0.7 | 0.1) )/ 2 => (0.35 | 0.4 | 0.35)

What pretrain model I used

All pretrained models I tried to use are trained on multi-languages as my test cases are expected to be multi-language.

Reference for getting to know more about the models:

https://ai.googleblog.com/2020/08/language-agnostic-bert-sentence.html

Result (for resulting vector clusters)

Important Note: The original vectors are mostly on dimensions ≥ 512, I used PCA to project them into 2D space and the clustering might be sometimes misleading in below plots

(top row) SBert-mUSE, SBert-xlm-paraphrase, SBert-xlm-stsb; (mid row) SBert-dist-bert-stsb, LaBSE, mUSE; (bottom row): LASER, Concat, Concat normalized

Result of searching among 1000+ dishes (with help of Faiss)

Here are some top 5 results per model (search term is “肉醬意粉” — “Spaghetti Bolognese” in Chinese)

{
"Concat": [{
"productNameEn": "Spaghetti Bolognese",
"productNameZhHk": "肉醬意粉"
}, {
"productNameEn": "Steamed Plain Rice Rolls with Assorted Sauce",
"productNameZhHk": "混醬腸粉 "
}, {
"productNameEn": "Pork Belly in Soy Sauce Rice Bowl",
"productNameZhHk": "醬油豬腩肉飯"
}, {
"productNameEn": "Pappardelle Bolognese",
"productNameZhHk": "肉醬寬帶麵"
}, {
"productNameEn": "Jajangmyeon",
"productNameZhHk": "炸醬麵"
}
],
"Concat_norm": [{
"productNameEn": "Spaghetti Bolognese",
"productNameZhHk": "肉醬意粉"
}, {
"productNameEn": "Steamed Plain Rice Rolls with Assorted Sauce",
"productNameZhHk": "混醬腸粉 "
}, {
"productNameEn": "Pork Belly in Soy Sauce Rice Bowl",
"productNameZhHk": "醬油豬腩肉飯"
}, {
"productNameEn": "Pappardelle Bolognese",
"productNameZhHk": "肉醬寬帶麵"
}, {
"productNameEn": "Jajangmyeon",
"productNameZhHk": "炸醬麵"
}
],
"LASER_embeddings": [{
"productNameEn": "Spaghetti Bolognese",
"productNameZhHk": "肉醬意粉"
}, {
"productNameEn": "Prawn Chorizo Spaghetti",
"productNameZhHk": "大蝦辣肉腸意粉"
}, {
"productNameEn": "Shredded Pork Chow Fun",
"productNameZhHk": "肉絲炒河粉"
}, {
"productNameEn": "Taro roll",
"productNameZhHk": "香芋卷"
}, {
"productNameEn": "Marinated Cucumber with Moro Miso",
"productNameZhHk": "麥粒青瓜"
}
],
"LaBSE": [{
"productNameEn": "Spaghetti Bolognese",
"productNameZhHk": "肉醬意粉"
}, {
"productNameEn": "Penne Bolognese",
"productNameZhHk": "肉醬長通粉"
}, {
"productNameEn": "Pork Belly in Soy Sauce Rice Bowl",
"productNameZhHk": "醬油豬腩肉飯"
}, {
"productNameEn": "Pho Tal",
"productNameZhHk": "生牛肉湯粉"
}, {
" productNameEn ": " Crab Pasta ",
" productNameZhHk ": " 蟹肉意粉 "
}
],
" SBert - dist - bert - stsb ": [{
" productNameEn ": " Spaghetti Bolognese ",
" productNameZhHk ": " 肉醬意粉 "
}, {
" productNameEn ": " Penne Bolognese ",
" productNameZhHk ": " 肉醬長通粉 "
}, {
" productNameEn ": " Baked Meat Sauce with Garlic Bread ",
" productNameZhHk ": " 焗肉醬蒜茸麵飽 "
}, {
" productNameEn ": " Lasagna Classica ",
" productNameZhHk ": " 肉醬千層寬麵 "
}, {
" productNameEn ": " Baked Bolongnese w Garlic Bread ",
" productNameZhHk ": " 焗肉醬蒜茸包 "
}
],
" SBert - mUSE ": [{
" productNameEn ": " Spaghetti Bolognese ",
" productNameZhHk ": " 肉醬意粉 "
}, {
" productNameEn ": " Penne Bolognese ",
" productNameZhHk ": " 肉醬長通粉 "
}, {
" productNameEn ": " Steamed Plain Rice Rolls with Assorted Sauce ",
" productNameZhHk ": " 混醬腸粉 "
}, {
" productNameEn ": " Bamboo Shoots ",
" productNameZhHk ": " 醬油味筍 "
}, {
" productNameEn ": " Sauteed beef with hot pepper ",
" productNameZhHk ": " 麻辣牛肉 "
}
],
" SBert - xlm - paraphrase ": [{
" productNameEn ": " Spaghetti Bolognese ",
" productNameZhHk ": " 肉醬意粉 "
}, {
" productNameEn ": " Shredded Pork Chow Mein ",
" productNameZhHk ": " 肉絲炒麵 "
}, {
" productNameEn ": " Bamboo Shoots ",
" productNameZhHk ": " 醬油味筍 "
}, {
" productNameEn ": " Jajangmyeon ",
" productNameZhHk ": " 炸醬麵 "
}, {
" productNameEn ": " Sauteed Chinese Broccoli with Preserved Meat ",
" productNameZhHk ": " 臘味炒芥蘭 "
}
],
" SBert - xlm - stsb ": [{
" productNameEn ": " Spaghetti Bolognese ",
" productNameZhHk ": " 肉醬意粉 "
}, {
" productNameEn ": " Steamed Chicken Feet in Black Bean Sauce ",
" productNameZhHk ": " 豉汁蒸鳳爪 "
}, {
" productNameEn ": " Miso Soup ",
" productNameZhHk ": " 麵豉湯底 "
}, {
" productNameEn ": " Baked Ox Tongue in Tomato Sauce ",
" productNameZhHk ": " 焗茄汁牛脷 "
}, {
" productNameEn ": " Sausage Roll ",
" productNameZhHk ": " 香酥肉卷 "
}
],
" mUSE ": [{
" productNameEn ": " Spaghetti Bolognese ",
" productNameZhHk ": " 肉醬意粉 "
}, {
" productNameEn ": " Penne Bolognese ",
" productNameZhHk ": " 肉醬長通粉 "
}, {
" productNameEn ": " Jajangmyeon ",
" productNameZhHk ": " 炸醬麵 "
}, {
" productNameEn ": " Stir - fried Vermicelli with Soy Sauce Pork Belly ",
" productNameZhHk ": " 醬油豬腩炒粉絲 "
}, {
" productNameEn ": " Kid ' s Spaghetti with Meat Sauce & French Fries ",
"productNameZhHk ": " 肉醬意粉、薯條 "
}
]
}

Reference for Faiss

Code Snippet

Install libraries and setup

pip install bert-for-tf2
pip install faiss-cpu
pip install tensorflow_text
pip install laserembeddings[zh]
python -m laserembeddings download-models
pip install sentence-transformers

Using LaBSE from Tensorflow Hub

My code is more or less follow the official Tensorflow hub documentation:

Using LASER from laserembeddings

The usage of this library is very simple, the main lines are:

from laserembeddings import Laser as LaserEmbeddings
encoder = LaserEmbeddings()
output = encoder.embed_sentence(["your sentence"], lang="zh")

Using mUSE from Tensorflow hub

The usage is well documented in the Tensorflow hub documentation, and pretty straight forward compare to LaBSE:

Using SBert.Net models

Follow the documentation from their site, very easy to follow:

Side Notes

While comparing the output of different models, I found that some of the models does NOT normalize the output, if your use case require the output being normalized, simply wrap it with function like following (using PyTorch):

import torch
import torch.nn.functional as F
def normalize_tensor(tensor):
# because in my case, input could be PyTorch tensor or Numpy array
tensor = torch.tensor(tensor)
# the main function call
normalized = F.normalize(tensor,dim=-1,p=2)
# my use case prefer numpy to further process
return normalized.numpy()

Using Faiss

Building an index and inserting items

# Assume our vector is size 512 in length
index = faiss.IndexFlatL2(512)
# Adding item(s) to index
index.add(numpy_array_of_vector)
# Embed the text into vector with one of the model above
query_embedded_by_model = model.embed("my text to query")
# Searching 5 nearest items
D, I = index.search(query_embedded_by_model, 5)
D = [[0.001, 0.0013, 0.0015, 0.0017, 0.0019]]I = [[9142, 3718, 5262, 3572, 18]]
# sentences already in batch below
sentences = [["noodle", "rice", "egg"], ["chicken",...], ... ]
added_items = []
# adding items to index and appending to added_items
for batch in sentences:
# assuming the model intake array of sentences
encoded = model.encode(batch)
# add to Faiss index
index.add(encoded)
# extend the array to include the sentences in batch
added_items.extend(batch)

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store