What I built (2) — leveraging different pretrained language model and create an NLP search

Stephen Cow Chau
9 min readMar 1, 2021

This is a NLP exploration during WFH period.

Intention and background

For a query like “can I get an egg sushi”, can the machine possible relate it to “tamago sushi” (in which tamago is egg in Japanese)

I would like to see if any pretrained models available can pick such kind of association. If that work, it might be possible to have an input query to search a pool of dish and pick up closest one.

What I tried

  • Load the language models, have them process some dish name, see their clustering
  • Load all dish into a Faiss index, try to query with dish name or a sentence that contain dish name(s)

How it work

What a language model does is to take the word sequence (sentence is considered a long word sequence) and process it into a vector.

And the Faiss index is a very efficient index to locate the closest match in a very high dimension vector space.

Some NLP basic — How word sequence becomes one vectors (please feel free to skip the whole section if you already know)

The “simplest” pipeline is considered:

word sequence -> tokenize -> embed each token into vector space -> run through a model per token to create a final vector of the sequence.

Note: I am skipping a lot here as these concepts can be very in depth and there must be a lot of other article better than this one.

Tokenize

This is simple, language like English have an advantage as words are simply separated with space, like “mango chocolate milk tea” could become “mango | chocolate | mile | tea”, while in other language like Cantonese (The language Hong Kong people speak) can be more challenging, like “芒果朱古力奶茶” becomes “芒果 | 朱古力 | 奶 | 茶”, obviously need some knowledge to help the “cutting” (yes, tokenizing is cutting sequence into tokens).

Note1: For tokenization, modern approach could leverage a machine learning model to find the cut, but for simplicity, Cantonese/Chinese can also be treated as 1 character at a time.

Note2: For some models, even for English, it might even cut down a word into subwords (like childish -> child | ish), to be explain a bit more below.

Embed each token into vector space

As NLP is run in computer, and computer process word different than human, which human see a token and can associate with meanings in our amazing brain, for computer, we try to have our computer “project” the symbol (the token) into some numbers (a vector).

What we hope the computer can do if it can store things like following (assuming the computer use 3 number to store each token):

0.2 | 0.1 | 0.6 => Mango0.1 | 0.1 | 0.8 => Orange0.1 | 0.1 | 0.82 => Tangerine0.5 | 0.7 | 0.1 => Milk0.0 | 0.9 | 0.0 => Water

Orange and Tangerine look very close in all 3 numbers (and they should be close in vector space as well) as they are more similar.

The process here is simple, we build a dictionary (mapping) per word we know and perform a look up. As you can imagine, the more words in the dictionary, the size of it would become bigger, and at certain point, they are just not able to fit into memory, and that’s why the subwords approach mentioned above.

An over-simplified case, consider we have words: (play, playground, ground, groundwork), we have 4different word, but if we cut into subwords, we can use (play, ground, work) to compose the above 4 words.

Note: If you are interested to know more, have a look at following article:

Run through a model to per token to create a final vector of the sequence

After we have vectors of each token, we run the sequence through a language model to capture the meaning of the sequence. People might ask, can we average the the vectors of words like following?

Mango Milk = ( (0.2 | 0.1 | 0.6) + (0.5 | 0.7 | 0.1) )/ 2 => (0.35 | 0.4 | 0.35)

The answer is, some research actually did that, it worked sometimes, but a language model normally do much better.

The reason is our language have word sequence and context, and the model tried to capture that information. The averaging approach, is trying to consider the input sequence (the sentence) as a “bag of words” and disregarded the factor of how we compose the word sequence might give different meanings, e.g. “cat love eating fish” v.s. “fish love eating cat”.

What pretrain model I used

All pretrained models I tried to use are trained on multi-languages as my test cases are expected to be multi-language.

Followings are the models I tried to test with:

Reference for getting to know more about the models:

https://ai.googleblog.com/2020/08/language-agnostic-bert-sentence.html

https://github.com/facebookresearch/MUSE

Result (for resulting vector clusters)

Important Note: The original vectors are mostly on dimensions ≥ 512, I used PCA to project them into 2D space and the clustering might be sometimes misleading in below plots

(top row) SBert-mUSE, SBert-xlm-paraphrase, SBert-xlm-stsb; (mid row) SBert-dist-bert-stsb, LaBSE, mUSE; (bottom row): LASER, Concat, Concat normalized

A quick observation would be, the models learned some clustering, some models clustered on cuisine, some on ingredients (like beef noodle and burger are close in middle right).

The “concat” and “concat normalized” on the lower middle and lower right is concatenating all 7 other models result into a giant vector (something like 5000 dimension), that way, with some validation, does not give a better result compare to single model.

Result of searching among 1000+ dishes (with help of Faiss)

Here are some top 5 results per model (search term is “肉醬意粉” — “Spaghetti Bolognese” in Chinese)

Honestly, I have no idea why 混醬腸粉 (rice rolls) score that high for some models.

{
"Concat": [{
"productNameEn": "Spaghetti Bolognese",
"productNameZhHk": "肉醬意粉"
}, {
"productNameEn": "Steamed Plain Rice Rolls with Assorted Sauce",
"productNameZhHk": "混醬腸粉 "
}, {
"productNameEn": "Pork Belly in Soy Sauce Rice Bowl",
"productNameZhHk": "醬油豬腩肉飯"
}, {
"productNameEn": "Pappardelle Bolognese",
"productNameZhHk": "肉醬寬帶麵"
}, {
"productNameEn": "Jajangmyeon",
"productNameZhHk": "炸醬麵"
}
],
"Concat_norm": [{
"productNameEn": "Spaghetti Bolognese",
"productNameZhHk": "肉醬意粉"
}, {
"productNameEn": "Steamed Plain Rice Rolls with Assorted Sauce",
"productNameZhHk": "混醬腸粉 "
}, {
"productNameEn": "Pork Belly in Soy Sauce Rice Bowl",
"productNameZhHk": "醬油豬腩肉飯"
}, {
"productNameEn": "Pappardelle Bolognese",
"productNameZhHk": "肉醬寬帶麵"
}, {
"productNameEn": "Jajangmyeon",
"productNameZhHk": "炸醬麵"
}
],
"LASER_embeddings": [{
"productNameEn": "Spaghetti Bolognese",
"productNameZhHk": "肉醬意粉"
}, {
"productNameEn": "Prawn Chorizo Spaghetti",
"productNameZhHk": "大蝦辣肉腸意粉"
}, {
"productNameEn": "Shredded Pork Chow Fun",
"productNameZhHk": "肉絲炒河粉"
}, {
"productNameEn": "Taro roll",
"productNameZhHk": "香芋卷"
}, {
"productNameEn": "Marinated Cucumber with Moro Miso",
"productNameZhHk": "麥粒青瓜"
}
],
"LaBSE": [{
"productNameEn": "Spaghetti Bolognese",
"productNameZhHk": "肉醬意粉"
}, {
"productNameEn": "Penne Bolognese",
"productNameZhHk": "肉醬長通粉"
}, {
"productNameEn": "Pork Belly in Soy Sauce Rice Bowl",
"productNameZhHk": "醬油豬腩肉飯"
}, {
"productNameEn": "Pho Tal",
"productNameZhHk": "生牛肉湯粉"
}, {
" productNameEn ": " Crab Pasta ",
" productNameZhHk ": " 蟹肉意粉 "
}
],
" SBert - dist - bert - stsb ": [{
" productNameEn ": " Spaghetti Bolognese ",
" productNameZhHk ": " 肉醬意粉 "
}, {
" productNameEn ": " Penne Bolognese ",
" productNameZhHk ": " 肉醬長通粉 "
}, {
" productNameEn ": " Baked Meat Sauce with Garlic Bread ",
" productNameZhHk ": " 焗肉醬蒜茸麵飽 "
}, {
" productNameEn ": " Lasagna Classica ",
" productNameZhHk ": " 肉醬千層寬麵 "
}, {
" productNameEn ": " Baked Bolongnese w Garlic Bread ",
" productNameZhHk ": " 焗肉醬蒜茸包 "
}
],
" SBert - mUSE ": [{
" productNameEn ": " Spaghetti Bolognese ",
" productNameZhHk ": " 肉醬意粉 "
}, {
" productNameEn ": " Penne Bolognese ",
" productNameZhHk ": " 肉醬長通粉 "
}, {
" productNameEn ": " Steamed Plain Rice Rolls with Assorted Sauce ",
" productNameZhHk ": " 混醬腸粉 "
}, {
" productNameEn ": " Bamboo Shoots ",
" productNameZhHk ": " 醬油味筍 "
}, {
" productNameEn ": " Sauteed beef with hot pepper ",
" productNameZhHk ": " 麻辣牛肉 "
}
],
" SBert - xlm - paraphrase ": [{
" productNameEn ": " Spaghetti Bolognese ",
" productNameZhHk ": " 肉醬意粉 "
}, {
" productNameEn ": " Shredded Pork Chow Mein ",
" productNameZhHk ": " 肉絲炒麵 "
}, {
" productNameEn ": " Bamboo Shoots ",
" productNameZhHk ": " 醬油味筍 "
}, {
" productNameEn ": " Jajangmyeon ",
" productNameZhHk ": " 炸醬麵 "
}, {
" productNameEn ": " Sauteed Chinese Broccoli with Preserved Meat ",
" productNameZhHk ": " 臘味炒芥蘭 "
}
],
" SBert - xlm - stsb ": [{
" productNameEn ": " Spaghetti Bolognese ",
" productNameZhHk ": " 肉醬意粉 "
}, {
" productNameEn ": " Steamed Chicken Feet in Black Bean Sauce ",
" productNameZhHk ": " 豉汁蒸鳳爪 "
}, {
" productNameEn ": " Miso Soup ",
" productNameZhHk ": " 麵豉湯底 "
}, {
" productNameEn ": " Baked Ox Tongue in Tomato Sauce ",
" productNameZhHk ": " 焗茄汁牛脷 "
}, {
" productNameEn ": " Sausage Roll ",
" productNameZhHk ": " 香酥肉卷 "
}
],
" mUSE ": [{
" productNameEn ": " Spaghetti Bolognese ",
" productNameZhHk ": " 肉醬意粉 "
}, {
" productNameEn ": " Penne Bolognese ",
" productNameZhHk ": " 肉醬長通粉 "
}, {
" productNameEn ": " Jajangmyeon ",
" productNameZhHk ": " 炸醬麵 "
}, {
" productNameEn ": " Stir - fried Vermicelli with Soy Sauce Pork Belly ",
" productNameZhHk ": " 醬油豬腩炒粉絲 "
}, {
" productNameEn ": " Kid ' s Spaghetti with Meat Sauce & French Fries ",
"productNameZhHk ": " 肉醬意粉、薯條 "
}
]
}

Reference for Faiss

Code Snippet

Install libraries and setup

pip install bert-for-tf2
pip install faiss-cpu
pip install tensorflow_text
pip install laserembeddings[zh]
python -m laserembeddings download-models
pip install sentence-transformers

The major reason I use “laserembeddings” library instead of Facebook official LASER repository is the convenience of installation and setup, please do read the github of laserembeddings to see the difference between the 2.

Note that faiss have a CPU pypi package (faiss-cpu) and a GPU pypi package (faiss-gpu), and there even have a pypi package “faiss”, which is old build and do not use that one.

Using LaBSE from Tensorflow Hub

My code is more or less follow the official Tensorflow hub documentation:

And I just figure out there is a Huggingface version of LaBSE (I would have use this if I know before as I am more a PyTorch user):

Using LASER from laserembeddings

The usage of this library is very simple, the main lines are:

from laserembeddings import Laser as LaserEmbeddings
encoder = LaserEmbeddings()
output = encoder.embed_sentence(["your sentence"], lang="zh")

Note that the model is expecting to embed a batch of sentences, so even you have a single text, wrap it in an array. Also the lang parameter can be an array of language code if your input sentence(s) is not single language.

Using mUSE from Tensorflow hub

The usage is well documented in the Tensorflow hub documentation, and pretty straight forward compare to LaBSE:

Using SBert.Net models

Follow the documentation from their site, very easy to follow:

Side Notes

While comparing the output of different models, I found that some of the models does NOT normalize the output, if your use case require the output being normalized, simply wrap it with function like following (using PyTorch):

import torch
import torch.nn.functional as F
def normalize_tensor(tensor):
# because in my case, input could be PyTorch tensor or Numpy array
tensor = torch.tensor(tensor)
# the main function call
normalized = F.normalize(tensor,dim=-1,p=2)
# my use case prefer numpy to further process
return normalized.numpy()

Using Faiss

Building an index and inserting items

# Assume our vector is size 512 in length
index = faiss.IndexFlatL2(512)
# Adding item(s) to index
index.add(numpy_array_of_vector)

Note that the “numpy_array_of_vector” above is the resulting numpy array after passing through the model, and if this array is 2 dimensions with shape like (x, 512), you are inserting a batch of size x into the index, if it’s 1 dimension (of size (512,) ), you are inserting a single vector into the index.

Searching the result

# Embed the text into vector with one of the model above
query_embedded_by_model = model.embed("my text to query")
# Searching 5 nearest items
D, I = index.search(query_embedded_by_model, 5)

Note that the result is D (distance) and I (index), so you might get results like:

D = [[0.001, 0.0013, 0.0015, 0.0017, 0.0019]]I = [[9142, 3718, 5262, 3572, 18]]

Note that they are 2D array and the index is just the index number, you do not get back your indexed item unless you keep a record of the sequence you insert your item into Faiss. Pseudocode like follow (I does not test the code even if look like real code):

# sentences already in batch below
sentences = [["noodle", "rice", "egg"], ["chicken",...], ... ]
added_items = []
# adding items to index and appending to added_items
for batch in sentences:
# assuming the model intake array of sentences
encoded = model.encode(batch)
# add to Faiss index
index.add(encoded)
# extend the array to include the sentences in batch
added_items.extend(batch)

Then one can use the search result’s return of indices of closest matches and refer to original item in “added_items”.

--

--