What I built (2) — leveraging different pretrained language model and create an NLP search

Intention and background

What I tried

  • Load the language models, have them process some dish name, see their clustering
  • Load all dish into a Faiss index, try to query with dish name or a sentence that contain dish name(s)

How it work

Some NLP basic — How word sequence becomes one vectors (please feel free to skip the whole section if you already know)

Tokenize

Embed each token into vector space

0.2 | 0.1 | 0.6 => Mango0.1 | 0.1 | 0.8 => Orange0.1 | 0.1 | 0.82 => Tangerine0.5 | 0.7 | 0.1 => Milk0.0 | 0.9 | 0.0 => Water

Run through a model to per token to create a final vector of the sequence

Mango Milk = ( (0.2 | 0.1 | 0.6) + (0.5 | 0.7 | 0.1) )/ 2 => (0.35 | 0.4 | 0.35)

What pretrain model I used

Reference for getting to know more about the models:

Result (for resulting vector clusters)

(top row) SBert-mUSE, SBert-xlm-paraphrase, SBert-xlm-stsb; (mid row) SBert-dist-bert-stsb, LaBSE, mUSE; (bottom row): LASER, Concat, Concat normalized

Result of searching among 1000+ dishes (with help of Faiss)

{
"Concat": [{
"productNameEn": "Spaghetti Bolognese",
"productNameZhHk": "肉醬意粉"
}, {
"productNameEn": "Steamed Plain Rice Rolls with Assorted Sauce",
"productNameZhHk": "混醬腸粉 "
}, {
"productNameEn": "Pork Belly in Soy Sauce Rice Bowl",
"productNameZhHk": "醬油豬腩肉飯"
}, {
"productNameEn": "Pappardelle Bolognese",
"productNameZhHk": "肉醬寬帶麵"
}, {
"productNameEn": "Jajangmyeon",
"productNameZhHk": "炸醬麵"
}
],
"Concat_norm": [{
"productNameEn": "Spaghetti Bolognese",
"productNameZhHk": "肉醬意粉"
}, {
"productNameEn": "Steamed Plain Rice Rolls with Assorted Sauce",
"productNameZhHk": "混醬腸粉 "
}, {
"productNameEn": "Pork Belly in Soy Sauce Rice Bowl",
"productNameZhHk": "醬油豬腩肉飯"
}, {
"productNameEn": "Pappardelle Bolognese",
"productNameZhHk": "肉醬寬帶麵"
}, {
"productNameEn": "Jajangmyeon",
"productNameZhHk": "炸醬麵"
}
],
"LASER_embeddings": [{
"productNameEn": "Spaghetti Bolognese",
"productNameZhHk": "肉醬意粉"
}, {
"productNameEn": "Prawn Chorizo Spaghetti",
"productNameZhHk": "大蝦辣肉腸意粉"
}, {
"productNameEn": "Shredded Pork Chow Fun",
"productNameZhHk": "肉絲炒河粉"
}, {
"productNameEn": "Taro roll",
"productNameZhHk": "香芋卷"
}, {
"productNameEn": "Marinated Cucumber with Moro Miso",
"productNameZhHk": "麥粒青瓜"
}
],
"LaBSE": [{
"productNameEn": "Spaghetti Bolognese",
"productNameZhHk": "肉醬意粉"
}, {
"productNameEn": "Penne Bolognese",
"productNameZhHk": "肉醬長通粉"
}, {
"productNameEn": "Pork Belly in Soy Sauce Rice Bowl",
"productNameZhHk": "醬油豬腩肉飯"
}, {
"productNameEn": "Pho Tal",
"productNameZhHk": "生牛肉湯粉"
}, {
" productNameEn ": " Crab Pasta ",
" productNameZhHk ": " 蟹肉意粉 "
}
],
" SBert - dist - bert - stsb ": [{
" productNameEn ": " Spaghetti Bolognese ",
" productNameZhHk ": " 肉醬意粉 "
}, {
" productNameEn ": " Penne Bolognese ",
" productNameZhHk ": " 肉醬長通粉 "
}, {
" productNameEn ": " Baked Meat Sauce with Garlic Bread ",
" productNameZhHk ": " 焗肉醬蒜茸麵飽 "
}, {
" productNameEn ": " Lasagna Classica ",
" productNameZhHk ": " 肉醬千層寬麵 "
}, {
" productNameEn ": " Baked Bolongnese w Garlic Bread ",
" productNameZhHk ": " 焗肉醬蒜茸包 "
}
],
" SBert - mUSE ": [{
" productNameEn ": " Spaghetti Bolognese ",
" productNameZhHk ": " 肉醬意粉 "
}, {
" productNameEn ": " Penne Bolognese ",
" productNameZhHk ": " 肉醬長通粉 "
}, {
" productNameEn ": " Steamed Plain Rice Rolls with Assorted Sauce ",
" productNameZhHk ": " 混醬腸粉 "
}, {
" productNameEn ": " Bamboo Shoots ",
" productNameZhHk ": " 醬油味筍 "
}, {
" productNameEn ": " Sauteed beef with hot pepper ",
" productNameZhHk ": " 麻辣牛肉 "
}
],
" SBert - xlm - paraphrase ": [{
" productNameEn ": " Spaghetti Bolognese ",
" productNameZhHk ": " 肉醬意粉 "
}, {
" productNameEn ": " Shredded Pork Chow Mein ",
" productNameZhHk ": " 肉絲炒麵 "
}, {
" productNameEn ": " Bamboo Shoots ",
" productNameZhHk ": " 醬油味筍 "
}, {
" productNameEn ": " Jajangmyeon ",
" productNameZhHk ": " 炸醬麵 "
}, {
" productNameEn ": " Sauteed Chinese Broccoli with Preserved Meat ",
" productNameZhHk ": " 臘味炒芥蘭 "
}
],
" SBert - xlm - stsb ": [{
" productNameEn ": " Spaghetti Bolognese ",
" productNameZhHk ": " 肉醬意粉 "
}, {
" productNameEn ": " Steamed Chicken Feet in Black Bean Sauce ",
" productNameZhHk ": " 豉汁蒸鳳爪 "
}, {
" productNameEn ": " Miso Soup ",
" productNameZhHk ": " 麵豉湯底 "
}, {
" productNameEn ": " Baked Ox Tongue in Tomato Sauce ",
" productNameZhHk ": " 焗茄汁牛脷 "
}, {
" productNameEn ": " Sausage Roll ",
" productNameZhHk ": " 香酥肉卷 "
}
],
" mUSE ": [{
" productNameEn ": " Spaghetti Bolognese ",
" productNameZhHk ": " 肉醬意粉 "
}, {
" productNameEn ": " Penne Bolognese ",
" productNameZhHk ": " 肉醬長通粉 "
}, {
" productNameEn ": " Jajangmyeon ",
" productNameZhHk ": " 炸醬麵 "
}, {
" productNameEn ": " Stir - fried Vermicelli with Soy Sauce Pork Belly ",
" productNameZhHk ": " 醬油豬腩炒粉絲 "
}, {
" productNameEn ": " Kid ' s Spaghetti with Meat Sauce & French Fries ",
"productNameZhHk ": " 肉醬意粉、薯條 "
}
]
}

Reference for Faiss

Code Snippet

Install libraries and setup

pip install bert-for-tf2
pip install faiss-cpu
pip install tensorflow_text
pip install laserembeddings[zh]
python -m laserembeddings download-models
pip install sentence-transformers

Using LaBSE from Tensorflow Hub

Using LASER from laserembeddings

from laserembeddings import Laser as LaserEmbeddings
encoder = LaserEmbeddings()
output = encoder.embed_sentence(["your sentence"], lang="zh")

Using mUSE from Tensorflow hub

Using SBert.Net models

Side Notes

import torch
import torch.nn.functional as F
def normalize_tensor(tensor):
# because in my case, input could be PyTorch tensor or Numpy array
tensor = torch.tensor(tensor)
# the main function call
normalized = F.normalize(tensor,dim=-1,p=2)
# my use case prefer numpy to further process
return normalized.numpy()

Using Faiss

# Assume our vector is size 512 in length
index = faiss.IndexFlatL2(512)
# Adding item(s) to index
index.add(numpy_array_of_vector)
# Embed the text into vector with one of the model above
query_embedded_by_model = model.embed("my text to query")
# Searching 5 nearest items
D, I = index.search(query_embedded_by_model, 5)
D = [[0.001, 0.0013, 0.0015, 0.0017, 0.0019]]I = [[9142, 3718, 5262, 3572, 18]]
# sentences already in batch below
sentences = [["noodle", "rice", "egg"], ["chicken",...], ... ]
added_items = []
# adding items to index and appending to added_items
for batch in sentences:
# assuming the model intake array of sentences
encoded = model.encode(batch)
# add to Faiss index
index.add(encoded)
# extend the array to include the sentences in batch
added_items.extend(batch)

--

--

--

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

An introduction to probabilistic programming, now available in TensorFlow Probability

Open Source Datasets for Machine Learning: Challenges and Solutions

A Glimpse into my GSoC project

Make AI plan for you. Part 1

How Customer-Centric Marketers Use Machine Learning

Predicting The Functional Status of Pumps in Tanzania

Self attention — a clever compromise

What Is The Difference Between predict() and predict_proba() in scikit-learn?

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Stephen Cow Chau

Stephen Cow Chau

More from Medium

What To Know About Semantic Search Using NLP

Training and integrating a custom text classifier to a spacy pipeline

NLP and GPT — 3: A bite-sized lesson

A photo describing different ways nlp can be used

Interview with Expert in Natural Language Processing (NLP) for Documents