Word Embeddings & Vector Search

Store word embeddings and perform semantic fitlering in CogDB.

Word embeddings represent words as vectors, useful for text classification, similarity, and clustering. Popular word embeddings include GloVe and FastText.


Word Embeddings API

put_embedding

g.put_embedding("orange", [0.1, 0.2, 0.3, 0.4, 0.5])

put_embeddings_batch

embeddings = [
    ("apple", [0.1, 0.2, 0.3]),
    ("banana", [0.3, 0.4, 0.5]),
    ("orange", [0.7, 0.8, 0.9]),
]
g.put_embeddings_batch(embeddings)

Use put_embeddings_batch() for bulk loading - much faster than individual puts.

get_embedding

g.get_embedding("orange")
# [0.1, 0.2, 0.3, 0.4, 0.5]

delete_embedding

g.delete_embedding("orange")

g.v().k_nearest(word, k).all()

Find the k most similar embeddings using cosine similarity.

g = Graph("embeddings_graph")

g.put("vec1", "type", "vector")
g.put("vec2", "type", "vector")
g.put("vec3", "type", "vector")

g.put_embedding("vec1", [0.1, 0.2, 0.3])
g.put_embedding("vec2", [0.1, 0.2, 0.4])
g.put_embedding("vec3", [0.9, 0.8, 0.7])

result = g.v().k_nearest("vec1", k=2).all()
# {'result': [{'id': 'vec1'}, {'id': 'vec2'}]}

Loading Pre-trained Embeddings

load_glove

g = Graph("glove")
count = g.load_glove("glove.6B.100d.txt", limit=10000)
result = g.v().k_nearest("king", k=5).all()

load_gensim

from gensim.models import KeyedVectors

model = KeyedVectors.load_word2vec_format("word2vec.bin", binary=True)
g = Graph("w2v")
g.load_gensim(model, limit=50000)

sim

sim(word, operator, value)

Filter vertices by cosine similarity.

Operators: >, <, >=, <=, ==, !=, in [min, max]

g.v().sim("orange", ">", 0.35).all()
# {'result': [{'id': 'clementines'}, {'id': 'tangerine'}, {'id': 'orange'}]}

g.v().sim("orange", "in", [0.25, 0.35]).all()
# {'result': [{'id': 'banana'}, {'id': 'apple'}]}

On this page