Word Embeddings & Vector Search

Store word embeddings and perform k-nearest neighbor search in CogDB.

Word embeddings represent words as vectors, useful for text classification, similarity, and clustering. Popular types include GloVe and FastText.


Word Embeddings API

put_embedding

g.put_embedding("orange", [0.1, 0.2, 0.3, 0.4, 0.5])

put_embeddings_batch

embeddings = [
    ("apple", [0.1, 0.2, 0.3]),
    ("banana", [0.3, 0.4, 0.5]),
    ("orange", [0.7, 0.8, 0.9]),
]
g.put_embeddings_batch(embeddings)

Use put_embeddings_batch() for bulk loading - much faster than individual puts.

get_embedding

g.get_embedding("orange")
# [0.1, 0.2, 0.3, 0.4, 0.5]

delete_embedding

g.delete_embedding("orange")

g.v().k_nearest(word, k).all()

Find the k most similar embeddings using cosine similarity.

g = Graph("embeddings_graph")

g.put("vec1", "type", "vector")
g.put("vec2", "type", "vector")
g.put("vec3", "type", "vector")

g.put_embedding("vec1", [0.1, 0.2, 0.3])
g.put_embedding("vec2", [0.1, 0.2, 0.4])
g.put_embedding("vec3", [0.9, 0.8, 0.7])

result = g.v().k_nearest("vec1", k=2).all()
# {'result': [{'id': 'vec1'}, {'id': 'vec2'}]}

Loading Pre-trained Embeddings

load_glove

g = Graph("glove")
count = g.load_glove("glove.6B.100d.txt", limit=10000)
result = g.v().k_nearest("king", k=5).all()

load_gensim

from gensim.models import KeyedVectors

model = KeyedVectors.load_word2vec_format("word2vec.bin", binary=True)
g = Graph("w2v")
g.load_gensim(model, limit=50000)

sim

sim(word, operator, value)

Filter vertices by cosine similarity.

Operators: >, <, >=, <=, ==, !=, in [min, max]

g.v().sim("orange", ">", 0.35).all()
# {'result': [{'id': 'clementines'}, {'id': 'tangerine'}, {'id': 'orange'}]}

g.v().sim("orange", "in", [0.25, 0.35]).all()
# {'result': [{'id': 'banana'}, {'id': 'apple'}]}

On this page