Word Embeddings & Vector Search

Store word embeddings and perform semantic fitlering in CogDB.

Word embeddings represent words as vectors, useful for text classification, similarity, and clustering. Popular word embeddings include GloVe and FastText.


Auto-Vectorize with g.vectorize()

Automatically add embeddings for your graph nodes using CogDB's free embedding service (or using a third party provider).

Embed all nodes in the graph

from cog.torque import Graph

g = Graph("planets")
g.put("europa", "orbits", "jupiter")
g.put("titan", "orbits", "saturn")
g.put("mars", "type", "planet")

g.vectorize()
# {'vectorized': 5, 'skipped': 0, 'total': 5}

Embed specific words

# Single word
g.vectorize("ocean")

# Multiple words (don't need to be in the graph)
g.vectorize(["ice", "atmosphere", "subsurface"])

Use a different provider

# OpenAI
g.vectorize(provider="openai", api_key="sk-...")

# Custom provider
g.vectorize(provider="custom", url="https://my-service.com/embed")

Auto-embed on query

After calling vectorize(), similarity queries automatically embed unknown words:

g.vectorize()

# "ocean" isn't in the graph, but it's auto embedded at query time.
g.v().k_nearest("ocean", k=3).all()
# {'result': [{'id': 'europa'}, {'id': 'titan'}, {'id': 'mars'}]}
g.v().sim("ice", ">", 0.5).all()

Word Embeddings API

For most use cases, g.vectorize() above is the easiest way to add embeddings. The methods below give you full manual control when needed.

put_embedding

g.put_embedding("orange", [0.1, 0.2, 0.3, 0.4, 0.5])

put_embeddings_batch

embeddings = [
    ("apple", [0.1, 0.2, 0.3]),
    ("banana", [0.3, 0.4, 0.5]),
    ("orange", [0.7, 0.8, 0.9]),
]
g.put_embeddings_batch(embeddings)

Use put_embeddings_batch() for bulk loading - much faster than individual puts.

get_embedding

g.get_embedding("orange")
# [0.1, 0.2, 0.3, 0.4, 0.5]

delete_embedding

g.delete_embedding("orange")

g.v().k_nearest(word, k).all()

Find the k most similar embeddings using cosine similarity.

g = Graph("embeddings_graph")

g.put("vec1", "type", "vector")
g.put("vec2", "type", "vector")
g.put("vec3", "type", "vector")

g.put_embedding("vec1", [0.1, 0.2, 0.3])
g.put_embedding("vec2", [0.1, 0.2, 0.4])
g.put_embedding("vec3", [0.9, 0.8, 0.7])

result = g.v().k_nearest("vec1", k=2).all()
# {'result': [{'id': 'vec1'}, {'id': 'vec2'}]}

Loading Pre-trained Embeddings

load_glove

g = Graph("glove")
count = g.load_glove("glove.6B.100d.txt", limit=10000)
result = g.v().k_nearest("king", k=5).all()

load_gensim

from gensim.models import KeyedVectors

model = KeyedVectors.load_word2vec_format("word2vec.bin", binary=True)
g = Graph("w2v")
g.load_gensim(model, limit=50000)

sim

sim(word, operator, value)

Filter vertices by cosine similarity.

Operators: >, <, >=, <=, ==, !=, in [min, max]

g.v().sim("orange", ">", 0.35).all()
# {'result': [{'id': 'clementines'}, {'id': 'tangerine'}, {'id': 'orange'}]}

g.v().sim("orange", "in", [0.25, 0.35]).all()
# {'result': [{'id': 'banana'}, {'id': 'apple'}]}

On this page