Word Embeddings & Vector Search

Store word embeddings and perform semantic fitlering in CogDB.

Word embeddings represent words as vectors, useful for text classification, similarity, and clustering. CogDB currently uses text for embedding, powered by the default embedding model bge-small-en-v1.5.

For a complete walkthrough on how to use embeddings with your graph, check out the Semantic Search Guide.


Auto-Vectorize with g.vectorize()

Automatically add embeddings for your graph nodes using CogDB's free embedding service (or using a third party provider). The service uses bge-small-en-v1.5 by default to embed the text of each node.

Embed all nodes in the graph

from cog.torque import Graph

g = Graph("planets")
g.put("europa", "orbits", "jupiter")
g.put("titan", "orbits", "saturn")
g.put("mars", "type", "planet")

g.vectorize()
# {'vectorized': 5, 'skipped': 0, 'total': 5}

Embed specific words

# Single word
g.vectorize("ocean")

# Multiple words (don't need to be in the graph)
g.vectorize(["ice", "atmosphere", "subsurface"])

Use a different provider

# OpenAI
g.vectorize(provider="openai", api_key="sk-...")

# Custom provider
g.vectorize(provider="custom", url="https://my-service.com/embed")

Auto-embed on query

After calling vectorize(), similarity queries automatically embed unknown words:

g.vectorize()

# "ocean" isn't in the graph, but it's auto embedded at query time.
g.v().k_nearest("ocean", k=3).all()
# {'result': [{'id': 'europa'}, {'id': 'titan'}, {'id': 'mars'}]}
g.v().sim("ice", ">", 0.5).all()

Set your own embedding using put_embedding

For most use cases, g.vectorize() above is the easiest way to add embeddings. The methods below give you full manual control to set your own embeddings when needed.

put_embedding

g.put_embedding("orange", [0.1, 0.2, 0.3, 0.4, 0.5])

put_embeddings_batch

embeddings = [
    ("apple", [0.1, 0.2, 0.3]),
    ("banana", [0.3, 0.4, 0.5]),
    ("orange", [0.7, 0.8, 0.9]),
]
g.put_embeddings_batch(embeddings)

Use put_embeddings_batch() for bulk loading - much faster than individual puts.

get_embedding

g.get_embedding("orange")
# [0.1, 0.2, 0.3, 0.4, 0.5]

delete_embedding

g.delete_embedding("orange")

g.v().k_nearest(word, k=10).all()

Find the k most similar embeddings using cosine similarity.

g = Graph("embeddings_graph")

g.put("vec1", "type", "vector")
g.put("vec2", "type", "vector")
g.put("vec3", "type", "vector")

g.put_embedding("vec1", [0.1, 0.2, 0.3])
g.put_embedding("vec2", [0.1, 0.2, 0.4])
g.put_embedding("vec3", [0.9, 0.8, 0.7])

result = g.v().k_nearest("vec1", k=2).all()
# {'result': [{'id': 'vec1'}, {'id': 'vec2'}]}

Loading Pre-trained Embeddings

load_glove

g = Graph("glove")
count = g.load_glove("glove.6B.100d.txt", limit=10000)
result = g.v().k_nearest("king", k=5).all()

load_gensim

from gensim.models import KeyedVectors

model = KeyedVectors.load_word2vec_format("word2vec.bin", binary=True)
g = Graph("w2v")
g.load_gensim(model, limit=50000)

sim

g.v().sim(word, operator, threshold, strict=False).all()

Filter vertices by cosine similarity.

Operators: >, <, >=, <=, =, !=, in

g.v().sim("orange", ">", 0.35).all()
# {'result': [{'id': 'clementines'}, {'id': 'tangerine'}, {'id': 'orange'}]}

g.v().sim("orange", "in", [0.25, 0.35]).all()
# {'result': [{'id': 'banana'}, {'id': 'apple'}]}

On this page