Build a Knowledge Graph from Text with Python

Learn how to build a knowledge graph in Python using CogDB. Extract entities and relationships from text with an LLM, load them as triples, and query with graph traversal and vector search.

What is a Knowledge Graph?

A knowledge graph is a data structure that represents information as a network. Entities are modeled as nodes, and relationships between them are modeled as edges.

In this guide, we’ll take a plain-text document about planetary habitability, use an LLM to extract entities and relationships, load them into CogDB, and build a knowledge graph.

Once the graph is built, we’ll run queries against it and ask some interesting questions.

Source code on GitHub

Source Material

We’ll use the Wikipedia article Planetary Habitability in the Solar System as our input:

https://en.wikipedia.org/wiki/Planetary_habitability_in_the_Solar_System

The raw text used in this tutorial is available in
planetary-habitability.txt inside the GitHub repository.

To build a knowledge graph, we first define an ontology. An ontology specifies the types of entities and relationships allowed in the graph. You can think of it as the schema for the graph, similar to how a database schema defines the structure of a relational database.

The ontology gives meaning to nodes and edges.

Data will be stored as Subject–Predicate–Object (SPO) triples:

Subject → node
Predicate → edge
Object → node

Nodes represent entities, and edges represent relationships between them.

Since we’re working in the domain of planetary science and space exploration, we’ll start by defining some entity types. For well-known domains like this, LLMs are actually quite useful for generating a reasonable starting ontology.

If we prompt an LLM with the raw text and ask for entity categories, we might get something like:

- CelestialBody
- Moon
- Mission
- Spacecraft
- SpaceAgency
- Scientist
- Chemical
- Feature
- Concept
- Region
- Instrument
- Event

Next, we define the relationships that can connect these entities:

- ORBITS
- MOON_OF
- HAS_FEATURE
- HAS_ATMOSPHERE
- EXPLORED_BY
- OPERATED_BY
- LAUNCHED_IN
- CARRIES
- DISCOVERED
- INDICATES
- REQUIRES
- LOCATED_IN
- MAY_HARBOR
- CONTAINS
- SUCCESSOR_OF
- TARGETS
- EVIDENCE_FOR
- PROPOSED_BY
- PART_OF

Extracting Entities and Relationships

With the ontology defined, we can extract structured data from the raw text.

LLMs work well for this kind of task. We provide the ontology and ask the model to extract entities that conform to it.

Here’s an example:

def extract_entities(client, text: str) -> list[dict]:
    """Extract typed entities from text, guided by the ontology."""
    response = client.chat.completions.create(
        model="gpt-5",
        messages=[
            {
                "role": "system",
                "content": (
                    "You are a knowledge graph entity extractor.\n\n"
                    "Given a text about planetary science and astrobiology, extract "
                    "all named entities. For each entity return:\n"
                    '  {"name": "<short canonical name>", "type": "<EntityType>"}\n\n'
                    "Use these entity types ONLY:\n"
                    + "\n".join(f"- {k}: {v}" for k, v in ONTOLOGY["entity_types"].items())
                    + "\n\nRules:\n"
                    "- Use short canonical names: 'Europa' not 'Europa, a moon of Jupiter'\n"
                    "- Normalize: 'JWST' and 'James Webb Space Telescope' → 'JWST'\n"
                    "- One entry per unique entity\n"
                    "- Return ONLY a JSON array. No commentary.\n"
                ),
            },
            {"role": "user", "content": text},
        ],
    )
    raw = response.choices[0].message.content.strip()
    if raw.startswith("```"):
        raw = raw.split("\n", 1)[1].rsplit("```", 1)[0]
    return json.loads(raw)

Normalization

After extraction, we typically need to normalize entities and relationships. Different phrases may refer to the same concept, and we want them represented consistently in the graph.

For example:

- james webb space telescope -> jwst
- james webb -> jwst
- webb telescope -> jwst
- webb -> jwst
- jst -> jwst
- perseverance rover -> perseverance
- perseverance -> perseverance
- cassini-huygens -> cassini
- cassini spacecraft -> cassini

Normalization can be handled either by prompting the LLM carefully or by post-processing its output.

Loading Triples into CogDB

Once we have triples, we can load them into CogDB.

Here is a simplified example:


from cog.torque import Graph

triples = [
     ("Europa", "ORBITS", "Jupiter"),
     ("Europa", "HAS_FEATURE", "Subsurface_Ocean"),
     ("Europa", "MAY_HARBOR", "Life"),
]

g = Graph("AstrobiologyKG")

g.put_batch(triples)

We can also attach vector embeddings to entities to enable semantic search. In the demo code, embeddings are generated using OpenAI’s embeddings API.

# dummy embeddings for demonstration purposes.
g.put_embedding("europa", [0.1, 0.2, 0.3]) 
g.put_embedding("jupiter", [0.1, 0.2, 0.3])
g.put_embedding("subsurface ocean", [0.1, 0.2, 0.3])
g.put_embedding("life", [0.1, 0.2, 0.3])

Querying a Knowledge Graph in Python

Now we have a working knowledge graph. We can traverse it, run semantic searches, or combine both.

Below are a few example queries. source code on GitHub

▶ Try these queries in the Playground

A. Graph Traversal

What is Europa connected to in the graph?

g.v("europa").out().all()

life, subsurface ocean

Which entities mention or relate to water?

g.v("water").inc().all()

messenger

Where is Jezero Crater located, and what might that body harbor?

g.v("jezero crater").out("located_in").out("may_harbor").all()

life

Which missions target bodies that may harbor life?

g.v("life").inc("may_harbor").inc("targets").all()

europa clipper, dragonfly

B. Semantic Search

Nearest entities to "signs of life"

g.v().k_nearest("signs of life", k=5).all()

biosignature, dawn, massapanspermia, mars, panspermia

Nearest entities to "ocean world"

g.v().k_nearest("ocean world", k=5).all()

subsurface ocean, europa, caloris basin, moon, mars

▶ Try these queries in the Playground

C. Hybrid: Graph Traversal + Semantic Filtering

What entities are within two hops of Europa that are most related to "water and ice"?

g.v("europa").bfs(max_depth=2).k_nearest("water and ice", k=3).all()

subsurface ocean

Which bodies may harbor life, ranked by relevance to "extraterrestrial biology"?

g.v("life").inc("may_harbor").k_nearest("extraterrestrial biology", k=5).all()

mars, titan, europa

What is Enceladus connected to that is most related to "ocean"?

g.v("enceladus").out().k_nearest("ocean", k=3).all()

subsurface ocean

What do NASA spacecraft target, ranked by "habitable world"?

g.v("nasa").inc("explored_by").out("targets").k_nearest("habitable world", k=5).all()

europa

Find entities similar to "icy moon", then see what they may harbor

g.v().k_nearest("icy moon", k=3).out("may_harbor").all()

early life, life, subsurface ocean

Which missions target habitable bodies, ranked by "robotic spacecraft"?

g.v("life").inc("may_harbor").inc("targets").k_nearest("robotic spacecraft", k=3).all()

dragonfly, europa clipper

Find "space exploration" entities, follow their edges, then filter for "habitable world"

g.v().k_nearest("space exploration", k=10).out().k_nearest("habitable world", k=5).all()

subsurface ocean, curiosity, gale crater, europa, titan

Find "deep space probe" entities, see what they target, then filter for "frozen world"

g.v().k_nearest("deep space probe", k=5).out("targets").k_nearest("frozen world", k=3).all()

gale crater

▶ Try all queries in the Playground