Build a Knowledge Graph from Text with Python

Turn a text document into a queryable knowledge graph.

A knowledge graph is a data structure that represents information as a network. Entities are modeled as nodes, and relationships between them are modeled as edges.

In this guide, we’ll take a plain-text document about planetary habitability, use an LLM to extract entities and relationships, load them into CogDB, and build a knowledge graph.

Once the graph is built, we’ll run queries against it and ask some interesting questions.

Source code on GitHub


Source Material

We’ll use the Wikipedia article Planetary Habitability in the Solar System as our input:

https://en.wikipedia.org/wiki/Planetary_habitability_in_the_Solar_System

The raw text used in this tutorial is available in
planetary-habitability.txt inside the GitHub repository.


Defining the Ontology

To build a knowledge graph, we first define an ontology. An ontology specifies the types of entities and relationships allowed in the graph. You can think of it as the schema for the graph, similar to how a database schema defines the structure of a relational database.

The ontology gives meaning to nodes and edges.

Data will be stored as Subject–Predicate–Object (SPO) triples:

  • Subject → node
  • Predicate → edge
  • Object → node

Nodes represent entities, and edges represent relationships between them.

Since we’re working in the domain of planetary science and space exploration, we’ll start by defining some entity types. For well-known domains like this, LLMs are actually quite useful for generating a reasonable starting ontology.

If we prompt an LLM with the raw text and ask for entity categories, we might get something like:

- CelestialBody
- Moon
- Mission
- Spacecraft
- SpaceAgency
- Scientist
- Chemical
- Feature
- Concept
- Region
- Instrument
- Event

Next, we define the relationships that can connect these entities:

- ORBITS
- MOON_OF
- HAS_FEATURE
- HAS_ATMOSPHERE
- EXPLORED_BY
- OPERATED_BY
- LAUNCHED_IN
- CARRIES
- DISCOVERED
- INDICATES
- REQUIRES
- LOCATED_IN
- MAY_HARBOR
- CONTAINS
- SUCCESSOR_OF
- TARGETS
- EVIDENCE_FOR
- PROPOSED_BY
- PART_OF

Extracting Entities and Relationships

With the ontology defined, we can extract structured data from the raw text.

LLMs work well for this kind of task. We provide the ontology and ask the model to extract entities that conform to it.

Here’s an example:

def extract_entities(client, text: str) -> list[dict]:
    """Extract typed entities from text, guided by the ontology."""
    response = client.chat.completions.create(
        model="gpt-5",
        messages=[
            {
                "role": "system",
                "content": (
                    "You are a knowledge graph entity extractor.\n\n"
                    "Given a text about planetary science and astrobiology, extract "
                    "all named entities. For each entity return:\n"
                    '  {"name": "<short canonical name>", "type": "<EntityType>"}\n\n'
                    "Use these entity types ONLY:\n"
                    + "\n".join(f"- {k}: {v}" for k, v in ONTOLOGY["entity_types"].items())
                    + "\n\nRules:\n"
                    "- Use short canonical names: 'Europa' not 'Europa, a moon of Jupiter'\n"
                    "- Normalize: 'JWST' and 'James Webb Space Telescope' → 'JWST'\n"
                    "- One entry per unique entity\n"
                    "- Return ONLY a JSON array. No commentary.\n"
                ),
            },
            {"role": "user", "content": text},
        ],
    )
    raw = response.choices[0].message.content.strip()
    if raw.startswith("```"):
        raw = raw.split("\n", 1)[1].rsplit("```", 1)[0]
    return json.loads(raw)

Normalization

After extraction, we typically need to normalize entities and relationships. Different phrases may refer to the same concept, and we want them represented consistently in the graph.

For example:

- james webb space telescope -> jwst
- james webb -> jwst
- webb telescope -> jwst
- webb -> jwst
- jst -> jwst
- perseverance rover -> perseverance
- perseverance -> perseverance
- cassini-huygens -> cassini
- cassini spacecraft -> cassini

Normalization can be handled either by prompting the LLM carefully or by post-processing its output.

Loading into CogDB

Once we have triples, we can load them into CogDB.

Here is a simplified example:


from cog.torque import Graph

triples = [
     ("Europa", "ORBITS", "Jupiter"),
     ("Europa", "HAS_FEATURE", "Subsurface_Ocean"),
     ("Europa", "MAY_HARBOR", "Life"),
]

g = Graph("AstrobiologyKG")

g.put_batch(triples)

We can also attach vector embeddings to entities to enable semantic search. In the demo code, embeddings are generated using OpenAI’s embeddings API.

# dummy embeddings for demonstration purposes.
g.put_embedding("europa", [0.1, 0.2, 0.3]) 
g.put_embedding("jupiter", [0.1, 0.2, 0.3])
g.put_embedding("subsurface ocean", [0.1, 0.2, 0.3])
g.put_embedding("life", [0.1, 0.2, 0.3])

Querying the Knowledge Graph

Now we have a working knowledge graph. We can traverse it, run semantic searches, or combine both.

Below are a few example queries. source code on GitHub

A. Graph Traversal

What is Europa connected to in the graph?

g.v("europa").out().all()

life, subsurface ocean


Which entities mention or relate to water?

g.v("water").inc().all()

messenger


Where is Jezero Crater located, and what might that body harbor?

g.v("jezero crater").out("located_in").out("may_harbor").all()

life


Which missions target bodies that may harbor life?

g.v("life").inc("may_harbor").inc("targets").all()

europa clipper, dragonfly


Nearest entities to "signs of life"

g.v().k_nearest("signs of life", k=5).all()

biosignature, dawn, massapanspermia, mars, panspermia


Nearest entities to "ocean world"

g.v().k_nearest("ocean world", k=5).all()

subsurface ocean, europa, caloris basin, moon, mars


C. Hybrid: Graph Traversal + Semantic Filtering

What entities are within two hops of Europa that are most related to "water and ice"?

g.v("europa").bfs(max_depth=2).k_nearest("water and ice", k=3).all()

subsurface ocean


Which bodies may harbor life, ranked by relevance to "extraterrestrial biology"?

g.v("life").inc("may_harbor").k_nearest("extraterrestrial biology", k=5).all()

mars, titan, europa


What is Enceladus connected to that is most related to "ocean"?

g.v("enceladus").out().k_nearest("ocean", k=3).all()

subsurface ocean


What do NASA spacecraft target, ranked by "habitable world"?

g.v("nasa").inc("explored_by").out("targets").k_nearest("habitable world", k=5).all()

europa


Find entities similar to "icy moon", then see what they may harbor

g.v().k_nearest("icy moon", k=3).out("may_harbor").all()

early life, life, subsurface ocean


Which missions target habitable bodies, ranked by "robotic spacecraft"?

g.v("life").inc("may_harbor").inc("targets").k_nearest("robotic spacecraft", k=3).all()

dragonfly, europa clipper


Find "space exploration" entities, follow their edges, then filter for "habitable world"

g.v().k_nearest("space exploration", k=10).out().k_nearest("habitable world", k=5).all()

subsurface ocean, curiosity, gale crater, europa, titan


Find "deep space probe" entities, see what they target, then filter for "frozen world"

g.v().k_nearest("deep space probe", k=5).out("targets").k_nearest("frozen world", k=3).all()

gale crater


On this page