Using Seq2Seq Models for Relation Extraction in GraphRAG • Ziad's blog

Cover Image Courtesy of: db-engines

Introduction

A few weeks ago, I tried to build a RAG system to search through some legal documents. My ordinary vector RAG failed to retrieve the relevant parts of the documents when queried and performed terribly. After some discussion with my professor, he advised me to try GraphRAGs as they’re generally better at finding relations between different entities, or in my case, corporations, mentioned in text and just know how to connect the dots more.

I tried different ones: One I built, then Microsoft’s, Morphik, which seemed promising but was too buggy, and then finally Graphiti, which was the winner for my use case.

One issue all GraphRAGs suffer from is cost and time. Extracting the entities and relations from a big corpus of text is very time-consuming and requires multiple calls to the LLM, which in my case was the OpenAI API. So I was able to witness my API bill almost skyrocket in real time, especially since Graphiti was configured to use an expensive model by default. So watch out if you’re planning to try it out.

But this is not how it’s always been. Before GPTs became popular, BERTs and Seq2Seq models were used for information extraction, which includes relation extraction and entity extraction. After some research, I found REBEL, a Seq2Seq model for relation extracting and the most downloaded on Hugging Face. So how did it perform?

Why not LLMs again?

For an LLM to be good (just good) at relation or entity extracting, it needs to have billions of parameters and to support structured output. This results in a high cost and slow extraction process. While BERTs and Seq2Seq only need a few hundred million parameters to work, in the case of REBEL, it has 406M parameters, and are already made to have structured output.

Plus the relations extracted by an LLM might differ from a piece of text to another even if the same relations are present in both, and it can name the same relation differently e.g., Ziad----Loves--->taboola or Ziad---enjoys--->taboola this means there’s no accountability or a deterministic mechanism in which an LLM would extract relations, however, Seq2Seq and BERTs are more deterministic and predictable.

Testing Out Plain REBEL

For Seq2Seq models to perform the way we want, it’s best to fine-tune them, we can’t just zero-shot prompt them like LLMs which is one of their downsides. But for now I’ll just try out plain REBEL and see how it performs.

Relation Extraction

I’ll use a part of a fake acquisition contract generated by ChatGPT and see how REBEL and ChatGPT-4.1-mini compare in relation extraction.

Contract:

CONTRACT 1: ACQUISITION AGREEMENT BETWEEN AVENTINE HOLDINGS INC. AND BRIARWOOD ANALYTICS LLC

This Acquisition Agreement (the "Agreement") is made and entered into as of July 1, 2025, by and among Aventine Holdings Inc., a Delaware corporation with its principal place of business at 500 Capitol Blvd., Wilmington, DE ("Acquirer"), and Briarwood Analytics LLC, a California limited liability company with its principal place of business at 2199 Market Street, San Francisco, CA ("Target"), and its Members, Willow H. Greaves ("Greaves") and Xander L. Moreau ("Moreau").

RECITALS

WHEREAS, Target is engaged in the business of data analytics for agritech solutions;

WHEREAS, Greaves owns sixty percent (60%) of the outstanding membership interests of Target and Moreau owns forty percent (40%) of the outstanding membership interests;

WHEREAS, Moreau and Yara M. Chen ("Chen") also hold an indirect interest of ten percent (10%) in Target via a silent partnership trust registered in the British Virgin Islands;

WHEREAS, Acquirer desires to purchase, and Target desires to sell, all of the outstanding membership interests in Target, thereby making Target a wholly-owned subsidiary of Acquirer;

NOW, THEREFORE, in consideration of the mutual promises herein contained, the parties agree as follows:

ChatGPT prompt:

Extract the relations from the given text. Structure each relation as a JSON object with the keys: "head", "relation", and "tail". Each object should represent a statement where the "head" entity is connected to the "tail" entity via the specified "relation".

<Text>
{text}
</Text>

The results by plugging this piece of text without any chunking is the following:

Relation #	REBEL	ChatGPT-4.1-mini
1	Aventine Holdings Inc -----headquarters location-----> Wilmington, DE	Acquisition Agreement -----is made and entered into as of-----> July 1, 2025
2		Acquisition Agreement -----is among-----> Aventine Holdings Inc.
3		Acquisition Agreement -----is among-----> Briarwood Analytics LLC
4		Acquisition Agreement -----is among-----> Willow H. Greaves
5		Acquisition Agreement -----is among-----> Xander L. Moreau
6		Aventine Holdings Inc. -----is a-----> Delaware corporation
7		Aventine Holdings Inc. -----has principal place of business at-----> 500 Capitol Blvd., Wilmington, DE
8		Briarwood Analytics LLC -----is a-----> California limited liability company
9		Briarwood Analytics LLC -----has principal place of business at-----> 2199 Market Street, San Francisco, CA
10		Aventine Holdings Inc. -----is referred to as-----> Acquirer
11		Briarwood Analytics LLC -----is referred to as-----> Target
12		Willow H. Greaves -----is a Member of-----> Target
13		Xander L. Moreau -----is a Member of-----> Target
14		Target -----is engaged in business of-----> data analytics for agritech solutions
15		Willow H. Greaves -----owns-----> 60% of membership interests of Target
16		Xander L. Moreau -----owns-----> 40% of membership interests of Target
17		Xander L. Moreau and Yara M. Chen -----hold-----> 10% indirect interest in Target via silent partnership trust
18		Silent partnership trust -----is registered in-----> British Virgin Islands
19		Acquirer -----desires to purchase-----> all outstanding membership interests in Target
20		Target -----desires to sell-----> all outstanding membership interests
21		Purchase of membership interests -----will make-----> Target a wholly-owned subsidiary of Acquirer

Okay, no comparison here, ChatGPT definitely blew REBEL out of the water. But actually we’re not using REBEL properly here. REBEL is trained to digest small pieces of text and output the relations present. So let’s do that, let’s break this piece of text into smaller ones.

Relation Extraction with Text Chunking

I’ll only feed the chunked text to REBEL and keep ChatGPT’s input the same, this is because feeding ChatGPT the chunked text chunk by chunk is very unpractical and ChatGPT already does a good enough job of extracting relations from a whole text.

I tried multiple chunking techniques, mainly semantic and recursive from langchain, but they were sub-optimal at best. And just a hint, chunking is not enough, you’ll see what I mean below.

When I tried to chunk the text into very small chunks using the recursive chunker, the semantic meaning would be lost and REBEL would struggle to understand how each entity relates to another. So I had to somehow compress the text while keeping the semantic meaning present, thus I made the chunks bigger to try to preserve the semantic meaning and summarized each chunk using a summarization model Falconsai/text_summarization. This model is a fine-tuned T5 model with only 60M parameters so it’s very light.

The summarization kept the semantic meaning of each chunk while making it shorter and thus produced better yet still not-so-great results. I tried to find a way to compress the text more and so I used a claims extraction model Babelscape/t5-base-summarization-claim-extractor which would take a summarized piece of text and output the atomic claims i.e., atomic propositions present. This is perfect for REBEL, atomic propositions are short and contain only the necessary information needed to extract a relation. Oh if only it was this easy.

REBEL would still hallucinate non-existent entities and relations, for example it would output something about John F. Kennedy and Harvard University although none exist in the given text. This is because the chunking I employed wasn’t the best and it would always result in some chunks that are too short or too long or cut off at a weird place and so on. To fix this I made my own sliding-window text chunker which would slide through the text at a fixed step size and return a list of overlapping chunks. These chunks would always contain something meaningful unlike the recursive chunker, but they still suffer from containing too much information.

Code for the sliding-window text chunker

def sliding_window_words(text, window_size, step=1):
    words = text.split()
    res = []
    n = len(words)
    for i in range(1, n - window_size + 1, step):
        res.append(' '.join(words[i:i + window_size]))
    return res

The hallucinations would appear mostly in the claims, the sliding window chunker solved most of them, like the JFK one, but others persisted. These remaining hallucinations would be gibberish, text that doesn’t make sense like a word repeated many many times or a sentence structure that just doesn’t make sense. So as a final step I added gibberish detection model madhurjindal/autonlp-Gibberish-Detector-492513457 that would analyze the claims and filter out gibberish ones.

Don’t be worried about the number of models used, they’re very fast and have a very low memory-footprint.

Finally, after so many trails (which I wish I could’ve shared with you, but it would make this article too long) this is the result of extracting the relations using REBEL

Relation #	REBEL	ChatGPT-4.1-mini
1	Briarwood Analytics LLC --------located in the administrative territorial entity---------> Delaware	”Acquisition Agreement” -----“is made and entered into as of”-----> “July 1, 2025”
2	Aventine Holdings --------headquarters location---------> Wilmington, DE	”Acquisition Agreement” -----“is among”-----> “Aventine Holdings Inc.”
3	Briarwood Analytics LLC --------instance of---------> limited liability company	”Acquisition Agreement” -----“is among”-----> “Briarwood Analytics LLC”
4	Aventine Holdings --------located in the administrative territorial entity---------> Delaware	”Acquisition Agreement” -----“is among”-----> “Willow H. Greaves”
5	Target --------located in the administrative territorial entity---------> Delaware	”Acquisition Agreement” -----“is among”-----> “Xander L. Moreau”
6	Target --------headquarters location---------> Wilmington, DE	”Aventine Holdings Inc.” -----“is a”-----> “Delaware corporation”
7	Aventine Holdings Inc. --------located in the administrative territorial entity---------> Delaware	”Aventine Holdings Inc.” -----“has principal place of business at”-----> “500 Capitol Blvd., Wilmington, DE”
8	Briarwood Analytics --------located in the administrative territorial entity---------> Delaware	”Briarwood Analytics LLC” -----“is a”-----> “California limited liability company”
9	Target --------industry---------> agritech	”Briarwood Analytics LLC” -----“has principal place of business at”-----> “2199 Market Street, San Francisco, CA”
10	Target --------industry---------> data analytics	”Aventine Holdings Inc.” -----“is referred to as”-----> “Acquirer”
11	Target --------owner of---------> Moreau	”Briarwood Analytics LLC” -----“is referred to as”-----> “Target”
12	Moreau --------owned by---------> Target	”Willow H. Greaves” -----“is a Member of”-----> “Target”
13	Target --------subsidiary---------> Moreau	”Xander L. Moreau” -----“is a Member of”-----> “Target”
14	Moreau --------parent organization---------> Target	”Target” -----“is engaged in business of”-----> “data analytics for agritech solutions”
15	Analytics LLC --------headquarters location---------> San Francisco, CA	”Willow H. Greaves” -----“owns”-----> “60% of membership interests of Target”
16	Analytics LLC --------headquarters location---------> San Francisco	”Xander L. Moreau” -----“owns”-----> “40% of membership interests of Target”
17	Market Street --------located in the administrative territorial entity---------> San Francisco, CA	”Xander L. Moreau and Yara M. Chen” -----“hold”-----> “10% indirect interest in Target via silent partnership trust”
18	Analytics LLC --------located in the administrative territorial entity---------> California	”Silent partnership trust” -----“is registered in”-----> “British Virgin Islands”
19	Analytics LLC --------instance of---------> limited liability company	”Acquirer” -----“desires to purchase”-----> “all outstanding membership interests in Target”
20	Target --------owned by---------> Yara M. Chen (“Chen”)	“Target” -----“desires to sell”-----> “all outstanding membership interests”
21	Yara M. Chen (“Chen”) --------owner of---------> Target	”Purchase of membership interests” -----“will make”-----> “Target a wholly-owned subsidiary of Acquirer”
22	Target --------owned by---------> Yara M. Moreau
23	Yara M. Moreau --------owner of---------> Target
24	Target --------instance of---------> silent partnership trust
25	Target --------located in the administrative territorial entity---------> British Virgin Islands
26	Target --------headquarters location---------> British Virgin Islands
27	Target --------owned by---------> Yara M. Chen
28	Target --------parent organization---------> Acquirer
29	Acquirer --------subsidiary---------> Target
30	Target --------owned by---------> Acquirer
31	Acquirer --------owner of---------> Target

If you notice there’s slight hallucinations such as “Target --------subsidiary---------> Moreau”, Moreau isn’t a company, it’s a person. Moreover, ChatGPT’s relations contain more details, for example “Xander L. Moreau and Yara M. Chen -----hold-----> 10% indirect interest in Target via”. But still these are amazing results!

We can fix the aforementioned issues by fine-tuning REBEL on a subset of the relations present in the documents. This can be done by using ChatGPT to extract random relations then train REBEL on ChatGPT’s output.

Fine-Tuning REBEL

Unfortunately, I still haven’t gotten to this part yet. I’m already inundated with so many projects so I have to put this one on hold for a bit. Also if you have any advice or an idea to improve this workflow or fine-tune REBEL please contact me through my socials, I’d love to chat about it!

I also won’t stop at fine-tuning, I want to try out a real GraphRAG using this implementation which hopefully I’d be able to do in the next couple of weeks.

Conclusion

Thank you for being with me through this journey and for reading this article. If you have any questions you can contact me through my socials above.

You can also find the Jupyter notebook I used here