Skip to content

Notebook study that improves the RAG accuracy of a Llama Hub pack that deals with embedded tables

Notifications You must be signed in to change notification settings

erdoganhalit/uefa-cl-rag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

The study in this repo is inspired by Llama Hub's EmbeddedTablesUnstructuredRetrieverPack and AI Makerspace's implementation of it.

Although the pack successfully utilizes Llama-Index modules to parse and retrieve HTML content, the accuracy of responses for detailed queries that retrieve table content was open to improvement.

After careful inspection, it is seen that tables with complex structures are not parsed entirely correctly, which causes low accuracy.

The unstructured node parser takes a llama_index.schema.Document as input that is read with a LlamaIndex FlatReader. Even though this Document contains most of the html tags such as <span>, <tr>, <td> , and the UnstructuredElementNodeParser does a great job detecting the tables, it fails to extract the full information under some conditions including but not limited to:

  • The table has merged cells
  • The cell value includes references
  • The cell value is a comma seperated list

This situation causes information loss, which can be recovered considering the original data source was an HTML and actually includes all the necessary tags to capture the information of the table perfectly.

Therefore, this notebook aims to follow the pipeline of the EmbeddedTablesUnstructuredRetrieverPack but just change the content of the TextNode objects that contain the tables with a custom function. The rest of the pipeline is the same but there is a noticeable improvement.

For easy access (and personal interest) the source document used for this study is the Wikipedia page of UEFA Champions League. However, I downloaded the page and worked locally to prevent changes in the source document from affecting the code.

To reproduce the results

Clone the repo
git clone https://github.com/erdoganhalit/uefa-cl-rag/
After creating your virtual environment (recommended), install the dependencies
git install requirements.txt
Run the notebook

About

Notebook study that improves the RAG accuracy of a Llama Hub pack that deals with embedded tables

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published