Enhancing Historical Understanding with Retrieval Augmented Generation

Project Overview

Although the amount of resources available on the internet seems to be growing daily, it’s becoming increasingly challenging to navigate the sea of information. While one might expect the influx of articles to make it easier to answer questions, the information accessible is often written well after the historical events have concluded, reflecting modern perspectives and interpretations. When answering historical questions, people often tend to rely on a simple Google search, which leads to websites such as Wikipedia. These sites can be problematic due to the unreliability of sources and the tendency to project 21st-century viewpoints onto historical information. These tools provide users with a surface-level summary that lacks the historical nuance and contextualization that is needed for a thorough understanding of the matter at hand. This project aims to address this gap in relevant and credible information retrieval. This project accomplishes these goals through Retrieval Augmented Generation (RAG), which is a combination of a search algorithm and Large Language Models (LLMs) to answer user queries. Retrieval Augmented Generation avoids the limitations of traditional search functions, which can struggle with large amounts of data, causing inaccurate or oversimplified results. A user can ask a question about historical events, and our model will search, retrieve, and synthesize data, ultimately responding to the user’s questions using a diverse array of historical news sources.

Methodology

The methodology underpinning our project involves several key phases: data preprocessing and storage, initial methodology formulation, and improvements leading to an optimized solution. Initially, our process began with the careful curation and preprocessing of historical newspaper data, focusing on publications from the early twentieth century. This data was meticulously cleaned and stored, employing advanced techniques to ensure efficiency and scalability. Our initial approach utilized BERT embeddings and FAISS indices to create a semantic search framework, but we encountered challenges related to accuracy and scalability. In response, we pivoted to a more effective methodology, incorporating Sentence Transformers for text processing and Chroma databases for data management. This shift not only improved the performance of our tool but also enhanced its ability to provide precise and contextually relevant responses to historical queries. Our final model employs the LLamas LLM, specifically chosen for its prowess in generating informative and accurate answers based on the nuanced understanding of the historical context provided by the input data.

Results

In our project, we aimed to refine LLM outputs and assess our search/retrieval methodology's performance and efficiency. Initial steps involved creating vector embeddings for documents, comparing models like DistilBERT, bert-base-uncased, TinyBERT, and all-MiniLM-L6-v2 for their encoding efficiency and architectural complexity. Our findings, reflecting in comparative analyses, highlighted TinyBERT's speed and all-MiniLM-L6-v2's balanced performance. However, our initial search methodology, using BERT embeddings and a FAISS index, faced significant issues with document retrieval relevance, leading to uniformly inadequate LLM responses across diverse queries. This prompted a methodology shift, focusing on embedding model efficiency and reducing context noise for the LLM. Subsequent user satisfaction tests showed improvement in response quality and specificity, with overall satisfaction increasing notably after methodological refinements. These results underlined our tool's evolution in addressing historical queries more effectively.

Discussion and conclusion

This project undertook the exploration of Retrieval Augmented Generation (RAG) to enhance historical understanding from 1925 to 1929, focusing on user satisfaction and the quality of outputs. Initially, the project utilized the "bert-base-uncased" model with an FAISS index for the search function, but inefficiencies and inaccuracies led to a methodology shift. Efficiency tests on multiple embedding models highlighted the need for a balance between model complexity and encoding time, essential for scaling to large document volumes. The selection of SentenceTransformer's all-MiniLM-L6-v2 model, despite its slower performance compared to TinyBERT, was due to its higher complexity and training parameters, alongside a chunking methodology that improved data handling from 2 million documents to 10 million document chunks, enhancing search relevancy and reducing noise. Initial search evaluations showed the system's inability to differentiate effectively between document meanings and user queries, rendering the retrieval process nearly random and impacting the quality of the LLM's output. The refined methodology, however, enabled more nuanced and detailed LLM responses, leading to a noticeable improvement in user satisfaction. Feedback from early tool usage indicated a need for further model tuning to eliminate irrelevant details and confusion, which upon implementation, significantly improved user satisfaction ratings. The project's findings align with current literature, emphasizing the importance of sophisticated information retrieval algorithms in RAG applications and suggesting potential areas for future research. However, the tool's limitations, including potential biases in source materials and the assumption of source credibility, were acknowledged. The project outlines future directions for enhancing the tool's efficiency and expanding its historical scope, indicating its broader implications for educational and research applications. This iterative development process and the integration of RAG in various domains promise continued advancement in information retrieval technologies.

Development Team

The project is done by Saachi Shenoy and Srianusha Nandula who are senior data science students at UC San Diego, alonside their project mentor Colin Jemmot. This effort was also a collaboration between the UCSD Data Science Library and ProQuest TDM Studios.

Impact and Applications

This project promises to significantly impact educational curricula, academic research, and historical analysis by providing a more accurate, accessible means of exploring historical events. Its applications range from classroom settings to professional research, enhancing understanding and engagement with history.

FAQs

How does this project improve access to historical information?

By combining Retrieval-Augmented Generation (RAG) with Large Language Models (LLMs), our project enhances the accuracy and depth of information retrieved from historical sources, making it easier and more reliable for users to access historical data.



Can anyone use this project's platform for research or educational purposes?

Due to copyright restrictions, the platform's data is limited to educators and students within the University of California, San Diego. Therefore, a UCSD.edu login is necessary to access the tool. This ensures compliance with copyright laws while providing valuable resources for educational and research purposes within the University of California, San Diego community.



What types of historical data does the project cover?

The project focuses on historical articles published between 1925 and 1929, specifically utilizing reputable newspapers such as the Chicago Tribune, Los Angeles Times, New York Times, Wall Street Journal, Washington Post, and San Francisco Chronicle. This selection aims to encompass a wide range of American perspectives and voices from different regions. Only articles have been included in the dataset, with items like obituaries and advertisements excluded, to optimize space and ensure relevance to the project's goals.



Are there any limitations to the types of questions the platform can answer?

While our platform covers a broad range of historical topics, it's optimized for questions related to the twentieth century. Questions outside this scope or those requiring highly specialized knowledge may be beyond our current capabilities.

Code

Visit our github for more information.





Contact Information

For more information, inquiries, or collaboration opportunities, please contact us via email Saachi Shenoy at svshenoy@ucsd.edu or Srianusha Nandula at snandula@ucsd.edu.