The project focuses on historical articles published between 1925 and 1929, specifically utilizing reputable newspapers such as the Chicago Tribune, Los Angeles Times, New York Times, Wall Street Journal, Washington Post, and San Francisco Chronicle. This selection aims to encompass a wide range of American perspectives and voices from different regions. Only articles have been included in the dataset, with items like obituaries and advertisements excluded, to optimize space and ensure relevance to the project's goals.
Over the course of the project, significant improvements have been made to both the search methodology and the LLM integration. Initially, the project used BERT models for document encoding and a FAISS index for retrieval, but encountered issues with scalability and relevance. In response, we transitioned to using SentenceTransformers for more efficient text encoding and adopted a Chroma database for better performance and scalability. These changes have greatly enhanced the system's ability to retrieve relevant documents quickly and provide more accurate responses to user queries.
By combining Retrieval-Augmented Generation (RAG) with Large Language Models (LLMs), our project enhances the accuracy and depth of information retrieved from historical sources, making it easier and more reliable for users to access historical data.
Due to copyright restrictions, the platform's data is limited to educators and students within the University of California, San Diego. Therefore, a UCSD.edu login is necessary to access the tool. This ensures compliance with copyright laws while providing valuable resources for educational and research purposes within the University of California, San Diego community.
While our platform covers a broad range of historical topics, it's optimized for questions related to the twentieth century. Questions outside this scope or those requiring highly specialized knowledge may be beyond our current capabilities.
To ensure accuracy, the platform employs a multi-step verification process that starts with selecting only reputable sources for data collection. The project uses historical newspapers known for their editorial rigor, such as the Chicago Tribune and The New York Times. Each query's response is generated through Retrieval-Augmented Generation, which integrates data retrieval with LLMs to synthesize information from multiple documents. This method not only supports data accuracy by cross-verifying facts across different sources but also enhances the depth and context of the information provided. Additionally, regular updates and refinements to the model’s algorithms help maintain the reliability and accuracy of the responses.