4 Simple RAG (Retrieval-Augmented Generation) System
ns rag
(:require
(:as io]
[clojure.java.io :as str]
[clojure.string :as tc])
[tablecloth.api :import
(
[dev.langchain4j.store.embedding.inmemory InMemoryEmbeddingStore]
[dev.langchain4j.data.segment TextSegment]
[dev.langchain4j.data.document.parser.apache.pdfbox ApachePdfBoxDocumentParser]
[dev.langchain4j.data.document.splitter DocumentSplitters] [dev.langchain4j.model.embedding.onnx.allminilml6v2 AllMiniLmL6V2EmbeddingModel]))
This is a Clojure / langchain4j adaption of a (simple_rag)[https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/simple_rag.ipynb]
4.1 Overview
This code implements a basic Retrieval-Augmented Generation (RAG) system for processing and querying PDF documents. The system encodes the document content into a vector store, which can then be queried to retrieve relevant information.
4.2 Key Components
PDF processing and text extraction
Text chunking for manageable processing
Vector store creation using InMemoryStore and AllMiniLmL6V2EmbeddingModel embeddings
Retriever setup for querying the processed documents
4.3 Method Details
4.3.1 Document Preprocessing
The PDF is loaded using ApachePdfBox, The text is split into chunks using RecursiveCharacterTextSplitter with specified chunk size and overlap.
4.3.2 Text Cleaning
A custom function replace-t-with-space is applied to clean the text chunks. This likely addresses specific formatting issues in the PDF.
4.3.3 Vector Store Creation
A AllMiniLmL6V2 embeddings are used to create vector representations of the text chunks.
A InMemoryStore vector store is created from these embeddings for similarity search.
4.3.4 Retriever Setup
A retriever is configured to fetch the top 5 most relevant chunks for a given query.
4.4 Key Features
Configurable Chunking: Allows adjustment of chunk size and overlap.
Simple Retrieval: Uses InMemoryVectorStore for JVM based similarity search.
Usage Example The code includes a test query: “What is the main cause of climate change?”. This demonstrates how to use the retriever to fetch relevant context from the processed document.
4.5 Benefits of this Approach
Scalability: Can handle large documents by processing them in chunks.
Flexibility: Easy to adjust parameters like chunk size and number of retrieved results.
4.6 Conclusion
This simple RAG system provides a solid foundation for building more complex information retrieval and question-answering systems.
By encoding document content into a searchable vector store, it enables efficient retrieval of relevant information in response to queries.
This approach is particularly useful for applications requiring quick access to specific information within large documents or document collections.
5 Implementation
A helper to replace tabs by space:
defn replace-t-with-space [list-of-documents]
(map
(fn [text-segment]
(let [cleaned-text (-> text-segment .text (str/replace #"\t" " "))
(meta (-> text-segment .metadata)]
meta)))
(TextSegment/from cleaned-text list-of-documents))
Convert PDF to text document:
def document (.parse (ApachePdfBoxDocumentParser.) (io/input-stream "Understanding_Climate_Change.pdf"))) (
Split document into chunks of max 1000 chars and overlapping of 200:
def texts
(
(.split 1000 200)
(DocumentSplitters/recursive document))
Clean texts:
def cleaned-texts (replace-t-with-space texts)) (
Create embedding for clean texts:
def embedding-model (AllMiniLmL6V2EmbeddingModel.)) (
def embedding-store (InMemoryEmbeddingStore.)) (
def embeddings
( (.embedAll embedding-model cleaned-texts))
Add all embeddings to vector store:
(run!fn [ [text-segment embedding]]
(
(.add embedding-store embedding text-segment))
map vector
(
cleaned-textscontent embeddings))) (.
nil
Encode the retriever text:
def retriever
(content (.embed embedding-model
(."What is the main cause of climate change?")))
Find top 5 relevant texts:
def relevant (.findRelevant embedding-store retriever 5)) (
Put 5 results in table:
(tc/datasetmap
(fn [a-relevant]
(hash-map
(:score (.score a-relevant)
:text (.text (.embedded a-relevant))))
relevant))
_unnamed [5 2]:
:score | :text |
---|---|
0.84511782 | Chapter 2: Causes of Climate Change |
Greenhouse Gases | |
The primary cause of recent climate change is the increase in greenhouse gases in the | |
atmosphere. Greenhouse gases, such as carbon dioxide (CO2), methane (CH4), and nitrous | |
oxide (N2O), trap heat from the sun, creating a “greenhouse effect.” This effect is essential | |
for life on Earth, as it keeps the planet warm enough to support life. However, human | |
activities have intensified this natural process, leading to a warmer climate. | |
Fossil Fuels | |
Burning fossil fuels for energy releases large amounts of CO2. This includes coal, oil, and | |
natural gas used for electricity, heating, and transportation. The industrial revolution marked | |
the beginning of a significant increase in fossil fuel consumption, which continues to rise | |
today. | |
Coal | |
Coal is the most carbon-intensive fossil fuel, and its use for electricity generation is a major | |
source of CO2 emissions. Despite a decline in some regions, coal remains a significant | |
0.81118799 | Understanding Climate Change |
Chapter 1: Introduction to Climate Change | |
Climate change refers to significant, long-term changes in the global climate. The term | |
“global climate” encompasses the planet’s overall weather patterns, including temperature, | |
precipitation, and wind patterns, over an extended period. Over the past century, human | |
activities, particularly the burning of fossil fuels and deforestation, have significantly | |
contributed to climate change. | |
Historical Context | |
The Earth’s climate has changed throughout history. Over the past 650,000 years, there have | |
been seven cycles of glacial advance and retreat, with the abrupt end of the last ice age about | |
11,700 years ago marking the beginning of the modern climate era and human civilization. | |
Most of these climate changes are attributed to very small variations in Earth’s orbit that | |
change the amount of solar energy our planet receives. During the Holocene epoch, which | |
0.78552265 | These effects include: |
Rising Temperatures | |
Global temperatures have risen by about 1.2 degrees Celsius (2.2 degrees Fahrenheit) since | |
the late 19th century. This warming is not uniform, with some regions experiencing more | |
significant increases than others. | |
Heatwaves | |
Heatwaves are becoming more frequent and severe, posing risks to human health, agriculture, | |
and infrastructure. Cities are particularly vulnerable due to the “urban heat island” effect. | |
Heatwaves can lead to heat-related illnesses and exacerbate existing health conditions. | |
Changing Seasons | |
Climate change is altering the timing and length of seasons, affecting ecosystems and human | |
activities. For example, spring is arriving earlier, and winters are becoming shorter and | |
milder in many regions. This shift disrupts plant and animal life cycles and agricultural | |
practices. | |
Melting Ice and Rising Sea Levels | |
Warmer temperatures are causing polar ice caps and glaciers to melt, contributing to rising | |
0.78288501 | Chapter 7: The Economics of Climate Change |
Costs of Inaction | |
Economic Impacts of Climate Change | |
The economic costs of climate change include damage to infrastructure, reduced agricultural | |
productivity, health care costs, and lost labor productivity. Extreme weather events, such as | |
hurricanes and floods, can cause significant economic disruption. Investing in climate action | |
now can prevent much higher costs in the future. | |
Social and Environmental Costs | |
Climate change exacerbates social inequalities, with marginalized communities often bearing | |
the brunt of its impacts. Environmental costs include loss of biodiversity, ecosystem | |
degradation, and decreased availability of natural resources. Addressing these issues requires | |
integrated, equitable solutions. | |
Benefits of Climate Action | |
Economic Opportunities | |
Investing in renewable energy, energy efficiency, and sustainable practices creates jobs and | |
stimulates economic growth. The transition to a green economy can drive innovation and | |
0.75514444 | Most of these climate changes are attributed to very small variations in Earth’s orbit that |
change the amount of solar energy our planet receives. | |
During the Holocene epoch, which | |
began at the end of the last ice age, human societies flourished, but the industrial era has seen | |
unprecedented changes. | |
Modern Observations | |
Modern scientific observations indicate a rapid increase in global temperatures, sea levels, | |
and extreme weather events. The Intergovernmental Panel on Climate Change (IPCC) has | |
documented these changes extensively. Ice core samples, tree rings, and ocean sediments | |
provide a historical record that scientists use to understand past climate conditions and | |
predict future trends. The evidence overwhelmingly shows that recent changes are primarily | |
driven by human activities, particularly the emission of greenhouse gases. | |
Chapter 2: Causes of Climate Change | |
Greenhouse Gases | |
The primary cause of recent climate change is the increase in greenhouse gases in the |