4  Simple RAG (Retrieval-Augmented Generation) System

(ns rag
  (:require
   [clojure.java.io :as io]
   [clojure.string :as str]
   [tablecloth.api :as tc]) 
  (:import
   [dev.langchain4j.store.embedding.inmemory InMemoryEmbeddingStore]
   [dev.langchain4j.data.segment TextSegment]
   [dev.langchain4j.data.document.parser.apache.pdfbox ApachePdfBoxDocumentParser]
   [dev.langchain4j.data.document.splitter DocumentSplitters]
   [dev.langchain4j.model.embedding.onnx.allminilml6v2 AllMiniLmL6V2EmbeddingModel]))

This is a Clojure / langchain4j adaption of a (simple_rag)[https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/simple_rag.ipynb]

4.1 Overview

This code implements a basic Retrieval-Augmented Generation (RAG) system for processing and querying PDF documents. The system encodes the document content into a vector store, which can then be queried to retrieve relevant information.

4.2 Key Components

PDF processing and text extraction

Text chunking for manageable processing

Vector store creation using InMemoryStore and AllMiniLmL6V2EmbeddingModel embeddings

Retriever setup for querying the processed documents

4.3 Method Details

4.3.1 Document Preprocessing

The PDF is loaded using ApachePdfBox, The text is split into chunks using RecursiveCharacterTextSplitter with specified chunk size and overlap.

4.3.2 Text Cleaning

A custom function replace-t-with-space is applied to clean the text chunks. This likely addresses specific formatting issues in the PDF.

4.3.3 Vector Store Creation

A AllMiniLmL6V2 embeddings are used to create vector representations of the text chunks.

A InMemoryStore vector store is created from these embeddings for similarity search.

4.3.4 Retriever Setup

A retriever is configured to fetch the top 5 most relevant chunks for a given query.

4.4 Key Features

Configurable Chunking: Allows adjustment of chunk size and overlap.

Simple Retrieval: Uses InMemoryVectorStore for JVM based similarity search.

Usage Example The code includes a test query: “What is the main cause of climate change?”. This demonstrates how to use the retriever to fetch relevant context from the processed document.

4.5 Benefits of this Approach

Scalability: Can handle large documents by processing them in chunks.

Flexibility: Easy to adjust parameters like chunk size and number of retrieved results.

4.6 Conclusion

This simple RAG system provides a solid foundation for building more complex information retrieval and question-answering systems.

By encoding document content into a searchable vector store, it enables efficient retrieval of relevant information in response to queries.

This approach is particularly useful for applications requiring quick access to specific information within large documents or document collections.

5 Implementation

A helper to replace tabs by space:

(defn replace-t-with-space [list-of-documents]
  (map
   (fn [text-segment]
     (let [cleaned-text (-> text-segment .text (str/replace #"\t" " "))
           meta (-> text-segment .metadata)]
       (TextSegment/from cleaned-text meta)))
   list-of-documents))

Convert PDF to text document:

(def document (.parse (ApachePdfBoxDocumentParser.) (io/input-stream "Understanding_Climate_Change.pdf")))

Split document into chunks of max 1000 chars and overlapping of 200:

(def texts
  (.split 
   (DocumentSplitters/recursive 1000 200)
   document))

Clean texts:

(def cleaned-texts (replace-t-with-space texts))

Create embedding for clean texts:

(def embedding-model (AllMiniLmL6V2EmbeddingModel.))
(def embedding-store (InMemoryEmbeddingStore.))
(def embeddings
  (.embedAll embedding-model cleaned-texts))

Add all embeddings to vector store:

(run!
    (fn [ [text-segment embedding]]
      (.add embedding-store embedding text-segment))

 (map vector
      cleaned-texts
      (.content embeddings)))
nil

Encode the retriever text:

(def retriever 
  (.content (.embed embedding-model
                    "What is the main cause of climate change?")))

Find top 5 relevant texts:

(def relevant (.findRelevant embedding-store retriever 5))

Put 5 results in table:

(tc/dataset
 (map
  (fn [a-relevant]
    (hash-map
     :score (.score a-relevant)
     :text (.text (.embedded a-relevant))))
  relevant))

_unnamed [5 2]:

:score :text
0.84511782 Chapter 2: Causes of Climate Change
Greenhouse Gases
The primary cause of recent climate change is the increase in greenhouse gases in the
atmosphere. Greenhouse gases, such as carbon dioxide (CO2), methane (CH4), and nitrous
oxide (N2O), trap heat from the sun, creating a “greenhouse effect.” This effect is essential
for life on Earth, as it keeps the planet warm enough to support life. However, human
activities have intensified this natural process, leading to a warmer climate.
Fossil Fuels
Burning fossil fuels for energy releases large amounts of CO2. This includes coal, oil, and
natural gas used for electricity, heating, and transportation. The industrial revolution marked
the beginning of a significant increase in fossil fuel consumption, which continues to rise
today.
Coal
Coal is the most carbon-intensive fossil fuel, and its use for electricity generation is a major
source of CO2 emissions. Despite a decline in some regions, coal remains a significant
0.81118799 Understanding Climate Change
Chapter 1: Introduction to Climate Change
Climate change refers to significant, long-term changes in the global climate. The term
“global climate” encompasses the planet’s overall weather patterns, including temperature,
precipitation, and wind patterns, over an extended period. Over the past century, human
activities, particularly the burning of fossil fuels and deforestation, have significantly
contributed to climate change.
Historical Context
The Earth’s climate has changed throughout history. Over the past 650,000 years, there have
been seven cycles of glacial advance and retreat, with the abrupt end of the last ice age about
11,700 years ago marking the beginning of the modern climate era and human civilization.
Most of these climate changes are attributed to very small variations in Earth’s orbit that
change the amount of solar energy our planet receives. During the Holocene epoch, which
0.78552265 These effects include:
Rising Temperatures
Global temperatures have risen by about 1.2 degrees Celsius (2.2 degrees Fahrenheit) since
the late 19th century. This warming is not uniform, with some regions experiencing more
significant increases than others.
Heatwaves
Heatwaves are becoming more frequent and severe, posing risks to human health, agriculture,
and infrastructure. Cities are particularly vulnerable due to the “urban heat island” effect.
Heatwaves can lead to heat-related illnesses and exacerbate existing health conditions.
Changing Seasons
Climate change is altering the timing and length of seasons, affecting ecosystems and human
activities. For example, spring is arriving earlier, and winters are becoming shorter and
milder in many regions. This shift disrupts plant and animal life cycles and agricultural
practices.
Melting Ice and Rising Sea Levels
Warmer temperatures are causing polar ice caps and glaciers to melt, contributing to rising
0.78288501 Chapter 7: The Economics of Climate Change
Costs of Inaction
Economic Impacts of Climate Change
The economic costs of climate change include damage to infrastructure, reduced agricultural
productivity, health care costs, and lost labor productivity. Extreme weather events, such as
hurricanes and floods, can cause significant economic disruption. Investing in climate action
now can prevent much higher costs in the future.
Social and Environmental Costs
Climate change exacerbates social inequalities, with marginalized communities often bearing
the brunt of its impacts. Environmental costs include loss of biodiversity, ecosystem
degradation, and decreased availability of natural resources. Addressing these issues requires
integrated, equitable solutions.
Benefits of Climate Action
Economic Opportunities
Investing in renewable energy, energy efficiency, and sustainable practices creates jobs and
stimulates economic growth. The transition to a green economy can drive innovation and
0.75514444 Most of these climate changes are attributed to very small variations in Earth’s orbit that
change the amount of solar energy our planet receives.
During the Holocene epoch, which
began at the end of the last ice age, human societies flourished, but the industrial era has seen
unprecedented changes.
Modern Observations
Modern scientific observations indicate a rapid increase in global temperatures, sea levels,
and extreme weather events. The Intergovernmental Panel on Climate Change (IPCC) has
documented these changes extensively. Ice core samples, tree rings, and ocean sediments
provide a historical record that scientists use to understand past climate conditions and
predict future trends. The evidence overwhelmingly shows that recent changes are primarily
driven by human activities, particularly the emission of greenhouse gases.
Chapter 2: Causes of Climate Change
Greenhouse Gases
The primary cause of recent climate change is the increase in greenhouse gases in the
source: projects/ml/llm/notebooks/rag.clj