Extending the RAG referece

ikatsov · ikatsov · commit ece96bb0ad82 · 2023-07-27T16:15:49.000-07:00
diff --git a/search/retrieval-augmented-generation-llm.ipynb b/search/retrieval-augmented-generation-llm.ipynb
@@ -13,9 +13,10 @@
    "source": [
     "# Retrieval-augmented Generation (RAG) Using LLMs \n",
     "\n",
-    "In this notebook, we create a basic prototype of a retrieval-augmented generation system. This prototype demonstates how traditional information retrieval methods can be combined with LLM-based text summarization to search through large documents or collection of documents.\n",
+    "In this notebook, we create several prototypes of a Retrieval-Augmented Generation (RAG) system. We start with a basic case where the input documents are small enough to fit the LLM context, and then develop more advanced solutions that can handle large document collections.\n",
     "\n",
-    "We use a Google Pixel 7 release notes as the input document. It is available in the `tensor-house-data` repository."
+    "### Data\n",
+    "We use small input documents that are availbale in the `tensor-house-data` repository."
    ]
   },
   {
@@ -34,7 +35,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 2,
+   "execution_count": 6,
    "id": "2a97eb6a-ccec-4527-b9fb-36b69c6ff3de",
    "metadata": {
     "editable": true,
@@ -56,19 +57,165 @@
     ")"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": 41,
+   "id": "a239de43-7783-46f3-b919-5ad39b0a2daf",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#\n",
+    "# Imports\n",
+    "#\n",
+    "from langchain.llms import VertexAI\n",
+    "\n",
+    "from langchain.vectorstores import Chroma\n",
+    "from langchain.chains import RetrievalQA, ConversationalRetrievalChain\n",
+    "from langchain.memory import ConversationBufferMemory\n",
+    "\n",
+    "from langchain import PromptTemplate\n",
+    "from langchain.chains.question_answering import load_qa_chain\n",
+    "\n",
+    "from langchain.document_loaders import UnstructuredHTMLLoader, TextLoader\n",
+    "from langchain.embeddings.vertexai import VertexAIEmbeddings\n",
+    "from langchain.text_splitter import CharacterTextSplitter"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1c28f863-c2e2-4c56-b459-df05ff12ad8f",
+   "metadata": {},
+   "source": [
+    "## Question Answering Using a Single Prompt\n",
+    "\n",
+    "The most basic scenario is querying small documents that fit the LLM prompt."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 68,
+   "id": "bf134242-352d-4281-8f57-099ca387143b",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "The text states that \"Lemons and limes have particularly high concentrations of the acid; it can constitute as much as 8% of the dry weight of these fruits (about 47 g/L in the juices)\". So the correct answer is lemons and limes.\n"
+     ]
+    }
+   ],
+   "source": [
+    "query_prompt = \"\"\"\n",
+    "Please read the following text and answer which fruits have the highest concentration of citric acid.\n",
+    "\n",
+    "TEXT:\n",
+    "Citric acid occurs in a variety of fruits and vegetables, most notably citrus fruits. Lemons and limes have particularly \n",
+    "high concentrations of the acid; it can constitute as much as 8% of the dry weight of these fruits (about 47 g/L in the juices).\n",
+    "The concentrations of citric acid in citrus fruits range from 0.005 mol/L for oranges and grapefruits to 0.30 mol/L in lemons \n",
+    "and limes; these values vary within species depending upon the cultivar and the circumstances under which the fruit was grown.\n",
+    "\"\"\"\n",
+    "\n",
+    "llm = VertexAI(temperature=0.7)\n",
+    "response = llm(query_prompt)\n",
+    "print(response)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6869bf50-e4f2-4ba6-b7b9-2278c1a072a0",
+   "metadata": {},
+   "source": [
+    "## Question Answering Using MapReduce\n",
+    "\n",
+    "For large documents and collections of documents that do not fit the LLM context, we can apply the MapReduce pattern to independently extract relevant summaries from document parts, and then merge these summaries into the final answer. This approach is appropriate for summarization and summarization-like queries."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 84,
+   "id": "c5a43885-5f2e-49d3-b5cf-bc6204b10d3a",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "The input document has been split into 2 chunks\n",
+      "\n",
+      "The three most important applications of citric acid are:\n",
+      "1. As a flavoring and preservative in food and beverages, especially soft drinks.\n",
+      "2. In soaps and laundry detergents to remove hard water stains.\n",
+      "3. In the biotechnology and pharmaceutical industry to passivate high purity process piping.\n"
+     ]
+    }
+   ],
+   "source": [
+    "#\n",
+    "# Load the input document\n",
+    "#\n",
+    "loader = TextLoader(\"../../tensor-house-data/search/food-additives/citric-acid-applications.txt\")\n",
+    "documents = loader.load()\n",
+    "\n",
+    "#\n",
+    "# Splitting\n",
+    "#\n",
+    "text_splitter = CharacterTextSplitter(chunk_size=2000, chunk_overlap=0)\n",
+    "texts = text_splitter.split_documents(documents)\n",
+    "print(f'The input document has been split into {len(texts)} chunks\\n')\n",
+    "\n",
+    "#\n",
+    "# Querying\n",
+    "#\n",
+    "question_prompt = \"\"\"\n",
+    "Use the following portion of a long document to see if any of the text is relevant to answer the question. \n",
+    "Return bullet points that help to answer the question.\n",
+    "\n",
+    "{context}\n",
+    "\n",
+    "Question: {question}\n",
+    "Bullet points:\n",
+    "\"\"\"\n",
+    "question_prompt_template = PromptTemplate(template=question_prompt, input_variables=[\"context\", \"question\"])\n",
+    "\n",
+    "combine_prompt = \"\"\"\n",
+    "Given the following bullet points extracted from a long document and a question, create a final answer.\n",
+    "Question: {question}\n",
+    "\n",
+    "=========\n",
+    "{summaries}\n",
+    "=========\n",
+    "\n",
+    "Final answer:\n",
+    "\"\"\"\n",
+    "combine_prompt_template = PromptTemplate(template=combine_prompt, input_variables=[\"summaries\", \"question\"])\n",
+    "\n",
+    "llm = VertexAI(temperature=0.7)\n",
+    "qa_chain = load_qa_chain(llm=llm,\n",
+    "                         chain_type='map_reduce',\n",
+    "                         question_prompt=question_prompt_template,\n",
+    "                         combine_prompt=combine_prompt_template,\n",
+    "                         verbose=False)\n",
+    "\n",
+    "question = \"What are the three most important applications of citric acid? Provide a short justification for each application.\"\n",
+    "response = qa_chain({\"input_documents\": texts, \"question\": question})\n",
+    "\n",
+    "print(response['output_text'])"
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "b3eb6e94-a2f2-4548-8d0c-f4934f0224f6",
    "metadata": {},
    "source": [
-    "## Question Answering Over a Large Document\n",
+    "## Question Answering Using Vector Search\n",
     "\n",
-    "In this section, we demonstrate how standalone questions can be answered. The input document(s) is split inot chunks which are then indexed in a vector store. To answer the user question, the most relevant chunks are retrieved and passed to the LLM."
+    "For large documents and point questions that require only specific document parts to be answered, LLMs can be combined with traditional information retrieval techniques. The input document(s) is split into chunks which are then indexed in a vector store. To answer the user question, the most relevant chunks are retrieved and passed to the LLM."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 27,
+   "execution_count": 17,
    "id": "3857ccc4-2559-493d-9e44-babb744576db",
    "metadata": {
     "editable": true,
@@ -82,37 +229,23 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "The input document has been split into 2 chunks\n",
-      "\n",
-      "The three coolest Pixel 7 features are:\n",
+      "The input document has been split into 5 chunks\n",
       "\n",
-      "* Next-generation Super Res Zoom up to 8x on Pixel 7 and up to 30x on Pixel 7 Pro.\n",
-      "* Macro Focus, which delivers Pixel HDR+ photo quality from as close as three centimeters away.\n",
-      "* Photo Unblur, a Google Photos feature only on Pixel 7 and Pixel 7 Pro. Photo Unblur uses machine learning to improve your blurry pictures – even old ones.\n"
+      "The melting point of citric acid is approximately 153 °C (307 °F).\n"
      ]
     }
    ],
    "source": [
-    "from langchain.llms import VertexAI\n",
-    "\n",
-    "from langchain.chains import RetrievalQA, ConversationalRetrievalChain\n",
-    "from langchain.memory import ConversationBufferMemory\n",
-    "\n",
-    "from langchain.document_loaders import UnstructuredHTMLLoader\n",
-    "from langchain.embeddings.vertexai import VertexAIEmbeddings\n",
-    "from langchain.text_splitter import CharacterTextSplitter\n",
-    "from langchain.vectorstores import Chroma\n",
-    "\n",
     "#\n",
     "# Load the input document\n",
     "#\n",
-    "loader = UnstructuredHTMLLoader(\"../../tensor-house-data/search/news-feeds/pixel-7-release.html\")\n",
+    "loader = TextLoader(\"../../tensor-house-data/search/food-additives/food-additives.txt\")\n",
     "documents = loader.load()\n",
     "\n",
     "#\n",
     "# Splitting\n",
     "#\n",
-    "text_splitter = CharacterTextSplitter(chunk_size=4000, chunk_overlap=0)\n",
+    "text_splitter = CharacterTextSplitter(chunk_size=3000, chunk_overlap=0)\n",
     "texts = text_splitter.split_documents(documents)\n",
     "print(f'The input document has been split into {len(texts)} chunks\\n')\n",
     "\n",
@@ -126,9 +259,12 @@
     "# Querying\n",
     "#\n",
     "llm = VertexAI(temperature=0.7, verbose=True)\n",
-    "qa_chain = RetrievalQA.from_chain_type(llm=llm, chain_type=\"stuff\", retriever=docsearch.as_retriever(search_kwargs={\"k\": 2}), return_source_documents=True)\n",
+    "qa_chain = RetrievalQA.from_chain_type(llm=llm, \n",
+    "                                       chain_type=\"stuff\", \n",
+    "                                       retriever=docsearch.as_retriever(search_kwargs={\"k\": 1}), \n",
+    "                                       return_source_documents=True)\n",
     "\n",
-    "question = \"What are the three coolest Pixel 7 features? Please provide a list with short summaries.\"\n",
+    "question = \"What is the melting point of citric acid?\"\n",
     "response = qa_chain({\"query\": question})\n",
     "\n",
     "print(response['result'])"
@@ -146,8 +282,8 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 26,
-   "id": "96697fd3-4efc-4e98-a61d-ae722d142289",
+   "execution_count": 65,
+   "id": "141daeb7-a343-4ac8-af55-fda046326a60",
    "metadata": {
     "editable": true,
     "slideshow": {
@@ -160,41 +296,61 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "What are the three coolest Pixel features? \n",
+      "The input document has been split into 5 chunks\n",
       "\n",
-      "The three coolest Pixel features are:\n",
+      "How much times aspartam is sweeter than table sugar? \n",
       "\n",
-      "1. Super Res Zoom up to 8x on Pixel 7 and up to 30x on Pixel 7 Pro.\n",
-      "2. Macro Focus, which delivers Pixel HDR+ photo quality from as close as three centimeters away.\n",
-      "3. Photo Unblur, a Google Photos feature only on Pixel 7 and Pixel 7 Pro. \n",
+      "Aspartame is approximately 150 to 200 times sweeter than sucrose (table sugar), making it an intense sweetener. \n",
       "\n",
-      "Can you explain the first feature in more detail? \n",
+      "What is its caloric value? \n",
       "\n",
-      "Super Res Zoom is a camera feature that uses machine learning to improve the quality of zoomed-in images. It works by taking multiple photos at different focal lengths and then stitching them together to create a single, high-resolution image. This results in images that are sharper and have less noise than images taken with a traditional zoom lens. \n",
+      "Aspartame is virtually calorie-free, as the human body does not metabolize it into energy like sugars or carbohydrates. \n",
       "\n"
      ]
     }
    ],
    "source": [
+    "#\n",
+    "# Load the input document\n",
+    "#\n",
+    "loader = TextLoader(\"../../tensor-house-data/search/food-additives/food-additives.txt\")\n",
+    "documents = loader.load()\n",
+    "\n",
+    "#\n",
+    "# Splitting\n",
+    "#\n",
+    "text_splitter = CharacterTextSplitter(chunk_size=3000, chunk_overlap=0)\n",
+    "texts = text_splitter.split_documents(documents)\n",
+    "print(f'The input document has been split into {len(texts)} chunks\\n')\n",
+    "\n",
+    "#\n",
+    "# Indexing and storing\n",
+    "#\n",
+    "embeddings = VertexAIEmbeddings()\n",
+    "docsearch = Chroma.from_documents(texts, embeddings)\n",
+    "\n",
     "#\n",
     "# Initialize new chat\n",
     "#\n",
     "llm = VertexAI(verbose=True)\n",
     "memory = ConversationBufferMemory(memory_key=\"chat_history\", return_messages=True)\n",
-    "chat = ConversationalRetrievalChain.from_llm(llm, retriever=docsearch.as_retriever(search_kwargs={\"k\": 2}), memory=memory, verbose=False, return_generated_question=False)\n",
+    "chat = ConversationalRetrievalChain.from_llm(llm, \n",
+    "                                             retriever=docsearch.as_retriever(search_kwargs={\"k\": 1}), \n",
+    "                                             memory=memory, \n",
+    "                                             verbose=False, \n",
+    "                                             return_generated_question=False)\n",
     "\n",
     "#\n",
-    "# Ask questions with continuous context\n",
+    "# Ask questions with a continuous context\n",
     "#\n",
-    "\n",
     "def print_last_chat_turn(chat_history):\n",
     "    print(chat_history[-2].content, '\\n')\n",
     "    print(chat_history[-1].content, '\\n')    \n",
     "\n",
-    "result = chat({\"question\": \"What are the three coolest Pixel features?\"})\n",
+    "result = chat({\"question\": \"How much times aspartam is sweeter than table sugar?\"})\n",
     "print_last_chat_turn(result['chat_history'])\n",
     "\n",
-    "result = chat({\"question\": \"Can you explain the first feature in more detail?\"})\n",
+    "result = chat({\"question\": \"What is its caloric value?\"})\n",
     "print_last_chat_turn(result['chat_history'])"
    ]
   }