PDF Summarisation with Llama 3.1
In this tutorial we will use llama 3.1 for pdf summarisation.
To summarise any pdf we first need to extract text from it, so to do that we will use PyPDF2.
Sample pdf — https://pdfobject.com/pdf/sample.pdf
Download above file and pass it to PdfReader
# pip install PyPDF2
from PyPDF2 import PdfReader
reader = PdfReader("sample.pdf")
text = ""
for page in reader.pages:
text += page.extract_text() + "\n"
print(text)
Sample PDFThis is a simple PDF file. Fun fun fun.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Phasellus facilisis odio sed mi. Curabitur suscipit. Nullam vel nisi. Etiam semper ipsum ut lectus. Proin aliquam, erat eget pharetra commodo, eros mi condimentum quam, sed commodo justo quam ut velit. Integer a erat. Cras laoreet ligula cursus enim. Aenean scelerisque velit et tellus. Vestibulum dictum aliquet sem. Nulla facilisi. Vestibulum accumsan ante vitae elit. Nulla erat dolor, blandit in, rutrum quis, semper pulvinar, enim. Nullam varius congue risus. Vivamus sollicitudin, metus ut interdum eleifend, nisi tellus pellentesque elit, tristique accumsan eros quam et risus. Suspendisse libero odio, mattis sit amet, aliquet eget, hendrerit vel, nulla. Sed vitae augue. Aliquam erat volutpat. Aliquam feugiat vulputate nisl. Suspendisse quis nulla pretium ante pretium mollis. Proin velit ligula, sagittis at, egestas a, pulvinar quis, nisl.Pellentesque sit amet lectus. Praesent pulvinar, nunc quis iaculis sagittis, justo quam lobortis tortor, sed vestibulum dui metus venenatis est. Nunc cursus ligula. Nulla facilisi. Phasellus ullamcorper consectetuer ante. Duis tincidunt, urna id condimentum luctus, nibh ante vulputate sapien, id sagittis massa orci ut enim. Pellentesque vestibulum convallis sem. Nulla consequat quam ut nisl. Nullam est. Curabitur tincidunt dapibus lorem. Proin velit turpis, scelerisque sit amet, iaculis nec, rhoncus ac, ipsum. Phasellus lorem arcu, feugiat eu, gravida eu, consequat molestie, ipsum. Nullam vel est ut ipsum volutpat feugiat. Aenean pellentesque.In mauris. Pellentesque dui nisi, iaculis eu, rhoncus in, venenatis ac, ante. Ut odio justo, scelerisque vel, facilisis non, commodo a, pede. Cras nec massa sit amet tortor volutpat varius. Donec lacinia, neque a luctus aliquet, pede massa imperdiet ante, at varius lorem pede sed sapien. Fusce erat nibh, aliquet in, eleifend eget, commodo eget, erat. Fusce consectetuer. Cras risus tortor, porttitor nec, tristique sed, convallis semper, eros. Fusce vulputate ipsum a mauris. Phasellus mollis. Curabitur sed urna. Aliquam nec sapien non nibh pulvinar convallis. Vivamus facilisis augue quis quam. Proin cursus aliquet metus. Suspendisse lacinia. Nulla at tellus ac turpis eleifend scelerisque. Maecenas a pede vitae enim commodo interdum. Donec odio. Sed sollicitudin dui vitae justo.Morbi elit nunc, facilisis a, mollis a, molestie at, lectus. Suspendisse eget mauris eu tellus molestie cursus. Duis ut magna at justo dignissim condimentum. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Vivamus varius. Ut sit amet diam suscipit mauris ornare aliquam. Sed varius. Duis arcu. Etiam tristique massa eget dui. Phasellus congue. Aenean est erat, tincidunt eget, venenatis quis, commodo at, quam.
You will see above output if your code runs successfully.
Now we will pass text variable to llama 3.1 for summarisation.
we will use ollama for deploying llama 3.1.
Install ollama and then run following command in your terminal.
ollama run llama3.1
Now we can access llama3.1 with ollama api like below
curl -X POST http://localhost:11434/api/generate -d '{
"model": "llama3.1:latest",
"prompt":"hi"
}'
But we will be using llama_index to access llama3.1, with llama_index we can pass chat_message, parsed text and assistant_message as well. This will help llama3.1 to understand what we want.
#pip install llama-index
from llama_index.llms import Ollama
from llama_index.llms import ChatMessage
def llama3_1_access(model_name, chat_message, text, assistant_message):
llm = Ollama(model=model_name)
messages = [
ChatMessage(role="assistant", content=assistant_message),
ChatMessage(role="user", content=f"{chat_message} {text}")
]
resp = llm.chat(messages)
content = None
for item in resp:
if isinstance(item, tuple) and item[0] == 'message':
content = item[1].content
break
return content
parsed_text = text # this is a data parsed from pdf file
chat_message = "Summarize in bullet points"
assistant_message = "You are an assistant, do what the user tells you to do properly."
model_name = 'llama3.1:latest'
summary = llama3_1_access(model_name, chat_message, parsed_text, assistant_message)
print("PDF Summary - ", summary)
Your output will look something like this -
PDF Summary - Here is a summary of the Sample PDF in bullet points:
* The text appears to be a block of Lorem Ipsum placeholder text.
* There are several instances of repeated phrases and sentences that seem to be examples of common web design and layout elements (e.g. "Phasellus facilisis odio sed mi", "Pellentesque sit amet lectus").
* The text contains various CSS class names and HTML tags, which suggests that it may have been generated from a web development template or sample code.
* There are no specific topics or themes mentioned in the text, but it appears to be a collection of generic design elements rather than a coherent piece of writing.
Here is an entire code —
from PyPDF2 import PdfReader
from llama_index.llms import Ollama
from llama_index.llms import ChatMessage
reader = PdfReader("sample.pdf")
text = ""
for page in reader.pages:
text += page.extract_text() + "\n"
def llama3_1_access(model_name, chat_message, text, assistant_message):
llm = Ollama(model=model_name)
messages = [
ChatMessage(role="assistant", content=assistant_message),
ChatMessage(role="user", content=f"{chat_message} {text}")
]
resp = llm.chat(messages)
content = None
for item in resp:
if isinstance(item, tuple) and item[0] == 'message':
content = item[1].content
break
return content
parsed_text = text # this is a data parsed from pdf file
chat_message = "Summarize in bullet points"
assistant_message = "You are an assistant, do what the user tells you to do properly."
model_name = 'llama3.1:latest'
summary = llama3_1_access(model_name, chat_message, parsed_text, assistant_message)
print("PDF Summary - ", summary)
How to parse text from online pdf without saving it
import urllib.request
from PyPDF2 import PdfReader
import io
response = urllib.request.urlopen(pdf_url)
file_data = response.read()
# Read the PDF file
pdf_file = io.BytesIO(file_data)
reader = PdfReader(pdf_file)
# Extract text from each page
text = ""
for page in reader.pages:
text += page.extract_text() + "\n"
print(text)