Summarizations

Summarizations#

LLM-based summarization involves using LLMs to generate concise versions of longer texts while retaining key information and context. The model analyzes the input, identifies important points, and rephrases the content into a shorter summary.

There are two main types:

Extractive Summarization: Selects and extracts key sentences directly from the source.

Abstractive Summarization: Rewrites the content in a new way, generating novel sentences that capture the main ideas. LLM-based summarization is useful for simplifying complex documents, news articles, or reports.

5 Levels Of Summarization: Novice to Expert#

Summarization is a fundamental building block of many LLM tasks. You’ll frequently run into use cases where you would like to distill a large body of text into a succinct set of points.

Depending on the length of the text you’d like to summarize, you have different summarization methods to choose from.

We’re going to run through 5 methods for summarization that start with Novice and end up expert. These aren’t the only options, feel free to make up your own. If you find another one you like please share it with the community.

5 Levels Of Summarization:

Summarize a couple sentences - Basic Prompt
Summarize a couple paragraphs - Prompt Templates
Summarize a couple pages - Map Reduce
Summarize an entire book - Best Representation Vectors
Summarize an unknown amount of text - Agents

First let’s import our OpenAI API Key

# Unzip data folder

import zipfile
with zipfile.ZipFile('../../data.zip', 'r') as zip_ref:
    zip_ref.extractall('..')

from dotenv import load_dotenv
import os

load_dotenv()

openai_api_key = os.getenv('OPENAI_API_KEY', 'YourAPIKey')

Level 1: Basic Prompt - Summarize a couple sentences#

If you just have a few sentences you want to one-off summarize you can use a simple prompt and copy and paste your text.

This method isn’t scalable and only practical for a few use cases…the perfect level #1!

from langchain import OpenAI

llm = OpenAI(temperature=0, openai_api_key=openai_api_key)

The important part is to provide instructions for the LLM to know what to do. In thise case I’m telling the model I want a summary of the text below

prompt = """
Please provide a summary of the following text

TEXT:
Philosophy (from Greek: φιλοσοφία, philosophia, 'love of wisdom') \
is the systematized study of general and fundamental questions, \
such as those about existence, reason, knowledge, values, mind, and language. \
Some sources claim the term was coined by Pythagoras (c. 570 – c. 495 BCE), \
although this theory is disputed by some. Philosophical methods include questioning, \
critical discussion, rational argument, and systematic presentation.
"""

num_tokens = llm.get_num_tokens(prompt)
print (f"Our prompt has {num_tokens} tokens")

Our prompt has 121 tokens

output = llm(prompt)
print (output)

Philosophy is a systematized study of general and fundamental questions about existence, reason, knowledge, values, mind, and language. It is believed to have been coined by Pythagoras, and its methods include questioning, critical discussion, rational argument, and systematic presentation.

Woof 🐶, that summary is still hard to understand. Let me add to my instructions so that the output is easier to understand. I’ll tell it to explain it to me like a 5 year old.

prompt = """
Please provide a summary of the following text.
Please provide your output in a manner that a 5 year old would understand

TEXT:
Philosophy (from Greek: φιλοσοφία, philosophia, 'love of wisdom') \
is the systematized study of general and fundamental questions, \
such as those about existence, reason, knowledge, values, mind, and language. \
Some sources claim the term was coined by Pythagoras (c. 570 – c. 495 BCE), \
although this theory is disputed by some. Philosophical methods include questioning, \
critical discussion, rational argument, and systematic presentation.
"""

num_tokens = llm.get_num_tokens(prompt)
print (f"Our prompt has {num_tokens} tokens")

Our prompt has 137 tokens

output = llm(prompt)
print (output)

Philosophy is about asking questions and trying to figure out the answers. It is about thinking about things like existence, knowledge, and values. People have been doing this for a very long time, and it is still done today.

Nice! That’s much better, but let’s look at something we can automate a bit more

Level 2: Prompt Templates - Summarize a couple paragraphs#

Prompt templates are a great way to dynamically place text within your prompts. They are like python f-strings but specialized for working with language models.

We’re going to look at 2 short Paul Graham essays

from langchain import OpenAI
from langchain import PromptTemplate
import os

paul_graham_essays = ['../data/PaulGrahamEssaySmall/getideas.txt', '../data/PaulGrahamEssaySmall/noob.txt']

essays = []

for file_name in paul_graham_essays:
    with open(file_name, 'r') as file:
        essays.append(file.read())

Let’s print out a preview of the essays to see what they look like

for i, essay in enumerate(essays):
    print (f"Essay #{i+1}: {essay[:300]}\n")

Essay #1: January 2023(Someone fed my essays into GPT to make something that could answer
questions based on them, then asked it where good ideas come from.  The
answer was ok, but not what I would have said. This is what I would have said.)The way to get new ideas is to notice anomalies: what seems strange,


Essay #2: January 2020When I was young, I thought old people had everything figured out.
Now that I'm old, I know this isn't true.I constantly feel like a noob. It seems like I'm always talking to
some startup working in a new field I know nothing about, or reading
a book about a topic I don't understand well

Next let’s create a prompt template which will hold our instructions and a placeholder for the essay. In this example I only want a 1 sentence summary to come back

template = """
Please write a one sentence summary of the following text:

{essay}
"""

prompt = PromptTemplate(
    input_variables=["essay"],
    template=template
)

Then let’s loop through the 2 essays and pass them to our LLM. I’m applying .strip() on the summaries to remove the white space on the front and back of the output

for essay in essays:
    summary_prompt = prompt.format(essay=essay)
    
    num_tokens = llm.get_num_tokens(summary_prompt)
    print (f"This prompt + essay has {num_tokens} tokens")
    
    summary = llm(summary_prompt)
    
    print (f"Summary: {summary.strip()}")
    print ("\n")

This prompt + essay has 205 tokens
Summary: Exploring anomalies at the frontiers of knowledge is the best way to generate new ideas.


This prompt + essay has 500 tokens
Summary: This text explores the idea that feeling like a "noob" is actually beneficial, as it is inversely correlated with actual ignorance and encourages us to discover new things.

Level 3: Map Reduce - Summarize a couple pages multiple pages#

If you have multiple pages you’d like to summarize, you’ll likely run into a token limit. Token limits won’t always be a problem, but it is good to know how to handle them if you run into the issue.

The chain type “Map Reduce” is a method that helps with this. You first generate a summary of smaller chunks (that fit within the token limit) and then you get a summary of the summaries.\

from langchain import OpenAI
from langchain.chains.summarize import load_summarize_chain
from langchain.text_splitter import RecursiveCharacterTextSplitter

paul_graham_essay = '../data/PaulGrahamEssays/startupideas.txt'

with open(paul_graham_essay, 'r') as file:
    essay = file.read()

Let’s see how many tokens are in this essay

llm.get_num_tokens(essay)

That’s too many, let’s split our text up into chunks so they fit into the prompt limit. I’m going a chunk size of 10,000 characters.

You can think of tokens as pieces of words used for natural language processing. For English text, 1 token is approximately 4 characters or 0.75 words. As a point of reference, the collected works of Shakespeare are about 900,000 words or 1.2M tokens.

This means the number of tokens we should expect is 10,000 / 4 = ~2,500 token chunks. But this will vary, each body of text/code will be different

text_splitter = RecursiveCharacterTextSplitter(separators=["\n\n", "\n"], chunk_size=10000, chunk_overlap=500)

docs = text_splitter.create_documents([essay])

num_docs = len(docs)

num_tokens_first_doc = llm.get_num_tokens(docs[0].page_content)

print (f"Now we have {num_docs} documents and the first one has {num_tokens_first_doc} tokens")

Now we have 5 documents and the first one has 2086 tokens

Great, assuming that number of tokens is consistent in the other docs we should be good to go. Let’s use LangChain’s load_summarize_chain to do the map_reducing for us. We first need to initialize our chain

summary_chain = load_summarize_chain(llm=llm, chain_type='map_reduce',
#                                      verbose=True # Set verbose=True if you want to see the prompts being used
                                    )

Now actually run it

output = summary_chain.run(docs)

output

' This article provides strategies for coming up with startup ideas on demand, such as looking in areas of expertise, talking to people about their needs, and looking for waves and gaps in the market. It also discusses the need for users to have sufficient activation energy to start using a product, and how this varies depending on the product. It looks at the difficulty of switching paths in life as one gets older, and how colleges can help students start startups. Finally, it looks at the importance of focusing on users rather than competitors, and how Steve Wozniak solved his own problems.'

This summary is a great start, but I’m more of a bullet point person. I want to get my final output in bullet point form.

In order to do this I’m going to use custom promopts (like we did above) to instruct the model on what I want.

The map_prompt is going to stay the same (just showing it for clarity), but I’ll edit the combine_prompt.

map_prompt = """
Write a concise summary of the following:
"{text}"
CONCISE SUMMARY:
"""
map_prompt_template = PromptTemplate(template=map_prompt, input_variables=["text"])

combine_prompt = """
Write a concise summary of the following text delimited by triple backquotes.
Return your response in bullet points which covers the key points of the text.
```{text}```
BULLET POINT SUMMARY:
"""
combine_prompt_template = PromptTemplate(template=combine_prompt, input_variables=["text"])

summary_chain = load_summarize_chain(llm=llm,
                                     chain_type='map_reduce',
                                     map_prompt=map_prompt_template,
                                     combine_prompt=combine_prompt_template,
#                                      verbose=True
                                    )

output = summary_chain.run(docs)

print (output)

- Y Combinator suggests that the best startup ideas come from looking for problems, preferably ones that the founders have themselves.
- Good ideas should appeal to a small number of people who need it urgently.
- To find startup ideas, one should look for things that seem to be missing and be prepared to question the status quo.
- College students should use their college experience to prepare themselves for the future and build things with other students.
- Tricks for coming up with startup ideas on demand include looking in areas of expertise, talking to people about their needs, and looking for waves and gaps in the market.
- Sam Altman points out that taking the time to come up with an idea is a better strategy than most founders are willing to put in the time for.
- Paul Buchheit suggests that trying to sell something bad can lead to better ideas.

Level 4: Best Representation Vectors - Summarize an entire book#

In the above method we pass the entire document (all 9.5K tokens of it) to the LLM. But what if you have more tokens than that?

What if you had a book you wanted to summarize? Let’s load one up, we’re going to load Into Thin Air about the 1996 Everest Disaster

from langchain.document_loaders import PyPDFLoader

# Load the book
loader = PyPDFLoader("../data/IntoThinAirBook.pdf")
pages = loader.load()

# Cut out the open and closing parts
pages = pages[26:277]

# Combine the pages, and replace the tabs with spaces
text = ""

for page in pages:
    text += page.page_content
    
text = text.replace('\t', ' ')

num_tokens = llm.get_num_tokens(text)

print (f"This book has {num_tokens} tokens in it")

This book has 139472 tokens in it

Wow, that’s over 100K tokens, even GPT 32K wouldn’t be able to handle that in one go. At 0.03 per 1K prompt tokens, this would cost us $4.17 just for the prompt alone.

So how do we do this without going through all the tokens? Pick random chunks? Pick equally spaced chunks?

Goal: Chunk your book then get embeddings of the chunks. Pick a subset of chunks which represent a wholistic but diverse view of the book. Or another way, is there a way to pick the top 10 passages that describe the book the best?

Once we have our chunks that represent the book then we can summarize those chunks and hopefully get a pretty good summary.

Keep in mind there are tools that would likely do this for you, and with token limits increasing this won’t be a problem for long. But if you want to do it from scratch this might help.

This is most definitely not the optimal answer, but it’s my take on it for now! If the clustering experts wanna help improve it that would be awesome.

The BRV Steps:

Load your book into a single text file
Split your text into large-ish chunks
Embed your chunks to get vectors
Cluster the vectors to see which are similar to each other and likely talk about the same parts of the book
Pick embeddings that represent the cluster the most (method: closest to each cluster centroid)
Summarize the documents that these embeddings represent

Another way to phrase this process, “Which ~10 documents from this book represent most of the meaning? I want to build a summary off those.”

Note: There will be a bit of information loss, but show me a summary of a whole book that doesn’t have information loss ;)

# Loaders
from langchain.schema import Document

# Splitters
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Model
from langchain.chat_models import ChatOpenAI

# Embedding Support
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings

# Summarizer we'll use for Map Reduce
from langchain.chains.summarize import load_summarize_chain

# Data Science
import numpy as np
from sklearn.cluster import KMeans

/Users/gregorykamradt/opt/anaconda3/lib/python3.9/site-packages/deeplake/util/check_latest_version.py:32: UserWarning: A newer version of deeplake (3.7.2) is available. It's recommended that you update to the latest version using `pip install -U deeplake`.
  warnings.warn(

I’m going to initialize two models, gpt-3.5 and gpt4. I’ll use gpt 3.5 for the first set of summaries to reduce cost and then gpt4 for the final pass which should hopefully increase the quality.

text_splitter = RecursiveCharacterTextSplitter(separators=["\n\n", "\n", "\t"], chunk_size=10000, chunk_overlap=3000)

docs = text_splitter.create_documents([text])

num_documents = len(docs)

print (f"Now our book is split up into {num_documents} documents")

Now our book is split up into 78 documents

Let’s get our embeddings of those 78 documents

embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)

vectors = embeddings.embed_documents([x.page_content for x in docs])

Now let’s cluster our embeddings. There are a ton of clustering algorithms you can chose from. Please try a few out to see what works best for you!

# Assuming 'embeddings' is a list or array of 1536-dimensional embeddings

# Choose the number of clusters, this can be adjusted based on the book's content.
# I played around and found ~10 was the best.
# Usually if you have 10 passages from a book you can tell what it's about
num_clusters = 11

# Perform K-means clustering
kmeans = KMeans(n_clusters=num_clusters, random_state=42).fit(vectors)

Here are the clusters that were found. It’s interesting to see the progression of clusters throughout the book. This is expected because as the plot changes you’d expect different clusters to emerge due to different semantic meaning

kmeans.labels_

This is sweet, but whenever you have a clustering exercise, it’s hard not to graph them. Make sure you add colors.

We also need to do dimensionality reduction to reduce the vectors from 1536 dimensions to 2 (this is sloppy data science but we are working towards the 80% solution)

from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# Taking out the warnings
import warnings
from warnings import simplefilter

# Filter out FutureWarnings
simplefilter(action='ignore', category=FutureWarning)

# Perform t-SNE and reduce to 2 dimensions
tsne = TSNE(n_components=2, random_state=42)
reduced_data_tsne = tsne.fit_transform(vectors)

# Plot the reduced data
plt.scatter(reduced_data_tsne[:, 0], reduced_data_tsne[:, 1], c=kmeans.labels_)
plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
plt.title('Book Embeddings Clustered')
plt.show()

Awesome, not perfect, but pretty good directionally. Now we need to get the vectors which are closest to the cluster centroids (the center).

The function below is a quick way to do that (w/ help from ChatGPT)

# Find the closest embeddings to the centroids

# Create an empty list that will hold your closest points
closest_indices = []

# Loop through the number of clusters you have
for i in range(num_clusters):
    
    # Get the list of distances from that particular cluster center
    distances = np.linalg.norm(vectors - kmeans.cluster_centers_[i], axis=1)
    
    # Find the list position of the closest one (using argmin to find the smallest distance)
    closest_index = np.argmin(distances)
    
    # Append that position to your closest indices list
    closest_indices.append(closest_index)

Now sort them (so the chunks are processed in order)

selected_indices = sorted(closest_indices)
selected_indices

It’s intersting to see which chunks pop up at most descriptive. How does your distribution look?

Let’s create our custom prompts. I’m going to use gpt4 (which has a bigger token limit) for the combine step so I’m asking for long summaries in the map step to reduce the information loss.

llm3 = ChatOpenAI(temperature=0,
                 openai_api_key=openai_api_key,
                 max_tokens=1000,
                 model='gpt-3.5-turbo'
                )

map_prompt = """
You will be given a single passage of a book. This section will be enclosed in triple backticks (```)
Your goal is to give a summary of this section so that a reader will have a full understanding of what happened.
Your response should be at least three paragraphs and fully encompass what was said in the passage.

```{text}```
FULL SUMMARY:
"""
map_prompt_template = PromptTemplate(template=map_prompt, input_variables=["text"])

I kept getting a timeout errors so I’m actually going to do this map reduce manually

map_chain = load_summarize_chain(llm=llm3,
                             chain_type="stuff",
                             prompt=map_prompt_template)

Then go get your docs which the top vectors represented.

selected_docs = [docs[doc] for doc in selected_indices]

Let’s loop through our selected docs and get a good summary for each chunk. We’ll store the summary in a list.

# Make an empty list to hold your summaries
summary_list = []

# Loop through a range of the lenght of your selected docs
for i, doc in enumerate(selected_docs):
    
    # Go get a summary of the chunk
    chunk_summary = map_chain.run([doc])
    
    # Append that summary to your list
    summary_list.append(chunk_summary)
    
    print (f"Summary #{i} (chunk #{selected_indices[i]}) - Preview: {chunk_summary[:250]} \n")

Great, now that we have our list of summaries, let’s get a summary of the summaries

summaries = "\n".join(summary_list)

# Convert it back to a document
summaries = Document(page_content=summaries)

print (f"Your total summary has {llm.get_num_tokens(summaries.page_content)} tokens")

llm4 = ChatOpenAI(temperature=0,
                 openai_api_key=openai_api_key,
                 max_tokens=3000,
                 model='gpt-4',
                 request_timeout=120
                )

combine_prompt = """
You will be given a series of summaries from a book. The summaries will be enclosed in triple backticks (```)
Your goal is to give a verbose summary of what happened in the story.
The reader should be able to grasp what happened in the book.

```{text}```
VERBOSE SUMMARY:
"""
combine_prompt_template = PromptTemplate(template=combine_prompt, input_variables=["text"])

reduce_chain = load_summarize_chain(llm=llm4,
                             chain_type="stuff",
                             prompt=combine_prompt_template,
#                              verbose=True # Set this to true if you want to see the inner workings
                                   )

Run! Note this will take a while

output = reduce_chain.run([summaries])

print (output)

Wow that was a long process, but you get the gist, hopefully we’ll see some library abstractions in the coming months that do this automatically for us! Let me know what you think on Twitter

Level 5: Agents - Summarize an unknown amount of text#

What if you have an unknown amount of text you need to summarize? This may be a verticalize use case (like law or medical) where more research is required as you uncover the first pieces of information.

We’re going to use agents below, this is still a very actively developed area and should be handled with care. Future agents will be able to handle a lot more complicated tasks.

from langchain import OpenAI
from langchain.chat_models import ChatOpenAI
from langchain.agents import initialize_agent, Tool
from langchain.utilities import WikipediaAPIWrapper

llm = ChatOpenAI(temperature=0, model_name='gpt-4', openai_api_key=openai_api_key)

We’re going to use the Wiki search tool and research multiple topics

wikipedia = WikipediaAPIWrapper()

Let’s define our toolkit, in this case it’s just one tool

tools = [
    Tool(
        name="Wikipedia",
        func=wikipedia.run,
        description="Useful for when you need to get information from wikipedia about a single topic"
    ),
]

Init our agent

agent_executor = initialize_agent(tools, llm, agent='zero-shot-react-description', verbose=True)

Then let’s ask a question that will need multiple documents

output = agent_executor.run("Can you please provide a quick summary of Napoleon Bonaparte? \
                          Then do a separate search and tell me what the commonalities are with Serena Williams")

print (output)

Using LLMs To Summarize Personal Research#

Our goal is to have LLM aid us in generating interview quetions for someone. I find that I’m constantly trying to ramp up to a person’s background and story when preparing to meet them.

There is a ton of awesome resources about a person online we can use

Twitter Profiles
Websites
Other Interviews (YouTube or Text)

Let’s bring all these together by first pulling the information and then generating questions or bullet points we can use as preparation.

First let’s import our packages! We’ll be using LangChain to help us interact with OpenAI

# Unzip data folder

import zipfile
with zipfile.ZipFile('../../data.zip', 'r') as zip_ref:
    zip_ref.extractall('..')

# LLMs
from langchain import PromptTemplate
from langchain.llms import OpenAI
from langchain.chat_models import ChatOpenAI
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains.summarize import load_summarize_chain
from langchain.prompts import PromptTemplate

# Twitter
import tweepy

# Scraping
import requests
from bs4 import BeautifulSoup
from markdownify import markdownify as md

# YouTube
from langchain.document_loaders import YoutubeLoader
# !pip install youtube-transcript-api

# Environment Variables
import os
from dotenv import load_dotenv

load_dotenv()

True

You’ll need a few API keys to complete the script below. It’s modular so if you don’t want to pull from Twitter feel free to leave those blank

TWITTER_API_KEY = os.getenv('TWITTER_API_KEY', 'YourAPIKeyIfNotSet')
TWITTER_API_SECRET = os.getenv('TWITTER_API_SECRET', 'YourAPIKeyIfNotSet')
TWITTER_ACCESS_TOKEN = os.getenv('TWITTER_ACCESS_TOKEN', 'YourAPIKeyIfNotSet')
TWITTER_ACCESS_TOKEN_SECRET = os.getenv('TWITTER_ACCESS_TOKEN_SECRET', 'YourAPIKeyIfNotSet')
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY', 'YourAPIKeyIfNotSet')

For this tutorial, let’s pretend we are going to be interviewing Elad Gil since he has a bunch of content online

Pulling Data From Twitter#

Great, now let’s set up a function that will pull tweets for us. This will help us get current events that the user is talking about. I’m excluding replies since they usually don’t have a ton of high signal text from the user. This is the same code that was used in the Twitter AI Bot tutorial.

def get_original_tweets(screen_name, tweets_to_pull=80, tweets_to_return=80):
    
    # Tweepy set up
    auth = tweepy.OAuthHandler(TWITTER_API_KEY, TWITTER_API_SECRET)
    auth.set_access_token(TWITTER_ACCESS_TOKEN, TWITTER_ACCESS_TOKEN_SECRET)
    api = tweepy.API(auth)

    # Holder for the tweets you'll find
    tweets = []
    
    # Go and pull the tweets
    tweepy_results = tweepy.Cursor(api.user_timeline,
                                   screen_name=screen_name,
                                   tweet_mode='extended',
                                   exclude_replies=True).items(tweets_to_pull)
    
    # Run through tweets and remove retweets and quote tweets so we can only look at a user's raw emotions
    for status in tweepy_results:
        if hasattr(status, 'retweeted_status') or hasattr(status, 'quoted_status'):
            # Skip if it's a retweet or quote tweet
            continue
        else:
            tweets.append({'full_text': status.full_text, 'likes': status.favorite_count})

    
    # Sort the tweets by number of likes. This will help us short_list the top ones later
    sorted_tweets = sorted(tweets, key=lambda x: x['likes'], reverse=True)

    # Get the text and drop the like count from the dictionary
    full_text = [x['full_text'] for x in sorted_tweets][:tweets_to_return]
    
    # Convert the list of tweets into a string of tweets we can use in the prompt later
    users_tweets = "\n\n".join(full_text)
            
    return users_tweets

Ok cool, let’s try it out!

user_tweets = get_original_tweets("eladgil")
print (user_tweets[:300])

More AI companies with sudden virality + paying customers should just bootstrap

0. Running co for cash may be best success

1. If it does scale, being profitable or near to it creates lot of options

2. it may not scale, or only work for a few months

3. Why get on the… https://t.co/Q9TRQo4yau

Som

Awesome, now we have a few tweets let’s move onto pulling data from a web page or two.

Pulling Data From Websites#

Let’s do two pages

His personal website which has his background - https://eladgil.com/
One of my favorite blog posts from him around AI defensibility & moats - https://blog.eladgil.com/p/defensibility-and-competition

First let’s create a function that will scrape a website for us.

We’ll do this by pulling the raw html, put it in a BeautifulSoup object, then convert that object to Markdown for better parsing

def pull_from_website(url):
    
    # Doing a try in case it doesn't work
    try:
        response = requests.get(url)
    except:
        # In case it doesn't work
        print ("Whoops, error")
        return
    
    # Put your response in a beautiful soup
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # Get your text
    text = soup.get_text()

    # Convert your html to markdown. This reduces tokens and noise
    text = md(text)
     
    return text

# I'm going to store my website data in a simple string.
# There is likely optimization to make this better but it's a solid 80% solution

website_data = ""
urls = ["https://eladgil.com/", "https://blog.eladgil.com/p/defensibility-and-competition"]

for url in urls:
    text = pull_from_website(url)
    
    website_data += text

Awesome, now that we have both of those data sources, let’s check out a sample

print (website_data[:400])

Elad Gil

Welcome to Elad Gil's retro homepage!

 Who? I am a technology entrepreneur. LinkedIn profile is here.
What?
I am an investor or advisor to companies including Airbnb, Airtable, Anduril, Brex, Checkr, Coinbase, dbt Labs, Deel, Figma, Flexport, Gitlab, Gusto, Instacart, Navan, Notion, Opendoor, PagerDuty, Pinterest, Retool, Rippling, Samsara, Square, Stripe
I am involved with AI com

Awesome, to round us off, let’s get the information from a youtube video. YouTube has tons of data like Podcasts and interviews. This will be valuable for us to have.

Pulling Data From YouTube#

We’ll use LangChains YouTube loaders for this. It only works if there is a transcript on the YT video already, if there isn’t then we’ll move on. You could get the transcript via Whisper if you really wanted to, but that’s out of scope for today.

We’ll make a function we can use to loop through videos

# Pulling data from YouTube in text form
def get_video_transcripts(url):
    loader = YoutubeLoader.from_youtube_url(url, add_video_info=True)
    documents = loader.load()
    transcript = ' '.join([doc.page_content for doc in documents])
    return transcript

# Using a regular string to store the youtube transcript data
# Video selection will be important.
# Parsing interviews is a whole other can of worms so I'm opting for one where Elad is mostly talking about himself
video_urls = ['https://www.youtube.com/watch?v=nglHX4B33_o']
videos_text = ""

for video_url in video_urls:
    video_text = get_video_transcripts(video_url)
    
    videos_text += video_text

Let’s look at at sample from the video

print(video_text[:300])

I like to say that startups are an act of desperation and the desperation went out of the ecosystem over the last two or three years and we just had people showing up for the status and the money and now I think it's getting back to people who are doing it for a variety of reasons including the impa

Awesome now that we have all of our data, let’s combine it together into a single information block

user_information = user_tweets + website_data + video_text

Our user_information variable is a big messy wall of text. Ideally we would clean this up more and try to increase the signal to noise ratio. However for this project we’ll just focus on the core use case of gathering data.

Next we’ll chunk our wall of text into pieces so we can do a map_reduce process on it. If you want learn more about techniques to split up your data check out my video on OpenAI Token Workarounds

# First we make our text splitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=20000, chunk_overlap=2000)

# Then we split our user information into different documents
docs = text_splitter.create_documents([user_information])

# Let's see how many documents we created
len(docs)

Because we have a special requset for the LLM on our data, I want to make custom prompts. This will allow me to tinker with what data the LLM pulls out. I’ll use Langchain’s load_summarize_chain with custom prompts to do this. We aren’t making a summary, but rather just using load_summarize_chain for its easy mapreduce functionality.

First let’s make our custom map prompt. This is where we’ll instruction the LLM that it will pull out interview questoins and what makes a good question.

map_prompt = """You are a helpful AI bot that aids a user in research.
Below is information about a person named {persons_name}.
Information will include tweets, interview transcripts, and blog posts about {persons_name}
Your goal is to generate interview questions that we can ask {persons_name}
Use specifics from the research when possible

% START OF INFORMATION ABOUT {persons_name}:
{text}
% END OF INFORMATION ABOUT {persons_name}:

Please respond with list of a few interview questions based on the topics above

YOUR RESPONSE:"""
map_prompt_template = PromptTemplate(template=map_prompt, input_variables=["text", "persons_name"])

Then we’ll make our custom combine promopt. This is the set of instructions that we’ll LLM on how to handle the list of questions that is returned in the first step above.

combine_prompt = """
You are a helpful AI bot that aids a user in research.
You will be given a list of potential interview questions that we can ask {persons_name}.

Please consolidate the questions and return a list

% INTERVIEW QUESTIONS
{text}
"""
combine_prompt_template = PromptTemplate(template=combine_prompt, input_variables=["text", "persons_name"])

Let’s create our LLM and chain. I’m increasing the color a bit for more creative language. If you notice that your questions have hallucinations in them, turn temperature to 0

llm = ChatOpenAI(temperature=.25, model_name='gpt-4')

chain = load_summarize_chain(llm,
                             chain_type="map_reduce",
                             map_prompt=map_prompt_template,
                             combine_prompt=combine_prompt_template,
#                              verbose=True
                            )

Ok, finally! With all of our data gathered and prompts ready, let’s run our chain

output = chain({"input_documents": docs, # The seven docs that were created before
                "persons_name": "Elad Gil"
               })

Warning: model not found. Using cl100k_base encoding.

print (output['output_text'])

As an investor and advisor to various AI companies, what are some common challenges you've observed in the industry, and how do you recommend overcoming them?

Can you elaborate on the advantages of bootstrapping for AI startups and share any success stories you've come across?

What are some key lessons you've learned from your experiences in high-profile companies like Twitter, Google, and Color Health that have shaped your approach to investing and advising startups?

How do you think AI will continue to shape the job market in the coming years?

What motivated you to enter the healthcare space as a co-founder of Color Health, and how do you envision the role of AI in improving healthcare outcomes?

Can you share some insights on what sets high growth companies apart from others and the key factors that contribute to their rapid growth?

How do you evaluate the defensibility of AI startups when considering investment or advisory opportunities?

What excites you the most about the future of AI, and what challenges do you foresee in its development and implementation?

Can you share your vision for Color Health and how it aims to revolutionize the healthcare industry?

What were the key challenges you faced during the rapid growth of Twitter, and how did you overcome them?

What advice would you give to founders looking to build defensibility into their startups from the beginning?

Can you share an example of a company that has successfully maintained a user-centric focus and how it has contributed to their success?

How do you see the balance between serving customer needs and building defensibility evolving in the future of AI-driven products and services?

Can you elaborate on the factors that contribute to your prediction of 2023 being a rough year for mid to late-stage private technology companies and how startups can prepare for these challenges?

What do you think are the most promising applications of large language models like GPT in the near future, and how can startups leverage them for growth?

How do you see the open versus closed structure playing out in the AI industry, and what implications could it have for startups and established companies in the AI space?

How do you think the costs involved in training large language models like GPT-3 and GPT-4 will affect competition and innovation in the AI industry, particularly for startups with limited resources?

What do you think are the key factors driving growth in the space and defense technology sector, and what opportunities do you see for startups in this industry?

How do you envision the future of defense tech startups, and what challenges do they need to overcome to succeed in this competitive landscape?

What lessons can other startups in the defense sector learn from Anduril's success, and how can they apply these strategies to their own businesses?

Awesome! Now we have some questions we can iterate on before we chat with the person. You can swap out different sources for different people.

These questions won’t be 100% ‘copy & paste’ ready, but they should serve as a really solid starting point for you to build on top of.

Next, let’s port this code over to a Streamlit app so we can share a deployed version easily