I Made a Weekly Newsletter that’s Completely AI Generated

Ryan Kemmer
13 min readSep 19, 2024

--

Image by the author

People used to spend hours in libraries, searching through alphabetized aisles trying to find new sources of information. Today in the age of the internet and AI, information is practically handed to you on a silver platter. Instead of having to go on a treacherous journey to learn about a topic, you can now just ask ChatGPT.

Like any other AI user, I am constantly thinking about how this technology will be applied to new use-cases. In my opinion one of the best untapped use-cases is newsletters. These daily/weekly news flashes take information that is already out there, and synthesize it into an easy to digest format. This is a process that LLMs are more than capable of doing for us, and could potentially further improve how we access relevant information.

In this article, I am going to tell you my story about how I built a software that can pull data from the web, synthesize it, and write an engaging weekly newsletter.

Newsletters: Trend or new information frontier?

Newsletters have exploded recently, with platforms like Substack, Beehiv and Convertkit gaining new subscribers faster than ever. According to a 2021 Storydoc study, at least 90% of Americans reported to being subscribed to at least one newsletter.

Actually creating a newsletter is tough work though. As a writer you constantly have to source new information which can take hours if not days. It is easy enough to come up with an idea for a newsletter, but actually being consistent with writing one is definitely not easy. As a hobby writer myself, I admit that almost every project I take on usually requires much more effort than I was expecting. But, what if you could create a newsletter that wrote itself?

This lead me to make the goal to create an automated newsletter using my own custom LLM-driven software. My friend and I decided to make a technology newsletter in Convertkit called “TechFizz” that would utilize this tool to help write weekly editions for us. The newsletter focuses on weekly advancements in the technology industry as a whole, on everything from big tech, AI, gadgets, to space.

Now to get into the details on how I built a software that can accomplish this…

The Architecture

Every good piece of software starts with a good architecture design.

In order to build a newsletter generation tool, I decided to build two main components: a web crawler and a text synthesizer. In order to implement these two components I took advantage of AWS Serverless architecture, which has an incredible free tier usage plan.

The web crawler would be used to pull down stories from specific technology news websites, and would be easily extendable to more websites and sources. Every week, once we are ready to actually write a newsletter, the text synthesizer would run, utilizing LLMs and turn information from our crawled sources into a digestible weekly edition.

Web Crawler

In order to create a robust web crawler, I designed a system involving 2 AWS lambda functions. One lambda would go through all individual URLs to crawl and process them into an SQS queue, while the other lambda would utilize libraries such as beautifulsoup to crawl all newsletter links on a specific URL. Once individual URLs are crawled, content would be scraped and stored in an Amazon DynamoDB table.

Design for a Web Crawler utilizing AWS

Text Synthesizer

Now comes the real meat and potatoes of the app architecture, the text synthesizer. Once news articles are collected and stored every week, another lambda would be used to query them and feed them into models to generate a newsletter.

The full process (shown below) involves a daily crawling job, and a weekly text generation job.

Full design of a web crawler and text synthesizer using AWS

Web Crawler

Implementing a web crawler involved using Python to fetch all news articles off of a homepage, scraping all content off of each article, and then saving articles to a database.

For example, let’s say we wanted to crawl all news articles on https://www.theverge.com/. First, we need the crawler to fetch all links on the homepage that follow the pattern below, indicating it is a news article.

https://www.theverge.com/2024/9/14/24243794
/tiktok-ban-bytedance-court-oral-arguments-lawsuit-explainer

Below is a snippet showing how we fetch all links on a page.

# Get response
response = requests.get(url, headers=self.build_headers())

if response.status_code == 200:

# Parse HTML content and find all page links
soup = BeautifulSoup(response.content, 'html.parser')
links = soup.find_all('a', href=True)
unique_article_links = set([link['href'] for link in links])
unique_article_links = [strip_query(link) for link in unique_article_links]

#Raise an error if there are no article URLs found.
if len(unique_article_links) < 1:
log.error(f"Error crawling site. No URLS found.")
raise ValueError("Error crawling site. No URLS found.")

It is important that we use regex to filter out only links to valid news articles within the site we are crawling. Also, we need to make sure we that articles are written within the last day to prevent crawling links we already have stored. We can use a Boolean python function, validate_links to do just that.

def validate_link(self, article_link, url):

#Regex format for news article links on theverge.com
pattern = r"theverge\.com/\d{4}(?=\D|$)"

link_is_valid = False
if (
parse_domain(url) in article_link and
article_link != url and
bool(re.search(pattern, article_link))
):
link_is_valid = True

return link_is_valid

Once we pull down all urls to news articles, we can then use beautifulsoup to scrape all of the relevant HTML content and store it in our DynamoDB. Along with the content itself, we store additional metadata so we can classify this article during the text generation phase.

#Check if article link is valid
if self.validate_link(article_link, url):

#Pull down article data and get publication date
r = requests.get(article_link,headers=self.build_headers())
article_publication_date = find_date(r.text)

if article_publication_date is not None:

#Convert date to Python Datetime
article_publication_date_datetime = datetime.strptime(article_publication_date,'%Y-%m-%d').date()

if self.validate_article_date(article_publication_date_datetime):

log.info(f"Scraping URL: {article_link}")

#Get title and article content
article_title = soup.title.get_text()
article_content = self.parse_article_content(article_link)

current_time = datetime.now().strftime('%Y-%m-%d %H:%M:%S')

#Save scraped conetn to DynamoDB table
response = self.table.put_item(
Item={
'URL': article_link,
'Source' : url,
'LoadTime': current_time,
'PublicationDate': article_publication_date,
'Title': article_title,
'Content': article_content
}
)
log.info(f"Scraped data saved to DynamoDB for article: {article_link}")

A lot more code was involved to create a full crawler that could be adaptable for multiple sources, but this is the general blueprint.

Once I finished making the crawler, I set it up to crawl a bunch of interesting technology-based news sources to feed our newsletter. The crawling process implemented above would be ran every day, and store all new news articles on a variety of sources.

The crawler currently fetches and stores around 300 news articles to DynamoDB everyday.

News Dynamo Table

Text Synthesizer

Once I had a rich selection of news being fetched and stored daily, it was time to utilize the data to generate weekly newsletters.

There are many potential ways to use a news database to generate content. Since many recent LLMs have a very large context window (size of text that can be fed as input), it is possible to go with a brute force approach of feeding all of them into a single prompt. However, this would not be very efficient, could potentially be very costly, and also may not perform as well as LLMs struggle to synthesize information in large context windows.

A better approach would be to split the newsletter into specific weekly sections, develop multiple prompts for each section, and then use an approach called Retrieval Augmented Generation (RAG) to fetch the most relevant articles for a section. Finally, the top 5 relevant articles would be fed into the prompt as input, making it so only a few specific sources would be added a prompt at a time.

My initial newsletter was set up to have 5 main sections:

  • Brain Blast (Tech story of the week)
  • Startup of the Week
  • Hacker No Hacking (Section about major cybersecurity attacks that have occurred)
  • Tech Giants
  • Bits & Ledgers (Section on crypto news)

Prompt Store

Prompts for individual sections of the newsletter were stored in a configurable file. Each prompt configuration contained a QUERY indicating what prompt should be used to search for relevant articles in our db, SOURCES indicating which sources to search in, and TASK indicating the prompt to generate the section of the newsletter in.

An example of prompts for the Brain Blast and Startup of the Week sections are shown below. The Context: {context} part of the prompt represents a variable where we will feed in relevant articles fetched using Retrieval Augmented Generation. 1–10 articles fetched from our database would be fed into this part of the prompt to provide refined information to synthesize.

BRAIN_BLAST_PROMPT = {
'TITLE' : 'Brain Blast',
'QUERY' :
"""
Find the latest and most interesting technology news.
Include breakthroughs in artifical intelligence, new product
launches, significant software updates,
notable mergers and aquisations in tech industry,
cybersecurity developments, and emerging tech trends.
Prioritize stories that have significant impact or showcase
innovation.
""",
'SOURCES' : [TECH_CRUNCH, TECHNOLOGY_REVIEW, THE_VERGE, SPACENEWS],
'TASK' :
"""

Write two news articles,
each consisting of at least 4 paragraphs with a
minimum of seven sentences per paragraph,
focusing on the most significant technological breakthrough
of the past week.

Only use information provided in the context.

Context: {context}

"""
}

STARTUP_NEWS = {
'TITLE' : 'Startup of the Week',
'QUERY' :
"""
Find a tech startup that has recently achieved a
significant technological breakthrough or milestone.
""",
'SOURCES' : [TECH_STARTUPS, GEEK_WIRE, TECH_CRUNCH],
'TASK' :
"""

Write two news articles, each consisting of at least 4 paragraphs
with a minimum of seven sentences per paragraph,
focusing on a tech startup that has recently achieved a significant
technological breakthrough or milestone within the last week.

Only use information provided in the context.

Context: {context}

"""
}

Since most LLMs do not have access to up-to-date news, it is important to specify in the prompt to only use information from articles provided in the context. This helps to ensure that the content generated is focused and free from hallucinations.

Retrieval Augmented Generation

Retrieval Augmented Generation (or RAG) is a prompt engineering process that involves using an LLM to retrieve relevant sources for a prompt, and then another LLM to write a response to the augmented prompt using fetched sources.

This can be easily be done using open source GenAI libraries such as Langchain.

Implementing this process involved embedding all articles to create a vector database. Once the vector database is created, we can then use a Lang-chain retriever to return relevant documents given one of our predefined prompt queries.

I used the GoogleGenerativeAIEmbeddings embedding-001 model to generate embedding representations for all of the relevant sources, and utilized FAISS to create a vector database allowing for efficient retrieval.

def load_retriever(sources):

#Embedding model
gemini_embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")

#Query the Dynamo table to find articles from specified sources
df = query_table(sources)

#Load documents into Lanagchain
loader = DataFrameLoader(df, page_content_column="ArticleData")
documents = loader.load()

#Split document text
text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

#Create a vectorDB and retriever
db = FAISS.from_documents(texts, gemini_embeddings)
retriever = db.as_retriever()

return retriever

Once the retriever is set up, the RAG process is completed by defining a generate_text function. This function fetches relevant articles to augment the prompt, and feeds them into our generation prompt. I decided to use the GPT-4o to generate the final article text since it seemed to do a good job at following directions and produced few hallucinations.

The following code demonstrates how to retrieve relevant documents for a prompt and use them to enhance the prompt using the Langchain framework.

def genrate_text(prompt, retriever):

#Text generation model
model = ChatOpenAI(model_name="gpt-4o", max_tokens=3000, temperature=.85)

#Find the top 10 most relevant docs to our prompt
relavent_docs = retriever.invoke(prompt['QUERY'], num_docs=10)
log.info(f"Relavant Docs Returned: {relavent_docs}")

#Inject relevant documents into the prompt
promptTemplate = PromptTemplate(
input_variables=["product"],
template=" ".join([SYSTEM_PROMPT, prompt['TASK']]),
)

#Create a basic LLM chain to generate text from our augmented prompt
chain = LLMChain(llm=model, prompt=promptTemplate)
result = chain.run({"context": relavent_docs})

return result

Tying it all together

Finally, the text generation process can be tied together in a lambda function which does the following steps:

  1. Iterates through all the prompts
  2. Loads a retriever for relevant sources
  3. Queries relevant articles
  4. Feeds them into a final prompt to generate a section of the newsletter

Once each section of the newsletter is generated they would then be saved as a .txt file in an S3 bucket. When it is time to send out the newsletter, these text snippets can then be manually checked, refined, and copied into a weekly newsletter to be sent out.

def lambda_handler(event, context):

current_date = datetime.now().strftime('%Y-%m-%d')
local_file_name = f'/tmp/newsletter-{current_date}.txt'
s3_file_name = f'GeneratedContent/newsletter-{current_date}.txt'

#Iterate through all prompts
for prompt in PROMPTS:

#Get configured sources for this prompt
sources = prompt['SOURCES']
log.info(f"Loading Retriever using sources {sources}")

#Load a retirever/vector db
try:
retriever = load_retriever(sources)
except:
log.info(f"Cannot generate content, no sources availible this week")
continue

#Generate text
log.info(f"Generating text for prompt: {prompt['TITLE']}")
text = genrate_text(prompt, retriever)
log.info(f"Generated Text: {text}")

#Save to file
with open(local_file_name, 'a') as file:
file.write(prompt['TITLE'] + '\n\n')
file.write(text + '\n\n')

#Upload completed file to S3 bucket
try:
s3.upload_file(local_file_name, BUCKET_NAME, s3_file_name)
except ClientError as e:
raise Exception(f"Failed to upload file: {e.response['Error']['Message']}") from e

return {
'statusCode': 200,
'body': f'File {s3_file_name} uploaded to {BUCKET_NAME} successfully.'
}

The total process takes around 50–60 seconds to run for each prompt, making it so we can generate a full newsletter within 5 minutes. Not bad!

Given our predefined prompts shown above, below is an example of what the software can generate for the “Brain Blast” section of the newsletter, which is the main headline section. In the given timeframe this was generated, the retriever determined that stories related to Apple’s new iphone capabilities were an important weekly story. Articles related to this topic were then fed into the GPT-4o to produce a fun and engaging story for the newsletter.

### Apple’s AI Adventure: Meet Siri 2.0!

Hold onto your iPhones, folks! Apple has just dropped its latest bombshell, and it’s all about AI. Yep, Apple’s rolling out a brand-new set of AI capabilities dubbed Apple Intelligence, and it’s coming to your iOS 18. The star of the show? A supercharged Siri that’s ready to take on the world — or at least, your daily digital dilemmas.

Now, don’t get too excited — this isn’t one of those “blink and you miss it” updates. Apple’s cooking up something big. We’re talking about AI-driven features that make your iPhone smarter than ever before. Visual search is one of the highlights, letting you snap photos of stuff and get instant info — because who has time to actually, like, Google things anymore? And in a move that might finally make Siri cool again, Apple’s giving her a serious upgrade. Think of it as Siri 2.0: sassier, smarter, and a lot more useful.

Of course, it’s not just about fancy features. Apple is partnering with other tech titans like Google to make these AI capabilities top-notch. Imagine having a visual search that doesn’t just work but works flawlessly. You point your camera at something, and bam! Instant knowledge. The kind of stuff that makes you wonder how you ever lived without it.

But why stop there? Apple’s also rolling out AI tools for the Apple Watch with watchOS 11. Translation features and an upgraded Smart Stack are set to make your wrist gadget smarter than ever. That’s right — now your watch can do more than just remind you to stand up every hour. It’s practically a mini polyglot, ready to break down language barriers with a flick of your wrist.

And let’s not forget about the new iPhone 16, which comes with a dedicated camera button. Because why should you have to fumble for the camera app when you need to capture that perfect moment? Just press and click. It’s simple, it’s sleek, and it’s so very Apple. All these upgrades are making one thing clear: Apple isn’t just playing catch-up in the AI game; it’s setting out to lead the pack. So, charge up your devices and get ready for the future — because it’s coming faster than you can say “Hey Siri.”

Key Takeaways

Is AI-augmented news really that useful? Potentially.

Personally, I think that the idea of using advanced LLMs to help source and synthesize information is powerful. There is so much information out there that humans just do not have the time to read on their own. AI-powered software can serve as an amazing tool to help us make sense of all the information out there, pinpointing details that we might miss.

However, a world where all written content is AI generated would be dystopian. If AI models just produced all content, everything would have no originality or charm. Organic human content is by far more valuable, and I don’t think even the biggest AI enthusiast would want to live in a world with a dead internet. Another raging concern is that everything needs to be fact checked. The risk for hallucinations in LLMs are high, and a world filled with LLM generated content could cause misinformation to spread rapidly. We are definitely in the early stages of this technology and there is a lot to improve to steer it from going off its course.

I think a good balance is only using LLMs to synthesize existing information out there, and not not using them come up with their own creative works. Since models are only trained on existing textual works, they don’t have the human ability to innovate and create exciting new developments in writing. However, their keen ability to make sense of large corpus with low effort make them perfect for a summary or snapshot of existing information.

Next Steps

Currently, I am running my newsletter generation software to help generate weekly content for my newsletter. I think it is a cool test to see if this content is valuable or not, and if subscribers find it valuable. It has also been a valuable experience for me to learn about Generative AI and prompt engineering.

In the future, I could see this tool to be useful for more than just our humble newsletter. If the prompts are changed and the sources are updated, theoretically this thing could be used for any topics. As long as the crawlers are set up for the right sources, this thing could practically generate an up to date “news flashes” for pretty much any topic.

On the technical side of things, I would love to improve my crawling process to find sources of information completely on its own. I have been doing some initial tests with LLM-driven crawling libraries like firecrawl, and am exited for integrate further with these tools.

Conclusion

In this article, I went through the process of how to create a newsletter that writes itself using scraped web content.

All the code from this article is currently in a private repository. However, if you are interested in getting involved, please reach out and I will personally send you my code.

If you want cutting edge tech news sourced weekly, subscribe to TechFizz!

As always, thanks for reading.

--

--