How to Extract Text from Any URL with an API
Your AI agent needs to read a web page. You could parse raw HTML yourself, fight with JavaScript-rendered content, and strip out nav bars manually. Or you could make one API call and get clean text back.
Why Text Extraction Matters Now
The rise of RAG (Retrieval-Augmented Generation) pipelines and AI agents has created a massive demand for clean text from web pages. LLMs don't want HTML tags, cookie banners, and navigation menus. They want the actual content of the page — the article text, the product description, the documentation.
Text extraction sounds simple until you try it. Modern web pages are JavaScript-rendered SPAs, wrapped in cookie consent modals, littered with ads, and structured in ways that make DOM parsing a nightmare. A good text extraction API handles all of that for you.
Who Needs This
- AI/LLM developers — Feed web content into RAG pipelines without HTML noise
- Content aggregators — Pull article text for summaries, analysis, or archival
- SEO tools — Analyze on-page content, word count, and keyword density
- Research tools — Extract data from web sources at scale
- Chatbots and agents — Give your AI the ability to "read" any URL a user shares
How Text Extraction Works Under the Hood
A text extraction API typically follows this pipeline:
Current Options Compared
Mozilla Readability (Self-Hosted)
The open-source library behind Firefox Reader View.
- Strength: free, battle-tested, excellent article extraction
- Weakness: Node.js only, no JavaScript rendering, you host and maintain it, struggles with SPAs
- Verdict: Great for static HTML articles. Falls apart on modern JS-rendered sites.
Diffbot
Enterprise-grade web scraping and structured data extraction.
- Strength: best accuracy, handles any page type, returns structured data (not just text)
- Weakness: starts at $299/month, complex API with many product tiers, overkill for text extraction
- Verdict: Unbeatable for enterprise data extraction. Way too expensive for "just give me the text."
Rebirth API — Text Extract Endpoint
Clean text from any URL. One call, plain text or markdown back.
- Strength: handles JS-rendered pages, returns clean text + metadata, 100 free calls/month, same API key as everything else
- Weakness: newer platform, no structured data extraction (articles only, not product pages with fields)
- Verdict: Best for developers who need "URL in, text out" without enterprise pricing.
| Feature | Readability | Diffbot | Rebirth API |
|---|---|---|---|
| JS-rendered pages | No | Yes | Yes |
| Hosting required | Yes (self-host) | No | No |
| Free tier | Unlimited (OSS) | 14-day trial | 100/month |
| Structured data | No | Yes | Metadata only |
| Cost at scale | Free (+ hosting) | $299+/mo | $49/mo |
Code Examples
cURL
curl -X POST https://rebirthapi.com/api/v1/text-extract \
-H "Authorization: Bearer rb_live_your_key" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com/blog/some-article"}'
# Response:
# {
# "title": "Some Article Title",
# "description": "The meta description...",
# "text": "The full extracted article text, clean and ready for LLM consumption...",
# "word_count": 1247,
# "url": "https://example.com/blog/some-article"
# }Building a RAG Pipeline
import requests
from openai import OpenAI
REBIRTH_KEY = "rb_live_your_key"
HEADERS = {"Authorization": f"Bearer {REBIRTH_KEY}", "Content-Type": "application/json"}
client = OpenAI()
def extract_text(url: str) -> str:
"""Extract clean text from any URL."""
res = requests.post(
"https://rebirthapi.com/api/v1/text-extract",
headers=HEADERS,
json={"url": url}
)
return res.json().get("text", "")
def ask_about_url(url: str, question: str) -> str:
"""Extract text from a URL and answer a question about it."""
context = extract_text(url)
if not context:
return "Could not extract text from that URL."
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": f"Answer based on this context:\n\n{context[:8000]}"},
{"role": "user", "content": question}
]
)
return response.choices[0].message.content
# Your AI agent can now "read" any web page
answer = ask_about_url(
"https://docs.stripe.com/payments/accept-a-payment",
"What are the steps to accept a payment with Stripe?"
)
print(answer)AI Agent Tool (Node.js)
// Define as a tool for your AI agent
const readWebPageTool = {
name: 'read_web_page',
description: 'Extract and read the text content of any web page URL',
parameters: {
type: 'object',
properties: {
url: { type: 'string', description: 'The URL to read' }
},
required: ['url']
},
execute: async ({ url }) => {
const res = await fetch('https://rebirthapi.com/api/v1/text-extract', {
method: 'POST',
headers: {
'Authorization': 'Bearer rb_live_your_key',
'Content-Type': 'application/json'
},
body: JSON.stringify({ url })
});
const data = await res.json();
return `Title: ${data.title}\n\n${data.text}`;
}
};
// Now your agent can read any URL a user shares in chatTips for Production Use
- Cache aggressively — Web page content changes slowly. Cache extracted text for 1-24 hours to reduce API calls.
- Truncate for LLMs — Most LLMs have context limits. Extract the text, then truncate to 4K-8K tokens before passing to your model.
- Handle failures gracefully — Some pages block bots, require auth, or return errors. Always have a fallback message.
- Respect robots.txt — While the API handles this, be mindful of sites that explicitly prohibit automated access.
FAQ
Can it handle JavaScript-rendered pages?
Yes. The API uses headless browser rendering to execute JavaScript before extracting text, so SPAs and dynamic content work correctly.
What about paywalled content?
The API extracts what's publicly visible. If a page requires login or payment to see the full content, you'll get whatever is shown to unauthenticated visitors.
How is this different from web scraping?
Web scraping extracts raw HTML or specific DOM elements. Text extraction goes further: it identifies the main content, strips boilerplate, and returns clean readable text. Think of it as scraping + intelligent parsing in one step.
Can I use this for SEO content analysis?
Absolutely. Extract text from your pages and competitors, compare word counts, analyze keyword density, and identify content gaps — all programmatically.
Try Text Extraction Free
100 free calls/month. Extract text from any URL. No credit card required.