How is text extraction different from web scraping?

Web scraping extracts raw HTML. Text extraction identifies main content, strips boilerplate, and returns clean readable text.

March 28, 202610 min readTutorial

How to Extract Text from Any URL with an API

Q: What about paywalled content?

The API extracts what's publicly visible. Paywalled content behind login or payment will not be accessible.

Q: Can I use this for SEO content analysis?

Yes. Extract text from your pages and competitors to compare word counts, keyword density, and content gaps.

Your AI agent needs to read a web page. You could parse raw HTML yourself, fight with JavaScript-rendered content, and strip out nav bars manually. Or you could make one API call and get clean text back.

Why Text Extraction Matters Now

The rise of RAG (Retrieval-Augmented Generation) pipelines and AI agents has created a massive demand for clean text from web pages. LLMs don't want HTML tags, cookie banners, and navigation menus. They want the actual content of the page — the article text, the product description, the documentation.

Text extraction sounds simple until you try it. Modern web pages are JavaScript-rendered SPAs, wrapped in cookie consent modals, littered with ads, and structured in ways that make DOM parsing a nightmare. A good text extraction API handles all of that for you.

Who Needs This

AI/LLM developers — Feed web content into RAG pipelines without HTML noise
Content aggregators — Pull article text for summaries, analysis, or archival
SEO tools — Analyze on-page content, word count, and keyword density
Research tools — Extract data from web sources at scale
Chatbots and agents — Give your AI the ability to "read" any URL a user shares

How Text Extraction Works Under the Hood

A text extraction API typically follows this pipeline:

Fetch — Load the URL, including JavaScript-rendered content (headless browser)

Parse — Build a DOM tree and identify content vs. boilerplate (nav, footer, ads)

Extract — Pull the main content text, title, description, and metadata

Clean — Strip remaining HTML, normalize whitespace, return plain text or markdown

Current Options Compared

Mozilla Readability (Self-Hosted)

The open-source library behind Firefox Reader View.

Strength: free, battle-tested, excellent article extraction
Weakness: Node.js only, no JavaScript rendering, you host and maintain it, struggles with SPAs
Verdict: Great for static HTML articles. Falls apart on modern JS-rendered sites.

Diffbot

Enterprise-grade web scraping and structured data extraction.

Strength: best accuracy, handles any page type, returns structured data (not just text)
Weakness: starts at $299/month, complex API with many product tiers, overkill for text extraction
Verdict: Unbeatable for enterprise data extraction. Way too expensive for "just give me the text."

Rebirth API — Text Extract Endpoint

Clean text from any URL. One call, plain text or markdown back.

Strength: handles JS-rendered pages, returns clean text + metadata, 100 free calls/month, same API key as everything else
Weakness: newer platform, no structured data extraction (articles only, not product pages with fields)
Verdict: Best for developers who need "URL in, text out" without enterprise pricing.

Feature	Readability	Diffbot	Rebirth API
JS-rendered pages	No	Yes	Yes
Hosting required	Yes (self-host)	No	No
Free tier	Unlimited (OSS)	14-day trial	100/month
Structured data	No	Yes	Metadata only
Cost at scale	Free (+ hosting)	$299+/mo	$49/mo

Code Examples

cURL

bash

curl -X POST https://rebirthapi.com/api/v1/text-extract \
  -H "Authorization: Bearer rb_live_your_key" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/blog/some-article"}'

# Response:
# {
#   "title": "Some Article Title",
#   "description": "The meta description...",
#   "text": "The full extracted article text, clean and ready for LLM consumption...",
#   "word_count": 1247,
#   "url": "https://example.com/blog/some-article"
# }

Building a RAG Pipeline

Python

import requests
from openai import OpenAI

REBIRTH_KEY = "rb_live_your_key"
HEADERS = {"Authorization": f"Bearer {REBIRTH_KEY}", "Content-Type": "application/json"}
client = OpenAI()

def extract_text(url: str) -> str:
    """Extract clean text from any URL."""
    res = requests.post(
        "https://rebirthapi.com/api/v1/text-extract",
        headers=HEADERS,
        json={"url": url}
    )
    return res.json().get("text", "")

def ask_about_url(url: str, question: str) -> str:
    """Extract text from a URL and answer a question about it."""
    context = extract_text(url)

    if not context:
        return "Could not extract text from that URL."

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": f"Answer based on this context:\n\n{context[:8000]}"},
            {"role": "user", "content": question}
        ]
    )
    return response.choices[0].message.content

# Your AI agent can now "read" any web page
answer = ask_about_url(
    "https://docs.stripe.com/payments/accept-a-payment",
    "What are the steps to accept a payment with Stripe?"
)
print(answer)

AI Agent Tool (Node.js)

JavaScript

// Define as a tool for your AI agent
const readWebPageTool = {
  name: 'read_web_page',
  description: 'Extract and read the text content of any web page URL',
  parameters: {
    type: 'object',
    properties: {
      url: { type: 'string', description: 'The URL to read' }
    },
    required: ['url']
  },
  execute: async ({ url }) => {
    const res = await fetch('https://rebirthapi.com/api/v1/text-extract', {
      method: 'POST',
      headers: {
        'Authorization': 'Bearer rb_live_your_key',
        'Content-Type': 'application/json'
      },
      body: JSON.stringify({ url })
    });
    const data = await res.json();
    return `Title: ${data.title}\n\n${data.text}`;
  }
};

// Now your agent can read any URL a user shares in chat

Tips for Production Use

Cache aggressively — Web page content changes slowly. Cache extracted text for 1-24 hours to reduce API calls.
Truncate for LLMs — Most LLMs have context limits. Extract the text, then truncate to 4K-8K tokens before passing to your model.
Handle failures gracefully — Some pages block bots, require auth, or return errors. Always have a fallback message.
Respect robots.txt — While the API handles this, be mindful of sites that explicitly prohibit automated access.

FAQ

Can it handle JavaScript-rendered pages?

Yes. The API uses headless browser rendering to execute JavaScript before extracting text, so SPAs and dynamic content work correctly.

What about paywalled content?

The API extracts what's publicly visible. If a page requires login or payment to see the full content, you'll get whatever is shown to unauthenticated visitors.

How is this different from web scraping?

Web scraping extracts raw HTML or specific DOM elements. Text extraction goes further: it identifies the main content, strips boilerplate, and returns clean readable text. Think of it as scraping + intelligent parsing in one step.

Can I use this for SEO content analysis?

Absolutely. Extract text from your pages and competitors, compare word counts, analyze keyword density, and identify content gaps — all programmatically.

Try Text Extraction Free

100 free calls/month. Extract text from any URL. No credit card required.

Try in Playground Get Free API Key