Convert any webpage into LLM-ready Markdown using Gaffa

The ability to convert websites into LLM-friendly markdown is powerful when building applications for summarization, Q&A, or knowledge extraction. In this guide, you'll learn how to use the Gaffa API to extract the main content of any web page using browser rendering and convert it into structured markdown.

By the end of this guide, you’ll be able to:

Render web pages using Gaffa’s API.
Extract clean page content.
Generate structured markdown suitable for LLM-based Q&A or summarization.

Prerequistes

Install Python 3.10 or newer.
Create a virtual environment

python -m venv venv && source venv/bin/activate

Install the required libraries

pip install requests openai

Get your Gaffa API key and OpenAI API key, and store them as environment variables:

GAFFA_API_KEY=your_gaffa_api_key
OPENAI_API_KEY=your_openai_api_key

Convert a webpage to Markdown

In the code below, we define a function that takes a URL as input, makes a POST request to the Gaffa API, invoking the generate_markdown action, which uses the browser rendering engine to extract the main content of the page and convert it into markdown.

import requests
import openai

GAFFA_API_KEY = os.getenv("GAFFA_API_KEY")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

# Fetch the markdown content from Gaffa
def fetch_markdown_with_gaffa(url):
    payload = {
        "url": url,
        "proxy_location": None,
        "async": False,
        "max_cache_age": 0,
        "settings": {
            "record_request": False,
            "actions": [
                {
                    "type": "wait",
                    "selector": "article"
                },
                {
                    "type": "generate_markdown"
                }
            ]
        }
    }
   
    # Set the headers for the request
    headers = {
        "x-api-key": GAFFA_API_KEY,
        "Content-Type": "application/json"
    }
    # Make the POST request to the Gaffa API
    print("Calling Gaffa API to generate markdown...")
    response = requests.post("https://api.gaffa.dev/v1/browser/requests", json=payload, headers=headers)
    response.raise_for_status()
   
    # Extract the markdown URL from the response
    markdown_url = response.json()["data"]["actions"][1]["output"]
   
    # Fetch the markdown content from the generated URL
    print(f"📥 Fetching markdown from: {markdown_url}")
    markdown_response = requests.get(markdown_url)
    markdown_response.raise_for_status()
   
    return markdown_response.text

Ask questions using OpenAI

Now that we have the markdown content, we can ask questions about it using the OpenAI API. The function below takes the markdown content and a question as input and uses the OpenAI API to generate a summary based on the provided content. In this case, we are using the gpt-3.5-turbo model, but you can choose any other model.

def ask_question(markdown, question):
    openai.api_key = OPENAI_API_KEY
    prompt = (
        f"You are an assistant helping analyze different webpages.\n\n"
        f"Markdown content:\n{markdown[:3000]}\n\n"
        f"Question: {question}\nAnswer as clearly as possible."
    )

    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "user", "content": prompt}
        ]
    )
    return response.choices[0].message["content"]

The markdown becomes the model’s context, enabling accurate answers about the original web content.

User Interaction and Execution

Having defined the functions, we can now create a simple command-line interface that allows users to input a URL and ask questions about the content.

def main():
    url = input("Enter the URL of the article: ")
    try:
        markdown = fetch_markdown_with_gaffa(url)
        print("\n✅ Markdown successfully retrieved from Gaffa.\n")

        while True:
            question = input("Ask a question about the content (or type 'exit'): ")
            if question.lower() == "exit":
                break
            answer = ask_question(markdown, question)
            print(f"\n💬 Answer: {answer}\n")

    except Exception as e:
        print(f"⚠️ Error: {e}")

 if __name__ == "__main__":
    main()

Full Script

The full script is available to download from the Gaffa Python Examples GitHub repo.

Running the Script

To run the script, simply execute it in your terminal:

python your_script_name.py

With your script running, you can enter any URL of any web page, and the script will fetch the markdown content and allow you to ask questions about it.

PreviousGET v1/schemas NextCapture a Full-Height Screenshot of a Webpage

Last updated 1 month ago