Scraping Websites with SpiderSite

Overview

SpiderSite uses the Crawl4AI library to extract content from websites, with optional AI-powered data extraction. This guide covers everything from basic scraping to advanced AI-powered extraction.

Basic Website Scraping

Submit a Simple Job

The most basic form of website scraping - extract all content from a URL:

import requests

url = "https://spideriq.di-atomic.com/api/v1/jobs/spiderSite/submit"
headers = {
    "Authorization": "Bearer <your_token>"
}
data = {
    "url": "https://example.com/blog/article"
}

response = requests.post(url, headers=headers, json=data)
job = response.json()
print(f"Job submitted: {job['job_id']}")

What You Get

SpiderSite returns comprehensive data:

markdown - Full page content in markdown format
screenshot_url - Full-page screenshot (Cloudflare R2 CDN)
metadata - Page title, description, keywords, author
links - All internal and external links
media - Images and media files found on the page

AI-Powered Extraction

Using Instructions

Add natural language instructions to extract specific data:

data = {
    "url": "https://blog.example.com/future-of-ai",
    "instructions": """
        Extract the following information:
        - Article title
        - Author name and bio
        - Publication date
        - Article category/tags
        - Full article content (without ads or sidebars)
        - All code snippets or examples
        - References or citations
    """
}

response = requests.post(url, headers=headers, json=data)

Results with AI Extraction

{
  "success": true,
  "job_id": "550e8400-e29b-41d4-a716-446655440000",
  "type": "spiderSite",
  "status": "completed",
  "data": {
    "url": "https://blog.example.com/future-of-ai",
    "markdown": "# The Future of AI\n\nBy John Doe\n\nPublished: Oct 27, 2025...",
    "screenshot_url": "https://cdn.spideriq.di-atomic.com/screenshots/550e8400.png",
    "extracted_content": {
      "title": "The Future of AI: Trends and Predictions",
      "author": "John Doe - AI Researcher at Tech Corp",
      "publication_date": "2025-10-27",
      "category": ["Artificial Intelligence", "Technology", "Future Trends"],
      "content": "The landscape of artificial intelligence is evolving...",
      "code_snippets": [
        "import tensorflow as tf\nmodel = tf.keras.Sequential([...])"
      ],
      "references": [
        "Smith et al., 2024 - 'Neural Networks Advances'",
        "Johnson, 2025 - 'AI Ethics Framework'"
      ]
    },
    "metadata": {
      "title": "The Future of AI: Trends and Predictions",
      "description": "Explore the latest trends and predictions...",
      "keywords": ["AI", "machine learning", "future tech"],
      "author": "John Doe"
    },
    "links": [
      {"url": "https://blog.example.com/related-article", "text": "Related: ML Basics"}
    ],
    "media": [
      {"url": "https://blog.example.com/images/ai-diagram.png", "alt": "AI Architecture"}
    ]
  }
}

Common Use Cases

1. Blog Article Extraction

Extract structured data from blog posts:

data = {
    "url": "https://techblog.com/article",
    "instructions": """
        Extract:
        - Article title
        - Author name
        - Publication date
        - Reading time
        - Main article content (exclude ads, comments, sidebars)
        - Tags/categories
        - Featured image URL
    """
}

2. E-commerce Product Scraping

Extract product information:

data = {
    "url": "https://shop.example.com/products/laptop-pro",
    "instructions": """
        Extract product information:
        - Product name and SKU
        - Current price and original price (if on sale)
        - Discount percentage
        - Product description
        - Technical specifications (all fields)
        - Stock availability
        - Shipping information
        - Customer rating (average)
        - Number of reviews
        - Seller/brand name
    """
}

3. Documentation Scraping

Extract code examples and API docs:

data = {
    "url": "https://docs.example.com/api/authentication",
    "instructions": """
        Extract from this API documentation:
        - Page title
        - API endpoint path and method
        - All request parameters with descriptions
        - Request body schema
        - Response schema
        - All code examples (preserve language tags)
        - Error codes and descriptions
    """
}

4. Contact Information Extraction

Extract contact details from business websites:

data = {
    "url": "https://company.example.com/contact",
    "instructions": """
        Extract all contact information:
        - Email addresses (all)
        - Phone numbers (all)
        - Physical addresses
        - Social media links (Facebook, Twitter, LinkedIn, etc.)
        - Contact form URL
        - Business hours
        - Department-specific contacts
    """
}

5. News Article Scraping

Extract news with sources and quotes:

data = {
    "url": "https://news.example.com/breaking-story",
    "instructions": """
        Extract:
        - Headline
        - Subheadline
        - Author and publication
        - Date and time
        - Full article text
        - All quoted statements (with attribution)
        - Related article links
        - Image captions
    """
}

Advanced Features

JavaScript-Heavy Sites

For sites that require JavaScript rendering, use the wait_for parameter:

data = {
    "url": "https://spa-app.example.com/content",
    "wait_for": ".main-content",  # CSS selector to wait for
    "timeout": 45,  # Increase timeout for slow-loading sites
    "instructions": "Extract all product listings"
}

Pagination Handling

To scrape multiple pages, submit separate jobs:

import requests

base_url = "https://blog.example.com/articles?page="
job_ids = []

for page in range(1, 11):  # Scrape pages 1-10
    data = {
        "url": f"{base_url}{page}",
        "instructions": "Extract all article titles, URLs, and summaries"
    }

    response = requests.post(submit_url, headers=headers, json=data)
    job_ids.append(response.json()['job_id'])

print(f"Submitted {len(job_ids)} jobs")

Best Practices for Instructions

✅ Good Instructions

Be specific and structured:

Extract the following:
- Product name (exact text from h1 heading)
- Price (numeric value only, without currency symbol)
- Product description (first paragraph only)
- Specifications as a list:
  * CPU
  * RAM
  * Storage
  * Display size
- Stock status (in stock / out of stock / pre-order)

❌ Poor Instructions

Vague requests:

Get the product info

Tips for Better Results

Be specific: List exactly what fields you want extracted

Use structure: Bullet points or numbered lists work best

Mention format: Specify if you want JSON, lists, or plain text

Exclude unwanted content: Explicitly mention what to exclude (ads, sidebars, comments)

Don’t over-complicate: Keep instructions focused on one page’s content

Complete Workflow Example

Here’s a complete workflow from submission to retrieval:

import requests
import time
import json

# Configuration
API_BASE = "https://spideriq.di-atomic.com/api/v1"
AUTH_TOKEN = "<your_token>"
headers = {"Authorization": f"Bearer {AUTH_TOKEN}"}

# Step 1: Submit job
submit_data = {
    "url": "https://example.com/article",
    "instructions": """
        Extract:
        - Article title
        - Author
        - Publication date
        - Full content
    """
}

response = requests.post(
    f"{API_BASE}/jobs/spiderSite/submit",
    headers=headers,
    json=submit_data
)
job_id = response.json()['job_id']
print(f"✓ Job submitted: {job_id}")

# Step 2: Poll for completion
max_wait = 120  # 2 minutes
start_time = time.time()

while time.time() - start_time < max_wait:
    response = requests.get(
        f"{API_BASE}/jobs/{job_id}/results",
        headers=headers
    )

    if response.status_code == 200:
        # Job completed
        results = response.json()
        print("✓ Job completed!")

        # Access extracted content
        extracted = results['data']['extracted_content']
        print(f"\nTitle: {extracted['title']}")
        print(f"Author: {extracted['author']}")
        print(f"Date: {extracted['publication_date']}")
        print(f"\nContent:\n{extracted['content'][:200]}...")

        # Save results
        with open(f'results_{job_id}.json', 'w') as f:
            json.dump(results, f, indent=2)

        break

    elif response.status_code == 202:
        # Still processing
        print("⏳ Job processing...")
        time.sleep(3)

    elif response.status_code == 410:
        # Job failed
        error = response.json()
        print(f"✗ Job failed: {error.get('error')}")
        break

    else:
        print(f"✗ Unexpected status: {response.status_code}")
        break
else:
    print("✗ Timeout waiting for job to complete")

Handling Errors

Common Errors

URL Not Accessible

Error: “Failed to connect to target URL”Cause: The URL is invalid, blocked, or requires authenticationSolution:

Verify the URL is correct and publicly accessible
Check if the site requires login (SpiderSite cannot scrape authenticated pages)
Ensure the site is not blocking bot traffic

Timeout

Error: “Page load timeout exceeded”Cause: Page took too long to load (> 30s default)Solution:

Increase the timeout parameter (max 60s)
Use wait_for parameter to wait for specific elements
Consider scraping a lighter version of the page

No Content Extracted

Error: Job completes but extracted_content is emptyCause: Instructions were too vague or content doesn’t matchSolution:

Make instructions more specific
Check the markdown field to see what was actually scraped
Verify the page structure matches your expectations

Rate Limit Exceeded

Error: “Rate limit exceeded. Maximum 100 requests per minute”Cause: Too many requests in a short timeSolution:

Implement delays between requests (wait for Retry-After header)
Use exponential backoff
Consider requesting a higher rate limit

Performance Optimization

Batch Processing

Process multiple URLs efficiently:

import requests
import time

def batch_scrape(urls, instructions, batch_size=10):
    """Submit jobs in batches to avoid rate limits"""
    job_ids = []

    for i in range(0, len(urls), batch_size):
        batch = urls[i:i+batch_size]

        for url in batch:
            response = requests.post(
                submit_url,
                headers=headers,
                json={"url": url, "instructions": instructions}
            )
            job_ids.append(response.json()['job_id'])

        # Wait between batches to respect rate limits
        if i + batch_size < len(urls):
            time.sleep(6)  # 10 requests per minute

    return job_ids

Parallel Result Retrieval

Check multiple jobs concurrently:

from concurrent.futures import ThreadPoolExecutor
import requests

def get_job_result(job_id):
    """Get result for a single job"""
    max_retries = 40
    for _ in range(max_retries):
        response = requests.get(
            f"{API_BASE}/jobs/{job_id}/results",
            headers=headers
        )
        if response.status_code == 200:
            return response.json()
        elif response.status_code == 202:
            time.sleep(3)
        else:
            return None
    return None

# Process multiple jobs in parallel
with ThreadPoolExecutor(max_workers=5) as executor:
    results = list(executor.map(get_job_result, job_ids))

# Filter successful results
successful = [r for r in results if r and r['success']]
print(f"Completed: {len(successful)}/{len(job_ids)} jobs")

Limitations

Authentication: SpiderSite cannot scrape pages that require login or authentication

CAPTCHAs: Pages with CAPTCHA protection cannot be scraped

Heavy JavaScript: Complex Single Page Applications (SPAs) may not render correctly

robots.txt: SpiderSite respects robots.txt. Ensure you have permission to scrape the target site.

Next Steps

Job Status

Check job processing status

Get Results

Retrieve scraping results

API Reference

Complete SpiderSite API documentation

List Jobs

View all your submitted jobs

Guides

​Overview

​Basic Website Scraping

​Submit a Simple Job

​What You Get

​AI-Powered Extraction

​Using Instructions

​Results with AI Extraction

​Common Use Cases

​1. Blog Article Extraction

​2. E-commerce Product Scraping

​3. Documentation Scraping

​4. Contact Information Extraction

​5. News Article Scraping

​Advanced Features

​JavaScript-Heavy Sites

​Pagination Handling

​Best Practices for Instructions

​✅ Good Instructions

​❌ Poor Instructions

​Tips for Better Results

​Complete Workflow Example

​Handling Errors

​Common Errors

​Performance Optimization

​Batch Processing

​Parallel Result Retrieval

​Limitations

​Next Steps

Job Status

Get Results

API Reference

List Jobs

Overview

Basic Website Scraping

Submit a Simple Job

What You Get

AI-Powered Extraction

Using Instructions

Results with AI Extraction

Common Use Cases

1. Blog Article Extraction

2. E-commerce Product Scraping

3. Documentation Scraping

4. Contact Information Extraction

5. News Article Scraping

Advanced Features

JavaScript-Heavy Sites

Pagination Handling

Best Practices for Instructions

✅ Good Instructions

❌ Poor Instructions

Tips for Better Results

Complete Workflow Example

Handling Errors

Common Errors

Performance Optimization

Batch Processing

Parallel Result Retrieval

Limitations

Next Steps