Overview

SpiderSite uses the Crawl4AI library to extract content from websites, with optional AI-powered data extraction. This guide covers everything from basic scraping to advanced AI-powered extraction.

Basic Website Scraping

Submit a Simple Job

The most basic form of website scraping - extract all content from a URL:
import requests

url = "https://spideriq.di-atomic.com/api/v1/jobs/spiderSite/submit"
headers = {
    "Authorization": "Bearer <your_token>"
}
data = {
    "url": "https://example.com/blog/article"
}

response = requests.post(url, headers=headers, json=data)
job = response.json()
print(f"Job submitted: {job['job_id']}")

What You Get

SpiderSite returns comprehensive data:
  • markdown - Full page content in markdown format
  • screenshot_url - Full-page screenshot (Cloudflare R2 CDN)
  • metadata - Page title, description, keywords, author
  • links - All internal and external links
  • media - Images and media files found on the page

AI-Powered Extraction

Using Instructions

Add natural language instructions to extract specific data:
data = {
    "url": "https://blog.example.com/future-of-ai",
    "instructions": """
        Extract the following information:
        - Article title
        - Author name and bio
        - Publication date
        - Article category/tags
        - Full article content (without ads or sidebars)
        - All code snippets or examples
        - References or citations
    """
}

response = requests.post(url, headers=headers, json=data)

Results with AI Extraction

{
  "success": true,
  "job_id": "550e8400-e29b-41d4-a716-446655440000",
  "type": "spiderSite",
  "status": "completed",
  "data": {
    "url": "https://blog.example.com/future-of-ai",
    "markdown": "# The Future of AI\n\nBy John Doe\n\nPublished: Oct 27, 2025...",
    "screenshot_url": "https://cdn.spideriq.di-atomic.com/screenshots/550e8400.png",
    "extracted_content": {
      "title": "The Future of AI: Trends and Predictions",
      "author": "John Doe - AI Researcher at Tech Corp",
      "publication_date": "2025-10-27",
      "category": ["Artificial Intelligence", "Technology", "Future Trends"],
      "content": "The landscape of artificial intelligence is evolving...",
      "code_snippets": [
        "import tensorflow as tf\nmodel = tf.keras.Sequential([...])"
      ],
      "references": [
        "Smith et al., 2024 - 'Neural Networks Advances'",
        "Johnson, 2025 - 'AI Ethics Framework'"
      ]
    },
    "metadata": {
      "title": "The Future of AI: Trends and Predictions",
      "description": "Explore the latest trends and predictions...",
      "keywords": ["AI", "machine learning", "future tech"],
      "author": "John Doe"
    },
    "links": [
      {"url": "https://blog.example.com/related-article", "text": "Related: ML Basics"}
    ],
    "media": [
      {"url": "https://blog.example.com/images/ai-diagram.png", "alt": "AI Architecture"}
    ]
  }
}

Common Use Cases

1. Blog Article Extraction

Extract structured data from blog posts:
data = {
    "url": "https://techblog.com/article",
    "instructions": """
        Extract:
        - Article title
        - Author name
        - Publication date
        - Reading time
        - Main article content (exclude ads, comments, sidebars)
        - Tags/categories
        - Featured image URL
    """
}

2. E-commerce Product Scraping

Extract product information:
data = {
    "url": "https://shop.example.com/products/laptop-pro",
    "instructions": """
        Extract product information:
        - Product name and SKU
        - Current price and original price (if on sale)
        - Discount percentage
        - Product description
        - Technical specifications (all fields)
        - Stock availability
        - Shipping information
        - Customer rating (average)
        - Number of reviews
        - Seller/brand name
    """
}

3. Documentation Scraping

Extract code examples and API docs:
data = {
    "url": "https://docs.example.com/api/authentication",
    "instructions": """
        Extract from this API documentation:
        - Page title
        - API endpoint path and method
        - All request parameters with descriptions
        - Request body schema
        - Response schema
        - All code examples (preserve language tags)
        - Error codes and descriptions
    """
}

4. Contact Information Extraction

Extract contact details from business websites:
data = {
    "url": "https://company.example.com/contact",
    "instructions": """
        Extract all contact information:
        - Email addresses (all)
        - Phone numbers (all)
        - Physical addresses
        - Social media links (Facebook, Twitter, LinkedIn, etc.)
        - Contact form URL
        - Business hours
        - Department-specific contacts
    """
}

5. News Article Scraping

Extract news with sources and quotes:
data = {
    "url": "https://news.example.com/breaking-story",
    "instructions": """
        Extract:
        - Headline
        - Subheadline
        - Author and publication
        - Date and time
        - Full article text
        - All quoted statements (with attribution)
        - Related article links
        - Image captions
    """
}

Advanced Features

JavaScript-Heavy Sites

For sites that require JavaScript rendering, use the wait_for parameter:
data = {
    "url": "https://spa-app.example.com/content",
    "wait_for": ".main-content",  # CSS selector to wait for
    "timeout": 45,  # Increase timeout for slow-loading sites
    "instructions": "Extract all product listings"
}

Pagination Handling

To scrape multiple pages, submit separate jobs:
import requests

base_url = "https://blog.example.com/articles?page="
job_ids = []

for page in range(1, 11):  # Scrape pages 1-10
    data = {
        "url": f"{base_url}{page}",
        "instructions": "Extract all article titles, URLs, and summaries"
    }

    response = requests.post(submit_url, headers=headers, json=data)
    job_ids.append(response.json()['job_id'])

print(f"Submitted {len(job_ids)} jobs")

Best Practices for Instructions

✅ Good Instructions

Be specific and structured:
Extract the following:
- Product name (exact text from h1 heading)
- Price (numeric value only, without currency symbol)
- Product description (first paragraph only)
- Specifications as a list:
  * CPU
  * RAM
  * Storage
  * Display size
- Stock status (in stock / out of stock / pre-order)

❌ Poor Instructions

Vague requests:
Get the product info

Tips for Better Results

Be specific: List exactly what fields you want extracted
Use structure: Bullet points or numbered lists work best
Mention format: Specify if you want JSON, lists, or plain text
Exclude unwanted content: Explicitly mention what to exclude (ads, sidebars, comments)
Don’t over-complicate: Keep instructions focused on one page’s content

Complete Workflow Example

Here’s a complete workflow from submission to retrieval:
import requests
import time
import json

# Configuration
API_BASE = "https://spideriq.di-atomic.com/api/v1"
AUTH_TOKEN = "<your_token>"
headers = {"Authorization": f"Bearer {AUTH_TOKEN}"}

# Step 1: Submit job
submit_data = {
    "url": "https://example.com/article",
    "instructions": """
        Extract:
        - Article title
        - Author
        - Publication date
        - Full content
    """
}

response = requests.post(
    f"{API_BASE}/jobs/spiderSite/submit",
    headers=headers,
    json=submit_data
)
job_id = response.json()['job_id']
print(f"✓ Job submitted: {job_id}")

# Step 2: Poll for completion
max_wait = 120  # 2 minutes
start_time = time.time()

while time.time() - start_time < max_wait:
    response = requests.get(
        f"{API_BASE}/jobs/{job_id}/results",
        headers=headers
    )

    if response.status_code == 200:
        # Job completed
        results = response.json()
        print("✓ Job completed!")

        # Access extracted content
        extracted = results['data']['extracted_content']
        print(f"\nTitle: {extracted['title']}")
        print(f"Author: {extracted['author']}")
        print(f"Date: {extracted['publication_date']}")
        print(f"\nContent:\n{extracted['content'][:200]}...")

        # Save results
        with open(f'results_{job_id}.json', 'w') as f:
            json.dump(results, f, indent=2)

        break

    elif response.status_code == 202:
        # Still processing
        print("⏳ Job processing...")
        time.sleep(3)

    elif response.status_code == 410:
        # Job failed
        error = response.json()
        print(f"✗ Job failed: {error.get('error')}")
        break

    else:
        print(f"✗ Unexpected status: {response.status_code}")
        break
else:
    print("✗ Timeout waiting for job to complete")

Handling Errors

Common Errors

Error: “Failed to connect to target URL”Cause: The URL is invalid, blocked, or requires authenticationSolution:
  • Verify the URL is correct and publicly accessible
  • Check if the site requires login (SpiderSite cannot scrape authenticated pages)
  • Ensure the site is not blocking bot traffic
Error: “Page load timeout exceeded”Cause: Page took too long to load (> 30s default)Solution:
  • Increase the timeout parameter (max 60s)
  • Use wait_for parameter to wait for specific elements
  • Consider scraping a lighter version of the page
Error: Job completes but extracted_content is emptyCause: Instructions were too vague or content doesn’t matchSolution:
  • Make instructions more specific
  • Check the markdown field to see what was actually scraped
  • Verify the page structure matches your expectations
Error: “Rate limit exceeded. Maximum 100 requests per minute”Cause: Too many requests in a short timeSolution:
  • Implement delays between requests (wait for Retry-After header)
  • Use exponential backoff
  • Consider requesting a higher rate limit

Performance Optimization

Batch Processing

Process multiple URLs efficiently:
import requests
import time

def batch_scrape(urls, instructions, batch_size=10):
    """Submit jobs in batches to avoid rate limits"""
    job_ids = []

    for i in range(0, len(urls), batch_size):
        batch = urls[i:i+batch_size]

        for url in batch:
            response = requests.post(
                submit_url,
                headers=headers,
                json={"url": url, "instructions": instructions}
            )
            job_ids.append(response.json()['job_id'])

        # Wait between batches to respect rate limits
        if i + batch_size < len(urls):
            time.sleep(6)  # 10 requests per minute

    return job_ids

Parallel Result Retrieval

Check multiple jobs concurrently:
from concurrent.futures import ThreadPoolExecutor
import requests

def get_job_result(job_id):
    """Get result for a single job"""
    max_retries = 40
    for _ in range(max_retries):
        response = requests.get(
            f"{API_BASE}/jobs/{job_id}/results",
            headers=headers
        )
        if response.status_code == 200:
            return response.json()
        elif response.status_code == 202:
            time.sleep(3)
        else:
            return None
    return None

# Process multiple jobs in parallel
with ThreadPoolExecutor(max_workers=5) as executor:
    results = list(executor.map(get_job_result, job_ids))

# Filter successful results
successful = [r for r in results if r and r['success']]
print(f"Completed: {len(successful)}/{len(job_ids)} jobs")

Limitations

Authentication: SpiderSite cannot scrape pages that require login or authentication
CAPTCHAs: Pages with CAPTCHA protection cannot be scraped
Heavy JavaScript: Complex Single Page Applications (SPAs) may not render correctly
robots.txt: SpiderSite respects robots.txt. Ensure you have permission to scrape the target site.

Next Steps