Overview

SpiderSite is an intelligent website crawler with AI-powered lead generation. It crawls websites, extracts contact information, and optionally applies AI analysis for company insights, team identification, and lead scoring.
Version 2.10.0: All AI features now combine into a single efficient API call, including custom prompts for tailored analysis.

How SpiderSite Works

┌─────────────────────────────────────────────────────────────────────────┐
│                         SpiderSite Flow                                  │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  1. CRAWL PHASE                                                          │
│     ├── Check for sitemap.xml (fastest method)                          │
│     ├── Score URLs by relevance (contact, about, team pages first)      │
│     ├── Auto-detect SPA (React/Vue/Angular) → use Playwright            │
│     └── Crawl up to max_pages using selected strategy                   │
│                                                                          │
│  2. EXTRACTION PHASE (No AI - Always runs)                              │
│     ├── Extract emails, phones, addresses                               │
│     ├── Find social media profiles (14 platforms)                       │
│     └── Generate markdown compendium of all content                     │
│                                                                          │
│  3. AI ANALYSIS PHASE (Opt-in - ONE unified call)                       │
│     └── Combines ALL enabled features:                                   │
│         ├── extract_team → Team members with titles/emails              │
│         ├── extract_company_info → Company summary/services             │
│         ├── extract_pain_points → Business challenges                   │
│         ├── Lead scoring (CHAMP) → If product/ICP provided              │
│         └── custom_ai_prompt → Your custom analysis                     │
│                                                                          │
│  4. RESPONSE                                                             │
│     └── Structured JSON with all extracted data                         │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

The 5 Request Types

SpiderSite supports 5 different levels of extraction, from basic scraping to full AI analysis:
TypeDescriptionAI UsedCost
1. Basic ScrapingURL → markdown compendium onlyNoFree
2. Contact ExtractionScrape + contacts/social mediaNoFree
3. AI Lead Intelligence+ team, company info, pain pointsYesAI tokens
4. CHAMP Lead Scoring+ lead scoring with product/ICPYesAI tokens
5. Custom AI Prompts+ your own analysis promptsYesAI tokens

Example 1: Basic Contact Extraction (No AI)

The simplest request - just provide a URL:
curl -X POST "https://spideriq.di-atomic.com/api/v1/jobs/spiderSite/submit" \
  -H "Authorization: Bearer $CLIENT_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "payload": {
      "url": "https://example.com",
      "max_pages": 5
    }
  }'
What you get:
  • Emails, phones, addresses
  • Social media links (14 platforms)
  • Markdown compendium (fit level)
  • No AI tokens used

Example 2: Full Lead Intelligence (AI Enabled)

Extract company info and team members:
Request Body
{
  "payload": {
    "url": "https://techstartup.io",
    "max_pages": 15,
    "extract_team": true,
    "extract_company_info": true,
    "extract_pain_points": true
  }
}
What you get:
  • All contact info
  • Company vitals (name, summary, industry, services, target audience)
  • Team members (names, titles, emails, LinkedIn)
  • Pain points analysis
  • Markdown compendium

Example 3: CHAMP Lead Scoring

Complete lead scoring with the CHAMP framework:
Request Body
{
  "payload": {
    "url": "https://enterprise-target.com",
    "max_pages": 20,
    "extract_team": true,
    "extract_company_info": true,
    "extract_pain_points": true,
    "product_description": "AI-powered sales automation platform that helps B2B teams close deals 3x faster",
    "icp_description": "Mid-market B2B SaaS companies with 50-500 employees, $10M-$100M ARR"
  }
}
What you get:
  • Everything from Example 2, plus:
  • CHAMP Analysis:
    • Challenges: Specific pain points matched to your solution
    • Authority: Decision makers and buying process
    • Money: Budget indicators and funding status
    • Prioritization: Urgency signals and priority level
  • ICP fit score (0-1)
  • Personalization hooks for outreach

Example 4: Custom AI Analysis (v2.10.0)

Extract specific information using your own prompts:
Request Body
{
  "payload": {
    "url": "https://saas-company.com",
    "max_pages": 10,
    "custom_ai_prompt": {
      "enabled": true,
      "system_prompt": "You are a cybersecurity analyst specializing in SaaS platforms.",
      "user_prompt": "Extract all security certifications, compliance frameworks, and data privacy practices mentioned on this website.",
      "json_schema": {
        "security_certifications": ["SOC 2", "ISO 27001"],
        "compliance_frameworks": ["GDPR", "HIPAA"],
        "data_privacy_summary": "string"
      },
      "model": "google/gemini-2.0-flash-exp:free",
      "temperature": 0.1,
      "max_tokens": 4000
    }
  }
}
Response includes:
{
  "data": {
    "custom_analysis": {
      "security_certifications": ["SOC 2 Type II", "ISO 27001"],
      "compliance_frameworks": ["GDPR", "CCPA", "HIPAA"],
      "data_privacy_summary": "Company maintains strict data encryption..."
    }
  }
}

Example 5: Combined AI + Custom Prompt (ONE Call!)

All AI features in a single API call for maximum efficiency:
Request Body
{
  "payload": {
    "url": "https://target-company.com",
    "max_pages": 15,
    "extract_team": true,
    "extract_company_info": true,
    "extract_pain_points": true,
    "product_description": "HR automation platform",
    "icp_description": "Companies with 100-1000 employees",
    "custom_ai_prompt": {
      "enabled": true,
      "system_prompt": "You are a competitive intelligence analyst.",
      "user_prompt": "Extract pricing information, key differentiators, and main competitors mentioned.",
      "output_field_name": "competitive_intel",
      "model": "google/gemini-2.0-flash-exp:free",
      "temperature": 0.2,
      "max_tokens": 6000
    }
  }
}
All extracted in ONE API call:
  • Team members
  • Company info
  • Pain points
  • Lead scoring (CHAMP)
  • Custom competitive intel

Example 6: Minimal Compendium for LLM Context

Optimize for RAG/LLM applications with minimal token usage:
Request Body
{
  "payload": {
    "url": "https://content-heavy-site.com",
    "max_pages": 30,
    "compendium": {
      "enabled": true,
      "cleanup_level": "minimal",
      "max_chars": 50000,
      "remove_duplicates": true
    }
  }
}
Cleanup levels:
LevelSizeBest For
raw100%Full fidelity, archival
fit~60%General purpose (default)
citations~35%Academic format with sources
minimal~15%LLM consumption, token savings

Example 7: SPA-Heavy Site

For React/Vue/Angular sites that need JavaScript rendering:
Request Body
{
  "payload": {
    "url": "https://react-dashboard.app",
    "max_pages": 10,
    "enable_spa": true,
    "spa_timeout": 60,
    "extract_company_info": true
  }
}
SPA detection is automatic by default. Increase spa_timeout for slow-loading sites.

Response Structure

{
  "success": true,
  "job_id": "uuid",
  "type": "spiderSite",
  "status": "completed",
  "processing_time_seconds": 25.4,
  "data": {
    "url": "https://example.com",
    "pages_crawled": 10,
    "crawl_status": "success",

    "emails": ["contact@example.com", "sales@example.com"],
    "phones": ["+1-555-123-4567"],
    "addresses": ["123 Main St, SF, CA"],

    "linkedin": "https://linkedin.com/company/example",
    "twitter": "https://twitter.com/example",
    "facebook": null,
    "instagram": null,
    "youtube": null,
    "tiktok": null,
    "github": "https://github.com/example",
    "pinterest": null,
    "snapchat": null,
    "reddit": null,
    "medium": null,
    "discord": null,
    "whatsapp": null,
    "telegram": null,

    "company_vitals": {
      "one_sentence_summary": "...",
      "key_services": ["Service A", "Service B"],
      "target_audience": "...",
      "industry": "B2B SaaS"
    },

    "team_members": [
      {
        "name": "John Doe",
        "title": "CEO",
        "email": "john@example.com",
        "linkedin": "https://linkedin.com/in/johndoe"
      }
    ],

    "pain_points": {
      "inferred_challenges": ["Challenge 1", "Challenge 2"],
      "recent_mentions": ["News item 1"]
    },

    "lead_scoring": {
      "icp_fit_grade": "A",
      "engagement_score": 85,
      "lead_priority": "Hot",
      "champ_breakdown": {
        "challenges": "...",
        "authority": "...",
        "money": "...",
        "prioritization": "..."
      }
    },

    "custom_analysis": {
      "your_custom_fields": "..."
    },

    "markdown_compendium": "# Company Name\n\n...",

    "compendium": {
      "available": true,
      "storage_location": "inline",
      "size_chars": 45000,
      "cleanup_level": "fit"
    },

    "metadata": {
      "crawl_strategy": "sitemap",
      "spa_enabled": true,
      "browser_rendering_available": true
    }
  }
}

Large Compendiums (R2 Storage)

When compendiums are too large, they’re stored in Cloudflare R2:
{
  "data": {
    "markdown_compendium": null,
    "compendium": {
      "storage_location": "r2",
      "r2_url": "https://cdn.spideriq.di-atomic.com/compendiums/abc123.md?...",
      "url_expires_at": "2025-10-28T12:00:00Z"
    }
  }
}

Complete Workflow Example

Here’s a complete workflow from submission to result retrieval:
import requests
import time

# Configuration
API_BASE = "https://spideriq.di-atomic.com/api/v1"
CLIENT_TOKEN = "<your_client_id>:<your_api_key>:<your_api_secret>"
headers = {"Authorization": f"Bearer {CLIENT_TOKEN}"}

# Step 1: Submit job
submit_data = {
    "payload": {
        "url": "https://target-company.com",
        "max_pages": 10,
        "extract_company_info": True,
        "extract_team": True
    }
}

response = requests.post(
    f"{API_BASE}/jobs/spiderSite/submit",
    headers=headers,
    json=submit_data
)
job_id = response.json()['job_id']
print(f"✓ Job submitted: {job_id}")

# Step 2: Poll for completion
max_wait = 120  # 2 minutes
start_time = time.time()

while time.time() - start_time < max_wait:
    response = requests.get(
        f"{API_BASE}/jobs/{job_id}/results",
        headers=headers
    )

    result = response.json()

    if result['status'] == 'completed':
        print("✓ Job completed!")
        data = result['data']

        # Access extracted data
        print(f"\nEmails: {data['emails']}")
        print(f"Phones: {data['phones']}")
        print(f"LinkedIn: {data['linkedin']}")

        if data.get('company_vitals'):
            print(f"\nCompany: {data['company_vitals']['one_sentence_summary']}")

        if data.get('team_members'):
            print(f"\nTeam Members: {len(data['team_members'])}")
            for member in data['team_members']:
                print(f"  - {member['name']}: {member.get('title', 'N/A')}")

        break

    elif result['status'] == 'failed':
        print(f"✗ Job failed: {result.get('error_message')}")
        break

    else:
        print(f"⏳ Status: {result['status']}...")
        time.sleep(3)

else:
    print("✗ Timeout waiting for job to complete")

Best Practices

Use AI features when:
  • Qualifying high-value leads
  • Building targeted outreach campaigns
  • Identifying decision makers
  • Scoring leads by ICP fit
Skip AI features for:
  • Bulk contact extraction
  • Budget-sensitive scraping
  • When you only need contact info
bestfirst (default): Best for most use cases - intelligent prioritizationSitemap-first (automatic): Used automatically when sitemap.xml discoveredbfs: When you need broad coverage across sectionsdfs: When you need deep coverage of specific sections
LevelUse Case
rawAcademic research, legal compliance
fitGeneral purpose (default)
citationsResearch documents with sources
minimalLLM/RAG applications
Be specific: Clearly define what data you want extractedUse json_schema: Helps the AI return structured dataSet output_field_name: Organize multiple custom analysesAdjust temperature: Lower (0.1) for factual extraction, higher (0.5+) for creative analysis

Error Handling

Error: “Failed to connect to target URL”Causes:
  • Invalid URL
  • Site blocking bots
  • Site requires authentication
Solutions:
  • Verify URL is correct and publicly accessible
  • Check if site blocks automated access
Error: “Page load timeout exceeded”Causes:
  • Slow-loading site
  • Heavy JavaScript rendering
Solutions:
  • Increase timeout parameter (max 120s)
  • Increase spa_timeout for SPA sites
  • Reduce max_pages
Error: “Rate limit exceeded”Solutions:
  • Implement delays between requests
  • Use exponential backoff
  • Contact support for higher limits

Limitations

Authentication: SpiderSite cannot scrape pages requiring login
CAPTCHAs: Sites with CAPTCHA protection cannot be scraped
robots.txt: SpiderSite respects robots.txt directives

Next Steps