POST
/
api
/
v1
/
jobs
/
spiderSite
/
submit
Submit SpiderSite Job
curl --request POST \
  --url https://spideriq.di-atomic.com/api/v1/jobs/spiderSite/submit \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '{
  "payload.url": "<string>",
  "payload.max_pages": 123,
  "payload.crawl_strategy": {},
  "payload.target_pages": [
    {}
  ],
  "payload.timeout": 123,
  "payload.enable_spa": true,
  "payload.spa_timeout": 123,
  "payload.extract_team": true,
  "payload.extract_company_info": true,
  "payload.extract_pain_points": true,
  "payload.product_description": "<string>",
  "payload.icp_description": "<string>",
  "payload.compendium": {},
  "payload.compendium.enabled": true,
  "payload.compendium.cleanup_level": {},
  "payload.compendium.max_chars": 123,
  "payload.compendium.remove_duplicates": true,
  "payload.compendium.include_in_response": true,
  "payload.compendium.separator": "<string>",
  "payload.compendium.priority_sections": [
    {}
  ],
  "priority": 123
}'
{
  "job_id": "<string>",
  "type": "<string>",
  "status": "<string>",
  "created_at": "<string>",
  "from_cache": true,
  "message": "<string>"
}

Overview

Submit a SpiderSite job to crawl websites with contact extraction, AI-powered company analysis, team member identification, and CHAMP lead scoring.
Version 2.7.0: Features AI Context Engine with markdown compendiums, SPA auto-detection, sitemap-first crawling, and multilingual support (36+ languages).

Key Features

Smart Crawling

Sitemap-first with intelligent page prioritization

Contact Extraction

Emails, phones, addresses, 14 social platforms

AI Analysis

Company vitals, team members, pain points

Lead Scoring

CHAMP framework with ICP fit scoring

Request Body

Required Parameters

payload.url
string
required
Website URL to crawl (must include https://)Example: https://example.com

Crawl Configuration

payload.max_pages
integer
default:"10"
Maximum pages to crawl (1-50)Higher values = more data but slower processing
payload.crawl_strategy
enum
default:"bestfirst"
Crawling strategy:
  • bestfirst - Intelligent prioritization (recommended)
  • bfs - Breadth-first search
  • dfs - Depth-first search
Note: Sitemap-first is used automatically if sitemap is discovered
payload.target_pages
array
Page types to prioritizeWorks with 36+ languages (e.g., “kontakt” in German, “contacto” in Spanish)
payload.timeout
integer
default:"30"
HTTP request timeout per page (10-120 seconds)

SPA Support (v2.4.0)

payload.enable_spa
boolean
default:"true"
Enable automatic SPA detection and Playwright renderingAutomatically detects JavaScript-heavy sites (React/Vue/Angular)
payload.spa_timeout
integer
default:"30"
Playwright page load timeout (10-120 seconds)Only used when SPA is detected. Increase for slow-loading sites.

AI Features (Opt-In - 0 Tokens by Default)

payload.extract_team
boolean
default:"false"
Extract team members using AI (~500 tokens)Includes names, titles, emails, LinkedIn profiles
payload.extract_company_info
boolean
default:"false"
Extract company summary using AI (~500 tokens)Includes services, target audience, industry
payload.extract_pain_points
boolean
default:"false"
Analyze business challenges using AI (~500 tokens)Infers pain points from news, blog posts, job listings

Lead Scoring (CHAMP Framework)

payload.product_description
string
Your product descriptionEnables CHAMP lead scoring when combined with icp_description (+1,500 tokens)
payload.icp_description
string
Your ideal customer profile (ICP)Enables CHAMP lead scoring when combined with product_description (+1,500 tokens)

AI Context Engine (v2.7.0)

payload.compendium
object
Markdown compendium configurationGenerates intelligent, deduplicated markdown optimized for LLMs
payload.compendium.enabled
boolean
default:"true"
Generate markdown compendiumProvides full transparency of scraped content
payload.compendium.cleanup_level
enum
default:"fit"
Cleanup level:
  • raw - Complete conversion (100% baseline)
  • fit - Remove nav/ads/footers (~60% size)
  • citations - Academic format with sources (~70% size)
  • minimal - Main content only (~30% size, 70% token savings)
payload.compendium.max_chars
integer
default:"100000"
Maximum compendium size in characters (1,000 - 1,000,000)Truncates if exceeded
payload.compendium.remove_duplicates
boolean
default:"true"
Smart deduplicationRemoves repeated headers/footers. Saves 20-40% size.
payload.compendium.include_in_response
boolean
default:"true"
Include markdown in API responseSet false for large files (use download URL instead)
payload.compendium.separator
string
default:"\\n\\n---\\n\\n"
Page separator in compendium
payload.compendium.priority_sections
array
default:"[\"main\", \"article\", \"content\"]"
HTML tags to prioritize (minimal mode)

Priority

priority
integer
default:"0"
Job priority (0-10, higher = processed first)

Response

job_id
string
required
Unique job identifier (UUID format)
type
string
required
Always spiderSite for this endpoint
status
string
required
Initial job status (always queued)
created_at
string
required
Job creation timestamp (ISO 8601)
from_cache
boolean
required
Whether this job was deduplicated from cache (24-hour TTL)
message
string
required
Confirmation message

Request Examples

  • Minimal
  • With AI Features
  • Full CHAMP Scoring
  • Compendium Minimal
  • Compendium Disabled
  • SPA Site
  • Multilingual
  • High Priority
  • Partial Compendium
  • Full Configuration
Most basic request - only URL (contact extraction only, no AI):
curl -X POST https://spideriq.di-atomic.com/api/v1/jobs/spiderSite/submit \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "payload": {
      "url": "https://example.com"
    }
  }'
What you get:
  • Contact info (emails, phones, addresses)
  • 14 social media platforms
  • Markdown compendium (fit level)
  • No AI tokens used (0 cost)

Example Response

201 Created
{
  "job_id": "974ceeda-84fe-4634-bdcd-adc895c6bc75",
  "type": "spiderSite",
  "status": "queued",
  "created_at": "2025-10-27T14:30:00Z",
  "from_cache": false,
  "message": "SpiderSite job queued successfully. Estimated processing time: 15-30 seconds."
}

From Cache (Deduplication)

If the same URL was crawled in the last 24 hours:
201 Created - From Cache
{
  "job_id": "abc12345-6789-4def-ghij-klmnopqrstuv",
  "type": "spiderSite",
  "status": "completed",
  "created_at": "2025-10-27T14:30:00Z",
  "from_cache": true,
  "message": "Job results retrieved from cache (original job: 974ceeda-84fe-4634-bdcd-adc895c6bc75)"
}
Deduplication: Identical URLs crawled within 24 hours return cached results instantly (Redis cache with 24hr TTL).

AI Token Costs

AI features are opt-in. By default, no AI tokens are used (0 cost). Enable only the features you need.
FeatureAI TokensWhat You Get
Base crawl (no AI)0 tokensContact info + compendium
extract_company_info~500 tokensCompany vitals (name, summary, industry, services, target audience)
extract_team~500 tokensTeam members with names, titles, emails, LinkedIn
extract_pain_points~500 tokensBusiness challenges inferred from content
CHAMP scoring+1,500 tokensFull CHAMP analysis + ICP fit score + personalization hooks
Total (all features)~3,000 tokensComplete lead profile
Cost optimization: Start with basic crawl (0 tokens). Enable AI features only for high-value leads.

Processing Time

ScenarioEstimated Time
Simple site (5-10 pages)5-15 seconds
Medium site (10-20 pages)15-30 seconds
Large site (20-50 pages)30-60 seconds
SPA site (JavaScript-heavy)+10-20 seconds
With AI extraction+5-10 seconds
Full CHAMP analysis20-60 seconds total

Best Practices

Use AI features when:
  • Qualifying high-value leads
  • Building targeted outreach campaigns
  • Identifying decision makers
  • Scoring leads by ICP fit
Skip AI features for:
  • Bulk contact extraction
  • Budget-sensitive scraping
  • When you only need contact info
raw (100%): Academic research, legal compliance, full fidelity neededfit (60%): General purpose, balances quality and size (default)citations (70%): Academic papers, research documents with sourcesminimal (30%): LLM consumption, token optimization, main content only
bestfirst: Best for most use cases - intelligent prioritizationSitemap-first (auto): Used automatically when sitemap.xml discoveredbfs: When you need broad coverage across sectionsdfs: When you need deep coverage of specific sections
Auto-detection works for:
  • React, Vue, Angular apps
  • Dynamically loaded content
  • Infinite scroll sites
Increase spa_timeout if:
  • Site loads slowly (>30s)
  • Content loads after initial render
  • You see incomplete data
Set enable_spa: false if:
  • Site is static HTML (faster processing)
  • You’re getting timeout errors unnecessarily

Common Use Cases

1. Basic Lead Generation (0 AI Tokens)

Extract contact info from company websites:
{
  "payload": {
    "url": "https://target-company.com",
    "max_pages": 10
  }
}
Returns: Emails, phones, addresses, social media, markdown compendium

2. Qualified Lead Scoring (CHAMP)

Full analysis for high-value prospects:
{
  "payload": {
    "url": "https://qualified-lead.com",
    "max_pages": 20,
    "extract_company_info": true,
    "extract_team": true,
    "extract_pain_points": true,
    "product_description": "Your product here...",
    "icp_description": "Your ICP here..."
  }
}
Returns: Full CHAMP analysis, ICP fit score, personalization hooks

3. Team Member Identification

Find decision makers and contacts:
{
  "payload": {
    "url": "https://target-company.com",
    "max_pages": 15,
    "target_pages": ["team", "about", "leadership", "management"],
    "extract_team": true
  }
}
Returns: Team members with names, titles, emails, LinkedIn

4. Competitor Analysis

Understand company positioning and offerings:
{
  "payload": {
    "url": "https://competitor.com",
    "max_pages": 25,
    "extract_company_info": true,
    "extract_pain_points": true,
    "compendium": {
      "cleanup_level": "citations",
      "max_chars": 200000
    }
  }
}
Returns: Company summary, services, target audience, pain points, detailed content

Limitations

Authentication: SpiderSite cannot scrape pages requiring login/authentication
CAPTCHAs: Sites with CAPTCHA protection cannot be scraped
Rate Limits: 100 requests per minute per API key
robots.txt: SpiderSite respects robots.txt directives

Next Steps