Submit SpiderSite Job - SpiderIQ API

Overview

Submit a SpiderSite job to crawl websites with contact extraction, AI-powered company analysis, team member identification, and CHAMP lead scoring.

Version 2.10.0: Now includes Custom AI Prompts for tailored analysis, plus AI Context Engine with markdown compendiums, SPA auto-detection, sitemap-first crawling, and multilingual support (36+ languages).

Key Features

Smart Crawling

Sitemap-first with intelligent page prioritization

Contact Extraction

Emails, phones, addresses, 14 social platforms

AI Analysis

Company vitals, team members, pain points

Lead Scoring

CHAMP framework with ICP fit scoring

Custom AI Prompts

Your own prompts for tailored analysis (v2.10.0)

R2 Storage

Large compendiums stored with presigned URLs

Request Body

payload

object

required

Job configuration payload

Show properties

url

string

required

Website URL to crawl (must include https://)Example: https://example.com

max_pages

integer

default:"10"

Maximum pages to crawl (1-50)Higher values = more data but slower processing

crawl_strategy

enum

default:"bestfirst"

Crawling strategy:

bestfirst - Intelligent prioritization (recommended)
bfs - Breadth-first search
dfs - Depth-first search

Note: Sitemap-first is used automatically if sitemap is discovered

target_pages

array

Page types to prioritizeWorks with 36+ languages (e.g., “kontakt” in German, “contacto” in Spanish)

timeout

integer

default:"30"

HTTP request timeout per page (10-120 seconds)

enable_spa

boolean

default:"true"

Enable automatic SPA detection and Playwright renderingAutomatically detects JavaScript-heavy sites (React/Vue/Angular)

spa_timeout

integer

default:"30"

Playwright page load timeout (10-120 seconds)Only used when SPA is detected. Increase for slow-loading sites.

extract_team

boolean

default:"false"

Extract team members using AI (~500 tokens)Includes names, titles, emails, LinkedIn profiles

extract_company_info

boolean

default:"false"

Extract company summary using AI (~500 tokens)Includes services, target audience, industry

extract_pain_points

boolean

default:"false"

Analyze business challenges using AI (~500 tokens)Infers pain points from news, blog posts, job listings

product_description

string

Your product descriptionEnables CHAMP lead scoring when combined with icp_description (+1,500 tokens)

icp_description

string

Your ideal customer profile (ICP)Enables CHAMP lead scoring when combined with product_description (+1,500 tokens)

compendium

object

Markdown compendium configuration (v2.7.0)Generates intelligent, deduplicated markdown optimized for LLMs

Show properties

enabled

boolean

default:"true"

Generate markdown compendiumProvides full transparency of scraped content

cleanup_level

enum

default:"fit"

Cleanup level:

raw - Complete conversion (100% baseline)
fit - Remove nav/ads/footers (~60% size)
citations - Academic format with sources (~70% size)
minimal - Main content only (~30% size, 70% token savings)

max_chars

integer

default:"100000"

Maximum compendium size in characters (1,000 - 1,000,000)Truncates if exceeded

remove_duplicates

boolean

default:"true"

Smart deduplicationRemoves repeated headers/footers. Saves 20-40% size.

include_in_response

boolean

default:"true"

Include markdown in API responseSet false for large files (use download URL instead)

separator

string

default:"\\n\\n---\\n\\n"

Page separator in compendium

priority_sections

array

default:"[\"main\", \"article\", \"content\"]"

HTML tags to prioritize (minimal mode)

custom_ai_prompt

object

Custom AI analysis configuration (v2.10.0)Analyze the scraped website content with your own prompts and return structured JSON. Combines with other AI features in a single efficient API call.

Show properties

enabled

boolean

default:"false"

Enable custom AI analysisRequires compendium.enabled: true (default)

system_prompt

string

AI persona/role definitionExample: "You are a cybersecurity analyst specializing in SaaS platforms."

user_prompt

string

Analysis task descriptionExample: "Extract all security certifications and compliance frameworks mentioned."

json_schema

object

Expected JSON output structure (optional)Helps the AI return data in your preferred format.Example:

{
  "security_certifications": ["SOC 2", "ISO 27001"],
  "compliance_frameworks": ["GDPR", "HIPAA"],
  "data_privacy_summary": "string"
}

output_field_name

string

default:"custom_analysis"

Field name in response where custom analysis will appearChange this to organize multiple custom analyses.

model

string

default:"google/gemini-2.0-flash-exp:free"

AI model to use for analysisDefault model is free. Other OpenRouter models available.

temperature

number

default:"0.1"

Sampling temperature (0.0-2.0)Lower = more deterministic, higher = more creative

max_tokens

integer

default:"4000"

Maximum tokens in AI response (100-16000)

fuzziq_enabled

boolean

Enable FuzzIQ deduplication for this job (v2.18.0+)When enabled, results are checked against your client’s FuzzIQ database and marked with fuzziq_unique flag.Default: Uses client-level setting (typically true)

fuzziq_unique_only

boolean

Return only unique records, filtering out duplicates (v2.18.0+)When true, results that are duplicates of previously seen records will be excluded.Default: Uses client-level setting (typically false)

priority

integer

default:"0"

Job priority (0-10, higher = processed first)

Response

job_id

string

required

Unique job identifier (UUID format)

type

string

required

Always spiderSite for this endpoint

status

string

required

Initial job status (always queued)

created_at

string

required

Job creation timestamp (ISO 8601)

from_cache

boolean

required

Whether this job was deduplicated from cache (24-hour TTL)

message

string

required

Confirmation message

Request Examples

Most basic request - only URL (contact extraction only, no AI):

curl -X POST https://spideriq.di-atomic.com/api/v1/jobs/spiderSite/submit \
  -H "Authorization: Bearer <your_token>" \
  -H "Content-Type: application/json" \
  -d '{
    "payload": {
      "url": "https://example.com"
    }
  }'

What you get:

Contact info (emails, phones, addresses)
14 social media platforms
Markdown compendium (fit level)
No AI tokens used (0 cost)

Extract company info and team members:

Request Body

{
  "payload": {
    "url": "https://techstart.com",
    "max_pages": 15,
    "extract_company_info": true,
    "extract_team": true,
    "extract_pain_points": true
  },
  "priority": 5
}

What you get:

All contact info
Company vitals (name, summary, industry, services, target audience)
Team members (names, titles, emails, LinkedIn)
Pain points analysis
Markdown compendium
AI tokens: ~1,500 tokens total

Complete lead scoring with CHAMP framework:

Request Body

{
  "payload": {
    "url": "https://techstart.com",
    "max_pages": 20,
    "extract_company_info": true,
    "extract_team": true,
    "extract_pain_points": true,
    "product_description": "AI-powered customer support automation platform that reduces ticket resolution time by 60% using intelligent routing and automated responses.",
    "icp_description": "B2B SaaS companies with 50-500 employees, experiencing rapid growth, struggling with support team scalability, budget >$50k/year for support tools."
  },
  "priority": 8
}

What you get:

All contact info + company info + team + pain points
CHAMP Analysis:
- Challenges: Specific pain points matched to your solution
- Authority: Decision makers and buying process
- Money: Budget indicators and funding status
- Prioritization: Urgency signals and priority level
ICP Fit Score: 0-1 score indicating how well they match your ICP
Personalization hooks for outreach
AI tokens: ~3,000 tokens total

Aggressive token optimization (70% savings):

Request Body

{
  "payload": {
    "url": "https://example.com",
    "max_pages": 10,
    "compendium": {
      "enabled": true,
      "cleanup_level": "minimal",
      "max_chars": 50000,
      "remove_duplicates": true
    }
  }
}

Use case: Feeding content to LLMs with limited context windows

Contact extraction only, no markdown:

Request Body

{
  "payload": {
    "url": "https://example.com",
    "compendium": {
      "enabled": false
    }
  }
}

Use case: When you only need contact info, not content

JavaScript-heavy site (React/Vue/Angular):

Request Body

{
  "payload": {
    "url": "https://modern-spa.com",
    "max_pages": 10,
    "enable_spa": true,
    "spa_timeout": 45
  }
}

Auto-detection: SPA rendering is automatic when JavaScript rendering is detected

German website with localized target pages:

Request Body

{
  "payload": {
    "url": "https://deutsche-firma.de",
    "max_pages": 15,
    "target_pages": ["kontakt", "über-uns", "team", "news"],
    "extract_company_info": true
  }
}

Supported languages: 36+ European languages automatically detected

Urgent job processing:

Request Body

{
  "payload": {
    "url": "https://urgent-lead.com",
    "max_pages": 10,
    "extract_company_info": true,
    "extract_team": true
  },
  "priority": 10
}

Priority range: 0-10 (higher = processed first)

Override only specific compendium settings:

Request Body

{
  "payload": {
    "url": "https://example.com",
    "compendium": {
      "cleanup_level": "citations",
      "max_chars": 200000
    }
  }
}

Default merge: Missing fields use defaults (enabled: true, remove_duplicates: true, etc.)

Merge behavior: Partial configs are merged with defaults. You don’t need to specify all fields.

All parameters specified:

Request Body

{
  "payload": {
    "url": "https://enterprise-site.com",
    "max_pages": 50,
    "crawl_strategy": "bestfirst",
    "target_pages": ["contact", "about", "team", "leadership", "news", "blog", "careers"],
    "enable_spa": true,
    "spa_timeout": 60,
    "timeout": 45,
    "extract_team": true,
    "extract_company_info": true,
    "extract_pain_points": true,
    "product_description": "Enterprise analytics platform for Fortune 500 companies...",
    "icp_description": "Enterprises with >1000 employees, data-driven culture, budget >$500k/year...",
    "compendium": {
      "enabled": true,
      "max_chars": 500000,
      "cleanup_level": "fit",
      "separator": "\n\n---\n\n",
      "include_in_response": true,
      "remove_duplicates": true,
      "priority_sections": ["main", "article", "content", "section"]
    }
  },
  "priority": 10
}

Use case: Enterprise-level lead generation with maximum data extraction

Extract specific information using your own prompts (v2.10.0):

Request Body

{
  "payload": {
    "url": "https://saas-company.com",
    "max_pages": 10,
    "custom_ai_prompt": {
      "enabled": true,
      "system_prompt": "You are a cybersecurity analyst specializing in SaaS platforms.",
      "user_prompt": "Extract all security certifications, compliance frameworks, and data privacy practices mentioned on this website.",
      "json_schema": {
        "security_certifications": ["SOC 2", "ISO 27001"],
        "compliance_frameworks": ["GDPR", "HIPAA"],
        "data_privacy_summary": "string"
      },
      "model": "google/gemini-2.0-flash-exp:free",
      "temperature": 0.1,
      "max_tokens": 4000
    }
  }
}

Response includes:

{
  "data": {
    "custom_analysis": {
      "security_certifications": ["SOC 2 Type II", "ISO 27001"],
      "compliance_frameworks": ["GDPR", "CCPA", "HIPAA"],
      "data_privacy_summary": "Company maintains strict data encryption..."
    }
  }
}

Use cases:

Security/compliance audits
Competitive intelligence
Industry-specific data extraction
Technical stack analysis

All AI features in ONE efficient API call (v2.10.0):

Request Body

{
  "payload": {
    "url": "https://target-company.com",
    "max_pages": 15,
    "extract_team": true,
    "extract_company_info": true,
    "extract_pain_points": true,
    "product_description": "HR automation platform",
    "icp_description": "Companies with 100-1000 employees",
    "custom_ai_prompt": {
      "enabled": true,
      "system_prompt": "You are a competitive intelligence analyst.",
      "user_prompt": "Extract pricing information, key differentiators, and main competitors mentioned.",
      "output_field_name": "competitive_intel",
      "model": "google/gemini-2.0-flash-exp:free",
      "temperature": 0.2,
      "max_tokens": 6000
    }
  }
}

All extracted in ONE API call:

Contact info (emails, phones, social)
Team members
Company info
Pain points
Lead scoring (CHAMP)
Custom competitive intelligence

Why this is efficient: All AI features combine into a single API request, reducing latency and cost.

Example Response

201 Created

{
  "job_id": "974ceeda-84fe-4634-bdcd-adc895c6bc75",
  "type": "spiderSite",
  "status": "queued",
  "created_at": "2025-10-27T14:30:00Z",
  "from_cache": false,
  "message": "SpiderSite job queued successfully. Estimated processing time: 15-30 seconds."
}

From Cache (Deduplication)

If the same URL was crawled in the last 24 hours:

201 Created - From Cache

{
  "job_id": "abc12345-6789-4def-ghij-klmnopqrstuv",
  "type": "spiderSite",
  "status": "completed",
  "created_at": "2025-10-27T14:30:00Z",
  "from_cache": true,
  "message": "Job results retrieved from cache (original job: 974ceeda-84fe-4634-bdcd-adc895c6bc75)"
}

Deduplication: Identical URLs crawled within 24 hours return cached results instantly (Redis cache with 24hr TTL).

AI Token Costs

AI features are opt-in. By default, no AI tokens are used (0 cost). Enable only the features you need.

Feature	AI Tokens	What You Get
Base crawl (no AI)	0 tokens	Contact info + compendium
`extract_company_info`	~500 tokens	Company vitals (name, summary, industry, services, target audience)
`extract_team`	~500 tokens	Team members with names, titles, emails, LinkedIn
`extract_pain_points`	~500 tokens	Business challenges inferred from content
CHAMP scoring	+1,500 tokens	Full CHAMP analysis + ICP fit score + personalization hooks
Total (all features)	~3,000 tokens	Complete lead profile

Cost optimization: Start with basic crawl (0 tokens). Enable AI features only for high-value leads.

Processing Time

Scenario	Estimated Time
Simple site (5-10 pages)	5-15 seconds
Medium site (10-20 pages)	15-30 seconds
Large site (20-50 pages)	30-60 seconds
SPA site (JavaScript-heavy)	+10-20 seconds
With AI extraction	+5-10 seconds
Full CHAMP analysis	20-60 seconds total

Best Practices

When to use AI features

Use AI features when:

Qualifying high-value leads
Building targeted outreach campaigns
Identifying decision makers
Scoring leads by ICP fit

Skip AI features for:

Bulk contact extraction
Budget-sensitive scraping
When you only need contact info

Choosing cleanup level

raw (100%): Academic research, legal compliance, full fidelity neededfit (60%): General purpose, balances quality and size (default)citations (70%): Academic papers, research documents with sourcesminimal (30%): LLM consumption, token optimization, main content only

Optimizing crawl strategy

bestfirst: Best for most use cases - intelligent prioritizationSitemap-first (auto): Used automatically when sitemap.xml discoveredbfs: When you need broad coverage across sectionsdfs: When you need deep coverage of specific sections

SPA detection tips

Auto-detection works for:

React, Vue, Angular apps
Dynamically loaded content
Infinite scroll sites

Increase spa_timeout if:

Site loads slowly (>30s)
Content loads after initial render
You see incomplete data

Set enable_spa: false if:

Site is static HTML (faster processing)
You’re getting timeout errors unnecessarily

Common Use Cases

1. Basic Lead Generation (0 AI Tokens)

Extract contact info from company websites:

{
  "payload": {
    "url": "https://target-company.com",
    "max_pages": 10
  }
}

Returns: Emails, phones, addresses, social media, markdown compendium

2. Qualified Lead Scoring (CHAMP)

Full analysis for high-value prospects:

{
  "payload": {
    "url": "https://qualified-lead.com",
    "max_pages": 20,
    "extract_company_info": true,
    "extract_team": true,
    "extract_pain_points": true,
    "product_description": "Your product here...",
    "icp_description": "Your ICP here..."
  }
}

Returns: Full CHAMP analysis, ICP fit score, personalization hooks

3. Team Member Identification

Find decision makers and contacts:

{
  "payload": {
    "url": "https://target-company.com",
    "max_pages": 15,
    "target_pages": ["team", "about", "leadership", "management"],
    "extract_team": true
  }
}

Returns: Team members with names, titles, emails, LinkedIn

4. Competitor Analysis

Understand company positioning and offerings:

{
  "payload": {
    "url": "https://competitor.com",
    "max_pages": 25,
    "extract_company_info": true,
    "extract_pain_points": true,
    "compendium": {
      "cleanup_level": "citations",
      "max_chars": 200000
    }
  }
}

Returns: Company summary, services, target audience, pain points, detailed content

5. Custom AI Analysis (v2.10.0)

Extract industry-specific or custom data:

{
  "payload": {
    "url": "https://fintech-company.com",
    "max_pages": 15,
    "custom_ai_prompt": {
      "enabled": true,
      "system_prompt": "You are a fintech industry analyst.",
      "user_prompt": "Extract: 1) Regulatory licenses held, 2) Banking partners mentioned, 3) Funding history, 4) Key product features",
      "output_field_name": "fintech_analysis"
    }
  }
}

Returns: Custom structured data in data.fintech_analysis

Large Compendiums (R2 Storage)

When a markdown compendium is too large to include inline, it’s automatically stored in Cloudflare R2 with a presigned download URL:

{
  "data": {
    "markdown_compendium": null,
    "compendium": {
      "storage_location": "r2",
      "r2_url": "https://cdn.spideriq.di-atomic.com/compendiums/abc123.md?X-Amz-Signature=...",
      "url_expires_at": "2025-10-28T12:00:00Z",
      "size_chars": 500000
    }
  }
}

R2 URLs expire after 24 hours. Download the content promptly or re-request the job.

Limitations

Authentication: SpiderSite cannot scrape pages requiring login/authentication

CAPTCHAs: Sites with CAPTCHA protection cannot be scraped

Rate Limits: 100 requests per minute per API key

robots.txt: SpiderSite respects robots.txt directives

Next Steps

Check Job Status

Monitor processing progress

Get Results

Retrieve crawled data

SpiderMaps

Scrape Google Maps businesses

Full Guide

Complete SpiderSite guide

API Reference

​Overview

​Key Features

Smart Crawling

Contact Extraction

AI Analysis

Lead Scoring

Custom AI Prompts

R2 Storage

​Request Body

​Response

​Request Examples

​Example Response

​From Cache (Deduplication)

​AI Token Costs

​Processing Time

​Best Practices

​Common Use Cases

​1. Basic Lead Generation (0 AI Tokens)

​2. Qualified Lead Scoring (CHAMP)

​3. Team Member Identification

​4. Competitor Analysis

​5. Custom AI Analysis (v2.10.0)

​Large Compendiums (R2 Storage)

​Limitations

​Next Steps

Check Job Status

Get Results

SpiderMaps

Full Guide

Overview

Key Features

Request Body

Response

Request Examples

Example Response

From Cache (Deduplication)

AI Token Costs

Processing Time

Best Practices

Common Use Cases

1. Basic Lead Generation (0 AI Tokens)

2. Qualified Lead Scoring (CHAMP)

3. Team Member Identification

4. Competitor Analysis

5. Custom AI Analysis (v2.10.0)

Large Compendiums (R2 Storage)

Limitations

Next Steps