POST
https://spideriq.di-atomic.com
/
api
/
v1
/
jobs
/
spiderSite
/
submit
Submit SpiderSite Job
curl --request POST \
  --url https://spideriq.di-atomic.com/api/v1/jobs/spiderSite/submit \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '
{
  "payload": {
    "url": "<string>",
    "max_pages": 123,
    "crawl_strategy": {},
    "target_pages": [
      {}
    ],
    "timeout": 123,
    "enable_spa": true,
    "spa_timeout": 123,
    "extract_team": true,
    "extract_company_info": true,
    "extract_pain_points": true,
    "product_description": "<string>",
    "icp_description": "<string>",
    "compendium": {
      "enabled": true,
      "cleanup_level": {},
      "max_chars": 123,
      "remove_duplicates": true,
      "include_in_response": true,
      "separator": "<string>",
      "priority_sections": [
        {}
      ]
    },
    "custom_ai_prompt": {
      "enabled": true,
      "system_prompt": "<string>",
      "user_prompt": "<string>",
      "json_schema": {},
      "output_field_name": "<string>",
      "model": "<string>",
      "temperature": 123,
      "max_tokens": 123
    },
    "fuzziq_enabled": true,
    "fuzziq_unique_only": true
  },
  "priority": 123
}
'
{
  "job_id": "<string>",
  "type": "<string>",
  "status": "<string>",
  "created_at": "<string>",
  "from_cache": true,
  "message": "<string>"
}

Overview

Submit a SpiderSite job to crawl websites with contact extraction, AI-powered company analysis, team member identification, and CHAMP lead scoring.
Version 2.10.0: Now includes Custom AI Prompts for tailored analysis, plus AI Context Engine with markdown compendiums, SPA auto-detection, sitemap-first crawling, and multilingual support (36+ languages).

Key Features

Smart Crawling

Sitemap-first with intelligent page prioritization

Contact Extraction

Emails, phones, addresses, 14 social platforms

AI Analysis

Company vitals, team members, pain points

Lead Scoring

CHAMP framework with ICP fit scoring

Custom AI Prompts

Your own prompts for tailored analysis (v2.10.0)

R2 Storage

Large compendiums stored with presigned URLs

Request Body

payload
object
required
Job configuration payload
priority
integer
default:"0"
Job priority (0-10, higher = processed first)

Response

job_id
string
required
Unique job identifier (UUID format)
type
string
required
Always spiderSite for this endpoint
status
string
required
Initial job status (always queued)
created_at
string
required
Job creation timestamp (ISO 8601)
from_cache
boolean
required
Whether this job was deduplicated from cache (24-hour TTL)
message
string
required
Confirmation message

Request Examples

Most basic request - only URL (contact extraction only, no AI):
curl -X POST https://spideriq.di-atomic.com/api/v1/jobs/spiderSite/submit \
  -H "Authorization: Bearer <your_token>" \
  -H "Content-Type: application/json" \
  -d '{
    "payload": {
      "url": "https://example.com"
    }
  }'
What you get:
  • Contact info (emails, phones, addresses)
  • 14 social media platforms
  • Markdown compendium (fit level)
  • No AI tokens used (0 cost)

Example Response

201 Created
{
  "job_id": "974ceeda-84fe-4634-bdcd-adc895c6bc75",
  "type": "spiderSite",
  "status": "queued",
  "created_at": "2025-10-27T14:30:00Z",
  "from_cache": false,
  "message": "SpiderSite job queued successfully. Estimated processing time: 15-30 seconds."
}

From Cache (Deduplication)

If the same URL was crawled in the last 24 hours:
201 Created - From Cache
{
  "job_id": "abc12345-6789-4def-ghij-klmnopqrstuv",
  "type": "spiderSite",
  "status": "completed",
  "created_at": "2025-10-27T14:30:00Z",
  "from_cache": true,
  "message": "Job results retrieved from cache (original job: 974ceeda-84fe-4634-bdcd-adc895c6bc75)"
}
Deduplication: Identical URLs crawled within 24 hours return cached results instantly (Redis cache with 24hr TTL).

AI Token Costs

AI features are opt-in. By default, no AI tokens are used (0 cost). Enable only the features you need.
FeatureAI TokensWhat You Get
Base crawl (no AI)0 tokensContact info + compendium
extract_company_info~500 tokensCompany vitals (name, summary, industry, services, target audience)
extract_team~500 tokensTeam members with names, titles, emails, LinkedIn
extract_pain_points~500 tokensBusiness challenges inferred from content
CHAMP scoring+1,500 tokensFull CHAMP analysis + ICP fit score + personalization hooks
Total (all features)~3,000 tokensComplete lead profile
Cost optimization: Start with basic crawl (0 tokens). Enable AI features only for high-value leads.

Processing Time

ScenarioEstimated Time
Simple site (5-10 pages)5-15 seconds
Medium site (10-20 pages)15-30 seconds
Large site (20-50 pages)30-60 seconds
SPA site (JavaScript-heavy)+10-20 seconds
With AI extraction+5-10 seconds
Full CHAMP analysis20-60 seconds total

Best Practices

Use AI features when:
  • Qualifying high-value leads
  • Building targeted outreach campaigns
  • Identifying decision makers
  • Scoring leads by ICP fit
Skip AI features for:
  • Bulk contact extraction
  • Budget-sensitive scraping
  • When you only need contact info
raw (100%): Academic research, legal compliance, full fidelity neededfit (60%): General purpose, balances quality and size (default)citations (70%): Academic papers, research documents with sourcesminimal (30%): LLM consumption, token optimization, main content only
bestfirst: Best for most use cases - intelligent prioritizationSitemap-first (auto): Used automatically when sitemap.xml discoveredbfs: When you need broad coverage across sectionsdfs: When you need deep coverage of specific sections
Auto-detection works for:
  • React, Vue, Angular apps
  • Dynamically loaded content
  • Infinite scroll sites
Increase spa_timeout if:
  • Site loads slowly (>30s)
  • Content loads after initial render
  • You see incomplete data
Set enable_spa: false if:
  • Site is static HTML (faster processing)
  • You’re getting timeout errors unnecessarily

Common Use Cases

1. Basic Lead Generation (0 AI Tokens)

Extract contact info from company websites:
{
  "payload": {
    "url": "https://target-company.com",
    "max_pages": 10
  }
}
Returns: Emails, phones, addresses, social media, markdown compendium

2. Qualified Lead Scoring (CHAMP)

Full analysis for high-value prospects:
{
  "payload": {
    "url": "https://qualified-lead.com",
    "max_pages": 20,
    "extract_company_info": true,
    "extract_team": true,
    "extract_pain_points": true,
    "product_description": "Your product here...",
    "icp_description": "Your ICP here..."
  }
}
Returns: Full CHAMP analysis, ICP fit score, personalization hooks

3. Team Member Identification

Find decision makers and contacts:
{
  "payload": {
    "url": "https://target-company.com",
    "max_pages": 15,
    "target_pages": ["team", "about", "leadership", "management"],
    "extract_team": true
  }
}
Returns: Team members with names, titles, emails, LinkedIn

4. Competitor Analysis

Understand company positioning and offerings:
{
  "payload": {
    "url": "https://competitor.com",
    "max_pages": 25,
    "extract_company_info": true,
    "extract_pain_points": true,
    "compendium": {
      "cleanup_level": "citations",
      "max_chars": 200000
    }
  }
}
Returns: Company summary, services, target audience, pain points, detailed content

5. Custom AI Analysis (v2.10.0)

Extract industry-specific or custom data:
{
  "payload": {
    "url": "https://fintech-company.com",
    "max_pages": 15,
    "custom_ai_prompt": {
      "enabled": true,
      "system_prompt": "You are a fintech industry analyst.",
      "user_prompt": "Extract: 1) Regulatory licenses held, 2) Banking partners mentioned, 3) Funding history, 4) Key product features",
      "output_field_name": "fintech_analysis"
    }
  }
}
Returns: Custom structured data in data.fintech_analysis

Large Compendiums (R2 Storage)

When a markdown compendium is too large to include inline, it’s automatically stored in Cloudflare R2 with a presigned download URL:
{
  "data": {
    "markdown_compendium": null,
    "compendium": {
      "storage_location": "r2",
      "r2_url": "https://cdn.spideriq.di-atomic.com/compendiums/abc123.md?X-Amz-Signature=...",
      "url_expires_at": "2025-10-28T12:00:00Z",
      "size_chars": 500000
    }
  }
}
R2 URLs expire after 24 hours. Download the content promptly or re-request the job.

Limitations

Authentication: SpiderSite cannot scrape pages requiring login/authentication
CAPTCHAs: Sites with CAPTCHA protection cannot be scraped
Rate Limits: 100 requests per minute per API key
robots.txt: SpiderSite respects robots.txt directives

Next Steps