Overview
Submit a SpiderSite job to crawl websites with contact extraction, AI-powered company analysis, team member identification, and CHAMP lead scoring.Version 2.7.0: Features AI Context Engine with markdown compendiums, SPA auto-detection, sitemap-first crawling, and multilingual support (36+ languages).
Key Features
Smart Crawling
Sitemap-first with intelligent page prioritization
Contact Extraction
Emails, phones, addresses, 14 social platforms
AI Analysis
Company vitals, team members, pain points
Lead Scoring
CHAMP framework with ICP fit scoring
Request Body
Required Parameters
Website URL to crawl (must include
https://)Example: https://example.comCrawl Configuration
Maximum pages to crawl (1-50)Higher values = more data but slower processing
Crawling strategy:
bestfirst- Intelligent prioritization (recommended)bfs- Breadth-first searchdfs- Depth-first search
Page types to prioritizeWorks with 36+ languages (e.g., “kontakt” in German, “contacto” in Spanish)
HTTP request timeout per page (10-120 seconds)
SPA Support (v2.4.0)
Enable automatic SPA detection and Playwright renderingAutomatically detects JavaScript-heavy sites (React/Vue/Angular)
Playwright page load timeout (10-120 seconds)Only used when SPA is detected. Increase for slow-loading sites.
AI Features (Opt-In - 0 Tokens by Default)
Extract team members using AI (~500 tokens)Includes names, titles, emails, LinkedIn profiles
Extract company summary using AI (~500 tokens)Includes services, target audience, industry
Analyze business challenges using AI (~500 tokens)Infers pain points from news, blog posts, job listings
Lead Scoring (CHAMP Framework)
Your product descriptionEnables CHAMP lead scoring when combined with
icp_description (+1,500 tokens)Your ideal customer profile (ICP)Enables CHAMP lead scoring when combined with
product_description (+1,500 tokens)AI Context Engine (v2.7.0)
Markdown compendium configurationGenerates intelligent, deduplicated markdown optimized for LLMs
Generate markdown compendiumProvides full transparency of scraped content
Cleanup level:
raw- Complete conversion (100% baseline)fit- Remove nav/ads/footers (~60% size)citations- Academic format with sources (~70% size)minimal- Main content only (~30% size, 70% token savings)
Maximum compendium size in characters (1,000 - 1,000,000)Truncates if exceeded
Smart deduplicationRemoves repeated headers/footers. Saves 20-40% size.
Include markdown in API responseSet false for large files (use download URL instead)
Page separator in compendium
HTML tags to prioritize (minimal mode)
Priority
Job priority (0-10, higher = processed first)
Response
Unique job identifier (UUID format)
Always
spiderSite for this endpointInitial job status (always
queued)Job creation timestamp (ISO 8601)
Whether this job was deduplicated from cache (24-hour TTL)
Confirmation message
Request Examples
- Minimal
- With AI Features
- Full CHAMP Scoring
- Compendium Minimal
- Compendium Disabled
- SPA Site
- Multilingual
- High Priority
- Partial Compendium
- Full Configuration
Most basic request - only URL (contact extraction only, no AI):What you get:
- Contact info (emails, phones, addresses)
- 14 social media platforms
- Markdown compendium (fit level)
- No AI tokens used (0 cost)
Example Response
201 Created
From Cache (Deduplication)
If the same URL was crawled in the last 24 hours:201 Created - From Cache
Deduplication: Identical URLs crawled within 24 hours return cached results instantly (Redis cache with 24hr TTL).
AI Token Costs
AI features are opt-in. By default, no AI tokens are used (0 cost). Enable only the features you need.
| Feature | AI Tokens | What You Get |
|---|---|---|
| Base crawl (no AI) | 0 tokens | Contact info + compendium |
extract_company_info | ~500 tokens | Company vitals (name, summary, industry, services, target audience) |
extract_team | ~500 tokens | Team members with names, titles, emails, LinkedIn |
extract_pain_points | ~500 tokens | Business challenges inferred from content |
| CHAMP scoring | +1,500 tokens | Full CHAMP analysis + ICP fit score + personalization hooks |
| Total (all features) | ~3,000 tokens | Complete lead profile |
Cost optimization: Start with basic crawl (0 tokens). Enable AI features only for high-value leads.
Processing Time
| Scenario | Estimated Time |
|---|---|
| Simple site (5-10 pages) | 5-15 seconds |
| Medium site (10-20 pages) | 15-30 seconds |
| Large site (20-50 pages) | 30-60 seconds |
| SPA site (JavaScript-heavy) | +10-20 seconds |
| With AI extraction | +5-10 seconds |
| Full CHAMP analysis | 20-60 seconds total |
Best Practices
When to use AI features
When to use AI features
Use AI features when:
- Qualifying high-value leads
- Building targeted outreach campaigns
- Identifying decision makers
- Scoring leads by ICP fit
- Bulk contact extraction
- Budget-sensitive scraping
- When you only need contact info
Choosing cleanup level
Choosing cleanup level
raw (100%): Academic research, legal compliance, full fidelity neededfit (60%): General purpose, balances quality and size (default)citations (70%): Academic papers, research documents with sourcesminimal (30%): LLM consumption, token optimization, main content only
Optimizing crawl strategy
Optimizing crawl strategy
bestfirst: Best for most use cases - intelligent prioritizationSitemap-first (auto): Used automatically when sitemap.xml discoveredbfs: When you need broad coverage across sectionsdfs: When you need deep coverage of specific sections
SPA detection tips
SPA detection tips
Auto-detection works for:
- React, Vue, Angular apps
- Dynamically loaded content
- Infinite scroll sites
spa_timeout if:- Site loads slowly (>30s)
- Content loads after initial render
- You see incomplete data
enable_spa: false if:- Site is static HTML (faster processing)
- You’re getting timeout errors unnecessarily
Common Use Cases
1. Basic Lead Generation (0 AI Tokens)
Extract contact info from company websites:2. Qualified Lead Scoring (CHAMP)
Full analysis for high-value prospects:3. Team Member Identification
Find decision makers and contacts:4. Competitor Analysis
Understand company positioning and offerings:Limitations
Authentication: SpiderSite cannot scrape pages requiring login/authentication
CAPTCHAs: Sites with CAPTCHA protection cannot be scraped
Rate Limits: 100 requests per minute per API key
robots.txt: SpiderSite respects robots.txt directives
