Overview
The most common SpiderIQ use case: extract contact information from company websites without using any AI tokens. This guide shows you how to get emails, phone numbers, addresses, and social media profiles in 15-30 seconds.
Zero AI costs: This approach uses 0 AI tokens - you only pay for the crawling infrastructure.
What You Get
With basic contact extraction, SpiderIQ automatically finds and structures:
Email Addresses Filtered and validated emails (tracking emails removed)
Phone Numbers All formats detected and normalized
Physical Addresses Full street addresses extracted
Social Media 14 platforms: LinkedIn, Twitter, Facebook, Instagram, YouTube, GitHub, and more
Plus: Markdown Compendium
Every crawl includes a smart markdown summary of the website (configurable from 30% to 100% of original size), giving you full transparency of what was found.
Quick Start
1. Submit a Job (Minimal Request)
The simplest possible request - just a URL:
curl -X POST https://spideriq.di-atomic.com/api/v1/jobs/spiderSite/submit \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"payload": {
"url": "https://example.com"
}
}'
Response:
{
"job_id" : "974ceeda-84fe-4634-bdcd-adc895c6bc75" ,
"type" : "spiderSite" ,
"status" : "queued" ,
"created_at" : "2025-10-27T14:30:00Z" ,
"from_cache" : false ,
"message" : "SpiderSite job queued successfully. Estimated processing time: 15-30 seconds."
}
2. Poll for Results
Wait 15-30 seconds, then retrieve the results:
curl https://spideriq.di-atomic.com/api/v1/jobs/{job_id}/results \
-H "Authorization: Bearer YOUR_API_KEY"
Here’s what a typical response looks like:
{
"success" : true ,
"job_id" : "974ceeda-84fe-4634-bdcd-adc895c6bc75" ,
"type" : "spiderSite" ,
"status" : "completed" ,
"processing_time_seconds" : 12.4 ,
"worker_id" : "spider-site-main-1" ,
"completed_at" : "2025-10-27T14:30:15Z" ,
"data" : {
"url" : "https://example.com" ,
"pages_crawled" : 8 ,
"crawl_status" : "success" ,
// Contact Information (Flat Structure)
"emails" : [
"contact@example.com" ,
"sales@example.com" ,
"support@example.com"
],
"phones" : [
"+1-555-123-4567" ,
"+1-800-555-0100"
],
"addresses" : [
"123 Main St, San Francisco, CA 94105" ,
"456 Market St, Suite 200, San Francisco, CA 94105"
],
// Social Media (All 14 platforms - null if not found)
"linkedin" : "https://linkedin.com/company/example" ,
"twitter" : "https://twitter.com/example" ,
"facebook" : "https://facebook.com/example" ,
"instagram" : "https://instagram.com/example" ,
"youtube" : "https://youtube.com/example" ,
"github" : "https://github.com/example" ,
"tiktok" : null ,
"pinterest" : null ,
"medium" : "https://medium.com/@example" ,
"discord" : null ,
"whatsapp" : null ,
"telegram" : null ,
"snapchat" : null ,
"reddit" : null ,
// Markdown Compendium
"markdown_compendium" : "# Example Company \n\n We provide enterprise solutions..." ,
"compendium" : {
"chars" : 8450 ,
"available" : true ,
"cleanup_level" : "fit" ,
"storage_location" : "inline"
},
// AI Features (all null - not enabled)
"company_vitals" : null ,
"pain_points" : null ,
"lead_scoring" : null ,
"team_members" : [],
"personalization_hooks" : null ,
// Metadata
"metadata" : {
"spa_enabled" : true ,
"sitemap_used" : true ,
"browser_rendering_available" : true ,
"crawl_strategy" : "sitemap" ,
"total_emails_found" : 3 ,
"total_phones_found" : 2
}
},
"error_message" : null
}
Understanding the Flat Structure
Breaking change (v2.7.1): Responses are now flat (2-3 levels max) instead of deeply nested (5 levels).
Old Structure (Pre-v2.7.1)
{
"results" : {
"results" : {
"contact_info" : {
"emails" : [ ... ],
"social_media" : {
"linkedin" : "..." ,
"twitter" : "..."
}
}
}
}
}
New Structure (v2.7.1+)
{
"data" : {
"emails" : [ ... ],
"linkedin" : "..." ,
"twitter" : "..."
}
}
Benefits:
Easier integration (fewer levels to navigate)
Consistent structure (all fields always present)
Industry standard (similar to Firecrawl/Outscraper)
Python Example
# Get the results
results = response.json()
data = results[ 'data' ]
# Extract contact info
emails = data[ 'emails' ]
phones = data[ 'phones' ]
addresses = data[ 'addresses' ]
# Extract social media (filter out nulls)
social_media = {
platform: url
for platform in [ 'linkedin' , 'twitter' , 'facebook' , 'instagram' ,
'youtube' , 'github' , 'tiktok' , 'pinterest' ,
'medium' , 'discord' , 'whatsapp' , 'telegram' ,
'snapchat' , 'reddit' ]
if (url := data.get(platform)) is not None
}
print ( f "Found { len (emails) } emails: { emails } " )
print ( f "Found { len (phones) } phones: { phones } " )
print ( f "Found { len (social_media) } social profiles: { social_media } " )
# Access markdown compendium
markdown = data.get( 'markdown_compendium' )
if markdown:
print ( f "Content preview: { markdown[: 200 ] } ..." )
JavaScript Example
// Get the results
const { data } = results ;
// Extract contact info
const { emails , phones , addresses } = data ;
// Extract social media (filter out nulls)
const socialPlatforms = [
'linkedin' , 'twitter' , 'facebook' , 'instagram' ,
'youtube' , 'github' , 'tiktok' , 'pinterest' ,
'medium' , 'discord' , 'whatsapp' , 'telegram' ,
'snapchat' , 'reddit'
];
const socialMedia = Object . fromEntries (
socialPlatforms
. map ( platform => [ platform , data [ platform ]])
. filter (([ _ , url ]) => url !== null )
);
console . log ( `Found ${ emails . length } emails:` , emails );
console . log ( `Found ${ phones . length } phones:` , phones );
console . log ( `Found ${ Object . keys ( socialMedia ). length } social profiles:` , socialMedia );
// Access markdown compendium
if ( data . markdown_compendium ) {
console . log ( `Content preview: ${ data . markdown_compendium . substring ( 0 , 200 ) } ...` );
}
Customization Options
1. Crawl More Pages
Default is 10 pages. Increase for larger sites:
{
"payload" : {
"url" : "https://example.com" ,
"max_pages" : 25
}
}
Processing time: ~1.5 seconds per page on average
2. Target Specific Pages
Prioritize contact-related pages (works in 36+ languages):
{
"payload" : {
"url" : "https://example.com" ,
"max_pages" : 15 ,
"target_pages" : [ "contact" , "about" , "team" , "locations" ]
}
}
Multilingual examples:
German: ["kontakt", "über-uns", "team"]
Spanish: ["contacto", "acerca-de", "equipo"]
French: ["contact", "à-propos", "équipe"]
3. Optimize Compendium Size
Control markdown size (affects token usage if feeding to LLMs):
Minimal (30% size)
Fit (60% size)
Raw (100% size)
Disabled
Best for LLM consumption - 70% token savings: {
"payload" : {
"url" : "https://example.com" ,
"compendium" : {
"cleanup_level" : "minimal" ,
"max_chars" : 50000
}
}
}
4. Handle JavaScript-Heavy Sites
SpiderIQ automatically detects SPAs (React/Vue/Angular), but you can force it:
{
"payload" : {
"url" : "https://modern-spa-site.com" ,
"enable_spa" : true ,
"spa_timeout" : 45
}
}
Auto-detection: SPA rendering is automatic in most cases. Only set enable_spa if you notice incomplete data.
Complete Working Example
Here’s a production-ready script that extracts contacts from multiple websites:
Python - Bulk Extraction
JavaScript - Bulk Extraction
import requests
import time
from typing import List, Dict
class SpiderIQClient :
def __init__ ( self , api_key : str ):
self .api_key = api_key
self .base_url = "https://spideriq.di-atomic.com/api/v1"
self .headers = { "Authorization" : f "Bearer { api_key } " }
def submit_job ( self , url : str , max_pages : int = 10 ) -> str :
"""Submit a contact extraction job"""
response = requests.post(
f " { self .base_url } /jobs/spiderSite/submit" ,
headers = self .headers,
json = {
"payload" : {
"url" : url,
"max_pages" : max_pages
}
}
)
response.raise_for_status()
return response.json()[ 'job_id' ]
def get_results ( self , job_id : str , max_wait : int = 120 ) -> Dict:
"""Poll for job results with timeout"""
url = f " { self .base_url } /jobs/ { job_id } /results"
for _ in range (max_wait // 3 ):
response = requests.get(url, headers = self .headers)
if response.status_code == 200 :
return response.json()
elif response.status_code == 202 :
time.sleep( 3 )
elif response.status_code == 410 :
error = response.json()
raise Exception ( f "Job failed: { error[ 'error_message' ] } " )
else :
response.raise_for_status()
raise TimeoutError ( f "Job { job_id } did not complete in { max_wait } s" )
def extract_contacts ( self , url : str , max_pages : int = 10 ) -> Dict:
"""Submit job and wait for results (one-shot)"""
job_id = self .submit_job(url, max_pages)
print ( f "Processing { url } ... (job: { job_id } )" )
return self .get_results(job_id)
# Usage: Extract contacts from multiple companies
client = SpiderIQClient( "YOUR_API_KEY" )
companies = [
"https://company1.com" ,
"https://company2.com" ,
"https://company3.com"
]
for url in companies:
try :
results = client.extract_contacts(url, max_pages = 15 )
data = results[ 'data' ]
print ( f " \n ✓ { url } " )
print ( f " Emails: { data[ 'emails' ] } " )
print ( f " Phones: { data[ 'phones' ] } " )
print ( f " LinkedIn: { data.get( 'linkedin' , 'Not found' ) } " )
print ( f " Pages crawled: { data[ 'pages_crawled' ] } " )
except Exception as e:
print ( f " \n ✗ { url } : { e } " )
Email Filtering
SpiderIQ automatically filters out tracking and garbage emails:
Filtered domains: sentry.io, wixpress.com, mailchimp.com, hubspot.com, google-analytics.com, and 20+ more tracking services
Example:
// Raw emails found:
[
"contact@example.com" , // ✓ Real contact
"noreply@sentry.io" , // ✗ Filtered (tracking)
"info@example.com" , // ✓ Real contact
"auto@wixpress.com" // ✗ Filtered (tracking)
]
// Returned in response:
[ "contact@example.com" , "info@example.com" ]
See metadata.total_emails_found vs emails.length to see filtering impact.
Deduplication (24-Hour Cache)
SpiderIQ automatically deduplicates crawls within 24 hours:
{
"job_id" : "abc-123" ,
"from_cache" : false ,
"message" : "SpiderSite job queued successfully"
}
Second Request (Same URL within 24h)
{
"job_id" : "def-456" ,
"from_cache" : true ,
"status" : "completed" , // ← Instant response!
"message" : "Job results retrieved from cache (original job: abc-123)"
}
Save time & money: If you accidentally submit the same URL twice, the second request returns instantly with cached results.
Processing Time
Scenario Estimated Time Small site (5-10 pages) 5-15 seconds Medium site (10-20 pages) 15-30 seconds Large site (20-50 pages) 30-60 seconds SPA site (JavaScript-heavy) +10-20 seconds
Best Practices
Optimal max_pages setting
Start with 10-15 pages for most B2B websites. This typically covers:
Homepage
Contact page
About page
Team page
Key product/service pages
Increase to 20-30 for:
Large enterprises
Multi-location businesses
Companies with extensive blogs
Increase to 40-50 for:
Complete site mapping
Comprehensive competitor analysis
Target pages for contact extraction
When to disable compendium
Disable compendium when:
You ONLY need contact info (not content)
Processing 1000+ URLs (save bandwidth)
Integrating with CRM (structured data only)
Keep compendium when:
Feeding content to LLMs
Doing market research
Analyzing company positioning
Building knowledge bases
Common failure reasons:
Connection timeout - Site is slow or blocking
Robots.txt restriction - Site blocks crawlers
CAPTCHA protection - Site requires human verification
Invalid URL - URL is malformed or unreachable
Mitigation:
Increase timeout to 60-90 seconds for slow sites
Increase spa_timeout to 60 seconds for heavy SPAs
Check error_message field for specific failure reason
Verify URL is publicly accessible (not behind login)
Common Use Cases
1. CRM Enrichment
Enrich existing lead database with contact info:
import pandas as pd
# Load leads from CSV
leads = pd.read_csv( 'leads.csv' ) # columns: company_name, website
client = SpiderIQClient( "YOUR_API_KEY" )
for idx, row in leads.iterrows():
try :
results = client.extract_contacts(row[ 'website' ], max_pages = 15 )
data = results[ 'data' ]
# Update DataFrame
leads.at[idx, 'emails' ] = ', ' .join(data[ 'emails' ])
leads.at[idx, 'phones' ] = ', ' .join(data[ 'phones' ])
leads.at[idx, 'linkedin' ] = data.get( 'linkedin' )
except Exception as e:
print ( f "Failed { row[ 'website' ] } : { e } " )
# Save enriched data
leads.to_csv( 'leads_enriched.csv' , index = False )
2. Prospecting Workflow
Find contact info for a list of target companies:
target_companies = [
"https://techcorp1.com" ,
"https://saas-company2.com" ,
"https://enterprise3.com"
]
contacts_db = []
for url in target_companies:
results = client.extract_contacts(url, max_pages = 20 )
data = results[ 'data' ]
# Structure for export
contacts_db.append({
'company_url' : url,
'emails' : data[ 'emails' ],
'phones' : data[ 'phones' ],
'linkedin' : data.get( 'linkedin' ),
'twitter' : data.get( 'twitter' ),
'pages_crawled' : data[ 'pages_crawled' ]
})
# Export to CSV/JSON
pd.DataFrame(contacts_db).to_csv( 'prospects.csv' )
3. Competitor Monitoring
Track competitor contact changes over time:
import json
from datetime import datetime
competitor = "https://competitor.com"
results = client.extract_contacts(competitor, max_pages = 30 )
# Save snapshot with timestamp
snapshot = {
'timestamp' : datetime.now().isoformat(),
'url' : competitor,
'data' : results[ 'data' ]
}
with open ( f 'competitor_snapshot_ { datetime.now().strftime( "%Y%m %d " ) } .json' , 'w' ) as f:
json.dump(snapshot, f, indent = 2 )
Rate Limits
100 requests per minute per API key
For bulk processing, add rate limiting:
import time
def rate_limited_extract ( client , urls , requests_per_minute = 90 ):
"""Extract contacts with rate limiting"""
delay = 60.0 / requests_per_minute
for url in urls:
start = time.time()
try :
results = client.extract_contacts(url)
yield url, results
except Exception as e:
yield url, { 'error' : str (e)}
# Rate limiting delay
elapsed = time.time() - start
if elapsed < delay:
time.sleep(delay - elapsed)
# Usage
for url, results in rate_limited_extract(client, companies):
print ( f "Processed: { url } " )
Next Steps