Scrape Webpage

Hit the Try it button to try this API now in our playground. It’s the best way to check the full request and response in one place, customize your parameters, and generate ready-to-use code snippets.

Examples

API Request
TypeScript
Python (Sync)

curl --request POST \
  --url 'https://api.usecortex.ai/upload/scrape_webpage?web_url=https%3A%2F%2Fwww.usecortex.ai%2F&tenant_id=tenant_1234&sub_tenant_id=sub_tenant_4567&file_id=CortexDoc1234' \
  --header 'Authorization: Bearer YOUR_API_KEY' \
  --header 'Content-Type: application/x-www-form-urlencoded'

Scrape and process webpage content directly from a URL. The webpage will be scraped, processed, chunked, and indexed for search and retrieval.

Webpage Scraping & Processing Pipeline

When you scrape a webpage, it goes through a specialized processing pipeline designed for web content:

1. Immediate URL Processing & Queue

Your webpage URL is immediately accepted and validated
The URL is added to our scraping queue for background processing
You receive a confirmation response with a file_id for tracking

2. Web Scraping Phase

Our system automatically handles:

URL Validation: Ensuring the URL is accessible and valid
Content Scraping: Extracting text, HTML, and metadata from the webpage
Structure Analysis: Understanding page layout, headers, and content hierarchy
Link Extraction: Identifying and processing internal and external links

3. Content Processing & Cleaning

HTML Parsing: Converting HTML to clean, structured text
Content Filtering: Removing navigation, ads, and irrelevant content
Text Normalization: Cleaning and standardizing web content
Language Detection: Identifying the webpage’s language for optimal processing

4. Intelligent Chunking

Web content is split into semantically meaningful chunks
Chunk size is optimized for both context preservation and search accuracy
Overlapping boundaries ensure no information is lost between chunks
Metadata is preserved and associated with each chunk

5. Embedding Generation

Each chunk is converted into high-dimensional vector embeddings
Embeddings capture semantic meaning and context
Vectors are optimized for similarity search and retrieval

6. Indexing & Database Updates

Embeddings are stored in our vector database for fast similarity search
Full-text search indexes are created for keyword-based queries
Metadata is indexed for filtering and faceted search
Cross-references are established for related web content

7. Quality Assurance

Automated quality checks ensure scraping accuracy
Content validation verifies extracted text completeness
Embedding quality is assessed for optimal retrieval performance

Processing Time: Webpage scraping and processing typically takes 2-5 minutes. Complex pages with heavy content may take up to 10 minutes. You can check processing status using the document ID returned in the response.

Default Sub-Tenant Behavior: If you don’t specify a sub_tenant_id, the webpage content will be uploaded to the default sub-tenant created when your tenant was set up. This is perfect for organization-wide web content that should be accessible across all departments.

File ID Management: When you provide a file_id as a key in the document_metadata object, that specific ID will be used to identify your content. If no file_id is provided in the document_metadata, the system will automatically generate a unique identifier for you. This allows you to maintain consistent references to your content across your application while ensuring every piece of content has a unique identifier.

Error Responses

All endpoints return consistent error responses following the standard format. For detailed error information, see our Error Responses documentation.

Authorizations

Authorization

string

header

required

Query Parameters

web_url

string

required

Example:

tenant_id

string

required

Example:

sub_tenant_id

string

default:""

Example:

file_id

string

default:""

Example:

Body

application/x-www-form-urlencoded · Body_scrape_webpage_upload_scrape_webpage_post · object

Response

file_id

string

required

Example:

message

string

required

success

boolean

default:true

Example:

API Documentation

Tenant Management

Knowledge Ingestion

Query & Retrieval

Knowledge Management

Embeddings

User Memories

Examples

Webpage Scraping & Processing Pipeline

1. Immediate URL Processing & Queue

2. Web Scraping Phase

3. Content Processing & Cleaning

4. Intelligent Chunking

5. Embedding Generation

6. Indexing & Database Updates

7. Quality Assurance

Error Responses

Authorizations

Query Parameters

Body

Response

API Documentation

Tenant Management

Knowledge Ingestion

Query & Retrieval

Knowledge Management

Embeddings

User Memories

​Examples

​Webpage Scraping & Processing Pipeline

​1. Immediate URL Processing & Queue

​2. Web Scraping Phase

​3. Content Processing & Cleaning

​4. Intelligent Chunking

​5. Embedding Generation

​6. Indexing & Database Updates

​7. Quality Assurance

​Error Responses

Authorizations

Query Parameters

Body

Response

Examples

Webpage Scraping & Processing Pipeline

1. Immediate URL Processing & Queue

2. Web Scraping Phase

3. Content Processing & Cleaning

4. Intelligent Chunking

5. Embedding Generation

6. Indexing & Database Updates

7. Quality Assurance

Error Responses