Skip to main content
POST
/
upload
/
scrape_webpage
Scrape Webpage
curl --request POST \
  --url https://api.usecortex.ai/upload/scrape_webpage \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/x-www-form-urlencoded'
{
  "file_id": "CortexDoc1234",
  "message": "<string>",
  "success": true
}
Hit the Try it button to try this API now in our playground. It’s the best way to check the full request and response in one place, customize your parameters, and generate ready-to-use code snippets.

Sample Request

curl --request POST \
  --url 'https://api.usecortex.ai/upload/scrape_webpage?web_url=https%3A%2F%2Fwww.usecortex.ai%2F&tenant_id=tenant_1234&sub_tenant_id=sub_tenant_4567&file_id=CortexDoc1234' \
  --header 'Authorization: Bearer YOUR_API_KEY' \
  --header 'Content-Type: application/x-www-form-urlencoded'
Scrape and process webpage content directly from a URL. The webpage will be scraped, processed, chunked, and indexed for search and retrieval.

Webpage Scraping & Processing Pipeline

When you scrape a webpage, it goes through a specialized processing pipeline designed for web content:

1. Immediate URL Processing & Queue

  • Your webpage URL is immediately accepted and validated
  • The URL is added to our scraping queue for background processing
  • You receive a confirmation response with a file_id for tracking

2. Web Scraping Phase

Our system automatically handles:
  • URL Validation: Ensuring the URL is accessible and valid
  • Content Scraping: Extracting text, HTML, and metadata from the webpage
  • Structure Analysis: Understanding page layout, headers, and content hierarchy
  • Link Extraction: Identifying and processing internal and external links

3. Content Processing & Cleaning

  • HTML Parsing: Converting HTML to clean, structured text
  • Content Filtering: Removing navigation, ads, and irrelevant content
  • Text Normalization: Cleaning and standardizing web content
  • Language Detection: Identifying the webpage’s language for optimal processing

4. Intelligent Chunking

  • Web content is split into semantically meaningful chunks
  • Chunk size is optimized for both context preservation and search accuracy
  • Overlapping boundaries ensure no information is lost between chunks
  • Metadata is preserved and associated with each chunk

5. Embedding Generation

  • Each chunk is converted into high-dimensional vector embeddings
  • Embeddings capture semantic meaning and context
  • Vectors are optimized for similarity search and retrieval

6. Indexing & Database Updates

  • Embeddings are stored in our vector database for fast similarity search
  • Full-text search indexes are created for keyword-based queries
  • Metadata is indexed for filtering and faceted search
  • Cross-references are established for related web content

7. Quality Assurance

  • Automated quality checks ensure scraping accuracy
  • Content validation verifies extracted text completeness
  • Embedding quality is assessed for optimal retrieval performance
Processing Time: Webpage scraping and processing typically takes 2-5 minutes. Complex pages with heavy content may take up to 10 minutes. You can check processing status using the document ID returned in the response.
Default Sub-Tenant Behavior: If you don’t specify a sub_tenant_id, the webpage content will be uploaded to the default sub-tenant created when your tenant was set up. This is perfect for organization-wide web content that should be accessible across all departments.
File ID Management: When you provide a file_id as a key in the document_metadata object, that specific ID will be used to identify your content. If no file_id is provided in the document_metadata, the system will automatically generate a unique identifier for you. This allows you to maintain consistent references to your content across your application while ensuring every piece of content has a unique identifier.

Error Responses

All endpoints return consistent error responses following the standard format. For detailed error information, see our Error Responses documentation.

Authorizations

Authorization
string
header
required

Bearer authentication header of the form Bearer <token>, where <token> is your auth token.

Query Parameters

web_url
string
required

The URL of the webpage to scrape and index

Example:

"https://www.usecortex.ai/"

tenant_id
string
required

Unique identifier for the tenant/organization

Example:

"tenant_1234"

sub_tenant_id
string
default:""

Optional sub-tenant identifier used to organize data within a tenant. If omitted, the default sub-tenant created during tenant setup will be used.

Example:

"sub_tenant_4567"

file_id
string
default:""

Optional custom file ID for the scraped content. If not provided, a unique ID will be generated

Example:

"CortexDoc1234"

Body

application/x-www-form-urlencoded · Body_scrape_webpage_upload_scrape_webpage_post · object

Response

Successful Response

file_id
string
required

Unique identifier for the file being processed

Example:

"CortexDoc1234"

message
string
required

Status message indicating document parsing scheduled or update completed

success
boolean
default:true
Example:

true

I