Try it
button to try this API now in our playground. It’s the best way to check the full request and response in one place, customize your parameters, and generate ready-to-use code snippets.Sample Request
Webpage Scraping & Processing Pipeline
When you scrape a webpage, it goes through a specialized processing pipeline designed for web content:1. Immediate URL Processing & Queue
- Your webpage URL is immediately accepted and validated
- The URL is added to our scraping queue for background processing
- You receive a confirmation response with a
file_id
for tracking
2. Web Scraping Phase
Our system automatically handles:- URL Validation: Ensuring the URL is accessible and valid
- Content Scraping: Extracting text, HTML, and metadata from the webpage
- Structure Analysis: Understanding page layout, headers, and content hierarchy
- Link Extraction: Identifying and processing internal and external links
3. Content Processing & Cleaning
- HTML Parsing: Converting HTML to clean, structured text
- Content Filtering: Removing navigation, ads, and irrelevant content
- Text Normalization: Cleaning and standardizing web content
- Language Detection: Identifying the webpage’s language for optimal processing
4. Intelligent Chunking
- Web content is split into semantically meaningful chunks
- Chunk size is optimized for both context preservation and search accuracy
- Overlapping boundaries ensure no information is lost between chunks
- Metadata is preserved and associated with each chunk
5. Embedding Generation
- Each chunk is converted into high-dimensional vector embeddings
- Embeddings capture semantic meaning and context
- Vectors are optimized for similarity search and retrieval
6. Indexing & Database Updates
- Embeddings are stored in our vector database for fast similarity search
- Full-text search indexes are created for keyword-based queries
- Metadata is indexed for filtering and faceted search
- Cross-references are established for related web content
7. Quality Assurance
- Automated quality checks ensure scraping accuracy
- Content validation verifies extracted text completeness
- Embedding quality is assessed for optimal retrieval performance
sub_tenant_id
, the webpage content will be uploaded to the default sub-tenant created when your tenant was set up. This is perfect for organization-wide web content that should be accessible across all departments.File ID Management: When you provide afile_id
as a key in thedocument_metadata
object, that specific ID will be used to identify your content. If nofile_id
is provided in thedocument_metadata
, the system will automatically generate a unique identifier for you. This allows you to maintain consistent references to your content across your application while ensuring every piece of content has a unique identifier.
Error Responses
All endpoints return consistent error responses following the standard format. For detailed error information, see our Error Responses documentation.Authorizations
Bearer authentication header of the form Bearer <token>
, where <token>
is your auth token.
Query Parameters
The URL of the webpage to scrape and index
"https://www.usecortex.ai/"
Unique identifier for the tenant/organization
"tenant_1234"
Optional sub-tenant identifier used to organize data within a tenant. If omitted, the default sub-tenant created during tenant setup will be used.
"sub_tenant_4567"
Optional custom file ID for the scraped content. If not provided, a unique ID will be generated
"CortexDoc1234"