Examples
- API Request
- TypeScript
- Python (Sync)
Webpage Scraping & Processing Pipeline
When you scrape a webpage, it goes through a specialized processing pipeline designed for web content:1. Immediate URL Processing & Queue
- Your webpage URL is immediately accepted and validated
- The URL is added to our scraping queue for background processing
- You receive a confirmation response with a
file_idfor tracking
2. Web Scraping Phase
Our system automatically handles:- URL Validation: Ensuring the URL is accessible and valid
- Content Scraping: Extracting text, HTML, and metadata from the webpage
- Structure Analysis: Understanding page layout, headers, and content hierarchy
- Link Extraction: Identifying and processing internal and external links
3. Content Processing & Cleaning
- HTML Parsing: Converting HTML to clean, structured text
- Content Filtering: Removing navigation, ads, and irrelevant content
- Text Normalization: Cleaning and standardizing web content
- Language Detection: Identifying the webpage’s language for optimal processing
4. Intelligent Chunking
- Web content is split into semantically meaningful chunks
- Chunk size is optimized for both context preservation and search accuracy
- Overlapping boundaries ensure no information is lost between chunks
- Metadata is preserved and associated with each chunk
5. Embedding Generation
- Each chunk is converted into high-dimensional vector embeddings
- Embeddings capture semantic meaning and context
- Vectors are optimized for similarity search and retrieval
6. Indexing & Database Updates
- Embeddings are stored in our vector database for fast similarity search
- Full-text search indexes are created for keyword-based queries
- Metadata is indexed for filtering and faceted search
- Cross-references are established for related web content
7. Quality Assurance
- Automated quality checks ensure scraping accuracy
- Content validation verifies extracted text completeness
- Embedding quality is assessed for optimal retrieval performance
Processing Time: Webpage scraping and processing typically takes 2-5 minutes. Complex pages with heavy content may take up to 10 minutes. You can check processing status using the document ID returned in the response.
Default Sub-Tenant Behavior: If you don’t specify a
sub_tenant_id, the webpage content will be uploaded to the default sub-tenant created when your tenant was set up. This is perfect for organization-wide web content that should be accessible across all departments.File ID Management: When you provide afile_idas a key in thedocument_metadataobject, that specific ID will be used to identify your content. If nofile_idis provided in thedocument_metadata, the system will automatically generate a unique identifier for you. This allows you to maintain consistent references to your content across your application while ensuring every piece of content has a unique identifier.
Error Responses
All endpoints return consistent error responses following the standard format. For detailed error information, see our Error Responses documentation.Authorizations
Query Parameters
Example:
Example:
Example:
Example:
Body
application/x-www-form-urlencoded · Body_scrape_webpage_upload_scrape_webpage_post · object