556 lines
10 KiB
Markdown
556 lines
10 KiB
Markdown
|
|
# Document Processing API
|
||
|
|
|
||
|
|
BotServer provides RESTful endpoints for processing, extracting, and analyzing various document formats including PDFs, Office documents, and images.
|
||
|
|
|
||
|
|
## Overview
|
||
|
|
|
||
|
|
The Document Processing API enables:
|
||
|
|
- Text extraction from documents
|
||
|
|
- OCR for scanned documents
|
||
|
|
- Metadata extraction
|
||
|
|
- Document conversion
|
||
|
|
- Content analysis and summarization
|
||
|
|
|
||
|
|
## Base URL
|
||
|
|
|
||
|
|
```
|
||
|
|
http://localhost:8080/api/v1/documents
|
||
|
|
```
|
||
|
|
|
||
|
|
## Authentication
|
||
|
|
|
||
|
|
All Document Processing API requests require authentication:
|
||
|
|
|
||
|
|
```http
|
||
|
|
Authorization: Bearer <token>
|
||
|
|
```
|
||
|
|
|
||
|
|
## Endpoints
|
||
|
|
|
||
|
|
### Upload Document
|
||
|
|
|
||
|
|
**POST** `/upload`
|
||
|
|
|
||
|
|
Upload a document for processing.
|
||
|
|
|
||
|
|
**Request:**
|
||
|
|
- Method: `POST`
|
||
|
|
- Content-Type: `multipart/form-data`
|
||
|
|
|
||
|
|
**Form Data:**
|
||
|
|
- `file` - The document file
|
||
|
|
- `process_options` - JSON string of processing options
|
||
|
|
|
||
|
|
**Example Request:**
|
||
|
|
```bash
|
||
|
|
curl -X POST \
|
||
|
|
-H "Authorization: Bearer token123" \
|
||
|
|
-F "file=@document.pdf" \
|
||
|
|
-F 'process_options={"extract_text":true,"extract_metadata":true}' \
|
||
|
|
http://localhost:8080/api/v1/documents/upload
|
||
|
|
```
|
||
|
|
|
||
|
|
**Response:**
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"document_id": "doc_abc123",
|
||
|
|
"filename": "document.pdf",
|
||
|
|
"size_bytes": 2048576,
|
||
|
|
"mime_type": "application/pdf",
|
||
|
|
"status": "processing",
|
||
|
|
"uploaded_at": "2024-01-15T10:00:00Z"
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
### Process Document
|
||
|
|
|
||
|
|
**POST** `/process`
|
||
|
|
|
||
|
|
Process an already uploaded document.
|
||
|
|
|
||
|
|
**Request Body:**
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"document_id": "doc_abc123",
|
||
|
|
"operations": [
|
||
|
|
"extract_text",
|
||
|
|
"extract_metadata",
|
||
|
|
"generate_summary",
|
||
|
|
"extract_entities"
|
||
|
|
],
|
||
|
|
"options": {
|
||
|
|
"language": "en",
|
||
|
|
"ocr_enabled": true,
|
||
|
|
"chunk_size": 1000
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Response:**
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"document_id": "doc_abc123",
|
||
|
|
"process_id": "prc_xyz789",
|
||
|
|
"status": "processing",
|
||
|
|
"estimated_completion": "2024-01-15T10:02:00Z"
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
### Get Processing Status
|
||
|
|
|
||
|
|
**GET** `/process/{process_id}/status`
|
||
|
|
|
||
|
|
Check the status of document processing.
|
||
|
|
|
||
|
|
**Response:**
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"process_id": "prc_xyz789",
|
||
|
|
"document_id": "doc_abc123",
|
||
|
|
"status": "completed",
|
||
|
|
"progress": 100,
|
||
|
|
"completed_at": "2024-01-15T10:01:30Z",
|
||
|
|
"results_available": true
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
### Get Extracted Text
|
||
|
|
|
||
|
|
**GET** `/documents/{document_id}/text`
|
||
|
|
|
||
|
|
Retrieve extracted text from a processed document.
|
||
|
|
|
||
|
|
**Query Parameters:**
|
||
|
|
- `page` - Specific page number (optional)
|
||
|
|
- `format` - Output format: `plain`, `markdown`, `html`
|
||
|
|
|
||
|
|
**Response:**
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"document_id": "doc_abc123",
|
||
|
|
"text": "This is the extracted text from the document...",
|
||
|
|
"pages": 10,
|
||
|
|
"word_count": 5420,
|
||
|
|
"language": "en"
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
### Get Document Metadata
|
||
|
|
|
||
|
|
**GET** `/documents/{document_id}/metadata`
|
||
|
|
|
||
|
|
Retrieve metadata from a document.
|
||
|
|
|
||
|
|
**Response:**
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"document_id": "doc_abc123",
|
||
|
|
"metadata": {
|
||
|
|
"title": "Annual Report 2024",
|
||
|
|
"author": "John Doe",
|
||
|
|
"created_date": "2024-01-10T08:00:00Z",
|
||
|
|
"modified_date": "2024-01-14T16:30:00Z",
|
||
|
|
"pages": 10,
|
||
|
|
"producer": "Microsoft Word",
|
||
|
|
"keywords": ["annual", "report", "finance"],
|
||
|
|
"custom_properties": {
|
||
|
|
"department": "Finance",
|
||
|
|
"confidentiality": "Internal"
|
||
|
|
}
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
### Generate Summary
|
||
|
|
|
||
|
|
**POST** `/documents/{document_id}/summarize`
|
||
|
|
|
||
|
|
Generate an AI summary of the document.
|
||
|
|
|
||
|
|
**Request Body:**
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"type": "abstractive",
|
||
|
|
"length": "medium",
|
||
|
|
"focus_areas": ["key_points", "conclusions"],
|
||
|
|
"language": "en"
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Response:**
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"document_id": "doc_abc123",
|
||
|
|
"summary": "This document discusses the annual financial performance...",
|
||
|
|
"key_points": [
|
||
|
|
"Revenue increased by 15%",
|
||
|
|
"New market expansion successful",
|
||
|
|
"Operating costs reduced"
|
||
|
|
],
|
||
|
|
"summary_length": 250
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
### Extract Entities
|
||
|
|
|
||
|
|
**POST** `/documents/{document_id}/entities`
|
||
|
|
|
||
|
|
Extract named entities from the document.
|
||
|
|
|
||
|
|
**Request Body:**
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"entity_types": ["person", "organization", "location", "date", "money"],
|
||
|
|
"confidence_threshold": 0.7
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Response:**
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"document_id": "doc_abc123",
|
||
|
|
"entities": [
|
||
|
|
{
|
||
|
|
"text": "John Smith",
|
||
|
|
"type": "person",
|
||
|
|
"confidence": 0.95,
|
||
|
|
"occurrences": 5
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"text": "New York",
|
||
|
|
"type": "location",
|
||
|
|
"confidence": 0.88,
|
||
|
|
"occurrences": 3
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"text": "$1.5 million",
|
||
|
|
"type": "money",
|
||
|
|
"confidence": 0.92,
|
||
|
|
"occurrences": 2
|
||
|
|
}
|
||
|
|
]
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
### Convert Document
|
||
|
|
|
||
|
|
**POST** `/documents/{document_id}/convert`
|
||
|
|
|
||
|
|
Convert document to another format.
|
||
|
|
|
||
|
|
**Request Body:**
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"target_format": "pdf",
|
||
|
|
"options": {
|
||
|
|
"compress": true,
|
||
|
|
"quality": "high",
|
||
|
|
"page_size": "A4"
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Response:**
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"document_id": "doc_abc123",
|
||
|
|
"converted_id": "doc_def456",
|
||
|
|
"original_format": "docx",
|
||
|
|
"target_format": "pdf",
|
||
|
|
"download_url": "/api/v1/documents/doc_def456/download"
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
### Search Within Document
|
||
|
|
|
||
|
|
**POST** `/documents/{document_id}/search`
|
||
|
|
|
||
|
|
Search for text within a document.
|
||
|
|
|
||
|
|
**Request Body:**
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"query": "revenue growth",
|
||
|
|
"case_sensitive": false,
|
||
|
|
"whole_words": false,
|
||
|
|
"regex": false
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Response:**
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"document_id": "doc_abc123",
|
||
|
|
"matches": [
|
||
|
|
{
|
||
|
|
"page": 3,
|
||
|
|
"line": 15,
|
||
|
|
"context": "...the company achieved significant revenue growth in Q4...",
|
||
|
|
"position": 1247
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"page": 7,
|
||
|
|
"line": 8,
|
||
|
|
"context": "...projecting continued revenue growth for next year...",
|
||
|
|
"position": 3892
|
||
|
|
}
|
||
|
|
],
|
||
|
|
"total_matches": 2
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
### Split Document
|
||
|
|
|
||
|
|
**POST** `/documents/{document_id}/split`
|
||
|
|
|
||
|
|
Split a document into multiple parts.
|
||
|
|
|
||
|
|
**Request Body:**
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"method": "by_pages",
|
||
|
|
"pages_per_split": 5
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Response:**
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"document_id": "doc_abc123",
|
||
|
|
"parts": [
|
||
|
|
{
|
||
|
|
"part_id": "part_001",
|
||
|
|
"pages": "1-5",
|
||
|
|
"download_url": "/api/v1/documents/part_001/download"
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"part_id": "part_002",
|
||
|
|
"pages": "6-10",
|
||
|
|
"download_url": "/api/v1/documents/part_002/download"
|
||
|
|
}
|
||
|
|
],
|
||
|
|
"total_parts": 2
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
### Merge Documents
|
||
|
|
|
||
|
|
**POST** `/documents/merge`
|
||
|
|
|
||
|
|
Merge multiple documents into one.
|
||
|
|
|
||
|
|
**Request Body:**
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"document_ids": ["doc_abc123", "doc_def456", "doc_ghi789"],
|
||
|
|
"output_format": "pdf",
|
||
|
|
"preserve_metadata": true
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Response:**
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"merged_document_id": "doc_merged_xyz",
|
||
|
|
"source_count": 3,
|
||
|
|
"total_pages": 30,
|
||
|
|
"download_url": "/api/v1/documents/doc_merged_xyz/download"
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
## Supported Formats
|
||
|
|
|
||
|
|
### Input Formats
|
||
|
|
- **Documents**: PDF, DOCX, DOC, ODT, RTF, TXT
|
||
|
|
- **Spreadsheets**: XLSX, XLS, ODS, CSV
|
||
|
|
- **Presentations**: PPTX, PPT, ODP
|
||
|
|
- **Images**: PNG, JPG, JPEG, GIF, BMP, TIFF
|
||
|
|
- **Web**: HTML, XML, MARKDOWN
|
||
|
|
|
||
|
|
### Output Formats
|
||
|
|
- PDF
|
||
|
|
- Plain Text
|
||
|
|
- Markdown
|
||
|
|
- HTML
|
||
|
|
- JSON
|
||
|
|
- CSV (for tabular data)
|
||
|
|
|
||
|
|
## Processing Options
|
||
|
|
|
||
|
|
### OCR Options
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"ocr_enabled": true,
|
||
|
|
"ocr_language": "eng",
|
||
|
|
"ocr_engine": "tesseract",
|
||
|
|
"preprocessing": {
|
||
|
|
"deskew": true,
|
||
|
|
"remove_noise": true,
|
||
|
|
"enhance_contrast": true
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
### Text Extraction Options
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"preserve_formatting": false,
|
||
|
|
"extract_tables": true,
|
||
|
|
"extract_images": false,
|
||
|
|
"chunk_text": true,
|
||
|
|
"chunk_size": 1000,
|
||
|
|
"chunk_overlap": 100
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
### Summary Options
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"summary_type": "extractive",
|
||
|
|
"summary_length": "medium",
|
||
|
|
"bullet_points": true,
|
||
|
|
"include_keywords": true,
|
||
|
|
"max_sentences": 5
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
## Batch Processing
|
||
|
|
|
||
|
|
### Submit Batch
|
||
|
|
|
||
|
|
**POST** `/batch/process`
|
||
|
|
|
||
|
|
Process multiple documents in batch.
|
||
|
|
|
||
|
|
**Request Body:**
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"documents": [
|
||
|
|
{
|
||
|
|
"document_id": "doc_001",
|
||
|
|
"operations": ["extract_text", "summarize"]
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"document_id": "doc_002",
|
||
|
|
"operations": ["extract_entities"]
|
||
|
|
}
|
||
|
|
],
|
||
|
|
"notify_on_completion": true,
|
||
|
|
"webhook_url": "https://example.com/webhook"
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
### Get Batch Status
|
||
|
|
|
||
|
|
**GET** `/batch/{batch_id}/status`
|
||
|
|
|
||
|
|
Check batch processing status.
|
||
|
|
|
||
|
|
**Response:**
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"batch_id": "batch_abc123",
|
||
|
|
"total_documents": 10,
|
||
|
|
"processed": 7,
|
||
|
|
"failed": 1,
|
||
|
|
"pending": 2,
|
||
|
|
"completion_percentage": 70
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
## Error Responses
|
||
|
|
|
||
|
|
### 400 Bad Request
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"error": "unsupported_format",
|
||
|
|
"message": "File format .xyz is not supported",
|
||
|
|
"supported_formats": ["pdf", "docx", "txt"]
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
### 413 Payload Too Large
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"error": "file_too_large",
|
||
|
|
"message": "File size exceeds maximum limit",
|
||
|
|
"max_size_bytes": 52428800,
|
||
|
|
"provided_size_bytes": 104857600
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
### 422 Unprocessable Entity
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"error": "corrupted_file",
|
||
|
|
"message": "The document appears to be corrupted and cannot be processed"
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
## Webhooks
|
||
|
|
|
||
|
|
Configure webhooks to receive processing notifications:
|
||
|
|
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"event": "document.processed",
|
||
|
|
"document_id": "doc_abc123",
|
||
|
|
"status": "completed",
|
||
|
|
"results": {
|
||
|
|
"text_extracted": true,
|
||
|
|
"summary_generated": true,
|
||
|
|
"entities_extracted": true
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
## Rate Limits
|
||
|
|
|
||
|
|
| Operation | Limit | Window |
|
||
|
|
|-----------|-------|--------|
|
||
|
|
| Upload Document | 50/hour | Per user |
|
||
|
|
| Process Document | 100/hour | Per user |
|
||
|
|
| Generate Summary | 20/hour | Per user |
|
||
|
|
| Batch Processing | 5/hour | Per user |
|
||
|
|
|
||
|
|
## Best Practices
|
||
|
|
|
||
|
|
1. **Preprocess Documents**: Clean scanned documents before OCR
|
||
|
|
2. **Use Appropriate Formats**: Choose the right output format for your use case
|
||
|
|
3. **Batch Similar Documents**: Process similar documents together for efficiency
|
||
|
|
4. **Handle Large Files**: Use chunking for large documents
|
||
|
|
5. **Cache Results**: Store processed results to avoid reprocessing
|
||
|
|
6. **Monitor Processing**: Use webhooks for long-running operations
|
||
|
|
|
||
|
|
## Integration Examples
|
||
|
|
|
||
|
|
### Python Example
|
||
|
|
|
||
|
|
```python
|
||
|
|
import requests
|
||
|
|
|
||
|
|
# Upload and process document
|
||
|
|
with open('document.pdf', 'rb') as f:
|
||
|
|
response = requests.post(
|
||
|
|
'http://localhost:8080/api/v1/documents/upload',
|
||
|
|
headers={'Authorization': 'Bearer token123'},
|
||
|
|
files={'file': f},
|
||
|
|
data={'process_options': '{"extract_text": true}'}
|
||
|
|
)
|
||
|
|
|
||
|
|
document_id = response.json()['document_id']
|
||
|
|
|
||
|
|
# Get extracted text
|
||
|
|
text_response = requests.get(
|
||
|
|
f'http://localhost:8080/api/v1/documents/{document_id}/text',
|
||
|
|
headers={'Authorization': 'Bearer token123'}
|
||
|
|
)
|
||
|
|
|
||
|
|
print(text_response.json()['text'])
|
||
|
|
```
|
||
|
|
|
||
|
|
## Related APIs
|
||
|
|
|
||
|
|
- [Storage API](./storage-api.md) - Document storage
|
||
|
|
- [ML API](./ml-api.md) - Advanced text analysis
|
||
|
|
- [Knowledge Base API](../chapter-03/kb-and-tools.md) - Document indexing
|