Rodrigo Rodriguez (Pragmatismo) f62e0d4f37 - From 8 to 13.5

2025-11-24 13:02:30 -03:00

10 KiB

Raw Blame History

Document Processing API

BotServer provides RESTful endpoints for processing, extracting, and analyzing various document formats including PDFs, Office documents, and images.

Overview

The Document Processing API enables:

Text extraction from documents
OCR for scanned documents
Metadata extraction
Document conversion
Content analysis and summarization

Base URL

http://localhost:8080/api/v1/documents

Authentication

All Document Processing API requests require authentication:

Authorization: Bearer <token>

Endpoints

Upload Document

POST /upload

Upload a document for processing.

Request:

Method: POST
Content-Type: multipart/form-data

Form Data:

file - The document file
process_options - JSON string of processing options

Example Request:

curl -X POST \
  -H "Authorization: Bearer token123" \
  -F "file=@document.pdf" \
  -F 'process_options={"extract_text":true,"extract_metadata":true}' \
  http://localhost:8080/api/v1/documents/upload

Response:

{
  "document_id": "doc_abc123",
  "filename": "document.pdf",
  "size_bytes": 2048576,
  "mime_type": "application/pdf",
  "status": "processing",
  "uploaded_at": "2024-01-15T10:00:00Z"
}

Process Document

POST /process

Process an already uploaded document.

Request Body:

{
  "document_id": "doc_abc123",
  "operations": [
    "extract_text",
    "extract_metadata",
    "generate_summary",
    "extract_entities"
  ],
  "options": {
    "language": "en",
    "ocr_enabled": true,
    "chunk_size": 1000
  }
}

Response:

{
  "document_id": "doc_abc123",
  "process_id": "prc_xyz789",
  "status": "processing",
  "estimated_completion": "2024-01-15T10:02:00Z"
}

Get Processing Status

GET /process/{process_id}/status

Check the status of document processing.

Response:

{
  "process_id": "prc_xyz789",
  "document_id": "doc_abc123",
  "status": "completed",
  "progress": 100,
  "completed_at": "2024-01-15T10:01:30Z",
  "results_available": true
}

Get Extracted Text

GET /documents/{document_id}/text

Retrieve extracted text from a processed document.

Query Parameters:

page - Specific page number (optional)
format - Output format: plain, markdown, html

Response:

{
  "document_id": "doc_abc123",
  "text": "This is the extracted text from the document...",
  "pages": 10,
  "word_count": 5420,
  "language": "en"
}

Get Document Metadata

GET /documents/{document_id}/metadata

Retrieve metadata from a document.

Response:

{
  "document_id": "doc_abc123",
  "metadata": {
    "title": "Annual Report 2024",
    "author": "John Doe",
    "created_date": "2024-01-10T08:00:00Z",
    "modified_date": "2024-01-14T16:30:00Z",
    "pages": 10,
    "producer": "Microsoft Word",
    "keywords": ["annual", "report", "finance"],
    "custom_properties": {
      "department": "Finance",
      "confidentiality": "Internal"
    }
  }
}

Generate Summary

POST /documents/{document_id}/summarize

Generate an AI summary of the document.

Request Body:

{
  "type": "abstractive",
  "length": "medium",
  "focus_areas": ["key_points", "conclusions"],
  "language": "en"
}

Response:

{
  "document_id": "doc_abc123",
  "summary": "This document discusses the annual financial performance...",
  "key_points": [
    "Revenue increased by 15%",
    "New market expansion successful",
    "Operating costs reduced"
  ],
  "summary_length": 250
}

Extract Entities

POST /documents/{document_id}/entities

Extract named entities from the document.

Request Body:

{
  "entity_types": ["person", "organization", "location", "date", "money"],
  "confidence_threshold": 0.7
}

Response:

{
  "document_id": "doc_abc123",
  "entities": [
    {
      "text": "John Smith",
      "type": "person",
      "confidence": 0.95,
      "occurrences": 5
    },
    {
      "text": "New York",
      "type": "location",
      "confidence": 0.88,
      "occurrences": 3
    },
    {
      "text": "$1.5 million",
      "type": "money",
      "confidence": 0.92,
      "occurrences": 2
    }
  ]
}

Convert Document

POST /documents/{document_id}/convert

Convert document to another format.

Request Body:

{
  "target_format": "pdf",
  "options": {
    "compress": true,
    "quality": "high",
    "page_size": "A4"
  }
}

Response:

{
  "document_id": "doc_abc123",
  "converted_id": "doc_def456",
  "original_format": "docx",
  "target_format": "pdf",
  "download_url": "/api/v1/documents/doc_def456/download"
}

Search Within Document

POST /documents/{document_id}/search

Search for text within a document.

Request Body:

{
  "query": "revenue growth",
  "case_sensitive": false,
  "whole_words": false,
  "regex": false
}

Response:

{
  "document_id": "doc_abc123",
  "matches": [
    {
      "page": 3,
      "line": 15,
      "context": "...the company achieved significant revenue growth in Q4...",
      "position": 1247
    },
    {
      "page": 7,
      "line": 8,
      "context": "...projecting continued revenue growth for next year...",
      "position": 3892
    }
  ],
  "total_matches": 2
}

Split Document

POST /documents/{document_id}/split

Split a document into multiple parts.

Request Body:

{
  "method": "by_pages",
  "pages_per_split": 5
}

Response:

{
  "document_id": "doc_abc123",
  "parts": [
    {
      "part_id": "part_001",
      "pages": "1-5",
      "download_url": "/api/v1/documents/part_001/download"
    },
    {
      "part_id": "part_002",
      "pages": "6-10",
      "download_url": "/api/v1/documents/part_002/download"
    }
  ],
  "total_parts": 2
}

Merge Documents

POST /documents/merge

Merge multiple documents into one.

Request Body:

{
  "document_ids": ["doc_abc123", "doc_def456", "doc_ghi789"],
  "output_format": "pdf",
  "preserve_metadata": true
}

Response:

{
  "merged_document_id": "doc_merged_xyz",
  "source_count": 3,
  "total_pages": 30,
  "download_url": "/api/v1/documents/doc_merged_xyz/download"
}

Supported Formats

Input Formats

Documents: PDF, DOCX, DOC, ODT, RTF, TXT
Spreadsheets: XLSX, XLS, ODS, CSV
Presentations: PPTX, PPT, ODP
Images: PNG, JPG, JPEG, GIF, BMP, TIFF
Web: HTML, XML, MARKDOWN

Output Formats

PDF
Plain Text
Markdown
HTML
JSON
CSV (for tabular data)

Processing Options

OCR Options

{
  "ocr_enabled": true,
  "ocr_language": "eng",
  "ocr_engine": "tesseract",
  "preprocessing": {
    "deskew": true,
    "remove_noise": true,
    "enhance_contrast": true
  }
}

Text Extraction Options

{
  "preserve_formatting": false,
  "extract_tables": true,
  "extract_images": false,
  "chunk_text": true,
  "chunk_size": 1000,
  "chunk_overlap": 100
}

Summary Options

{
  "summary_type": "extractive",
  "summary_length": "medium",
  "bullet_points": true,
  "include_keywords": true,
  "max_sentences": 5
}

Batch Processing

Submit Batch

POST /batch/process

Process multiple documents in batch.

Request Body:

{
  "documents": [
    {
      "document_id": "doc_001",
      "operations": ["extract_text", "summarize"]
    },
    {
      "document_id": "doc_002",
      "operations": ["extract_entities"]
    }
  ],
  "notify_on_completion": true,
  "webhook_url": "https://example.com/webhook"
}

Get Batch Status

GET /batch/{batch_id}/status

Check batch processing status.

Response:

{
  "batch_id": "batch_abc123",
  "total_documents": 10,
  "processed": 7,
  "failed": 1,
  "pending": 2,
  "completion_percentage": 70
}

Error Responses

400 Bad Request

{
  "error": "unsupported_format",
  "message": "File format .xyz is not supported",
  "supported_formats": ["pdf", "docx", "txt"]
}

413 Payload Too Large

{
  "error": "file_too_large",
  "message": "File size exceeds maximum limit",
  "max_size_bytes": 52428800,
  "provided_size_bytes": 104857600
}

422 Unprocessable Entity

{
  "error": "corrupted_file",
  "message": "The document appears to be corrupted and cannot be processed"
}

Webhooks

Configure webhooks to receive processing notifications:

{
  "event": "document.processed",
  "document_id": "doc_abc123",
  "status": "completed",
  "results": {
    "text_extracted": true,
    "summary_generated": true,
    "entities_extracted": true
  }
}

Rate Limits

Operation	Limit	Window
Upload Document	50/hour	Per user
Process Document	100/hour	Per user
Generate Summary	20/hour	Per user
Batch Processing	5/hour	Per user

Best Practices

Preprocess Documents: Clean scanned documents before OCR
Use Appropriate Formats: Choose the right output format for your use case
Batch Similar Documents: Process similar documents together for efficiency
Handle Large Files: Use chunking for large documents
Cache Results: Store processed results to avoid reprocessing
Monitor Processing: Use webhooks for long-running operations

Integration Examples

Python Example

import requests

# Upload and process document
with open('document.pdf', 'rb') as f:
    response = requests.post(
        'http://localhost:8080/api/v1/documents/upload',
        headers={'Authorization': 'Bearer token123'},
        files={'file': f},
        data={'process_options': '{"extract_text": true}'}
    )
    
document_id = response.json()['document_id']

# Get extracted text
text_response = requests.get(
    f'http://localhost:8080/api/v1/documents/{document_id}/text',
    headers={'Authorization': 'Bearer token123'}
)

print(text_response.json()['text'])

Storage API - Document storage
ML API - Advanced text analysis
Knowledge Base API - Document indexing

10 KiB Raw Blame History

Document Processing API

Overview

Base URL

Authentication

Endpoints

Upload Document

Process Document

Get Processing Status

Get Extracted Text

Get Document Metadata

Generate Summary

Extract Entities

Convert Document

Search Within Document

Split Document

Merge Documents

Supported Formats

Input Formats

Output Formats

Processing Options

OCR Options

Text Extraction Options

Summary Options

Batch Processing

Submit Batch

Get Batch Status

Error Responses

400 Bad Request

413 Payload Too Large

422 Unprocessable Entity

Webhooks

Rate Limits

Best Practices

Integration Examples

Python Example

Related APIs

10 KiB

Raw Blame History