DOCX Processing Architecture - Cloudflare Native
Overview
This document describes the asynchronous DOCX processing architecture using Cloudflare Workers, Queues, R2, and D1.
Architecture Diagram
Component Flow Diagram
Data Flow
Database Schema
documents table
sql
CREATE TABLE documents (
id TEXT PRIMARY KEY,
user_id TEXT NOT NULL,
filename TEXT NOT NULL,
blob_path TEXT NOT NULL,
-- Processing status
status TEXT NOT NULL CHECK(status IN ('processing', 'completed', 'failed')),
-- Extracted content
content_text TEXT,
content_html TEXT,
-- Error handling
error_message TEXT,
retry_count INTEGER DEFAULT 0,
-- Timestamps
created_at INTEGER NOT NULL,
processed_at INTEGER,
FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE CASCADE
);
CREATE INDEX idx_documents_user_status ON documents(user_id, status);
CREATE INDEX idx_documents_status ON documents(status);DOCX File Structure
Key Benefits
1. Asynchronous Processing
- Upload returns immediately (~100ms)
- Processing happens in background
- No timeout issues for large files
2. Scalability
- Queue handles 100+ concurrent uploads
- Consumer processes batches efficiently
- Automatic retry on failures
3. Reliability
- Guaranteed message delivery
- Automatic retries with exponential backoff
- Dead letter queue for permanent failures
4. Performance
- Upload: ~100-200ms
- Queue latency: ~50-100ms
- Processing 100-page doc: ~1-3 seconds
- Total user-perceived time: <200ms (async)
5. Cost Efficiency
- No external services
- R2: $0.015/GB storage
- Queue: Included in Workers plan
- D1: First 5GB free
Implementation Checklist
- [ ] Install dependencies (
fflate,fast-xml-parser) - [ ] Create DOCX parser utility
- [ ] Update
wrangler.jsoncwith queue config - [ ] Implement upload API endpoint
- [ ] Implement queue consumer
- [ ] Implement status polling endpoint
- [ ] Add database migration for documents table
- [ ] Create frontend upload component with polling
- [ ] Test with sample 100-page documents
- [ ] Add error handling and monitoring
- [ ] Deploy to production
Alternative: Cloudflare Workflows
For even more complex processing pipelines, consider using Cloudflare Workflows (currently in beta):
Workflows provide:
- Up to 1024 steps per workflow
- Built-in state management
- Automatic retries and error handling
- Durable execution
Security Considerations
- Authentication: Verify user session on upload
- File Validation: Check file type and size limits
- Path Sanitization: Prevent path traversal attacks
- Quota Enforcement: Limit uploads per user/day
- Virus Scanning: Consider integration with scanning service
- Access Control: Ensure users can only access their documents
Monitoring & Observability
typescript
// Add to queue consumer
export default {
async queue(batch: MessageBatch, env: Env) {
const startTime = Date.now()
for (const message of batch.messages) {
try {
// Processing logic...
// Log metrics
console.log({
event: 'document_processed',
documentId: message.body.documentId,
duration: Date.now() - startTime,
fileSize: buffer.byteLength
})
} catch (error) {
// Log errors
console.error({
event: 'document_processing_failed',
documentId: message.body.documentId,
error: error.message,
stack: error.stack
})
}
}
}
}Future Enhancements
- Streaming Processing: For very large files (>10MB)
- Format Preservation: Extract tables, lists, formatting
- Image Extraction: Store embedded images separately
- OCR Integration: Extract text from images in documents
- Version Control: Track document revisions
- Bulk Processing: Process multiple documents in one upload
- Webhooks: Notify external systems when processing completes