333 lines
12 KiB
Markdown
333 lines
12 KiB
Markdown
# MinerU-based PDF/PPT Parsing Module
|
|
|
|
This module provides a comprehensive PDF and PowerPoint document parsing solution using MinerU technology, designed as a modular extension to the existing gzero.py parsing system.
|
|
|
|
## Features
|
|
|
|
### Core Capabilities
|
|
- **Multi-format Support**: Direct processing of PPT (.ppt/.pptx) and PDF files
|
|
- **Intelligent Format Detection**: Automatically detects PPT-origin PDFs for optimized processing
|
|
- **Page-by-Page Processing**: Splits PDFs into individual pages for parallel MinerU processing
|
|
- **Advanced Image Processing**: AI-powered image classification and content extraction
|
|
- **Table Recognition**: Specialized handling for documents containing table structures
|
|
- **Content Fusion**: Combines MinerU structured output with plain text for accuracy
|
|
|
|
### Processing Flow
|
|
|
|
#### Page-by-Page Processing
|
|
**Core Benefits**: Faster processing, better parallelization, optimal resource usage
|
|
|
|
1. **PDF Splitting**: Automatically splits PDF into individual page files using PyMuPDF
|
|
2. **Parallel Processing**: Processes multiple pages simultaneously with MinerU API
|
|
3. **Batch Management**: Smart batching to avoid API rate limits (configurable concurrency)
|
|
4. **Progress Tracking**: Real-time progress reporting with page-level status
|
|
5. **Error Resilience**: Continues processing even if individual pages fail
|
|
6. **Result Merging**: Combines all page results into structured markdown content
|
|
7. **Image Organization**: Automatically renames images with page prefixes for better organization
|
|
|
|
#### Document Processing Pipeline
|
|
1. **File Input**: Accepts PPT (.ppt/.pptx) or PDF files
|
|
2. **Format Detection**: Determines if PDF originated from PPT presentation
|
|
3. **PPT Conversion**: Converts PPT to PDF using LibreOffice (with fallback)
|
|
4. **Page Splitting**: Splits PDF into individual pages for parallel processing
|
|
5. **MinerU Processing**: Each page processed through MinerU API with OCR, formula, and table recognition
|
|
6. **Content Integration**: Merges page results with structured content organization
|
|
7. **Image Processing**: AI-powered image classification and upload integration
|
|
8. **Final Assembly**: Creates complete document with page-organized content
|
|
|
|
## Architecture
|
|
|
|
### Module Structure
|
|
```
|
|
loader/mineru/
|
|
├── __init__.py # Module initialization
|
|
├── maxkb_adapter/ # MaxKB adapter implementation
|
|
│ ├── __init__.py # Adapter module initialization
|
|
│ ├── adapter.py # MaxKBAdapter and MinerUExtractor
|
|
│ └── config_maxkb.py # MaxKB-specific configuration
|
|
├── base_parser.py # Base classes for platform adapters
|
|
├── config.py # Configuration management
|
|
├── converter.py # File conversion and detection
|
|
├── api_client.py # MinerU API integration
|
|
├── image_processor.py # Image recognition and processing
|
|
├── content_processor.py # Content fusion and refinement
|
|
├── flowchart_plugin.py # Specialized flowchart processing
|
|
├── utils.py # Utility functions
|
|
├── example.py # Usage examples
|
|
└── README.md # This file
|
|
```
|
|
|
|
### Key Components
|
|
|
|
**MinerUExtractor**: Main parser class orchestrating the complete pipeline
|
|
- Handles file processing from input to final Document output
|
|
- Manages temporary directories and caching
|
|
- Integrates with existing gzero.py patterns for compatibility
|
|
|
|
**DocumentConverter**: File type detection and conversion
|
|
- PPT to PDF conversion with LibreOffice and fallback methods
|
|
- PDF format detection (PPT-origin vs native PDF)
|
|
- Page extraction and metadata analysis
|
|
|
|
**MinerUAPIClient**: Interface to MinerU service
|
|
- Handles API communication and response processing
|
|
- Supports both cloud and self-hosted MinerU deployments
|
|
- Includes mock implementation for development/testing
|
|
- Cloud API: Asynchronous processing with polling
|
|
- Self-hosted API: Synchronous processing with direct file upload
|
|
|
|
**MinerUImageProcessor**: Advanced image handling
|
|
- AI-powered image classification (structured_content/brief_description/meaningless)
|
|
- Batch processing with concurrency control
|
|
- Integration with existing image upload infrastructure
|
|
|
|
**MinerUContentProcessor**: Content analysis and enhancement
|
|
- Table detection and specialized processing
|
|
- Plain text extraction and merging with structured content
|
|
- LLM-based content refinement for complex documents
|
|
|
|
## Configuration
|
|
|
|
### Environment Variables
|
|
```bash
|
|
# MinerU API Configuration
|
|
MINERU_API_KEY=your_mineru_api_key # Required for cloud API
|
|
MINERU_API_URL=https://mineru.net # Cloud API URL
|
|
MINERU_API_TYPE=cloud # "cloud" or "self_hosted"
|
|
|
|
# For self-hosted MinerU (alternative configuration)
|
|
# MINERU_API_URL=http://10.128.4.1:30001 # Self-hosted API URL
|
|
# MINERU_API_TYPE=self_hosted # No API key required
|
|
|
|
# LLM Configuration (uses existing gzero.py settings)
|
|
ADVANCED_PARSER_KEY_OPENAI=your_openai_key
|
|
ADVANCED_PARSER_KEY_CLAUDE=your_claude_key
|
|
ADVANCED_PARSER_KEY_GEMINI=your_gemini_key
|
|
|
|
# Processing Configuration
|
|
MAX_FILE_SIZE=52428800 # 50MB
|
|
LIBREOFFICE_PATH=libreoffice
|
|
CONVERSION_TIMEOUT=300 # 5 minutes
|
|
|
|
# Parallel Processing Controls
|
|
MAX_CONCURRENT_UPLOADS=5
|
|
MAX_CONCURRENT_API_CALLS=3 # Controls page processing concurrency
|
|
MAX_IMAGE_SIZE_MB=5.0
|
|
COMPRESSION_QUALITY=85
|
|
```
|
|
|
|
### API Type Comparison
|
|
|
|
| Feature | Cloud API | Self-Hosted API |
|
|
|---------|-----------|-----------------|
|
|
| **Authentication** | API key required | No authentication |
|
|
| **Processing Model** | Asynchronous with polling | Synchronous direct response |
|
|
| **File Upload** | Requires public URL | Direct multipart upload |
|
|
| **Rate Limits** | 2000 pages/day (free tier) | Limited by server resources |
|
|
| **Network Requirements** | Internet access required | Local network access |
|
|
| **Setup Complexity** | Simple (API key only) | Requires self-hosted deployment |
|
|
| **Processing Speed** | Depends on queue | Immediate processing |
|
|
| **Data Privacy** | Data sent to cloud | Data stays on-premise |
|
|
|
|
### Settings Integration
|
|
The module integrates with existing `gptbase.settings` for:
|
|
- API keys and model configurations
|
|
- Upload and storage settings
|
|
- Cache and processing parameters
|
|
- Logging configuration
|
|
|
|
## Usage
|
|
|
|
### Basic Usage
|
|
```python
|
|
from loader.mineru.gbase_adapter import MinerUExtractor
|
|
|
|
# Initialize extractor (automatically uses page-by-page processing)
|
|
extractor = MinerUExtractor(learn_type=9)
|
|
|
|
# Process file
|
|
documents = await extractor.process_file(
|
|
filepath="/path/to/document.pptx",
|
|
upload_options=upload_options
|
|
)
|
|
|
|
# Access results
|
|
doc = documents[0]
|
|
content = doc.page_content
|
|
metadata = doc.metadata
|
|
|
|
# Check processing results
|
|
print(f"Processing mode: {metadata.get('processing_mode', 'page_by_page')}")
|
|
print(f"Total pages: {metadata.get('total_pages', 0)}")
|
|
print(f"Successful pages: {metadata.get('successful_pages', 0)}")
|
|
print(f"Images found: {metadata.get('images_found', 0)}")
|
|
```
|
|
|
|
### Configuring API Type
|
|
```python
|
|
from loader.mineru.gbase_adapter import MinerUExtractor
|
|
from loader.mineru.config_base import MinerUConfig
|
|
|
|
# Cloud API configuration (default)
|
|
cloud_config = MinerUConfig()
|
|
cloud_config.mineru_api_type = "cloud"
|
|
cloud_config.mineru_api_key = "your_api_key"
|
|
cloud_config.mineru_api_url = "https://mineru.net"
|
|
|
|
# Self-hosted API configuration
|
|
self_hosted_config = MinerUConfig()
|
|
self_hosted_config.mineru_api_type = "self_hosted"
|
|
self_hosted_config.mineru_api_url = "http://10.128.4.1:30001"
|
|
# No API key required for self-hosted
|
|
|
|
# Use with custom configuration
|
|
extractor = MinerUExtractor(learn_type=9)
|
|
extractor.config = self_hosted_config
|
|
|
|
documents = await extractor.process_file(
|
|
filepath="/path/to/document.pdf",
|
|
upload_options=upload_options
|
|
)
|
|
```
|
|
|
|
### Configuring Concurrency
|
|
```python
|
|
from loader.mineru.gbase_adapter import MinerUExtractor
|
|
from loader.mineru.config_base import MinerUConfig
|
|
|
|
# Configure parallel processing limits
|
|
config = MinerUConfig()
|
|
config.max_concurrent_api_calls = 2 # Process 2 pages simultaneously
|
|
config.max_concurrent_uploads = 3 # Upload 3 images simultaneously
|
|
|
|
# Use with custom configuration
|
|
extractor = MinerUExtractor(learn_type=9)
|
|
extractor.config = config
|
|
|
|
documents = await extractor.process_file(
|
|
filepath="/path/to/large_document.pdf",
|
|
upload_options=upload_options
|
|
)
|
|
```
|
|
|
|
### Integration with Existing Code
|
|
The module is designed to be a drop-in replacement for gzero.py processing:
|
|
|
|
```python
|
|
# Replace gzero_load with mineru processing
|
|
from loader.mineru.gbase_adapter import MinerUExtractor
|
|
|
|
async def process_with_mineru(file_path, learn_type, upload_options):
|
|
extractor = MinerUExtractor(learn_type)
|
|
return await extractor.process_file(file_path, upload_options=upload_options)
|
|
|
|
# Batch processing multiple files
|
|
async def batch_process_files(file_paths, learn_type, upload_options):
|
|
extractor = MinerUExtractor(learn_type)
|
|
results = []
|
|
|
|
for file_path in file_paths:
|
|
try:
|
|
documents = await extractor.process_file(file_path, upload_options=upload_options)
|
|
results.append((file_path, documents[0]))
|
|
except Exception as e:
|
|
print(f"Failed to process {file_path}: {e}")
|
|
results.append((file_path, None))
|
|
|
|
return results
|
|
```
|
|
|
|
## Advanced Features
|
|
|
|
### Flowchart Plugin
|
|
Specialized processing for complex flowchart documents:
|
|
- Multi-step node identification
|
|
- Department organization mapping
|
|
- Mermaid diagram generation
|
|
- Enhanced visual element extraction
|
|
|
|
### Content Fusion
|
|
Sophisticated content merging strategies:
|
|
- Plain text + structured content integration
|
|
- Table-aware processing workflows
|
|
- LLM-based content refinement
|
|
- Context-preserving enhancement
|
|
|
|
### Image Intelligence
|
|
Advanced image processing capabilities:
|
|
- Semantic classification of document images
|
|
- OCR content extraction and integration
|
|
- Meaningless image filtering
|
|
- Batch processing with optimization
|
|
|
|
## Logging and Tracing
|
|
|
|
Comprehensive logging with trace_id support:
|
|
- File ID-based tracing throughout pipeline
|
|
- Detailed processing metrics and timing
|
|
- Error handling with context preservation
|
|
- Integration with existing logging infrastructure
|
|
|
|
## Compatibility
|
|
|
|
### gzero.py Integration
|
|
- Uses same `learn_infos` configuration structure
|
|
- Compatible with existing upload and storage systems
|
|
- Follows same Document metadata patterns
|
|
- Maintains cache and temporary file conventions
|
|
|
|
### Dependencies
|
|
- **Core**: Built on existing project dependencies
|
|
- **LibreOffice**: For PPT conversion (with fallback options)
|
|
- **PyMuPDF (fitz)**: For PDF processing and analysis
|
|
- **Optional**: python-pptx, reportlab for alternative conversion
|
|
|
|
## Performance Considerations
|
|
|
|
### Optimization Features
|
|
- **Page-by-Page Processing**: Parallel processing of individual pages for faster results
|
|
- **Smart Batching**: Configurable concurrency limits to optimize API usage
|
|
- **Concurrent Processing**: Parallel image classification and upload
|
|
- **Intelligent Caching**: Reuses processed results when possible
|
|
- **Selective Processing**: Filters meaningless images before upload
|
|
- **Progress Monitoring**: Real-time tracking of page processing status
|
|
|
|
### Resource Management
|
|
- **Memory Efficient**: Streams large files and cleans up resources
|
|
- **Configurable Limits**: Adjustable concurrency and size limits
|
|
- **Error Recovery**: Graceful handling of processing failures
|
|
- **Timeout Management**: Prevents hanging operations
|
|
|
|
## Development and Testing
|
|
|
|
### Mock Implementation
|
|
The module includes mock MinerU processing for development:
|
|
- Simulates API responses for testing
|
|
- Provides realistic processing results
|
|
- Enables development without actual MinerU service
|
|
|
|
### Example Scripts
|
|
- `example.py`: Comprehensive usage examples
|
|
- Configuration validation helpers
|
|
- Batch processing demonstrations
|
|
|
|
### Error Handling
|
|
- Comprehensive exception handling throughout pipeline
|
|
- Graceful fallbacks for conversion failures
|
|
- Detailed error logging with context
|
|
- Recovery strategies for partial failures
|
|
|
|
## Future Enhancements
|
|
|
|
### Planned Features
|
|
- **Enhanced File Upload**: Integrated temporary file hosting for MinerU API
|
|
- **Advanced Table Processing**: More sophisticated table structure analysis
|
|
- **Multi-language Support**: Extended language detection and processing
|
|
- **Performance Monitoring**: Built-in metrics and performance tracking
|
|
- **Adaptive Batching**: Dynamic concurrency adjustment based on API performance
|
|
|
|
### Extensibility
|
|
- **Plugin Architecture**: Easy addition of specialized processors
|
|
- **Custom Workflows**: Configurable processing pipelines
|
|
- **API Extensions**: Support for additional MinerU service features
|
|
- **Format Extensions**: Framework for additional input formats |