maxkb/apps/common/handle/impl/mineru/README.md
2025-08-24 00:56:02 +08:00

333 lines
12 KiB
Markdown

# MinerU-based PDF/PPT Parsing Module
This module provides a comprehensive PDF and PowerPoint document parsing solution using MinerU technology, designed as a modular extension to the existing gzero.py parsing system.
## Features
### Core Capabilities
- **Multi-format Support**: Direct processing of PPT (.ppt/.pptx) and PDF files
- **Intelligent Format Detection**: Automatically detects PPT-origin PDFs for optimized processing
- **Page-by-Page Processing**: Splits PDFs into individual pages for parallel MinerU processing
- **Advanced Image Processing**: AI-powered image classification and content extraction
- **Table Recognition**: Specialized handling for documents containing table structures
- **Content Fusion**: Combines MinerU structured output with plain text for accuracy
### Processing Flow
#### Page-by-Page Processing
**Core Benefits**: Faster processing, better parallelization, optimal resource usage
1. **PDF Splitting**: Automatically splits PDF into individual page files using PyMuPDF
2. **Parallel Processing**: Processes multiple pages simultaneously with MinerU API
3. **Batch Management**: Smart batching to avoid API rate limits (configurable concurrency)
4. **Progress Tracking**: Real-time progress reporting with page-level status
5. **Error Resilience**: Continues processing even if individual pages fail
6. **Result Merging**: Combines all page results into structured markdown content
7. **Image Organization**: Automatically renames images with page prefixes for better organization
#### Document Processing Pipeline
1. **File Input**: Accepts PPT (.ppt/.pptx) or PDF files
2. **Format Detection**: Determines if PDF originated from PPT presentation
3. **PPT Conversion**: Converts PPT to PDF using LibreOffice (with fallback)
4. **Page Splitting**: Splits PDF into individual pages for parallel processing
5. **MinerU Processing**: Each page processed through MinerU API with OCR, formula, and table recognition
6. **Content Integration**: Merges page results with structured content organization
7. **Image Processing**: AI-powered image classification and upload integration
8. **Final Assembly**: Creates complete document with page-organized content
## Architecture
### Module Structure
```
loader/mineru/
├── __init__.py # Module initialization
├── maxkb_adapter/ # MaxKB adapter implementation
│ ├── __init__.py # Adapter module initialization
│ ├── adapter.py # MaxKBAdapter and MinerUExtractor
│ └── config_maxkb.py # MaxKB-specific configuration
├── base_parser.py # Base classes for platform adapters
├── config.py # Configuration management
├── converter.py # File conversion and detection
├── api_client.py # MinerU API integration
├── image_processor.py # Image recognition and processing
├── content_processor.py # Content fusion and refinement
├── flowchart_plugin.py # Specialized flowchart processing
├── utils.py # Utility functions
├── example.py # Usage examples
└── README.md # This file
```
### Key Components
**MinerUExtractor**: Main parser class orchestrating the complete pipeline
- Handles file processing from input to final Document output
- Manages temporary directories and caching
- Integrates with existing gzero.py patterns for compatibility
**DocumentConverter**: File type detection and conversion
- PPT to PDF conversion with LibreOffice and fallback methods
- PDF format detection (PPT-origin vs native PDF)
- Page extraction and metadata analysis
**MinerUAPIClient**: Interface to MinerU service
- Handles API communication and response processing
- Supports both cloud and self-hosted MinerU deployments
- Includes mock implementation for development/testing
- Cloud API: Asynchronous processing with polling
- Self-hosted API: Synchronous processing with direct file upload
**MinerUImageProcessor**: Advanced image handling
- AI-powered image classification (structured_content/brief_description/meaningless)
- Batch processing with concurrency control
- Integration with existing image upload infrastructure
**MinerUContentProcessor**: Content analysis and enhancement
- Table detection and specialized processing
- Plain text extraction and merging with structured content
- LLM-based content refinement for complex documents
## Configuration
### Environment Variables
```bash
# MinerU API Configuration
MINERU_API_KEY=your_mineru_api_key # Required for cloud API
MINERU_API_URL=https://mineru.net # Cloud API URL
MINERU_API_TYPE=cloud # "cloud" or "self_hosted"
# For self-hosted MinerU (alternative configuration)
# MINERU_API_URL=http://10.128.4.1:30001 # Self-hosted API URL
# MINERU_API_TYPE=self_hosted # No API key required
# LLM Configuration (uses existing gzero.py settings)
ADVANCED_PARSER_KEY_OPENAI=your_openai_key
ADVANCED_PARSER_KEY_CLAUDE=your_claude_key
ADVANCED_PARSER_KEY_GEMINI=your_gemini_key
# Processing Configuration
MAX_FILE_SIZE=52428800 # 50MB
LIBREOFFICE_PATH=libreoffice
CONVERSION_TIMEOUT=300 # 5 minutes
# Parallel Processing Controls
MAX_CONCURRENT_UPLOADS=5
MAX_CONCURRENT_API_CALLS=3 # Controls page processing concurrency
MAX_IMAGE_SIZE_MB=5.0
COMPRESSION_QUALITY=85
```
### API Type Comparison
| Feature | Cloud API | Self-Hosted API |
|---------|-----------|-----------------|
| **Authentication** | API key required | No authentication |
| **Processing Model** | Asynchronous with polling | Synchronous direct response |
| **File Upload** | Requires public URL | Direct multipart upload |
| **Rate Limits** | 2000 pages/day (free tier) | Limited by server resources |
| **Network Requirements** | Internet access required | Local network access |
| **Setup Complexity** | Simple (API key only) | Requires self-hosted deployment |
| **Processing Speed** | Depends on queue | Immediate processing |
| **Data Privacy** | Data sent to cloud | Data stays on-premise |
### Settings Integration
The module integrates with existing `gptbase.settings` for:
- API keys and model configurations
- Upload and storage settings
- Cache and processing parameters
- Logging configuration
## Usage
### Basic Usage
```python
from loader.mineru.gbase_adapter import MinerUExtractor
# Initialize extractor (automatically uses page-by-page processing)
extractor = MinerUExtractor(learn_type=9)
# Process file
documents = await extractor.process_file(
filepath="/path/to/document.pptx",
upload_options=upload_options
)
# Access results
doc = documents[0]
content = doc.page_content
metadata = doc.metadata
# Check processing results
print(f"Processing mode: {metadata.get('processing_mode', 'page_by_page')}")
print(f"Total pages: {metadata.get('total_pages', 0)}")
print(f"Successful pages: {metadata.get('successful_pages', 0)}")
print(f"Images found: {metadata.get('images_found', 0)}")
```
### Configuring API Type
```python
from loader.mineru.gbase_adapter import MinerUExtractor
from loader.mineru.config_base import MinerUConfig
# Cloud API configuration (default)
cloud_config = MinerUConfig()
cloud_config.mineru_api_type = "cloud"
cloud_config.mineru_api_key = "your_api_key"
cloud_config.mineru_api_url = "https://mineru.net"
# Self-hosted API configuration
self_hosted_config = MinerUConfig()
self_hosted_config.mineru_api_type = "self_hosted"
self_hosted_config.mineru_api_url = "http://10.128.4.1:30001"
# No API key required for self-hosted
# Use with custom configuration
extractor = MinerUExtractor(learn_type=9)
extractor.config = self_hosted_config
documents = await extractor.process_file(
filepath="/path/to/document.pdf",
upload_options=upload_options
)
```
### Configuring Concurrency
```python
from loader.mineru.gbase_adapter import MinerUExtractor
from loader.mineru.config_base import MinerUConfig
# Configure parallel processing limits
config = MinerUConfig()
config.max_concurrent_api_calls = 2 # Process 2 pages simultaneously
config.max_concurrent_uploads = 3 # Upload 3 images simultaneously
# Use with custom configuration
extractor = MinerUExtractor(learn_type=9)
extractor.config = config
documents = await extractor.process_file(
filepath="/path/to/large_document.pdf",
upload_options=upload_options
)
```
### Integration with Existing Code
The module is designed to be a drop-in replacement for gzero.py processing:
```python
# Replace gzero_load with mineru processing
from loader.mineru.gbase_adapter import MinerUExtractor
async def process_with_mineru(file_path, learn_type, upload_options):
extractor = MinerUExtractor(learn_type)
return await extractor.process_file(file_path, upload_options=upload_options)
# Batch processing multiple files
async def batch_process_files(file_paths, learn_type, upload_options):
extractor = MinerUExtractor(learn_type)
results = []
for file_path in file_paths:
try:
documents = await extractor.process_file(file_path, upload_options=upload_options)
results.append((file_path, documents[0]))
except Exception as e:
print(f"Failed to process {file_path}: {e}")
results.append((file_path, None))
return results
```
## Advanced Features
### Flowchart Plugin
Specialized processing for complex flowchart documents:
- Multi-step node identification
- Department organization mapping
- Mermaid diagram generation
- Enhanced visual element extraction
### Content Fusion
Sophisticated content merging strategies:
- Plain text + structured content integration
- Table-aware processing workflows
- LLM-based content refinement
- Context-preserving enhancement
### Image Intelligence
Advanced image processing capabilities:
- Semantic classification of document images
- OCR content extraction and integration
- Meaningless image filtering
- Batch processing with optimization
## Logging and Tracing
Comprehensive logging with trace_id support:
- File ID-based tracing throughout pipeline
- Detailed processing metrics and timing
- Error handling with context preservation
- Integration with existing logging infrastructure
## Compatibility
### gzero.py Integration
- Uses same `learn_infos` configuration structure
- Compatible with existing upload and storage systems
- Follows same Document metadata patterns
- Maintains cache and temporary file conventions
### Dependencies
- **Core**: Built on existing project dependencies
- **LibreOffice**: For PPT conversion (with fallback options)
- **PyMuPDF (fitz)**: For PDF processing and analysis
- **Optional**: python-pptx, reportlab for alternative conversion
## Performance Considerations
### Optimization Features
- **Page-by-Page Processing**: Parallel processing of individual pages for faster results
- **Smart Batching**: Configurable concurrency limits to optimize API usage
- **Concurrent Processing**: Parallel image classification and upload
- **Intelligent Caching**: Reuses processed results when possible
- **Selective Processing**: Filters meaningless images before upload
- **Progress Monitoring**: Real-time tracking of page processing status
### Resource Management
- **Memory Efficient**: Streams large files and cleans up resources
- **Configurable Limits**: Adjustable concurrency and size limits
- **Error Recovery**: Graceful handling of processing failures
- **Timeout Management**: Prevents hanging operations
## Development and Testing
### Mock Implementation
The module includes mock MinerU processing for development:
- Simulates API responses for testing
- Provides realistic processing results
- Enables development without actual MinerU service
### Example Scripts
- `example.py`: Comprehensive usage examples
- Configuration validation helpers
- Batch processing demonstrations
### Error Handling
- Comprehensive exception handling throughout pipeline
- Graceful fallbacks for conversion failures
- Detailed error logging with context
- Recovery strategies for partial failures
## Future Enhancements
### Planned Features
- **Enhanced File Upload**: Integrated temporary file hosting for MinerU API
- **Advanced Table Processing**: More sophisticated table structure analysis
- **Multi-language Support**: Extended language detection and processing
- **Performance Monitoring**: Built-in metrics and performance tracking
- **Adaptive Batching**: Dynamic concurrency adjustment based on API performance
### Extensibility
- **Plugin Architecture**: Easy addition of specialized processors
- **Custom Workflows**: Configurable processing pipelines
- **API Extensions**: Support for additional MinerU service features
- **Format Extensions**: Framework for additional input formats