maxkb/apps/common/handle/impl/mineru/README.md

# MinerU-based PDF/PPT Parsing Module

This module provides a comprehensive PDF and PowerPoint document parsing solution using MinerU technology, designed as a modular extension to the existing gzero.py parsing system.

## Features

### Core Capabilities
- **Multi-format Support**: Direct processing of PPT (.ppt/.pptx) and PDF files
- **Intelligent Format Detection**: Automatically detects PPT-origin PDFs for optimized processing
- **Page-by-Page Processing**: Splits PDFs into individual pages for parallel MinerU processing
- **Advanced Image Processing**: AI-powered image classification and content extraction
- **Table Recognition**: Specialized handling for documents containing table structures
- **Content Fusion**: Combines MinerU structured output with plain text for accuracy

### Processing Flow

#### Page-by-Page Processing
**Core Benefits**: Faster processing, better parallelization, optimal resource usage

1. **PDF Splitting**: Automatically splits PDF into individual page files using PyMuPDF
2. **Parallel Processing**: Processes multiple pages simultaneously with MinerU API
3. **Batch Management**: Smart batching to avoid API rate limits (configurable concurrency)
4. **Progress Tracking**: Real-time progress reporting with page-level status
5. **Error Resilience**: Continues processing even if individual pages fail
6. **Result Merging**: Combines all page results into structured markdown content
7. **Image Organization**: Automatically renames images with page prefixes for better organization

#### Document Processing Pipeline
1. **File Input**: Accepts PPT (.ppt/.pptx) or PDF files
2. **Format Detection**: Determines if PDF originated from PPT presentation
3. **PPT Conversion**: Converts PPT to PDF using LibreOffice (with fallback)
4. **Page Splitting**: Splits PDF into individual pages for parallel processing
5. **MinerU Processing**: Each page processed through MinerU API with OCR, formula, and table recognition
6. **Content Integration**: Merges page results with structured content organization
7. **Image Processing**: AI-powered image classification and upload integration
8. **Final Assembly**: Creates complete document with page-organized content

## Architecture

### Module Structure
```
loader/mineru/
├── __init__.py              # Module initialization
├── maxkb_adapter/           # MaxKB adapter implementation
│   ├── __init__.py          # Adapter module initialization
│   ├── adapter.py           # MaxKBAdapter and MinerUExtractor
│   └── config_maxkb.py      # MaxKB-specific configuration
├── base_parser.py           # Base classes for platform adapters
├── config.py                # Configuration management
├── converter.py             # File conversion and detection
├── api_client.py            # MinerU API integration
├── image_processor.py       # Image recognition and processing
├── content_processor.py     # Content fusion and refinement
├── flowchart_plugin.py      # Specialized flowchart processing
├── utils.py                 # Utility functions
├── example.py               # Usage examples
└── README.md                # This file
```

### Key Components

**MinerUExtractor**: Main parser class orchestrating the complete pipeline
- Handles file processing from input to final Document output
- Manages temporary directories and caching
- Integrates with existing gzero.py patterns for compatibility

**DocumentConverter**: File type detection and conversion
- PPT to PDF conversion with LibreOffice and fallback methods
- PDF format detection (PPT-origin vs native PDF)
- Page extraction and metadata analysis

**MinerUAPIClient**: Interface to MinerU service
- Handles API communication and response processing
- Supports both cloud and self-hosted MinerU deployments
- Includes mock implementation for development/testing
- Cloud API: Asynchronous processing with polling
- Self-hosted API: Synchronous processing with direct file upload

**MinerUImageProcessor**: Advanced image handling
- AI-powered image classification (structured_content/brief_description/meaningless)
- Batch processing with concurrency control
- Integration with existing image upload infrastructure

**MinerUContentProcessor**: Content analysis and enhancement
- Table detection and specialized processing
- Plain text extraction and merging with structured content
- LLM-based content refinement for complex documents

## Configuration

### Environment Variables
```bash
# MinerU API Configuration
MINERU_API_KEY=your_mineru_api_key          # Required for cloud API
MINERU_API_URL=https://mineru.net           # Cloud API URL
MINERU_API_TYPE=cloud                       # "cloud" or "self_hosted"

# For self-hosted MinerU (alternative configuration)
# MINERU_API_URL=http://10.128.4.1:30001   # Self-hosted API URL
# MINERU_API_TYPE=self_hosted               # No API key required

# LLM Configuration (uses existing gzero.py settings)
ADVANCED_PARSER_KEY_OPENAI=your_openai_key
ADVANCED_PARSER_KEY_CLAUDE=your_claude_key
ADVANCED_PARSER_KEY_GEMINI=your_gemini_key

# Processing Configuration
MAX_FILE_SIZE=52428800  # 50MB
LIBREOFFICE_PATH=libreoffice
CONVERSION_TIMEOUT=300  # 5 minutes

# Parallel Processing Controls
MAX_CONCURRENT_UPLOADS=5
MAX_CONCURRENT_API_CALLS=3  # Controls page processing concurrency
MAX_IMAGE_SIZE_MB=5.0
COMPRESSION_QUALITY=85
```

### API Type Comparison

| Feature | Cloud API | Self-Hosted API |
|---------|-----------|-----------------|
| **Authentication** | API key required | No authentication |
| **Processing Model** | Asynchronous with polling | Synchronous direct response |
| **File Upload** | Requires public URL | Direct multipart upload |
| **Rate Limits** | 2000 pages/day (free tier) | Limited by server resources |
| **Network Requirements** | Internet access required | Local network access |
| **Setup Complexity** | Simple (API key only) | Requires self-hosted deployment |
| **Processing Speed** | Depends on queue | Immediate processing |
| **Data Privacy** | Data sent to cloud | Data stays on-premise |

### Settings Integration
The module integrates with existing `gptbase.settings` for:
- API keys and model configurations
- Upload and storage settings
- Cache and processing parameters
- Logging configuration

## Usage

### Basic Usage
```python
from loader.mineru.gbase_adapter import MinerUExtractor

# Initialize extractor (automatically uses page-by-page processing)
extractor = MinerUExtractor(learn_type=9)

# Process file
documents = await extractor.process_file(
    filepath="/path/to/document.pptx",
    upload_options=upload_options
)

# Access results
doc = documents[0]
content = doc.page_content
metadata = doc.metadata

# Check processing results
print(f"Processing mode: {metadata.get('processing_mode', 'page_by_page')}")
print(f"Total pages: {metadata.get('total_pages', 0)}")
print(f"Successful pages: {metadata.get('successful_pages', 0)}")
print(f"Images found: {metadata.get('images_found', 0)}")
```

### Configuring API Type
```python
from loader.mineru.gbase_adapter import MinerUExtractor
from loader.mineru.config_base import MinerUConfig

# Cloud API configuration (default)
cloud_config = MinerUConfig()
cloud_config.mineru_api_type = "cloud"
cloud_config.mineru_api_key = "your_api_key"
cloud_config.mineru_api_url = "https://mineru.net"

# Self-hosted API configuration
self_hosted_config = MinerUConfig()
self_hosted_config.mineru_api_type = "self_hosted"
self_hosted_config.mineru_api_url = "http://10.128.4.1:30001"
# No API key required for self-hosted

# Use with custom configuration
extractor = MinerUExtractor(learn_type=9)
extractor.config = self_hosted_config

documents = await extractor.process_file(
    filepath="/path/to/document.pdf",
    upload_options=upload_options
)
```

### Configuring Concurrency
```python
from loader.mineru.gbase_adapter import MinerUExtractor
from loader.mineru.config_base import MinerUConfig

# Configure parallel processing limits
config = MinerUConfig()
config.max_concurrent_api_calls = 2  # Process 2 pages simultaneously
config.max_concurrent_uploads = 3    # Upload 3 images simultaneously

# Use with custom configuration
extractor = MinerUExtractor(learn_type=9)
extractor.config = config

documents = await extractor.process_file(
    filepath="/path/to/large_document.pdf",
    upload_options=upload_options
)
```

### Integration with Existing Code
The module is designed to be a drop-in replacement for gzero.py processing:

```python
# Replace gzero_load with mineru processing
from loader.mineru.gbase_adapter import MinerUExtractor

async def process_with_mineru(file_path, learn_type, upload_options):
    extractor = MinerUExtractor(learn_type)
    return await extractor.process_file(file_path, upload_options=upload_options)

# Batch processing multiple files
async def batch_process_files(file_paths, learn_type, upload_options):
    extractor = MinerUExtractor(learn_type)
    results = []

    for file_path in file_paths:
        try:
            documents = await extractor.process_file(file_path, upload_options=upload_options)
            results.append((file_path, documents[0]))
        except Exception as e:
            print(f"Failed to process {file_path}: {e}")
            results.append((file_path, None))

    return results
```

## Advanced Features

### Flowchart Plugin
Specialized processing for complex flowchart documents:
- Multi-step node identification
- Department organization mapping
- Mermaid diagram generation
- Enhanced visual element extraction

### Content Fusion
Sophisticated content merging strategies:
- Plain text + structured content integration
- Table-aware processing workflows
- LLM-based content refinement
- Context-preserving enhancement

### Image Intelligence
Advanced image processing capabilities:
- Semantic classification of document images
- OCR content extraction and integration
- Meaningless image filtering
- Batch processing with optimization

## Logging and Tracing

Comprehensive logging with trace_id support:
- File ID-based tracing throughout pipeline
- Detailed processing metrics and timing
- Error handling with context preservation
- Integration with existing logging infrastructure

## Compatibility

### gzero.py Integration
- Uses same `learn_infos` configuration structure
- Compatible with existing upload and storage systems
- Follows same Document metadata patterns
- Maintains cache and temporary file conventions

### Dependencies
- **Core**: Built on existing project dependencies
- **LibreOffice**: For PPT conversion (with fallback options)
- **PyMuPDF (fitz)**: For PDF processing and analysis
- **Optional**: python-pptx, reportlab for alternative conversion

## Performance Considerations

### Optimization Features
- **Page-by-Page Processing**: Parallel processing of individual pages for faster results
- **Smart Batching**: Configurable concurrency limits to optimize API usage
- **Concurrent Processing**: Parallel image classification and upload
- **Intelligent Caching**: Reuses processed results when possible
- **Selective Processing**: Filters meaningless images before upload
- **Progress Monitoring**: Real-time tracking of page processing status

### Resource Management
- **Memory Efficient**: Streams large files and cleans up resources
- **Configurable Limits**: Adjustable concurrency and size limits
- **Error Recovery**: Graceful handling of processing failures
- **Timeout Management**: Prevents hanging operations

## Development and Testing

### Mock Implementation
The module includes mock MinerU processing for development:
- Simulates API responses for testing
- Provides realistic processing results
- Enables development without actual MinerU service

### Example Scripts
- `example.py`: Comprehensive usage examples
- Configuration validation helpers
- Batch processing demonstrations

### Error Handling
- Comprehensive exception handling throughout pipeline
- Graceful fallbacks for conversion failures
- Detailed error logging with context
- Recovery strategies for partial failures

## Future Enhancements

### Planned Features
- **Enhanced File Upload**: Integrated temporary file hosting for MinerU API
- **Advanced Table Processing**: More sophisticated table structure analysis
- **Multi-language Support**: Extended language detection and processing
- **Performance Monitoring**: Built-in metrics and performance tracking
- **Adaptive Batching**: Dynamic concurrency adjustment based on API performance

### Extensibility
- **Plugin Architecture**: Easy addition of specialized processors
- **Custom Workflows**: Configurable processing pipelines
- **API Extensions**: Support for additional MinerU service features
- **Format Extensions**: Framework for additional input formats