maxkb/apps/common/handle/impl/mineru
朱潮 51481055d6
Some checks failed
sync2gitee / repo-sync (push) Has been cancelled
Typos Check / Spell Check with Typos (push) Has been cancelled
确保文件夹存在
2025-12-19 13:54:10 +08:00
..
gbase_adapter os error 2025-08-27 11:16:30 +08:00
maxkb_adapter 确保文件夹存在 2025-12-19 13:54:10 +08:00
prompts
__init__.py
api_client.py midyf model_id 2025-08-26 16:35:29 +08:00
base_parser.py 修复mineru的json解析报错 2025-12-18 12:59:14 +08:00
config_base.py 修复mineru的json解析报错 2025-12-18 12:59:14 +08:00
content_processor.py
context_types.py
converter.py
flowchart_plugin.py
image_optimizer.py
image_processor.py 修复mineru的json解析报错 2025-12-18 12:59:14 +08:00
language_detector.py
logger.py
parallel_processor_pool.py midyf model_id 2025-08-26 14:10:15 +08:00
parallel_processor.py 修复mineru的json解析报错 2025-12-18 12:59:14 +08:00
prompts.py
README.md
utils.py

MinerU-based PDF/PPT Parsing Module

This module provides a comprehensive PDF and PowerPoint document parsing solution using MinerU technology, designed as a modular extension to the existing gzero.py parsing system.

Features

Core Capabilities

  • Multi-format Support: Direct processing of PPT (.ppt/.pptx) and PDF files
  • Intelligent Format Detection: Automatically detects PPT-origin PDFs for optimized processing
  • Page-by-Page Processing: Splits PDFs into individual pages for parallel MinerU processing
  • Advanced Image Processing: AI-powered image classification and content extraction
  • Table Recognition: Specialized handling for documents containing table structures
  • Content Fusion: Combines MinerU structured output with plain text for accuracy

Processing Flow

Page-by-Page Processing

Core Benefits: Faster processing, better parallelization, optimal resource usage

  1. PDF Splitting: Automatically splits PDF into individual page files using PyMuPDF
  2. Parallel Processing: Processes multiple pages simultaneously with MinerU API
  3. Batch Management: Smart batching to avoid API rate limits (configurable concurrency)
  4. Progress Tracking: Real-time progress reporting with page-level status
  5. Error Resilience: Continues processing even if individual pages fail
  6. Result Merging: Combines all page results into structured markdown content
  7. Image Organization: Automatically renames images with page prefixes for better organization

Document Processing Pipeline

  1. File Input: Accepts PPT (.ppt/.pptx) or PDF files
  2. Format Detection: Determines if PDF originated from PPT presentation
  3. PPT Conversion: Converts PPT to PDF using LibreOffice (with fallback)
  4. Page Splitting: Splits PDF into individual pages for parallel processing
  5. MinerU Processing: Each page processed through MinerU API with OCR, formula, and table recognition
  6. Content Integration: Merges page results with structured content organization
  7. Image Processing: AI-powered image classification and upload integration
  8. Final Assembly: Creates complete document with page-organized content

Architecture

Module Structure

loader/mineru/
├── __init__.py              # Module initialization
├── maxkb_adapter/           # MaxKB adapter implementation
│   ├── __init__.py          # Adapter module initialization
│   ├── adapter.py           # MaxKBAdapter and MinerUExtractor
│   └── config_maxkb.py      # MaxKB-specific configuration
├── base_parser.py           # Base classes for platform adapters
├── config.py                # Configuration management
├── converter.py             # File conversion and detection
├── api_client.py            # MinerU API integration
├── image_processor.py       # Image recognition and processing
├── content_processor.py     # Content fusion and refinement
├── flowchart_plugin.py      # Specialized flowchart processing
├── utils.py                 # Utility functions
├── example.py               # Usage examples
└── README.md                # This file

Key Components

MinerUExtractor: Main parser class orchestrating the complete pipeline

  • Handles file processing from input to final Document output
  • Manages temporary directories and caching
  • Integrates with existing gzero.py patterns for compatibility

DocumentConverter: File type detection and conversion

  • PPT to PDF conversion with LibreOffice and fallback methods
  • PDF format detection (PPT-origin vs native PDF)
  • Page extraction and metadata analysis

MinerUAPIClient: Interface to MinerU service

  • Handles API communication and response processing
  • Supports both cloud and self-hosted MinerU deployments
  • Includes mock implementation for development/testing
  • Cloud API: Asynchronous processing with polling
  • Self-hosted API: Synchronous processing with direct file upload

MinerUImageProcessor: Advanced image handling

  • AI-powered image classification (structured_content/brief_description/meaningless)
  • Batch processing with concurrency control
  • Integration with existing image upload infrastructure

MinerUContentProcessor: Content analysis and enhancement

  • Table detection and specialized processing
  • Plain text extraction and merging with structured content
  • LLM-based content refinement for complex documents

Configuration

Environment Variables

# MinerU API Configuration
MINERU_API_KEY=your_mineru_api_key          # Required for cloud API
MINERU_API_URL=https://mineru.net           # Cloud API URL
MINERU_API_TYPE=cloud                       # "cloud" or "self_hosted"

# For self-hosted MinerU (alternative configuration)
# MINERU_API_URL=http://10.128.4.1:30001   # Self-hosted API URL
# MINERU_API_TYPE=self_hosted               # No API key required

# LLM Configuration (uses existing gzero.py settings)
ADVANCED_PARSER_KEY_OPENAI=your_openai_key
ADVANCED_PARSER_KEY_CLAUDE=your_claude_key
ADVANCED_PARSER_KEY_GEMINI=your_gemini_key

# Processing Configuration
MAX_FILE_SIZE=52428800  # 50MB
LIBREOFFICE_PATH=libreoffice
CONVERSION_TIMEOUT=300  # 5 minutes

# Parallel Processing Controls
MAX_CONCURRENT_UPLOADS=5
MAX_CONCURRENT_API_CALLS=3  # Controls page processing concurrency
MAX_IMAGE_SIZE_MB=5.0
COMPRESSION_QUALITY=85

API Type Comparison

Feature Cloud API Self-Hosted API
Authentication API key required No authentication
Processing Model Asynchronous with polling Synchronous direct response
File Upload Requires public URL Direct multipart upload
Rate Limits 2000 pages/day (free tier) Limited by server resources
Network Requirements Internet access required Local network access
Setup Complexity Simple (API key only) Requires self-hosted deployment
Processing Speed Depends on queue Immediate processing
Data Privacy Data sent to cloud Data stays on-premise

Settings Integration

The module integrates with existing gptbase.settings for:

  • API keys and model configurations
  • Upload and storage settings
  • Cache and processing parameters
  • Logging configuration

Usage

Basic Usage

from loader.mineru.gbase_adapter import MinerUExtractor

# Initialize extractor (automatically uses page-by-page processing)
extractor = MinerUExtractor(learn_type=9)

# Process file
documents = await extractor.process_file(
    filepath="/path/to/document.pptx",
    upload_options=upload_options
)

# Access results
doc = documents[0]
content = doc.page_content
metadata = doc.metadata

# Check processing results
print(f"Processing mode: {metadata.get('processing_mode', 'page_by_page')}")
print(f"Total pages: {metadata.get('total_pages', 0)}")
print(f"Successful pages: {metadata.get('successful_pages', 0)}")
print(f"Images found: {metadata.get('images_found', 0)}")

Configuring API Type

from loader.mineru.gbase_adapter import MinerUExtractor
from loader.mineru.config_base import MinerUConfig

# Cloud API configuration (default)
cloud_config = MinerUConfig()
cloud_config.mineru_api_type = "cloud"
cloud_config.mineru_api_key = "your_api_key"
cloud_config.mineru_api_url = "https://mineru.net"

# Self-hosted API configuration
self_hosted_config = MinerUConfig()
self_hosted_config.mineru_api_type = "self_hosted"
self_hosted_config.mineru_api_url = "http://10.128.4.1:30001"
# No API key required for self-hosted

# Use with custom configuration
extractor = MinerUExtractor(learn_type=9)
extractor.config = self_hosted_config

documents = await extractor.process_file(
    filepath="/path/to/document.pdf",
    upload_options=upload_options
)

Configuring Concurrency

from loader.mineru.gbase_adapter import MinerUExtractor
from loader.mineru.config_base import MinerUConfig

# Configure parallel processing limits
config = MinerUConfig()
config.max_concurrent_api_calls = 2  # Process 2 pages simultaneously
config.max_concurrent_uploads = 3    # Upload 3 images simultaneously

# Use with custom configuration
extractor = MinerUExtractor(learn_type=9)
extractor.config = config

documents = await extractor.process_file(
    filepath="/path/to/large_document.pdf",
    upload_options=upload_options
)

Integration with Existing Code

The module is designed to be a drop-in replacement for gzero.py processing:

# Replace gzero_load with mineru processing
from loader.mineru.gbase_adapter import MinerUExtractor

async def process_with_mineru(file_path, learn_type, upload_options):
    extractor = MinerUExtractor(learn_type)
    return await extractor.process_file(file_path, upload_options=upload_options)

# Batch processing multiple files
async def batch_process_files(file_paths, learn_type, upload_options):
    extractor = MinerUExtractor(learn_type)
    results = []
    
    for file_path in file_paths:
        try:
            documents = await extractor.process_file(file_path, upload_options=upload_options)
            results.append((file_path, documents[0]))
        except Exception as e:
            print(f"Failed to process {file_path}: {e}")
            results.append((file_path, None))
    
    return results

Advanced Features

Flowchart Plugin

Specialized processing for complex flowchart documents:

  • Multi-step node identification
  • Department organization mapping
  • Mermaid diagram generation
  • Enhanced visual element extraction

Content Fusion

Sophisticated content merging strategies:

  • Plain text + structured content integration
  • Table-aware processing workflows
  • LLM-based content refinement
  • Context-preserving enhancement

Image Intelligence

Advanced image processing capabilities:

  • Semantic classification of document images
  • OCR content extraction and integration
  • Meaningless image filtering
  • Batch processing with optimization

Logging and Tracing

Comprehensive logging with trace_id support:

  • File ID-based tracing throughout pipeline
  • Detailed processing metrics and timing
  • Error handling with context preservation
  • Integration with existing logging infrastructure

Compatibility

gzero.py Integration

  • Uses same learn_infos configuration structure
  • Compatible with existing upload and storage systems
  • Follows same Document metadata patterns
  • Maintains cache and temporary file conventions

Dependencies

  • Core: Built on existing project dependencies
  • LibreOffice: For PPT conversion (with fallback options)
  • PyMuPDF (fitz): For PDF processing and analysis
  • Optional: python-pptx, reportlab for alternative conversion

Performance Considerations

Optimization Features

  • Page-by-Page Processing: Parallel processing of individual pages for faster results
  • Smart Batching: Configurable concurrency limits to optimize API usage
  • Concurrent Processing: Parallel image classification and upload
  • Intelligent Caching: Reuses processed results when possible
  • Selective Processing: Filters meaningless images before upload
  • Progress Monitoring: Real-time tracking of page processing status

Resource Management

  • Memory Efficient: Streams large files and cleans up resources
  • Configurable Limits: Adjustable concurrency and size limits
  • Error Recovery: Graceful handling of processing failures
  • Timeout Management: Prevents hanging operations

Development and Testing

Mock Implementation

The module includes mock MinerU processing for development:

  • Simulates API responses for testing
  • Provides realistic processing results
  • Enables development without actual MinerU service

Example Scripts

  • example.py: Comprehensive usage examples
  • Configuration validation helpers
  • Batch processing demonstrations

Error Handling

  • Comprehensive exception handling throughout pipeline
  • Graceful fallbacks for conversion failures
  • Detailed error logging with context
  • Recovery strategies for partial failures

Future Enhancements

Planned Features

  • Enhanced File Upload: Integrated temporary file hosting for MinerU API
  • Advanced Table Processing: More sophisticated table structure analysis
  • Multi-language Support: Extended language detection and processing
  • Performance Monitoring: Built-in metrics and performance tracking
  • Adaptive Batching: Dynamic concurrency adjustment based on API performance

Extensibility

  • Plugin Architecture: Easy addition of specialized processors
  • Custom Workflows: Configurable processing pipelines
  • API Extensions: Support for additional MinerU service features
  • Format Extensions: Framework for additional input formats