2025-08-24 00:56:02 +08:00

12 KiB

Raw Blame History

MinerU-based PDF/PPT Parsing Module

This module provides a comprehensive PDF and PowerPoint document parsing solution using MinerU technology, designed as a modular extension to the existing gzero.py parsing system.

Features

Core Capabilities

Multi-format Support: Direct processing of PPT (.ppt/.pptx) and PDF files
Intelligent Format Detection: Automatically detects PPT-origin PDFs for optimized processing
Page-by-Page Processing: Splits PDFs into individual pages for parallel MinerU processing
Advanced Image Processing: AI-powered image classification and content extraction
Table Recognition: Specialized handling for documents containing table structures
Content Fusion: Combines MinerU structured output with plain text for accuracy

Processing Flow

Page-by-Page Processing

Core Benefits: Faster processing, better parallelization, optimal resource usage

PDF Splitting: Automatically splits PDF into individual page files using PyMuPDF
Parallel Processing: Processes multiple pages simultaneously with MinerU API
Batch Management: Smart batching to avoid API rate limits (configurable concurrency)
Progress Tracking: Real-time progress reporting with page-level status
Error Resilience: Continues processing even if individual pages fail
Result Merging: Combines all page results into structured markdown content
Image Organization: Automatically renames images with page prefixes for better organization

Document Processing Pipeline

File Input: Accepts PPT (.ppt/.pptx) or PDF files
Format Detection: Determines if PDF originated from PPT presentation
PPT Conversion: Converts PPT to PDF using LibreOffice (with fallback)
Page Splitting: Splits PDF into individual pages for parallel processing
MinerU Processing: Each page processed through MinerU API with OCR, formula, and table recognition
Content Integration: Merges page results with structured content organization
Image Processing: AI-powered image classification and upload integration
Final Assembly: Creates complete document with page-organized content

Architecture

Module Structure

loader/mineru/
├── __init__.py              # Module initialization
├── maxkb_adapter/           # MaxKB adapter implementation
│   ├── __init__.py          # Adapter module initialization
│   ├── adapter.py           # MaxKBAdapter and MinerUExtractor
│   └── config_maxkb.py      # MaxKB-specific configuration
├── base_parser.py           # Base classes for platform adapters
├── config.py                # Configuration management
├── converter.py             # File conversion and detection
├── api_client.py            # MinerU API integration
├── image_processor.py       # Image recognition and processing
├── content_processor.py     # Content fusion and refinement
├── flowchart_plugin.py      # Specialized flowchart processing
├── utils.py                 # Utility functions
├── example.py               # Usage examples
└── README.md                # This file

Key Components

MinerUExtractor: Main parser class orchestrating the complete pipeline

Handles file processing from input to final Document output
Manages temporary directories and caching
Integrates with existing gzero.py patterns for compatibility

DocumentConverter: File type detection and conversion

PPT to PDF conversion with LibreOffice and fallback methods
PDF format detection (PPT-origin vs native PDF)
Page extraction and metadata analysis

MinerUAPIClient: Interface to MinerU service

Handles API communication and response processing
Supports both cloud and self-hosted MinerU deployments
Includes mock implementation for development/testing
Cloud API: Asynchronous processing with polling
Self-hosted API: Synchronous processing with direct file upload

MinerUImageProcessor: Advanced image handling

AI-powered image classification (structured_content/brief_description/meaningless)
Batch processing with concurrency control
Integration with existing image upload infrastructure

MinerUContentProcessor: Content analysis and enhancement

Table detection and specialized processing
Plain text extraction and merging with structured content
LLM-based content refinement for complex documents

Configuration

Environment Variables

# MinerU API Configuration
MINERU_API_KEY=your_mineru_api_key          # Required for cloud API
MINERU_API_URL=https://mineru.net           # Cloud API URL
MINERU_API_TYPE=cloud                       # "cloud" or "self_hosted"

# For self-hosted MinerU (alternative configuration)
# MINERU_API_URL=http://10.128.4.1:30001   # Self-hosted API URL
# MINERU_API_TYPE=self_hosted               # No API key required

# LLM Configuration (uses existing gzero.py settings)
ADVANCED_PARSER_KEY_OPENAI=your_openai_key
ADVANCED_PARSER_KEY_CLAUDE=your_claude_key
ADVANCED_PARSER_KEY_GEMINI=your_gemini_key

# Processing Configuration
MAX_FILE_SIZE=52428800  # 50MB
LIBREOFFICE_PATH=libreoffice
CONVERSION_TIMEOUT=300  # 5 minutes

# Parallel Processing Controls
MAX_CONCURRENT_UPLOADS=5
MAX_CONCURRENT_API_CALLS=3  # Controls page processing concurrency
MAX_IMAGE_SIZE_MB=5.0
COMPRESSION_QUALITY=85

API Type Comparison

Feature	Cloud API	Self-Hosted API
Authentication	API key required	No authentication
Processing Model	Asynchronous with polling	Synchronous direct response
File Upload	Requires public URL	Direct multipart upload
Rate Limits	2000 pages/day (free tier)	Limited by server resources
Network Requirements	Internet access required	Local network access
Setup Complexity	Simple (API key only)	Requires self-hosted deployment
Processing Speed	Depends on queue	Immediate processing
Data Privacy	Data sent to cloud	Data stays on-premise

Settings Integration

The module integrates with existing gptbase.settings for:

API keys and model configurations
Upload and storage settings
Cache and processing parameters
Logging configuration

Usage

Basic Usage

from loader.mineru.gbase_adapter import MinerUExtractor

# Initialize extractor (automatically uses page-by-page processing)
extractor = MinerUExtractor(learn_type=9)

# Process file
documents = await extractor.process_file(
    filepath="/path/to/document.pptx",
    upload_options=upload_options
)

# Access results
doc = documents[0]
content = doc.page_content
metadata = doc.metadata

# Check processing results
print(f"Processing mode: {metadata.get('processing_mode', 'page_by_page')}")
print(f"Total pages: {metadata.get('total_pages', 0)}")
print(f"Successful pages: {metadata.get('successful_pages', 0)}")
print(f"Images found: {metadata.get('images_found', 0)}")

Configuring API Type

from loader.mineru.gbase_adapter import MinerUExtractor
from loader.mineru.config_base import MinerUConfig

# Cloud API configuration (default)
cloud_config = MinerUConfig()
cloud_config.mineru_api_type = "cloud"
cloud_config.mineru_api_key = "your_api_key"
cloud_config.mineru_api_url = "https://mineru.net"

# Self-hosted API configuration
self_hosted_config = MinerUConfig()
self_hosted_config.mineru_api_type = "self_hosted"
self_hosted_config.mineru_api_url = "http://10.128.4.1:30001"
# No API key required for self-hosted

# Use with custom configuration
extractor = MinerUExtractor(learn_type=9)
extractor.config = self_hosted_config

documents = await extractor.process_file(
    filepath="/path/to/document.pdf",
    upload_options=upload_options
)

Configuring Concurrency

from loader.mineru.gbase_adapter import MinerUExtractor
from loader.mineru.config_base import MinerUConfig

# Configure parallel processing limits
config = MinerUConfig()
config.max_concurrent_api_calls = 2  # Process 2 pages simultaneously
config.max_concurrent_uploads = 3    # Upload 3 images simultaneously

# Use with custom configuration
extractor = MinerUExtractor(learn_type=9)
extractor.config = config

documents = await extractor.process_file(
    filepath="/path/to/large_document.pdf",
    upload_options=upload_options
)

Integration with Existing Code

The module is designed to be a drop-in replacement for gzero.py processing:

# Replace gzero_load with mineru processing
from loader.mineru.gbase_adapter import MinerUExtractor

async def process_with_mineru(file_path, learn_type, upload_options):
    extractor = MinerUExtractor(learn_type)
    return await extractor.process_file(file_path, upload_options=upload_options)

# Batch processing multiple files
async def batch_process_files(file_paths, learn_type, upload_options):
    extractor = MinerUExtractor(learn_type)
    results = []
    
    for file_path in file_paths:
        try:
            documents = await extractor.process_file(file_path, upload_options=upload_options)
            results.append((file_path, documents[0]))
        except Exception as e:
            print(f"Failed to process {file_path}: {e}")
            results.append((file_path, None))
    
    return results

Advanced Features

Flowchart Plugin

Specialized processing for complex flowchart documents:

Multi-step node identification
Department organization mapping
Mermaid diagram generation
Enhanced visual element extraction

Content Fusion

Sophisticated content merging strategies:

Plain text + structured content integration
Table-aware processing workflows
LLM-based content refinement
Context-preserving enhancement

Image Intelligence

Advanced image processing capabilities:

Semantic classification of document images
OCR content extraction and integration
Meaningless image filtering
Batch processing with optimization

Logging and Tracing

Comprehensive logging with trace_id support:

File ID-based tracing throughout pipeline
Detailed processing metrics and timing
Error handling with context preservation
Integration with existing logging infrastructure

Compatibility

gzero.py Integration

Uses same learn_infos configuration structure
Compatible with existing upload and storage systems
Follows same Document metadata patterns
Maintains cache and temporary file conventions

Dependencies

Core: Built on existing project dependencies
LibreOffice: For PPT conversion (with fallback options)
PyMuPDF (fitz): For PDF processing and analysis
Optional: python-pptx, reportlab for alternative conversion

Performance Considerations

Optimization Features

Page-by-Page Processing: Parallel processing of individual pages for faster results
Smart Batching: Configurable concurrency limits to optimize API usage
Concurrent Processing: Parallel image classification and upload
Intelligent Caching: Reuses processed results when possible
Selective Processing: Filters meaningless images before upload
Progress Monitoring: Real-time tracking of page processing status

Resource Management

Memory Efficient: Streams large files and cleans up resources
Configurable Limits: Adjustable concurrency and size limits
Error Recovery: Graceful handling of processing failures
Timeout Management: Prevents hanging operations

Development and Testing

Mock Implementation

The module includes mock MinerU processing for development:

Simulates API responses for testing
Provides realistic processing results
Enables development without actual MinerU service

Example Scripts

example.py: Comprehensive usage examples
Configuration validation helpers
Batch processing demonstrations

Error Handling

Comprehensive exception handling throughout pipeline
Graceful fallbacks for conversion failures
Detailed error logging with context
Recovery strategies for partial failures

Future Enhancements

Planned Features

Enhanced File Upload: Integrated temporary file hosting for MinerU API
Advanced Table Processing: More sophisticated table structure analysis
Multi-language Support: Extended language detection and processing
Performance Monitoring: Built-in metrics and performance tracking
Adaptive Batching: Dynamic concurrency adjustment based on API performance

Extensibility

Plugin Architecture: Easy addition of specialized processors
Custom Workflows: Configurable processing pipelines
API Extensions: Support for additional MinerU service features
Format Extensions: Framework for additional input formats

12 KiB Raw Blame History