API Reference#

This section provides complete API documentation for all classes and methods in epub-utils.

Document Class#

class Document(path)#

Main class for working with EPUB files.

Parameters:: path (str) – Path to the EPUB file

Example:

from epub_utils import Document

doc = Document("book.epub")
print(doc.package.metadata.title)

container#

Access to the container information.

Type:: Container
Returns:: Container object with container.xml information

Example:

container = doc.container
print(f"Package path: {container.rootfile_path}")

package#

Access to the package (OPF) information.

Type:: Package
Returns:: Package object with OPF file information

Example:

package = doc.package
print(f"Title: {package.metadata.title}")

toc#

Access to the table of contents.

Type:: TableOfContents
Returns:: Table of contents object

Example:

toc = doc.toc
toc_xml = toc.to_xml()

ncx#

Access to the NCX (Navigation Control for XML) table of contents.

Type:: TableOfContents or None
Returns:: NCX table of contents object for EPUB 2, or for EPUB 3 if NCX is present, None otherwise

Example:

ncx = doc.ncx
if ncx:
    ncx_xml = ncx.to_xml()

Note: For EPUB 2, this returns the same as toc. For EPUB 3, this specifically accesses the NCX file if present, which provides backward compatibility.

nav#

Access to the Navigation Document (EPUB 3 only).

Type:: TableOfContents or None
Returns:: Navigation Document table of contents object for EPUB 3, None for EPUB 2 or if not present

Example:

nav = doc.nav
if nav:
    nav_xml = nav.to_xml()

Note: This property specifically accesses EPUB 3 Navigation Documents. Returns None for EPUB 2 documents.

get_files_info()#

Get detailed information about all files in the EPUB.

Returns:: List of dictionaries containing file information
Return type:: List[Dict[str, Union[str, int]]]

Each dictionary contains: - path (str): File path within the EPUB - size (int): Uncompressed size in bytes - compressed_size (int): Compressed size in bytes - modified (str): Last modified date in ISO format

Example:

files = doc.get_files_info()
for file_info in files:
    print(f"{file_info['path']}: {file_info['size']} bytes")

list_files()#

Get basic information about all files in the EPUB.

Returns:: List of dictionaries with basic file information
Return type:: List[Dict[str, str]]

Example:

files = doc.list_files()
print(f"EPUB contains {len(files)} files")

Container Class#

class Container#

Represents the META-INF/container.xml file information.

rootfile_path#

Path to the main package file within the EPUB.

Type:: str

rootfile_media_type#

Media type of the main package file.

Type:: str

to_xml(highlight_syntax=True)#

Get formatted XML representation.

Parameters:: highlight_syntax (bool) – Whether to apply syntax highlighting
Returns:: Formatted XML string
Return type:: str

to_str()#

Get raw XML content.

Returns:: Raw XML string
Return type:: str

Package Class#

class Package#

Represents the main OPF package file.

metadata#

Package metadata information.

Type:: Metadata

manifest#

Package manifest information.

Type:: Manifest

spine#

Package spine information.

Type:: Spine

to_xml(highlight_syntax=True)#

Get formatted XML representation of the complete package.

Parameters:: highlight_syntax (bool) – Whether to apply syntax highlighting
Returns:: Formatted XML string
Return type:: str

to_str()#

Get raw XML content of the complete package.

Returns:: Raw XML string
Return type:: str

Metadata Class#

class Metadata#

Represents Dublin Core and EPUB-specific metadata.

title#

Book title from dc:title element.

Type:: str

creator#

Book author/creator from dc:creator element.

Type:: str

language#

Language code from dc:language element.

Type:: str

identifier#

Unique identifier from dc:identifier element.

Type:: str

publisher#

Publisher from dc:publisher element.

Type:: str

date#

Publication date from dc:date element.

Type:: str

subject#

Subject/keywords from dc:subject element.

Type:: str

description#

Description from dc:description element.

Type:: str

contributor#

Contributor from dc:contributor element.

Type:: str

type#

Resource type from dc:type element.

Type:: str

format#

Format from dc:format element.

Type:: str

source#

Source from dc:source element.

Type:: str

relation#

Relation from dc:relation element.

Type:: str

coverage#

Coverage from dc:coverage element.

Type:: str

rights#

Rights information from dc:rights element.

Type:: str

__getattr__(name)#

Dynamic attribute access for any metadata field.

Parameters:: name (str) – Metadata field name
Returns:: Metadata value or empty string
Return type:: str

Example:

# Access any metadata field
isbn = metadata.isbn if hasattr(metadata, 'isbn') else 'Not available'
series = getattr(metadata, 'series', 'Not available')

to_xml(highlight_syntax=True)#

Get formatted XML representation of metadata.

Parameters:: highlight_syntax (bool) – Whether to apply syntax highlighting
Returns:: Formatted XML string
Return type:: str

to_kv()#

Get metadata as key-value pairs.

Returns:: Key-value formatted string
Return type:: str

Example:

kv_data = metadata.to_kv()
print(kv_data)
# Output:
# title: The Great Gatsby
# creator: F. Scott Fitzgerald
# language: en

to_str()#

Get raw XML content of metadata.

Returns:: Raw XML string
Return type:: str

Manifest Class#

class Manifest#

Represents the package manifest section.

items#

Dictionary of manifest items.

Type:: Dict[str, Dict[str, str]]

Each item contains: - href: File path - media-type: MIME type - Other attributes as needed

Example:

for item_id, item in manifest.items.items():
    print(f"ID: {item_id}")
    print(f"  File: {item['href']}")
    print(f"  Type: {item['media-type']}")

to_xml(highlight_syntax=True)#

Get formatted XML representation.

Parameters:: highlight_syntax (bool) – Whether to apply syntax highlighting
Returns:: Formatted XML string
Return type:: str

to_str()#

Get raw XML content.

Returns:: Raw XML string
Return type:: str

Spine Class#

class Spine#

Represents the package spine section.

items#

List of spine items in reading order.

Type:: List[Dict[str, str]]

Example:

for item in spine.items:
    print(f"Reading order item: {item}")

to_xml(highlight_syntax=True)#

Get formatted XML representation.

Parameters:: highlight_syntax (bool) – Whether to apply syntax highlighting
Returns:: Formatted XML string
Return type:: str

to_str()#

Get raw XML content.

Returns:: Raw XML string
Return type:: str

TableOfContents Class#

class TableOfContents#

Represents the table of contents (NCX or Navigation Document).

to_xml(highlight_syntax=True)#

Get formatted XML representation.

Parameters:: highlight_syntax (bool) – Whether to apply syntax highlighting
Returns:: Formatted XML string
Return type:: str

to_str()#

Get raw XML content.

Returns:: Raw XML string
Return type:: str

Content Classes#

class Content#

Base class for EPUB content documents.

to_xml(highlight_syntax=True)#

Get formatted content.

Parameters:: highlight_syntax (bool) – Whether to apply syntax highlighting
Returns:: Formatted content string
Return type:: str

to_str()#

Get raw content.

Returns:: Raw content string
Return type:: str

class XHTMLContent#

Specialized class for XHTML content documents.

Inherits from Content with additional XHTML-specific methods.

to_plain()#

Get plain text content with HTML tags stripped.

Returns:: Plain text string
Return type:: str

Example:

from epub_utils.content import XHTMLContent

# This would typically be accessed through Document
# content = XHTMLContent(raw_html)
# plain_text = content.to_plain()

Exception Classes#

exception ParseError#

Raised when there’s an error parsing EPUB content.

Base class: Exception

Example:

from epub_utils import Document
from epub_utils.exceptions import ParseError

try:
    doc = Document("corrupted.epub")
    title = doc.package.metadata.title
except ParseError as e:
    print(f"Failed to parse EPUB: {e}")
except FileNotFoundError:
    print("EPUB file not found")

Usage Examples#

Basic Usage#

from epub_utils import Document

# Load document
doc = Document("book.epub")

# Access metadata
metadata = doc.package.metadata
print(f"Title: {metadata.title}")
print(f"Author: {metadata.creator}")

# Check file structure
files = doc.get_files_info()
print(f"Contains {len(files)} files")

# Get formatted output
toc_xml = doc.toc.to_xml()
metadata_kv = metadata.to_kv()

Error Handling#

from epub_utils import Document
from epub_utils.exceptions import ParseError

def safe_load_epub(path):
    try:
        doc = Document(path)
        return {
            'status': 'success',
            'document': doc,
            'title': getattr(doc.package.metadata, 'title', 'Unknown')
        }
    except ParseError as e:
        return {
            'status': 'parse_error',
            'error': str(e)
        }
    except FileNotFoundError:
        return {
            'status': 'file_not_found',
            'error': 'EPUB file not found'
        }
    except Exception as e:
        return {
            'status': 'unknown_error',
            'error': str(e)
        }

Batch Processing#

import os
from pathlib import Path
from epub_utils import Document

def process_epub_directory(directory):
    epub_files = Path(directory).glob("*.epub")
    results = []

    for epub_path in epub_files:
        try:
            doc = Document(str(epub_path))
            metadata = doc.package.metadata

            result = {
                'file': epub_path.name,
                'title': getattr(metadata, 'title', ''),
                'author': getattr(metadata, 'creator', ''),
                'language': getattr(metadata, 'language', ''),
                'file_size': epub_path.stat().st_size,
                'epub_files': len(doc.get_files_info())
            }
            results.append(result)

        except Exception as e:
            results.append({
                'file': epub_path.name,
                'error': str(e)
            })

    return results

Type Hints#

For better IDE support and type checking, here are the main type hints:

from typing import Dict, List, Union, Optional
from epub_utils import Document

# Function signatures for reference
def get_files_info(self) -> List[Dict[str, Union[str, int]]]: ...
def list_files(self) -> List[Dict[str, str]]: ...
def to_xml(self, highlight_syntax: bool = True) -> str: ...
def to_str(self) -> str: ...
def to_kv(self) -> str: ...

# Type-safe usage example
doc: Document = Document("book.epub")
files_info: List[Dict[str, Union[str, int]]] = doc.get_files_info()
title: str = doc.package.metadata.title
kv_data: str = doc.package.metadata.to_kv()

Module Structure#

The epub-utils package is organized as follows:

epub_utils/
├── __init__.py          # Main exports (Document, Container)
├── doc.py               # Document class
├── container.py         # Container class
├── package/
│   ├── __init__.py      # Package class
│   ├── metadata.py      # Metadata class
│   ├── manifest.py      # Manifest class
│   └── spine.py         # Spine class
├── content/
│   ├── __init__.py      # Content classes
│   ├── base.py          # Base Content class
│   └── xhtml.py         # XHTMLContent class
├── toc.py               # TableOfContents class
├── exceptions.py        # Exception classes
├── highlighters.py      # Syntax highlighting utilities
└── cli.py               # Command-line interface

For detailed implementation examples, see Use as a Python library and Examples and Use Cases.