API Reference#
This section provides complete API documentation for all classes and methods in epub-utils.
Document Class#
- class Document(path)#
Main class for working with EPUB files.
- Parameters:
path (str) – Path to the EPUB file
Example:
from epub_utils import Document doc = Document("book.epub") print(doc.package.metadata.title)
- container#
Access to the container information.
- Type:
- Returns:
Container object with container.xml information
Example:
container = doc.container print(f"Package path: {container.rootfile_path}")
- package#
Access to the package (OPF) information.
- Type:
- Returns:
Package object with OPF file information
Example:
package = doc.package print(f"Title: {package.metadata.title}")
- toc#
Access to the table of contents.
- Type:
- Returns:
Table of contents object
Example:
toc = doc.toc toc_xml = toc.to_xml()
- ncx#
Access to the NCX (Navigation Control for XML) table of contents.
- Type:
TableOfContents or None
- Returns:
NCX table of contents object for EPUB 2, or for EPUB 3 if NCX is present, None otherwise
Example:
ncx = doc.ncx if ncx: ncx_xml = ncx.to_xml()
Note: For EPUB 2, this returns the same as
toc
. For EPUB 3, this specifically accesses the NCX file if present, which provides backward compatibility.
Access to the Navigation Document (EPUB 3 only).
- Type:
TableOfContents or None
- Returns:
Navigation Document table of contents object for EPUB 3, None for EPUB 2 or if not present
Example:
nav = doc.nav if nav: nav_xml = nav.to_xml()
Note: This property specifically accesses EPUB 3 Navigation Documents. Returns None for EPUB 2 documents.
- get_files_info()#
Get detailed information about all files in the EPUB.
- Returns:
List of dictionaries containing file information
- Return type:
List[Dict[str, Union[str, int]]]
Each dictionary contains: -
path
(str): File path within the EPUB -size
(int): Uncompressed size in bytes -compressed_size
(int): Compressed size in bytes -modified
(str): Last modified date in ISO formatExample:
files = doc.get_files_info() for file_info in files: print(f"{file_info['path']}: {file_info['size']} bytes")
- list_files()#
Get basic information about all files in the EPUB.
- Returns:
List of dictionaries with basic file information
- Return type:
List[Dict[str, str]]
Example:
files = doc.list_files() print(f"EPUB contains {len(files)} files")
Container Class#
- class Container#
Represents the META-INF/container.xml file information.
- rootfile_path#
Path to the main package file within the EPUB.
- Type:
str
- rootfile_media_type#
Media type of the main package file.
- Type:
str
- to_xml(highlight_syntax=True)#
Get formatted XML representation.
- Parameters:
highlight_syntax (bool) – Whether to apply syntax highlighting
- Returns:
Formatted XML string
- Return type:
str
- to_str()#
Get raw XML content.
- Returns:
Raw XML string
- Return type:
str
Package Class#
- class Package#
Represents the main OPF package file.
- to_xml(highlight_syntax=True)#
Get formatted XML representation of the complete package.
- Parameters:
highlight_syntax (bool) – Whether to apply syntax highlighting
- Returns:
Formatted XML string
- Return type:
str
- to_str()#
Get raw XML content of the complete package.
- Returns:
Raw XML string
- Return type:
str
Metadata Class#
- class Metadata#
Represents Dublin Core and EPUB-specific metadata.
- title#
Book title from dc:title element.
- Type:
str
- creator#
Book author/creator from dc:creator element.
- Type:
str
- language#
Language code from dc:language element.
- Type:
str
- identifier#
Unique identifier from dc:identifier element.
- Type:
str
- publisher#
Publisher from dc:publisher element.
- Type:
str
- date#
Publication date from dc:date element.
- Type:
str
- subject#
Subject/keywords from dc:subject element.
- Type:
str
- description#
Description from dc:description element.
- Type:
str
- contributor#
Contributor from dc:contributor element.
- Type:
str
- type#
Resource type from dc:type element.
- Type:
str
- format#
Format from dc:format element.
- Type:
str
- source#
Source from dc:source element.
- Type:
str
- relation#
Relation from dc:relation element.
- Type:
str
- coverage#
Coverage from dc:coverage element.
- Type:
str
- rights#
Rights information from dc:rights element.
- Type:
str
- __getattr__(name)#
Dynamic attribute access for any metadata field.
- Parameters:
name (str) – Metadata field name
- Returns:
Metadata value or empty string
- Return type:
str
Example:
# Access any metadata field isbn = metadata.isbn if hasattr(metadata, 'isbn') else 'Not available' series = getattr(metadata, 'series', 'Not available')
- to_xml(highlight_syntax=True)#
Get formatted XML representation of metadata.
- Parameters:
highlight_syntax (bool) – Whether to apply syntax highlighting
- Returns:
Formatted XML string
- Return type:
str
- to_kv()#
Get metadata as key-value pairs.
- Returns:
Key-value formatted string
- Return type:
str
Example:
kv_data = metadata.to_kv() print(kv_data) # Output: # title: The Great Gatsby # creator: F. Scott Fitzgerald # language: en
- to_str()#
Get raw XML content of metadata.
- Returns:
Raw XML string
- Return type:
str
Manifest Class#
- class Manifest#
Represents the package manifest section.
- items#
Dictionary of manifest items.
- Type:
Dict[str, Dict[str, str]]
Each item contains: -
href
: File path -media-type
: MIME type - Other attributes as neededExample:
for item_id, item in manifest.items.items(): print(f"ID: {item_id}") print(f" File: {item['href']}") print(f" Type: {item['media-type']}")
- to_xml(highlight_syntax=True)#
Get formatted XML representation.
- Parameters:
highlight_syntax (bool) – Whether to apply syntax highlighting
- Returns:
Formatted XML string
- Return type:
str
- to_str()#
Get raw XML content.
- Returns:
Raw XML string
- Return type:
str
Spine Class#
- class Spine#
Represents the package spine section.
- items#
List of spine items in reading order.
- Type:
List[Dict[str, str]]
Example:
for item in spine.items: print(f"Reading order item: {item}")
- to_xml(highlight_syntax=True)#
Get formatted XML representation.
- Parameters:
highlight_syntax (bool) – Whether to apply syntax highlighting
- Returns:
Formatted XML string
- Return type:
str
- to_str()#
Get raw XML content.
- Returns:
Raw XML string
- Return type:
str
TableOfContents Class#
- class TableOfContents#
Represents the table of contents (NCX or Navigation Document).
- to_xml(highlight_syntax=True)#
Get formatted XML representation.
- Parameters:
highlight_syntax (bool) – Whether to apply syntax highlighting
- Returns:
Formatted XML string
- Return type:
str
- to_str()#
Get raw XML content.
- Returns:
Raw XML string
- Return type:
str
Content Classes#
- class Content#
Base class for EPUB content documents.
- to_xml(highlight_syntax=True)#
Get formatted content.
- Parameters:
highlight_syntax (bool) – Whether to apply syntax highlighting
- Returns:
Formatted content string
- Return type:
str
- to_str()#
Get raw content.
- Returns:
Raw content string
- Return type:
str
- class XHTMLContent#
Specialized class for XHTML content documents.
Inherits from Content with additional XHTML-specific methods.
- to_plain()#
Get plain text content with HTML tags stripped.
- Returns:
Plain text string
- Return type:
str
Example:
from epub_utils.content import XHTMLContent # This would typically be accessed through Document # content = XHTMLContent(raw_html) # plain_text = content.to_plain()
Exception Classes#
- exception ParseError#
Raised when there’s an error parsing EPUB content.
Base class:
Exception
Example:
from epub_utils import Document from epub_utils.exceptions import ParseError try: doc = Document("corrupted.epub") title = doc.package.metadata.title except ParseError as e: print(f"Failed to parse EPUB: {e}") except FileNotFoundError: print("EPUB file not found")
Usage Examples#
Basic Usage#
from epub_utils import Document
# Load document
doc = Document("book.epub")
# Access metadata
metadata = doc.package.metadata
print(f"Title: {metadata.title}")
print(f"Author: {metadata.creator}")
# Check file structure
files = doc.get_files_info()
print(f"Contains {len(files)} files")
# Get formatted output
toc_xml = doc.toc.to_xml()
metadata_kv = metadata.to_kv()
Error Handling#
from epub_utils import Document
from epub_utils.exceptions import ParseError
def safe_load_epub(path):
try:
doc = Document(path)
return {
'status': 'success',
'document': doc,
'title': getattr(doc.package.metadata, 'title', 'Unknown')
}
except ParseError as e:
return {
'status': 'parse_error',
'error': str(e)
}
except FileNotFoundError:
return {
'status': 'file_not_found',
'error': 'EPUB file not found'
}
except Exception as e:
return {
'status': 'unknown_error',
'error': str(e)
}
Batch Processing#
import os
from pathlib import Path
from epub_utils import Document
def process_epub_directory(directory):
epub_files = Path(directory).glob("*.epub")
results = []
for epub_path in epub_files:
try:
doc = Document(str(epub_path))
metadata = doc.package.metadata
result = {
'file': epub_path.name,
'title': getattr(metadata, 'title', ''),
'author': getattr(metadata, 'creator', ''),
'language': getattr(metadata, 'language', ''),
'file_size': epub_path.stat().st_size,
'epub_files': len(doc.get_files_info())
}
results.append(result)
except Exception as e:
results.append({
'file': epub_path.name,
'error': str(e)
})
return results
Type Hints#
For better IDE support and type checking, here are the main type hints:
from typing import Dict, List, Union, Optional
from epub_utils import Document
# Function signatures for reference
def get_files_info(self) -> List[Dict[str, Union[str, int]]]: ...
def list_files(self) -> List[Dict[str, str]]: ...
def to_xml(self, highlight_syntax: bool = True) -> str: ...
def to_str(self) -> str: ...
def to_kv(self) -> str: ...
# Type-safe usage example
doc: Document = Document("book.epub")
files_info: List[Dict[str, Union[str, int]]] = doc.get_files_info()
title: str = doc.package.metadata.title
kv_data: str = doc.package.metadata.to_kv()
Module Structure#
The epub-utils
package is organized as follows:
epub_utils/
├── __init__.py # Main exports (Document, Container)
├── doc.py # Document class
├── container.py # Container class
├── package/
│ ├── __init__.py # Package class
│ ├── metadata.py # Metadata class
│ ├── manifest.py # Manifest class
│ └── spine.py # Spine class
├── content/
│ ├── __init__.py # Content classes
│ ├── base.py # Base Content class
│ └── xhtml.py # XHTMLContent class
├── toc.py # TableOfContents class
├── exceptions.py # Exception classes
├── highlighters.py # Syntax highlighting utilities
└── cli.py # Command-line interface
For detailed implementation examples, see Use as a Python library and Examples and Use Cases.