| .github/workflows | ||
| nim_mmcif | ||
| tests | ||
| .gitignore | ||
| build_nim.py | ||
| conftest.py | ||
| LICENSE | ||
| MANIFEST.in | ||
| mmcif.nimble | ||
| nim_mmcif.nim | ||
| pyproject.toml | ||
| README.md | ||
| setup.py | ||
| update_version.py | ||
nim-mmcif
Fast mmCIF (Macromolecular Crystallographic Information File) parser written in Nim with Python bindings
The goal of this repository is to experiment with vibe coding while building something useful for bioinformatics community, to see how much of a cross platform library can be driven to completion by transformers
Verdict: I have upgraded to the Max 200$ plan. Opus is the only viable model, at least for me, and can be treated as a superhumanly fast but imperfect junior developer. With right prompting, it can be used to automate a lot of boring work and allow me to focus on the high level creative ones.
Features
- 🚀 High-performance parsing of mmCIF files using Nim
- 🌍 Cross-platform support (Linux, macOS, Windows)
- 📦 Easy installation via pip
Installation
Prerequisites
- Python 3.8 or higher
- Nim compiler (see platform-specific instructions)
From PyPI
pip install nim-mmcif
From Source
# Install Nim (platform-specific, see below)
# macOS: brew install nim
# Linux: curl https://nim-lang.org/choosenim/init.sh -sSf | sh
# Windows: scoop install nim
# Install the package
git clone https://github.com/lucidrains/nim-mmcif
cd nim-mmcif
pip install -e .
For detailed platform-specific instructions, see CROSS_PLATFORM.md.
Quick Start
Python Usage
Dictionary Access
from nim_mmcif import parse_mmcif
# Parse an mmCIF file (returns dict by default)
data = parse_mmcif("tests/test.mmcif")
print(f"Found {len(data['atoms'])} atoms")
# Access atom properties using dictionary notation
first_atom = data['atoms'][0]
print(f"Atom {first_atom['id']}: {first_atom['label_atom_id']}")
print(f"Position: ({first_atom['x']}, {first_atom['y']}, {first_atom['z']})")
# Parse multiple files using glob patterns
results = parse_mmcif("tests/*.mmcif")
for filepath, data in results.items():
print(f"{filepath}: {len(data['atoms'])} atoms")
Dataclass Access
from nim_mmcif import parse_mmcif, parse_mmcif_batch
# Parse with dataclass support for cleaner dot notation access
data = parse_mmcif("tests/test.mmcif", as_dataclass=True)
print(f"Found {data.atom_count} atoms")
# Access atom properties using dot notation
first_atom = data.atoms[0]
print(f"Atom {first_atom.id}: {first_atom.label_atom_id}")
print(f"Position: ({first_atom.x}, {first_atom.y}, {first_atom.z})")
print(f"Chain: {first_atom.label_asym_id}, Residue: {first_atom.label_comp_id}")
# Use convenience properties and methods
print(f"Unique chains: {data.chains}")
print(f"Number of residues: {len(data.residues)}")
# Get all atoms from a specific chain
chain_a_atoms = data.get_chain('A')
# Get all atoms from a specific residue
residue_atoms = data.get_residue('A', 1)
# Get all positions as tuples
positions = data.positions # List of (x, y, z) tuples
# Batch processing with dataclasses
results = parse_mmcif_batch(["tests/test1.mmcif", "tests/test2.mmcif"], as_dataclass=True)
for result in results:
print(f"Structure has {result.atom_count} atoms in {len(result.chains)} chain(s)")
Other Functions
import nim_mmcif
# Get atom count directly
count = nim_mmcif.get_atom_count("tests/test.mmcif")
print(f"File contains {count} atoms")
# Get all atoms with their properties (returns list of dicts)
atoms = nim_mmcif.get_atoms("tests/test.mmcif")
for atom in atoms[:5]: # Print first 5 atoms
print(f"Atom {atom['id']}: {atom['label_atom_id']} at ({atom['x']}, {atom['y']}, {atom['z']})")
# Get just the 3D coordinates
positions = nim_mmcif.get_atom_positions("tests/test.mmcif")
for i, (x, y, z) in enumerate(positions[:5]):
print(f"Position {i}: ({x:.3f}, {y:.3f}, {z:.3f})")
Nim Usage
First
$ nimble install nim_mmcif
Then
import nim_mmcif
# Parse an mmCIF file
let data = mmcif_parse("tests/test.mmcif")
echo "Found ", data.atoms.len, " atoms"
# Iterate through atoms
for atom in data.atoms[0..<min(5, data.atoms.len)]:
echo "Atom ", atom.id, ": ", atom.label_atom_id,
" at (", atom.Cartn_x, ", ", atom.Cartn_y, ", ", atom.Cartn_z, ")"
# Access specific atom properties
if data.atoms.len > 0:
let firstAtom = data.atoms[0]
echo "Chain: ", firstAtom.label_asym_id
echo "Residue: ", firstAtom.label_comp_id
echo "B-factor: ", firstAtom.B_iso_or_equiv
Batch Processing
Process multiple mmCIF files efficiently in a single operation:
import nim_mmcif
# List of mmCIF files to process
files = [
"path/to/structure1.mmcif",
"path/to/structure2.mmcif",
"path/to/structure3.mmcif"
]
# Parse all files in batch (returns list when no globs used)
results = nim_mmcif.parse_mmcif_batch(files)
# Process results
for i, data in enumerate(results):
print(f"Structure {i+1}: {len(data['atoms'])} atoms")
# Analyze each structure
atoms = data['atoms']
if atoms:
# Get unique chain IDs
chains = set(atom['label_asym_id'] for atom in atoms)
print(f" Chains: {', '.join(sorted(chains))}")
# Count residues
residues = set((atom['label_asym_id'], atom['label_seq_id'])
for atom in atoms)
print(f" Residues: {len(residues)}")
# Batch processing with glob patterns (returns dict)
results = nim_mmcif.parse_mmcif_batch("path/to/*.mmcif")
for filepath, data in results.items():
print(f"{filepath}: {len(data['atoms'])} atoms")
# Mix of glob patterns and regular paths (returns dict)
results = nim_mmcif.parse_mmcif_batch([
"specific_file.mmcif",
"structures/*.mmcif",
"models/model_?.mmcif"
])
for filepath, data in results.items():
print(f"{filepath}: {len(data['atoms'])} atoms")
Batch processing is particularly useful when:
- Analyzing multiple protein structures for comparative studies
- Processing entire datasets of crystallographic structures
- Building machine learning datasets from PDB files
- Performing high-throughput structural analysis
The batch function provides better performance than individual parsing when processing multiple files, as it reduces the overhead of repeated function calls.
API Reference
Functions
parse_mmcif(filepath: str, as_dataclass: bool = False) -> dict | MmcifData | dict[str, dict] | dict[str, MmcifData]
Parse an mmCIF file or files matching a glob pattern.
- filepath: Path to mmCIF file or glob pattern
- as_dataclass: If True, returns MmcifData dataclass(es) with dot notation access
- Returns:
- Single file + dict: Dictionary with 'atoms' key
- Single file + dataclass: MmcifData instance
- Glob pattern + dict: Dictionary mapping file paths to parsed data
- Glob pattern + dataclass: Dictionary mapping file paths to MmcifData instances
- Supports wildcards:
*(any characters),?(single character),**(recursive)
parse_mmcif_batch(filepaths: list[str] | str, as_dataclass: bool = False) -> list[dict] | list[MmcifData] | dict[str, dict] | dict[str, MmcifData]
Parse multiple mmCIF files in a single operation.
- filepaths: List of paths, single path, or glob pattern
- as_dataclass: If True, returns MmcifData dataclass(es) with dot notation access
- Returns:
- No glob + dict: List of dictionaries with parsed data
- No glob + dataclass: List of MmcifData instances
- With glob + dict: Dictionary mapping file paths to parsed data
- With glob + dataclass: Dictionary mapping file paths to MmcifData instances
- More efficient than parsing files individually when processing multiple structures
get_atom_count(filepath: str) -> int
Get the number of atoms in an mmCIF file.
get_atoms(filepath: str) -> list[dict]
Get all atoms from an mmCIF file as a list of dictionaries.
get_atom_positions(filepath: str) -> list[tuple[float, float, float]]
Get 3D coordinates of all atoms as a list of (x, y, z) tuples.
Dataclasses
MmcifData
Container for parsed mmCIF data with typed atom access.
Properties:
atoms: List of Atom objectsatom_count: Total number of atomspositions: List of (x, y, z) tuples for all atomschains: Set of unique chain identifiersresidues: Set of unique (chain_id, seq_id) tuples
Methods:
get_chain(chain_id: str): Get all atoms from a specific chainget_residue(chain_id: str, seq_id: int): Get all atoms from a specific residueto_dict(): Convert back to dictionary format
Atom
Represents a single atom with typed properties accessible via dot notation.
Properties:
type: Record type (ATOM or HETATM)id: Atom serial numbertype_symbol: Element symbollabel_atom_id: Atom namelabel_comp_id: Residue namelabel_asym_id: Chain identifierlabel_entity_id: Entity IDlabel_seq_id: Residue sequence numberCartn_x,Cartn_y,Cartn_z: 3D coordinatesx,y,z: Convenient aliases for coordinatesoccupancy: Occupancy factorB_iso_or_equiv: B-factor (temperature factor)position: Tuple of (x, y, z) coordinates
Methods:
to_dict(): Convert back to dictionary format
Dictionary Format
When using the default dictionary format (as_dataclass=False), each atom dictionary contains:
type: Record type (ATOM or HETATM)id: Atom serial numberlabel_atom_id: Atom namelabel_comp_id: Residue namelabel_asym_id: Chain identifierlabel_seq_id: Residue sequence numberx,y,z: 3D coordinates (aliases for Cartn_x, Cartn_y, Cartn_z)occupancy: Occupancy factorB_iso_or_equiv: B-factor- And more...
Platform Support
| Platform | Architecture | Python | Status |
|---|---|---|---|
| Linux | x64, ARM64 | 3.8-3.12 | ✅ |
| macOS | x64, ARM64 | 3.8-3.12 | ✅ |
| Windows | x64 | 3.8-3.12 | ✅ |
Building from Source
Automatic Build
python build_nim.py
Manual Build
# Build using nimble tasks
nimble build # Build debug version
nimble buildRelease # Build optimized release version
Development
Running Tests
pip install pytest
pytest tests/ -v
Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Run tests
- Submit a pull request
Documentation
- Cross-Platform Guide - Platform-specific build instructions
Performance
The Nim implementation provides significant performance improvements over pure Python parsers, especially for large mmCIF files commonly used in structural biology.
License
MIT License - see LICENSE file for details.