No description
Find a file
2025-12-17 16:22:29 -08:00
.github/workflows let gemini address some github action issue 2025-12-17 16:22:29 -08:00
nim_mmcif let gemini address some github action issue 2025-12-17 16:22:29 -08:00
tests give opus another task 2025-08-24 05:57:34 -07:00
.gitignore ask claude to cleanup and add a new github workflow 2025-08-22 08:15:11 -07:00
build_nim.py trying max thinking mode for opus for cleanup 2025-08-23 20:52:07 -07:00
conftest.py fix 2025-08-23 17:03:46 -07:00
LICENSE Initial commit 2025-08-19 12:50:01 -07:00
MANIFEST.in another round 2025-08-21 17:24:29 -07:00
mmcif.nimble let gemini address some github action issue 2025-12-17 16:22:29 -08:00
nim_mmcif.nim give opus another task 2025-08-24 05:57:34 -07:00
pyproject.toml its true 2025-08-23 17:01:49 -07:00
README.md readme 2025-08-24 07:45:04 -07:00
setup.py 0.0.17 2025-08-23 14:12:39 -07:00
update_version.py let gemini address some github action issue 2025-12-17 16:22:29 -08:00

nim-mmcif

Fast mmCIF (Macromolecular Crystallographic Information File) parser written in Nim with Python bindings

The goal of this repository is to experiment with vibe coding while building something useful for bioinformatics community, to see how much of a cross platform library can be driven to completion by transformers

Verdict: I have upgraded to the Max 200$ plan. Opus is the only viable model, at least for me, and can be treated as a superhumanly fast but imperfect junior developer. With right prompting, it can be used to automate a lot of boring work and allow me to focus on the high level creative ones.

Features

  • 🚀 High-performance parsing of mmCIF files using Nim
  • 🌍 Cross-platform support (Linux, macOS, Windows)
  • 📦 Easy installation via pip

Installation

Prerequisites

From PyPI

pip install nim-mmcif

From Source

# Install Nim (platform-specific, see below)
# macOS: brew install nim
# Linux: curl https://nim-lang.org/choosenim/init.sh -sSf | sh
# Windows: scoop install nim

# Install the package
git clone https://github.com/lucidrains/nim-mmcif
cd nim-mmcif
pip install -e .

For detailed platform-specific instructions, see CROSS_PLATFORM.md.

Quick Start

Python Usage

Dictionary Access

from nim_mmcif import parse_mmcif

# Parse an mmCIF file (returns dict by default)
data = parse_mmcif("tests/test.mmcif")
print(f"Found {len(data['atoms'])} atoms")

# Access atom properties using dictionary notation
first_atom = data['atoms'][0]
print(f"Atom {first_atom['id']}: {first_atom['label_atom_id']}")
print(f"Position: ({first_atom['x']}, {first_atom['y']}, {first_atom['z']})")

# Parse multiple files using glob patterns
results = parse_mmcif("tests/*.mmcif")
for filepath, data in results.items():
    print(f"{filepath}: {len(data['atoms'])} atoms")

Dataclass Access

from nim_mmcif import parse_mmcif, parse_mmcif_batch

# Parse with dataclass support for cleaner dot notation access
data = parse_mmcif("tests/test.mmcif", as_dataclass=True)
print(f"Found {data.atom_count} atoms")

# Access atom properties using dot notation
first_atom = data.atoms[0]
print(f"Atom {first_atom.id}: {first_atom.label_atom_id}")
print(f"Position: ({first_atom.x}, {first_atom.y}, {first_atom.z})")
print(f"Chain: {first_atom.label_asym_id}, Residue: {first_atom.label_comp_id}")

# Use convenience properties and methods
print(f"Unique chains: {data.chains}")
print(f"Number of residues: {len(data.residues)}")

# Get all atoms from a specific chain
chain_a_atoms = data.get_chain('A')

# Get all atoms from a specific residue
residue_atoms = data.get_residue('A', 1)

# Get all positions as tuples
positions = data.positions  # List of (x, y, z) tuples

# Batch processing with dataclasses
results = parse_mmcif_batch(["tests/test1.mmcif", "tests/test2.mmcif"], as_dataclass=True)
for result in results:
    print(f"Structure has {result.atom_count} atoms in {len(result.chains)} chain(s)")

Other Functions

import nim_mmcif

# Get atom count directly
count = nim_mmcif.get_atom_count("tests/test.mmcif")
print(f"File contains {count} atoms")

# Get all atoms with their properties (returns list of dicts)
atoms = nim_mmcif.get_atoms("tests/test.mmcif")
for atom in atoms[:5]:  # Print first 5 atoms
    print(f"Atom {atom['id']}: {atom['label_atom_id']} at ({atom['x']}, {atom['y']}, {atom['z']})")

# Get just the 3D coordinates
positions = nim_mmcif.get_atom_positions("tests/test.mmcif")
for i, (x, y, z) in enumerate(positions[:5]):
    print(f"Position {i}: ({x:.3f}, {y:.3f}, {z:.3f})")

Nim Usage

First

$ nimble install nim_mmcif

Then

import nim_mmcif

# Parse an mmCIF file
let data = mmcif_parse("tests/test.mmcif")
echo "Found ", data.atoms.len, " atoms"

# Iterate through atoms
for atom in data.atoms[0..<min(5, data.atoms.len)]:
  echo "Atom ", atom.id, ": ", atom.label_atom_id, 
       " at (", atom.Cartn_x, ", ", atom.Cartn_y, ", ", atom.Cartn_z, ")"

# Access specific atom properties
if data.atoms.len > 0:
  let firstAtom = data.atoms[0]
  echo "Chain: ", firstAtom.label_asym_id
  echo "Residue: ", firstAtom.label_comp_id
  echo "B-factor: ", firstAtom.B_iso_or_equiv

Batch Processing

Process multiple mmCIF files efficiently in a single operation:

import nim_mmcif

# List of mmCIF files to process
files = [
    "path/to/structure1.mmcif",
    "path/to/structure2.mmcif",
    "path/to/structure3.mmcif"
]

# Parse all files in batch (returns list when no globs used)
results = nim_mmcif.parse_mmcif_batch(files)

# Process results
for i, data in enumerate(results):
    print(f"Structure {i+1}: {len(data['atoms'])} atoms")
    
    # Analyze each structure
    atoms = data['atoms']
    if atoms:
        # Get unique chain IDs
        chains = set(atom['label_asym_id'] for atom in atoms)
        print(f"  Chains: {', '.join(sorted(chains))}")
        
        # Count residues
        residues = set((atom['label_asym_id'], atom['label_seq_id']) 
                      for atom in atoms)
        print(f"  Residues: {len(residues)}")

# Batch processing with glob patterns (returns dict)
results = nim_mmcif.parse_mmcif_batch("path/to/*.mmcif")
for filepath, data in results.items():
    print(f"{filepath}: {len(data['atoms'])} atoms")

# Mix of glob patterns and regular paths (returns dict)
results = nim_mmcif.parse_mmcif_batch([
    "specific_file.mmcif",
    "structures/*.mmcif",
    "models/model_?.mmcif"
])
for filepath, data in results.items():
    print(f"{filepath}: {len(data['atoms'])} atoms")

Batch processing is particularly useful when:

  • Analyzing multiple protein structures for comparative studies
  • Processing entire datasets of crystallographic structures
  • Building machine learning datasets from PDB files
  • Performing high-throughput structural analysis

The batch function provides better performance than individual parsing when processing multiple files, as it reduces the overhead of repeated function calls.

API Reference

Functions

parse_mmcif(filepath: str, as_dataclass: bool = False) -> dict | MmcifData | dict[str, dict] | dict[str, MmcifData]

Parse an mmCIF file or files matching a glob pattern.

  • filepath: Path to mmCIF file or glob pattern
  • as_dataclass: If True, returns MmcifData dataclass(es) with dot notation access
  • Returns:
    • Single file + dict: Dictionary with 'atoms' key
    • Single file + dataclass: MmcifData instance
    • Glob pattern + dict: Dictionary mapping file paths to parsed data
    • Glob pattern + dataclass: Dictionary mapping file paths to MmcifData instances
  • Supports wildcards: * (any characters), ? (single character), ** (recursive)

parse_mmcif_batch(filepaths: list[str] | str, as_dataclass: bool = False) -> list[dict] | list[MmcifData] | dict[str, dict] | dict[str, MmcifData]

Parse multiple mmCIF files in a single operation.

  • filepaths: List of paths, single path, or glob pattern
  • as_dataclass: If True, returns MmcifData dataclass(es) with dot notation access
  • Returns:
    • No glob + dict: List of dictionaries with parsed data
    • No glob + dataclass: List of MmcifData instances
    • With glob + dict: Dictionary mapping file paths to parsed data
    • With glob + dataclass: Dictionary mapping file paths to MmcifData instances
  • More efficient than parsing files individually when processing multiple structures

get_atom_count(filepath: str) -> int

Get the number of atoms in an mmCIF file.

get_atoms(filepath: str) -> list[dict]

Get all atoms from an mmCIF file as a list of dictionaries.

get_atom_positions(filepath: str) -> list[tuple[float, float, float]]

Get 3D coordinates of all atoms as a list of (x, y, z) tuples.

Dataclasses

MmcifData

Container for parsed mmCIF data with typed atom access.

Properties:

  • atoms: List of Atom objects
  • atom_count: Total number of atoms
  • positions: List of (x, y, z) tuples for all atoms
  • chains: Set of unique chain identifiers
  • residues: Set of unique (chain_id, seq_id) tuples

Methods:

  • get_chain(chain_id: str): Get all atoms from a specific chain
  • get_residue(chain_id: str, seq_id: int): Get all atoms from a specific residue
  • to_dict(): Convert back to dictionary format

Atom

Represents a single atom with typed properties accessible via dot notation.

Properties:

  • type: Record type (ATOM or HETATM)
  • id: Atom serial number
  • type_symbol: Element symbol
  • label_atom_id: Atom name
  • label_comp_id: Residue name
  • label_asym_id: Chain identifier
  • label_entity_id: Entity ID
  • label_seq_id: Residue sequence number
  • Cartn_x, Cartn_y, Cartn_z: 3D coordinates
  • x, y, z: Convenient aliases for coordinates
  • occupancy: Occupancy factor
  • B_iso_or_equiv: B-factor (temperature factor)
  • position: Tuple of (x, y, z) coordinates

Methods:

  • to_dict(): Convert back to dictionary format

Dictionary Format

When using the default dictionary format (as_dataclass=False), each atom dictionary contains:

  • type: Record type (ATOM or HETATM)
  • id: Atom serial number
  • label_atom_id: Atom name
  • label_comp_id: Residue name
  • label_asym_id: Chain identifier
  • label_seq_id: Residue sequence number
  • x, y, z: 3D coordinates (aliases for Cartn_x, Cartn_y, Cartn_z)
  • occupancy: Occupancy factor
  • B_iso_or_equiv: B-factor
  • And more...

Platform Support

Platform Architecture Python Status
Linux x64, ARM64 3.8-3.12
macOS x64, ARM64 3.8-3.12
Windows x64 3.8-3.12

Building from Source

Automatic Build

python build_nim.py

Manual Build

# Build using nimble tasks
nimble build         # Build debug version
nimble buildRelease  # Build optimized release version

Development

Running Tests

pip install pytest
pytest tests/ -v

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Run tests
  5. Submit a pull request

Documentation

Performance

The Nim implementation provides significant performance improvements over pure Python parsers, especially for large mmCIF files commonly used in structural biology.

License

MIT License - see LICENSE file for details.

Acknowledgments

  • Built with Nim for high performance
  • Python integration via nimporter and nimpy
  • mmCIF format specification from wwPDB