# CALYPSO-Database

**Repository Path**: ms2lab/calypso-database

## Basic Information

- **Project Name**: CALYPSO-Database
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2026-04-15
- **Last Updated**: 2026-05-15

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# CALYPSO Database

A Python library for parsing, validating, and managing materials science data from calculations, including crystal structures, symmetry information, and spectroscopic data (XRD, IR, Raman).

## Features

- **Command-Line Interface**: Easy-to-use CLI for data conversion and parsing with Unix-style piping support
- **Structured Data Models**: Type-safe dataclasses for materials, symmetry, and spectroscopic data
- **Flexible Parsing**: Convert CSV files with embedded file references to structured JSON
- **Batch Processing**: Efficiently process multiple material records with `BatchParser`
- **Error Handling**: Configurable error strategies (`ignore`, `warn`, `raise`) for robust data processing
- **Validation**: Built-in validation for spectrum data with detailed error reporting
- **Serialization**: Easy conversion between Python objects, dictionaries, and JSON
- **Unicode Support**: Preserve Chinese and other non-ASCII characters by default

## Installation

This package is distributed as wheel files or source code. Choose the installation method that fits your needs.

### Option 1: Install from Wheel (Recommended for End Users)

Download the `.whl` file and install as a CLI tool:

```bash
# Install globally with uv tool
uv tool install calypso_database-*.whl

# Use directly
calypso-db --version
```

Or install in a project:

```bash
# Using uv
uv pip install calypso_database-*.whl

# Using pip
pip install calypso_database-*.whl
```

### Option 2: Install from Source (Recommended for Developers)

Clone or download the source code:

```bash
# Clone repository
git clone <repository-url>
cd CALYPSO-Database

# Create virtual environment and install in editable mode
uv venv
uv pip install -e .

# Activate and use
source .venv/bin/activate  # or `.venv\Scripts\activate` on Windows
calypso-db --version
```

Or install directly without editable mode:

```bash
cd CALYPSO-Database
uv pip install .
```

## Command-Line Interface

The package provides a `calypso-db` command for data conversion and parsing.

### Quick Start

```bash
# Convert CSV to JSON and parse in one pipeline
calypso-db convert data.csv | calypso-db parse -

# Parse with 8 worker processes
calypso-db convert data.csv | calypso-db parse - -j 8

# Save the result
calypso-db convert data.csv | calypso-db parse - -o result.json

# With formatted output
calypso-db convert data.csv | calypso-db parse - --indent 2
```
### Convert CSV to JSON

Convert CSV files to JSON format without validation:

```bash
# Basic conversion (compact output, Unicode preserved)
calypso-db convert data.csv

# Save to file
calypso-db convert data.csv -o output.json

# Formatted output with indentation
calypso-db convert data.csv --indent 2

# Escape non-ASCII characters
calypso-db convert data.csv --ascii
```

### Parse and Validate JSON

Parse JSON data with validation and structure it:

```bash
# Parse from file
calypso-db parse data.json

# Parse from stdin
cat data.json | calypso-db parse -
echo '[{"material_id":"test","phase_id":"phase1"}]' | calypso-db parse -

# Chain convert and parse
calypso-db convert data.csv | calypso-db parse -

# Parse in parallel with 8 worker processes
calypso-db parse data.json -j 8

# Error handling options
calypso-db parse data.json --on-error ignore   # Ignore errors
calypso-db parse data.json --on-error warn     # Show warnings (default)
calypso-db parse data.json --on-error raise    # Raise exceptions

# Save output
calypso-db parse data.json -o result.json

# Formatted output
calypso-db parse data.json --indent 2

# Quiet mode (suppress error messages)
calypso-db parse data.json --quiet
```

### Integration with Other Tools

```bash
# Use with jq for filtering
calypso-db convert data.csv | calypso-db parse - | jq '.material[0]'

# Save intermediate results
calypso-db convert data.csv | tee raw.json | calypso-db parse - -o parsed.json

# Format existing JSON
calypso-db parse compact.json --indent 2 -o formatted.json
```

### CLI Options

**Common options:**
- `--output, -o FILE`: Write output to file instead of stdout
- `--indent N`: Format JSON with N spaces indentation (default: compact)
- `--ascii`: Escape non-ASCII characters (default: keep Unicode)
- `--version, -v`: Show version information

**Parse-specific options:**
- `--on-error {ignore,warn,raise}`: Error handling strategy (default: warn)
- `--quiet, -q`: Suppress error messages
- `-j, --jobs N`: Number of worker processes for parsing (default: 1)

## Output JSON Structure

The `parse` command outputs a structured JSON with categorized data:

```json
{
  "material": [...],      // Material properties and composition
  "symmetry": [...],      // Crystal symmetry information
  "xrd_exp": [...],       // Experimental XRD spectra
  "xrd_theory": [...],    // Theoretical XRD spectra
  "ir_exp": [...],        // Experimental IR spectra
  "ir_theory": [...],     // Theoretical IR spectra
  "raman_exp": [...],     // Experimental Raman spectra
  "raman_theory": [...]   // Theoretical Raman spectra
}
```

**Notes:**
- Each array contains records corresponding to input materials in order
- `null` entries indicate missing or invalid data
- Custom properties from `prop_*` columns appear in material records
- Spectrum metadata (custom keys from files) are preserved

## CSV Input Format

The CSV file should follow this structure:

```csv
material_id,phase_id,experimental,source,license,structure,prop_band_gap,prop_density,XRD_exp,XRD_theory,...
string,string,boolean,string,string,path,float,float,path,path,...
Material ID,Phase ID,Is experimental?,Data source,License,Structure file,Band gap (eV),Density (g/cm³),XRD exp file,XRD theory file,...
mat_001,phase_alpha,true,DFT,MIT,structures/mat_001.cif,2.5,3.2,spectra/xrd_exp.txt,,...
mat_002,phase_beta,false,Exp,CC-BY,structures/mat_002.cif,1.8,4.1,,spectra/xrd_theory.txt,...
```

**Key Points:**
- Row 1: Column headers
- Row 2-3: Automatically skipped (units/types or descriptions etc.)
- Row 4+: Data rows
- File path columns (`structure`, `XRD_*`, `IR_*`, `Raman_*`) are relative to CSV location
- File contents are automatically embedded in the JSON output
- Custom property columns with `prop_` prefix (e.g., `prop_band_gap`, `prop_density`) are stored in the Material model without the `prop_` prefix

### Supported Columns

| Column | Type | Description |
|--------|------|-------------|
| `material_id` | string | Material identifier (auto-generated if missing) |
| `phase_id` | string | Phase identifier |
| `experimental` | boolean | Experimental vs theoretical data flag |
| `source` | string | Data source |
| `license` | string | License information |
| `structure` | path | CIF format structure file |
| `prop_*` | any | Material properties (e.g., `prop_band_gap`) |
| `XRD_exp` | path | Experimental XRD spectrum |
| `XRD_theory` | path | Theoretical XRD spectrum |
| `IR_exp` | path | Experimental IR spectrum |
| `IR_theory` | path | Theoretical IR spectrum |
| `Raman_exp` | path | Experimental Raman spectrum |
| `Raman_theory` | path | Theoretical Raman spectrum |

## Spectrum File Format

Spectrum files should follow this format:

```
# Comments start with #, and key-value split by '='
# key1 = value1
# key2 = value2
# Data section: two columns (variable, intensity)
10.5  100.2
20.3  150.7
30.1  200.5
```

**Reserved Keys:**
- `variable`: X-axis data (automatically populated from data section)
- `intensity`: Y-axis data (automatically populated from data section)
---

## Python API

For advanced usage, you can use the Python API directly.

### Basic Usage

```python
from calypso_database.parsers import Parser

# Parse a single material record
data = {
    "material_id": "mat_001",
    "phase_id": "phase_alpha",
    "structure": "...",  # CIF format structure
    "XRD_exp": "...",    # XRD spectrum data
}

parser = Parser.from_dict(data, on_error="warn")
print(parser.to_json(indent=2))
```

### Batch Processing

```python
from calypso_database.csv_loader import csv2json
from calypso_database.parsers import BatchParser

# Convert CSV to JSON
json_str = csv2json("materials_data.csv")

# Parse all records
batch = BatchParser.from_json(json_str, on_error="warn")

# Parse all records in parallel
batch_parallel = BatchParser.from_json(json_str, on_error="warn", jobs=8)

# Get aggregated results
result = batch.to_dict()
# {
#   "material": [...],
#   "symmetry": [...],
#   "xrd_exp": [...],
#   ...
# }
```


### Error Handling

Control how parsing errors are handled:

```python
# Ignore all errors silently
parser = Parser.from_dict(data, on_error="ignore")

# Emit warnings but continue (default)
parser = Parser.from_dict(data, on_error="warn")

# Raise exception on first error
parser = Parser.from_dict(data, on_error="raise")
```

## Data Models

### Material
- `material_id`: Unique identifier (ULID)
- `phase_id`: Phase identifier
- `experimental`: Boolean flag
- `source`: Data source
- `license`: License information
- `formula`: Chemical formula
- `nsites`: Number of sites
- `elements`: List of elements
- `nelements`: Number of elements
- `volume`: Unit cell volume
- `density`: Density
- `prop_*`: Custom properties

### Symmetry
- `material_id`: Reference to material
- `crystal_system`: Crystal system
- `space_group_number`: Space group number
- `space_group_symbol`: Space group symbol
- `point_group`: Point group

### Spectrum (XRD, IR, Raman)
- `material_id`: Reference to material
- `variable`: X-axis data (2θ for XRD, wavenumber for IR/Raman)
- `intensity`: Y-axis data
- Custom metadata fields

## API Reference

### Parser

```python
Parser.from_dict(data: dict, on_error: Literal["ignore", "warn", "raise"] = "warn") -> Parser
```
Parse a single material record.

```python
parser.to_dict() -> dict
```
Convert to dictionary with all fields present (None if not available).

```python
parser.to_json(**kwargs) -> str
```
Convert to JSON string.

### BatchParser

```python
BatchParser.from_json(
    json_str: str,
    on_error: Literal["ignore", "warn", "raise"] = "warn",
    jobs: int = 1,
) -> BatchParser
```
Parse multiple material records from JSON string.

```python
batch.to_dict() -> dict
```
Aggregate all parsers into categorized lists.

```python
batch.to_json(**kwargs) -> str
```
Convert to JSON string.

### Utilities

```python
csv2json(fcsv: str | Path) -> str
```
Convert CSV file to JSON string for BatchParser processing.

## Development

### Setup

```bash
# Clone the repository
git clone https://github.com/yourusername/calypso-database.git
cd calypso-database

# Install dependencies including test dependencies
uv sync --extra test

# The package is automatically installed in editable mode
```

### Running Tests

```bash
# Run all tests
uv run pytest

# Run with verbose output
uv run pytest -v

# Run with coverage report
uv run pytest --cov=calypso_database --cov-report=term-missing

# Run specific test file
uv run pytest tests/test_utils.py

# Run specific test
uv run pytest tests/test_utils.py::TestCsv2Json::test_simple_csv
```

### Code Quality

```bash
# Format code
uv run ruff format

# Lint code
uv run ruff check

# Fix linting issues
uv run ruff check --fix
```

### Distribution

Build wheel and source distribution packages:

```bash
# Build distribution packages
uv build
```

This creates files in the `dist/` directory:
- `calypso_database-*.whl` - Wheel package (binary distribution)
- `calypso_database-*.tar.gz` - Source distribution

Test the built wheel:

```bash
# Install from local wheel
uv tool install dist/calypso_database-*.whl

# Test the installed tool
calypso-db --version

# Uninstall
uv tool uninstall calypso-database
```

## Requirements

- Python >= 3.11
- pandas >= 3.0.2
- pymatgen >= 2026.3.23
- python-ulid >= 3.1.0
- typing-extensions >= 4.0.0

## License

[Add your license here]

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## Citation

If you use this library in your research, please cite:

```
[Add citation information here]
```

## Contact

Xiaoshan Luo - luoxs@calypso.cn