# CALYPSO-Database **Repository Path**: ms2lab/calypso-database ## Basic Information - **Project Name**: CALYPSO-Database - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2026-04-15 - **Last Updated**: 2026-05-15 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # CALYPSO Database A Python library for parsing, validating, and managing materials science data from calculations, including crystal structures, symmetry information, and spectroscopic data (XRD, IR, Raman). ## Features - **Command-Line Interface**: Easy-to-use CLI for data conversion and parsing with Unix-style piping support - **Structured Data Models**: Type-safe dataclasses for materials, symmetry, and spectroscopic data - **Flexible Parsing**: Convert CSV files with embedded file references to structured JSON - **Batch Processing**: Efficiently process multiple material records with `BatchParser` - **Error Handling**: Configurable error strategies (`ignore`, `warn`, `raise`) for robust data processing - **Validation**: Built-in validation for spectrum data with detailed error reporting - **Serialization**: Easy conversion between Python objects, dictionaries, and JSON - **Unicode Support**: Preserve Chinese and other non-ASCII characters by default ## Installation This package is distributed as wheel files or source code. Choose the installation method that fits your needs. ### Option 1: Install from Wheel (Recommended for End Users) Download the `.whl` file and install as a CLI tool: ```bash # Install globally with uv tool uv tool install calypso_database-*.whl # Use directly calypso-db --version ``` Or install in a project: ```bash # Using uv uv pip install calypso_database-*.whl # Using pip pip install calypso_database-*.whl ``` ### Option 2: Install from Source (Recommended for Developers) Clone or download the source code: ```bash # Clone repository git clone cd CALYPSO-Database # Create virtual environment and install in editable mode uv venv uv pip install -e . # Activate and use source .venv/bin/activate # or `.venv\Scripts\activate` on Windows calypso-db --version ``` Or install directly without editable mode: ```bash cd CALYPSO-Database uv pip install . ``` ## Command-Line Interface The package provides a `calypso-db` command for data conversion and parsing. ### Quick Start ```bash # Convert CSV to JSON and parse in one pipeline calypso-db convert data.csv | calypso-db parse - # Parse with 8 worker processes calypso-db convert data.csv | calypso-db parse - -j 8 # Save the result calypso-db convert data.csv | calypso-db parse - -o result.json # With formatted output calypso-db convert data.csv | calypso-db parse - --indent 2 ``` ### Convert CSV to JSON Convert CSV files to JSON format without validation: ```bash # Basic conversion (compact output, Unicode preserved) calypso-db convert data.csv # Save to file calypso-db convert data.csv -o output.json # Formatted output with indentation calypso-db convert data.csv --indent 2 # Escape non-ASCII characters calypso-db convert data.csv --ascii ``` ### Parse and Validate JSON Parse JSON data with validation and structure it: ```bash # Parse from file calypso-db parse data.json # Parse from stdin cat data.json | calypso-db parse - echo '[{"material_id":"test","phase_id":"phase1"}]' | calypso-db parse - # Chain convert and parse calypso-db convert data.csv | calypso-db parse - # Parse in parallel with 8 worker processes calypso-db parse data.json -j 8 # Error handling options calypso-db parse data.json --on-error ignore # Ignore errors calypso-db parse data.json --on-error warn # Show warnings (default) calypso-db parse data.json --on-error raise # Raise exceptions # Save output calypso-db parse data.json -o result.json # Formatted output calypso-db parse data.json --indent 2 # Quiet mode (suppress error messages) calypso-db parse data.json --quiet ``` ### Integration with Other Tools ```bash # Use with jq for filtering calypso-db convert data.csv | calypso-db parse - | jq '.material[0]' # Save intermediate results calypso-db convert data.csv | tee raw.json | calypso-db parse - -o parsed.json # Format existing JSON calypso-db parse compact.json --indent 2 -o formatted.json ``` ### CLI Options **Common options:** - `--output, -o FILE`: Write output to file instead of stdout - `--indent N`: Format JSON with N spaces indentation (default: compact) - `--ascii`: Escape non-ASCII characters (default: keep Unicode) - `--version, -v`: Show version information **Parse-specific options:** - `--on-error {ignore,warn,raise}`: Error handling strategy (default: warn) - `--quiet, -q`: Suppress error messages - `-j, --jobs N`: Number of worker processes for parsing (default: 1) ## Output JSON Structure The `parse` command outputs a structured JSON with categorized data: ```json { "material": [...], // Material properties and composition "symmetry": [...], // Crystal symmetry information "xrd_exp": [...], // Experimental XRD spectra "xrd_theory": [...], // Theoretical XRD spectra "ir_exp": [...], // Experimental IR spectra "ir_theory": [...], // Theoretical IR spectra "raman_exp": [...], // Experimental Raman spectra "raman_theory": [...] // Theoretical Raman spectra } ``` **Notes:** - Each array contains records corresponding to input materials in order - `null` entries indicate missing or invalid data - Custom properties from `prop_*` columns appear in material records - Spectrum metadata (custom keys from files) are preserved ## CSV Input Format The CSV file should follow this structure: ```csv material_id,phase_id,experimental,source,license,structure,prop_band_gap,prop_density,XRD_exp,XRD_theory,... string,string,boolean,string,string,path,float,float,path,path,... Material ID,Phase ID,Is experimental?,Data source,License,Structure file,Band gap (eV),Density (g/cm³),XRD exp file,XRD theory file,... mat_001,phase_alpha,true,DFT,MIT,structures/mat_001.cif,2.5,3.2,spectra/xrd_exp.txt,,... mat_002,phase_beta,false,Exp,CC-BY,structures/mat_002.cif,1.8,4.1,,spectra/xrd_theory.txt,... ``` **Key Points:** - Row 1: Column headers - Row 2-3: Automatically skipped (units/types or descriptions etc.) - Row 4+: Data rows - File path columns (`structure`, `XRD_*`, `IR_*`, `Raman_*`) are relative to CSV location - File contents are automatically embedded in the JSON output - Custom property columns with `prop_` prefix (e.g., `prop_band_gap`, `prop_density`) are stored in the Material model without the `prop_` prefix ### Supported Columns | Column | Type | Description | |--------|------|-------------| | `material_id` | string | Material identifier (auto-generated if missing) | | `phase_id` | string | Phase identifier | | `experimental` | boolean | Experimental vs theoretical data flag | | `source` | string | Data source | | `license` | string | License information | | `structure` | path | CIF format structure file | | `prop_*` | any | Material properties (e.g., `prop_band_gap`) | | `XRD_exp` | path | Experimental XRD spectrum | | `XRD_theory` | path | Theoretical XRD spectrum | | `IR_exp` | path | Experimental IR spectrum | | `IR_theory` | path | Theoretical IR spectrum | | `Raman_exp` | path | Experimental Raman spectrum | | `Raman_theory` | path | Theoretical Raman spectrum | ## Spectrum File Format Spectrum files should follow this format: ``` # Comments start with #, and key-value split by '=' # key1 = value1 # key2 = value2 # Data section: two columns (variable, intensity) 10.5 100.2 20.3 150.7 30.1 200.5 ``` **Reserved Keys:** - `variable`: X-axis data (automatically populated from data section) - `intensity`: Y-axis data (automatically populated from data section) --- ## Python API For advanced usage, you can use the Python API directly. ### Basic Usage ```python from calypso_database.parsers import Parser # Parse a single material record data = { "material_id": "mat_001", "phase_id": "phase_alpha", "structure": "...", # CIF format structure "XRD_exp": "...", # XRD spectrum data } parser = Parser.from_dict(data, on_error="warn") print(parser.to_json(indent=2)) ``` ### Batch Processing ```python from calypso_database.csv_loader import csv2json from calypso_database.parsers import BatchParser # Convert CSV to JSON json_str = csv2json("materials_data.csv") # Parse all records batch = BatchParser.from_json(json_str, on_error="warn") # Parse all records in parallel batch_parallel = BatchParser.from_json(json_str, on_error="warn", jobs=8) # Get aggregated results result = batch.to_dict() # { # "material": [...], # "symmetry": [...], # "xrd_exp": [...], # ... # } ``` ### Error Handling Control how parsing errors are handled: ```python # Ignore all errors silently parser = Parser.from_dict(data, on_error="ignore") # Emit warnings but continue (default) parser = Parser.from_dict(data, on_error="warn") # Raise exception on first error parser = Parser.from_dict(data, on_error="raise") ``` ## Data Models ### Material - `material_id`: Unique identifier (ULID) - `phase_id`: Phase identifier - `experimental`: Boolean flag - `source`: Data source - `license`: License information - `formula`: Chemical formula - `nsites`: Number of sites - `elements`: List of elements - `nelements`: Number of elements - `volume`: Unit cell volume - `density`: Density - `prop_*`: Custom properties ### Symmetry - `material_id`: Reference to material - `crystal_system`: Crystal system - `space_group_number`: Space group number - `space_group_symbol`: Space group symbol - `point_group`: Point group ### Spectrum (XRD, IR, Raman) - `material_id`: Reference to material - `variable`: X-axis data (2θ for XRD, wavenumber for IR/Raman) - `intensity`: Y-axis data - Custom metadata fields ## API Reference ### Parser ```python Parser.from_dict(data: dict, on_error: Literal["ignore", "warn", "raise"] = "warn") -> Parser ``` Parse a single material record. ```python parser.to_dict() -> dict ``` Convert to dictionary with all fields present (None if not available). ```python parser.to_json(**kwargs) -> str ``` Convert to JSON string. ### BatchParser ```python BatchParser.from_json( json_str: str, on_error: Literal["ignore", "warn", "raise"] = "warn", jobs: int = 1, ) -> BatchParser ``` Parse multiple material records from JSON string. ```python batch.to_dict() -> dict ``` Aggregate all parsers into categorized lists. ```python batch.to_json(**kwargs) -> str ``` Convert to JSON string. ### Utilities ```python csv2json(fcsv: str | Path) -> str ``` Convert CSV file to JSON string for BatchParser processing. ## Development ### Setup ```bash # Clone the repository git clone https://github.com/yourusername/calypso-database.git cd calypso-database # Install dependencies including test dependencies uv sync --extra test # The package is automatically installed in editable mode ``` ### Running Tests ```bash # Run all tests uv run pytest # Run with verbose output uv run pytest -v # Run with coverage report uv run pytest --cov=calypso_database --cov-report=term-missing # Run specific test file uv run pytest tests/test_utils.py # Run specific test uv run pytest tests/test_utils.py::TestCsv2Json::test_simple_csv ``` ### Code Quality ```bash # Format code uv run ruff format # Lint code uv run ruff check # Fix linting issues uv run ruff check --fix ``` ### Distribution Build wheel and source distribution packages: ```bash # Build distribution packages uv build ``` This creates files in the `dist/` directory: - `calypso_database-*.whl` - Wheel package (binary distribution) - `calypso_database-*.tar.gz` - Source distribution Test the built wheel: ```bash # Install from local wheel uv tool install dist/calypso_database-*.whl # Test the installed tool calypso-db --version # Uninstall uv tool uninstall calypso-database ``` ## Requirements - Python >= 3.11 - pandas >= 3.0.2 - pymatgen >= 2026.3.23 - python-ulid >= 3.1.0 - typing-extensions >= 4.0.0 ## License [Add your license here] ## Contributing Contributions are welcome! Please feel free to submit a Pull Request. ## Citation If you use this library in your research, please cite: ``` [Add citation information here] ``` ## Contact Xiaoshan Luo - luoxs@calypso.cn