LeMat-Bulk
The LeMatBulk dataset is a materials science and chemistry dataset that includes several configurations (such as compatible_pbe, compatible_pbesol, compatible_scan, non_compatible) and encompasses various chemical structure features such as elements, chemical formulas, lattice vectors, and energy properties. The dataset is intended to support materials science research, particularly in the context of density functional theory (DFT) calculations. It contains subsets filtered for compatibility according to different DFT functionals and pseudopotentials. The dataset also describes methods for ensuring compatibility and deduplication of entries. Distributed under the CC‑BY‑4.0 license, it can be downloaded from the Hugging Face datasets library and used in Python.
Dataset description and usage context
LeMat‑Bulk Dataset Overview
Dataset Description
Configuration Information
-
compatible_pbe:
- Features:
- elements: sequence[string]
- nsites: int
- chemical_formula_anonymous: string
- chemical_formula_reduced: string
- chemical_formula_descriptive: string
- nelements: int
- dimension_types: sequence[int]
- nperiodic_dimensions: int
- lattice_vectors: sequence[sequence[float]]
- immutable_id: string
- cartesian_site_positions: sequence[sequence[float]]
- species: string
- species_at_sites: sequence[string]
- last_modified: string
- elements_ratios: sequence[float]
- stress_tensor: sequence[sequence[float]]
- energy: float
- magnetic_moments: sequence[float]
- forces: sequence[sequence[float]]
- total_magnetization: float
- dos_ef: float
- functional: string
- cross_compatibility: bool
- entalpic_fingerprint: string
- Splits:
- train: 5,335,299 samples, 8,043,765,194 bytes
- Download Size: 3,036,919,717 bytes
- Dataset Size: 8,043,765,194 bytes
- Features:
-
compatible_pbesol:
- Features: same as above
- Splits:
- train: 447,824 samples, 646,300,349 bytes
- Download Size: 230,878,194 bytes
- Dataset Size: 646,300,349 bytes
-
compatible_scan:
- Features: same as above
- Splits:
- train: 422,840 samples, 597,846,818 bytes
- Download Size: 207,887,396 bytes
- Dataset Size: 597,846,818 bytes
-
non_compatible:
- Features: same as above
- Splits:
- train: 519,627 samples, 818,845,899 bytes
- Download Size: 268,949,608 bytes
- Dataset Size: 818,845,899 bytes
Data Fields
| Feature Name | Data Type | Description | Optimade Required Field |
|---|---|---|---|
| elements | sequence[string] | List of elements in the structure | ✅ |
| nsites | int | Total number of sites in the structure | ✅ |
| chemical_formula_anonymous | string | Anonymous chemical formula | ✅ |
| chemical_formula_reduced | string | Reduced chemical formula | ✅ |
| chemical_formula_descriptive | string | Descriptive chemical formula | ✅ |
| nelements | int | Number of distinct elements in the structure | ✅ |
| dimension_types | sequence[int] | Periodic boundary condition types | ✅ |
| nperiodic_dimensions | int | Number of periodic dimensions | ✅ |
| lattice_vectors | sequence[sequence[float]] | Lattice vectors | ✅ |
| immutable_id | string | Material ID | ✅ |
| cartesian_site_positions | sequence[sequence[float]] | Cartesian site positions | ✅ |
| species | JSON | Species information | ✅ |
| species_at_sites | sequence[string] | Chemical element at each site | ✅ |
| last_modified | datetime | Last modification date | ✅ |
| elements_ratios | dict | Elemental composition ratios | ✅ |
| stress_tensor | sequence[sequence[float]] | Stress tensor | |
| energy | float | Uncorrected energy | |
| magnetic_moments | sequence[float] | Magnetic moment per site | |
| forces | sequence[sequence[float]] | Force per site | |
| total_magnetization | float | Total magnetization of the structure | |
| functional | string | Computational functional | |
| cross_compatibility | bool | Compatibility with other rows | |
| entalpic_fingerprint | string | Material fingerprint |
Available Subsets
- Compatible, PBE (default): Rows filtered for DFT compatibility, containing only PBE records.
- Compatible, PBESol: Contains only PBESol data.
- Compatible, SCAN: Contains only SCAN data.
- All: All records.
Database Statistics
| Database | Number of Materials | Number of Structures |
|---|---|---|
| Materials Project | 148,453 | 189,403 |
| Alexandria | 4,635,066 | 5,459,260 |
| OQMD | 1,076,926 | 1,076,926 |
| LeMaterial (All) | 5,860,446 | 6,725,590 |
| LeMaterial (Compatible, PBE) | 5,335,299 | 5,335,299 |
| LeMaterial (Compatible, PBESOL) | 447,824 | 447,824 |
| LeMaterial (Compatible, SCAN) | 422,840 | 422,840 |
Methods
Compatibility Compliance
- Pseudopotentials: Ensure consistent pseudopotentials are used.
- Hubbard U Parameters: Exclude records containing specific elements.
- Spin Polarization: Exclude non‑spin‑polarized calculations.
- Convergence Criteria: No records were excluded based on convergence settings.
- Energy Above Convex Hull: High‑energy materials were not filtered.
Deduplication Method
- Compute bonds using the EconNN algorithm.
- Build a structure graph and hash it with the Weisfeiler‑Lehman algorithm.
- Add symmetry and composition information.
- Remove duplicate structures, keeping only the lowest‑energy entry.
Future Updates
- Planned release of band gap information for all materials.
- Unified energy corrections.
- Publication of Bader charges.
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.