LeMat-Bulk
The LeMatBulk dataset is a materials science and chemistry dataset that includes several configurations (such as compatible_pbe, compatible_pbesol, compatible_scan, non_compatible) and encompasses various chemical structure features such as elements, chemical formulas, lattice vectors, and energy properties. The dataset is intended to support materials science research, particularly in the context of density functional theory (DFT) calculations. It contains subsets filtered for compatibility according to different DFT functionals and pseudopotentials. The dataset also describes methods for ensuring compatibility and deduplication of entries. Distributed under the CC‑BY‑4.0 license, it can be downloaded from the Hugging Face datasets library and used in Python.
Description
LeMat‑Bulk Dataset Overview
Dataset Description
Configuration Information
-
compatible_pbe:
- Features:
- elements: sequence[string]
- nsites: int
- chemical_formula_anonymous: string
- chemical_formula_reduced: string
- chemical_formula_descriptive: string
- nelements: int
- dimension_types: sequence[int]
- nperiodic_dimensions: int
- lattice_vectors: sequence[sequence[float]]
- immutable_id: string
- cartesian_site_positions: sequence[sequence[float]]
- species: string
- species_at_sites: sequence[string]
- last_modified: string
- elements_ratios: sequence[float]
- stress_tensor: sequence[sequence[float]]
- energy: float
- magnetic_moments: sequence[float]
- forces: sequence[sequence[float]]
- total_magnetization: float
- dos_ef: float
- functional: string
- cross_compatibility: bool
- entalpic_fingerprint: string
- Splits:
- train: 5,335,299 samples, 8,043,765,194 bytes
- Download Size: 3,036,919,717 bytes
- Dataset Size: 8,043,765,194 bytes
- Features:
-
compatible_pbesol:
- Features: same as above
- Splits:
- train: 447,824 samples, 646,300,349 bytes
- Download Size: 230,878,194 bytes
- Dataset Size: 646,300,349 bytes
-
compatible_scan:
- Features: same as above
- Splits:
- train: 422,840 samples, 597,846,818 bytes
- Download Size: 207,887,396 bytes
- Dataset Size: 597,846,818 bytes
-
non_compatible:
- Features: same as above
- Splits:
- train: 519,627 samples, 818,845,899 bytes
- Download Size: 268,949,608 bytes
- Dataset Size: 818,845,899 bytes
Data Fields
| Feature Name | Data Type | Description | Optimade Required Field |
|---|---|---|---|
| elements | sequence[string] | List of elements in the structure | ✅ |
| nsites | int | Total number of sites in the structure | ✅ |
| chemical_formula_anonymous | string | Anonymous chemical formula | ✅ |
| chemical_formula_reduced | string | Reduced chemical formula | ✅ |
| chemical_formula_descriptive | string | Descriptive chemical formula | ✅ |
| nelements | int | Number of distinct elements in the structure | ✅ |
| dimension_types | sequence[int] | Periodic boundary condition types | ✅ |
| nperiodic_dimensions | int | Number of periodic dimensions | ✅ |
| lattice_vectors | sequence[sequence[float]] | Lattice vectors | ✅ |
| immutable_id | string | Material ID | ✅ |
| cartesian_site_positions | sequence[sequence[float]] | Cartesian site positions | ✅ |
| species | JSON | Species information | ✅ |
| species_at_sites | sequence[string] | Chemical element at each site | ✅ |
| last_modified | datetime | Last modification date | ✅ |
| elements_ratios | dict | Elemental composition ratios | ✅ |
| stress_tensor | sequence[sequence[float]] | Stress tensor | |
| energy | float | Uncorrected energy | |
| magnetic_moments | sequence[float] | Magnetic moment per site | |
| forces | sequence[sequence[float]] | Force per site | |
| total_magnetization | float | Total magnetization of the structure | |
| functional | string | Computational functional | |
| cross_compatibility | bool | Compatibility with other rows | |
| entalpic_fingerprint | string | Material fingerprint |
Available Subsets
- Compatible, PBE (default): Rows filtered for DFT compatibility, containing only PBE records.
- Compatible, PBESol: Contains only PBESol data.
- Compatible, SCAN: Contains only SCAN data.
- All: All records.
Database Statistics
| Database | Number of Materials | Number of Structures |
|---|---|---|
| Materials Project | 148,453 | 189,403 |
| Alexandria | 4,635,066 | 5,459,260 |
| OQMD | 1,076,926 | 1,076,926 |
| LeMaterial (All) | 5,860,446 | 6,725,590 |
| LeMaterial (Compatible, PBE) | 5,335,299 | 5,335,299 |
| LeMaterial (Compatible, PBESOL) | 447,824 | 447,824 |
| LeMaterial (Compatible, SCAN) | 422,840 | 422,840 |
Methods
Compatibility Compliance
- Pseudopotentials: Ensure consistent pseudopotentials are used.
- Hubbard U Parameters: Exclude records containing specific elements.
- Spin Polarization: Exclude non‑spin‑polarized calculations.
- Convergence Criteria: No records were excluded based on convergence settings.
- Energy Above Convex Hull: High‑energy materials were not filtered.
Deduplication Method
- Compute bonds using the EconNN algorithm.
- Build a structure graph and hash it with the Weisfeiler‑Lehman algorithm.
- Add symmetry and composition information.
- Remove duplicate structures, keeping only the lowest‑energy entry.
Future Updates
- Planned release of band gap information for all materials.
- Unified energy corrections.
- Publication of Bader charges.
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: huggingface
Created: 12/7/2024
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.