matbench_genmetrics.mp_time_split package
Subpackages
Submodules
matbench_genmetrics.mp_time_split.splitter module
Core functionality for Materials Project time-based train/test splitting
- class matbench_genmetrics.mp_time_split.splitter.MPTimeSplit(num_sites: Tuple[int, int] | None = None, elements: List[str] | None = None, exclude_elements: List[str] | Literal['noble', 'radioactive', 'noble+radioactive'] | None = None, use_theoretical: bool = False, mode: str = 'TimeSeriesSplit', target: str = 'energy_above_hull', save_dir=None)[source]
Bases:
object- fetch_data(one_by_one=False)[source]
Fetch data directly from Materials Project and split into train/test sets.
- Parameters:
one_by_one (bool, optional) – Whether to retrieve data one-by-one instead of in bulk. This is useful for (since need provenance attributes). By default False.
- Returns:
df – Dataframe of Materials Project data containing structure and target columns. structure is of type pymatgen.core.structure.Structure.
- Return type:
pd.DataFrame
- Raises:
ImportError – Failed to import fetch_data(). Try pip install mp_time_split[api] or pip install mp-api to install the optional mp-api dependency. Note that this requires Python >=3.8
ValueError – self.data is not a pd.DataFrame
Examples
>>> mpts = MPTimeSplit() >>> mpts.fetch_data()
- get_final_test_data()[source]
The ‘for real life’ test split, i.e., what gets touched only once before submitting a manuscript.
- get_train_and_val_data(fold)[source]
Get training and validation data for a given fold.
- Parameters:
fold (int) – The cross-validation fold to get the data for.
- Returns:
Input training data, input validation data, output training data, and output validation data. Note that the input data is a pymatgen.core.structure.Structure object.
- Return type:
pd.DataFrame, pd.DataFrame, pd.DataFrame, pd.DataFrame
- Raises:
NameError – fetch_data() or load() must be run first.
ValueError – fold={fold} should be one of {FOLDS}
Examples
>>> mpts = MPTimeSplit() >>> mpts.get_train_and_val_data(0)
- load(url=None, checksum=None, dummy=False, force_download=False)[source]
Load data from an existing snapshot.
- Parameters:
url (str, optional) – URL to download the data from, by default None
checksum (str, optional) – Checksum to ensure the validity of the file, by default None
dummy (bool, optional) – Whether to load a dummy snapshot or not, by default False
force_download (bool, optional) – Whether to force download, regardless of whether the data has already been downloaded, by default False
- Returns:
DataFrame of Materials Project data containing structure and target columns. structure is of type pymatgen.core.structure.Structure.
- Return type:
pd.DataFrame
- Raises:
ValueError – url should not be None at this point. url: {url}, type: {type(url)}
ValueError – checksum from {url} ({checksum}) does not match what was expected {checksum_frozen})
Examples
>>> mpts = MPTimeSplit() >>> mpts.load(url=None, checksum=None, dummy=False, force_download=False)
- matbench_genmetrics.mp_time_split.splitter.get_data_home(data_home=None)[source]
Selects the home directory to look for datasets, if the specified home directory doesn’t exist the directory structure is built
Modified from source: https://github.com/hackingmaterials/matminer/blob/76a529b769055c729d62f11a419d319d8e2f838e/matminer/datasets/utils.py#L26-L43 # noqa:E501
- Parameters:
data_home (str) – folder to look in, if None a default is selected
Returns (str)
- matbench_genmetrics.mp_time_split.splitter.main(args)[source]
Wrapper allowing calls with string arguments in a CLI fashion
stdoutin a nicely formatted message. Args:- args (List[str]): command line parameters as list of strings
(for example
["--verbose", "./data"]).
- matbench_genmetrics.mp_time_split.splitter.parse_args(args)[source]
Parse command line parameters :param args: command line parameters as list of strings
(for example
["--help"]).- Returns:
command line parameters namespace
- Return type:
Module contents
Create Materials Project time-based cross-validation splits