matbench_genmetrics.mp_time_split package

Subpackages

Submodules

matbench_genmetrics.mp_time_split.splitter module

Core functionality for Materials Project time-based train/test splitting

class matbench_genmetrics.mp_time_split.splitter.MPTimeSplit(num_sites: Tuple[int, int] | None = None, elements: List[str] | None = None, exclude_elements: List[str] | Literal['noble', 'radioactive', 'noble+radioactive'] | None = None, use_theoretical: bool = False, mode: str = 'TimeSeriesSplit', target: str = 'energy_above_hull', save_dir=None)[source]

Bases: object

fetch_data(one_by_one=False)[source]

Fetch data directly from Materials Project and split into train/test sets.

Parameters:

one_by_one (bool, optional) – Whether to retrieve data one-by-one instead of in bulk. This is useful for (since need provenance attributes). By default False.

Returns:

df – Dataframe of Materials Project data containing structure and target columns. structure is of type pymatgen.core.structure.Structure.

Return type:

pd.DataFrame

Raises:
  • ImportError – Failed to import fetch_data(). Try pip install mp_time_split[api] or pip install mp-api to install the optional mp-api dependency. Note that this requires Python >=3.8

  • ValueErrorself.data is not a pd.DataFrame

Examples

>>> mpts = MPTimeSplit()
>>> mpts.fetch_data()
get_final_test_data()[source]

The ‘for real life’ test split, i.e., what gets touched only once before submitting a manuscript.

get_train_and_val_data(fold)[source]

Get training and validation data for a given fold.

Parameters:

fold (int) – The cross-validation fold to get the data for.

Returns:

Input training data, input validation data, output training data, and output validation data. Note that the input data is a pymatgen.core.structure.Structure object.

Return type:

pd.DataFrame, pd.DataFrame, pd.DataFrame, pd.DataFrame

Raises:
  • NameErrorfetch_data() or load() must be run first.

  • ValueError – fold={fold} should be one of {FOLDS}

Examples

>>> mpts = MPTimeSplit()
>>> mpts.get_train_and_val_data(0)
load(url=None, checksum=None, dummy=False, force_download=False)[source]

Load data from an existing snapshot.

Parameters:
  • url (str, optional) – URL to download the data from, by default None

  • checksum (str, optional) – Checksum to ensure the validity of the file, by default None

  • dummy (bool, optional) – Whether to load a dummy snapshot or not, by default False

  • force_download (bool, optional) – Whether to force download, regardless of whether the data has already been downloaded, by default False

Returns:

DataFrame of Materials Project data containing structure and target columns. structure is of type pymatgen.core.structure.Structure.

Return type:

pd.DataFrame

Raises:
  • ValueError – url should not be None at this point. url: {url}, type: {type(url)}

  • ValueError – checksum from {url} ({checksum}) does not match what was expected {checksum_frozen})

Examples

>>> mpts = MPTimeSplit()
>>> mpts.load(url=None, checksum=None, dummy=False, force_download=False)
matbench_genmetrics.mp_time_split.splitter.get_data_home(data_home=None)[source]

Selects the home directory to look for datasets, if the specified home directory doesn’t exist the directory structure is built

Modified from source: https://github.com/hackingmaterials/matminer/blob/76a529b769055c729d62f11a419d319d8e2f838e/matminer/datasets/utils.py#L26-L43 # noqa:E501

Parameters:

data_home (str) – folder to look in, if None a default is selected

Returns (str)

matbench_genmetrics.mp_time_split.splitter.main(args)[source]

Wrapper allowing calls with string arguments in a CLI fashion stdout in a nicely formatted message. Args:

args (List[str]): command line parameters as list of strings

(for example ["--verbose", "./data"]).

matbench_genmetrics.mp_time_split.splitter.parse_args(args)[source]

Parse command line parameters :param args: command line parameters as list of strings

(for example ["--help"]).

Returns:

command line parameters namespace

Return type:

argparse.Namespace

matbench_genmetrics.mp_time_split.splitter.run()[source]

Calls main() passing the CLI arguments extracted from sys.argv This function can be used as entry point to create console scripts with setuptools.

matbench_genmetrics.mp_time_split.splitter.setup_logging(loglevel)[source]

Setup basic logging :param loglevel: minimum loglevel for emitting messages :type loglevel: int

Module contents

Create Materials Project time-based cross-validation splits