matbench_genmetrics.mp_time_split.utils namespace

Submodules

matbench_genmetrics.mp_time_split.utils.api module

matbench_genmetrics.mp_time_split.utils.api.fetch_data(api_key: str | None = None, fields: List[str] | None = ['structure', 'material_id', 'theoretical', 'energy_above_hull', 'formation_energy_per_atom'], num_sites: Tuple[int, int] | None = None, elements: List[str] | None = None, exclude_elements: List[str] | Literal['noble', 'radioactive', 'noble+radioactive'] | None = None, use_theoretical: bool = False, return_both_if_experimental: bool = False, one_by_one: bool = False, **search_kwargs) DataFrame | Tuple[DataFrame, DataFrame][source]

Retrieve MP data sorted by MPID (theoretical+exptl) or pub year (exptl).

See *How do I do a time-split of Materials Project entries? e.g. pre-2018 vs. post-2018*

Output DataFrame-s will contain all specified fields unless fields is None, in which case all MPRester().summary.available_fields() will be returned. If return experimental data, the additional fields of provenance, discovery and year corresponding to emmet.core.provenance.ProvenanceDoc(), a dictionary containing earliest year and author information, and the earliest year, respectively, will also be returned.

Parameters:
  • api_key (Union[str, DEFAULT_API_KEY]) – mp_api() API Key. On Windows, can set as an environment variable via: setx MP_API_KEY="abc123def456". By default: mp_api.core.client.DEFAULT_API_KEY() See also: https://github.com/materialsproject/api/issues/566#issuecomment-1087941474

  • fields (Optional[List[str]]) – fields (List[str]): List of fields to project. When searching, it is better to only ask for the specific fields of interest to reduce the time taken to retrieve the documents. See the MPRester().summary.available_fields() property to see a list of fields to choose from. By default: ["structure", "material_id", "theoretical"].

  • num_sites (Tuple[int, int]) – Tuple of min and max number of sites used as filtering criteria, e.g. (1, 52) meaning at least 1 and no more than 52 sites. If None then no compounds with any number of sites are allowed. By default None.

  • elements (List[str]) – List of element symbols, e.g. ["Ni", "Fe"]. If None then all elements are allowed. By default None.

  • exclude_elements (Optional[) –

    Union[List[str], Literal[“noble”, “radioactive”,

    ”noble+radioactive”]]

    ]

    List of element symbols to _exclude_, e.g. ["Ar", "Ne"]. If None then all elements are allowed. If a supported string value (“noble”, “radioactive”, or “noble+radioactive”), then filters out the appropriate elements. By default None.

  • use_theoretical (bool, optional) – Whether to include both theoretical and experimental compounds or to filter down to only experimentally-verified compounds, by default False

  • return_both_if_experimental (bool, optional) – Whether to return both the full DataFrame containing theoretical+experimental (df) and the experimental-only DataFrame (expt_df) or only expt_df, by default False. This is only applicable if use_theoretical is False.

  • one_by_one (bool, optional) – Whether to retrieve data one-by-one instead of in bulk. This is useful for testing with a small number or in case the mp-api search is malfunctioning (since need provenance attributes). By default False.

  • search_kwargs (dict, optional) – kwargs: Supported search terms, e.g. nelements_max=3 for the “materials” search API. Consult the specific API route for valid search terms, i.e. MPRester().summary.available_fields()

Returns:

  • df (pd.DataFrame) – if use_theoretical then returns a DataFrame containing both theoretical and experimental compounds.

  • expt_df, df (Tuple[pd.DataFrame, pd.DataFrame]) – if not use_theoretical and return_both_if_experimental, then returns two :func:`pd.DataFrame-s containing theoretical+experimental and experimental-only.

  • expt_df (pd.DataFrame) – if not use_theoretical and not return_both_if_experimental, then returns a pd.DataFrame() containing the experimental-only compounds.

Examples

>>> api_key = "abc123def456"
>>> num_sites = (1, 52)
>>> elements = ["V"]
>>> expt_df = fetch_data(api_key, num_sites=num_sites, elements=elements)
>>> df = fetch_data(
        api_key,
        num_sites=num_sites,
        elements=elements,
        use_theoretical=True
    )
>>> expt_df, df = fetch_data(
        api_key,
        num_sites=num_sites,
        elements=elements,
        use_theoretical=False,
        return_both_if_experimental
    )

matbench_genmetrics.mp_time_split.utils.data module

matbench_genmetrics.mp_time_split.utils.data.get_discovery_dict(references: List[dict]) List[dict][source]

Get a dictionary containing earliest bib info for each MP entry.

Modified from source: “How do I do a time-split of Materials Project entries? e.g. pre-2018 vs. post-2018” https://matsci.org/t/42584/4?u=sgbaird, answer by @Joseph_Montoya, Materials Project Alumni

Parameters:

provenance_results (List[dict]) – List of references results, e.g. taken from from the ProvenanceRester API results (mp_api.materials.provenance())

Returns:

Dictionary containing earliest bib info for each MP entry with keys: ["year", "authors", "num_authors"]

Return type:

discovery, List[dict]

Examples

>>> with MPRester(api_key) as mpr:
...     provenance_results = mpr.materials.provenance.search(num_sites=(1, 4), elements=["V"])
>>> discovery = get_discovery_dict(provenance_results)
[{'year': 1963, 'authors': ['Raub, E.', 'Fritzsche, W.'], 'num_authors': 2}, {'year': 1925, 'authors': ['Becker, K.', 'Ebert, F.'], 'num_authors': 2}, {'year': 1965, 'authors': ['Giessen, B.C.', 'Grant, N.J.'], 'num_authors': 2}, {'year': 1957, 'authors': ['Philip, T.V.', 'Beck, P.A.'], 'num_authors': 2}, {'year': 1963, 'authors': ['Darby, J.B.jr.'], 'num_authors': 1}, {'year': 1977, 'authors': ['Aksenova, T.V.', 'Kuprina, V.V.', 'Bernard, V.B.', 'Skolozdra, R.V.'], 'num_authors': 4}, {'year': 1964, 'authors': ['Maldonado, A.', 'Schubert, K.'], 'num_authors': 2}, {'year': 1962, 'authors': ['Darby, J.B.jr.', 'Lam, D.J.', 'Norton, L.J.', 'Downey, J.W.'], 'num_authors': 4}, {'year': 1925, 'authors': ['Becker, K.', 'Ebert, F.'], 'num_authors': 2}, {'year': 1959, 'authors': ['Dwight, A.E.'], 'num_authors': 1}] # noqa: E501

matbench_genmetrics.mp_time_split.utils.gen module

class matbench_genmetrics.mp_time_split.utils.gen.DummyGenerator[source]

Bases: object

fit(inputs)[source]
gen(n=100)[source]

This function generates a list of pymatgen Structure objects by creating random crystals using the pyxtal library. Each crystal is composed of Ba, Ti, and O in a 1:1:3 ratio.

Parameters:

n (int, optional) – The number of structures to generate, by default 100.

Returns:

A list of pymatgen Structure objects.

Return type:

List[Structure]

Examples

>>> structures = DummyGenerator().gen(n=100)
matbench_genmetrics.mp_time_split.utils.gen.get_random_sio_structure(rng=Generator(PCG64) at 0x7FA04032AA40)[source]

matbench_genmetrics.mp_time_split.utils.split module

class matbench_genmetrics.mp_time_split.utils.split.TimeKFold(n_splits=5, *, shuffle=False, random_state=None)[source]

Bases: _BaseKFold

Time Series K-Folds cross-validator

TODO: update docstring

Provides train/test indices to split data in train/test sets. Split dataset into k consecutive folds (without shuffling by default).

Each fold is then used once as a validation while the k - 1 remaining folds form the training set.

Read more in the User Guide.

Parameters:

n_splits (int, default=5) –

Number of folds. Must be at least 2.

Changed in version 0.22: n_splits default value changed from 3 to 5.

Examples

>>> import numpy as np
>>> from sklearn.model_selection import KFold
>>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
>>> y = np.array([1, 2, 3, 4])
>>> kf = KFold(n_splits=2)
>>> kf.get_n_splits(X)
2
>>> print(kf)
KFold(n_splits=2, random_state=None, shuffle=False)
>>> for train_index, test_index in kf.split(X):
...     print("TRAIN:", train_index, "TEST:", test_index)
...     X_train, X_test = X[train_index], X[test_index]
...     y_train, y_test = y[train_index], y[test_index]
TRAIN: [2 3] TEST: [0 1]
TRAIN: [0 1] TEST: [2 3]

Notes

The first n_samples % n_splits folds have size n_samples // n_splits + 1, other folds have size n_samples // n_splits, where n_samples is the number of samples.

Randomized CV splitters may return different results for each call of split. You can make the results identical by setting random_state to an integer.

See also

StratifiedKFold

Takes group information into account to avoid building folds with imbalanced class distributions (for binary or multiclass classification tasks).

GroupKFold

K-fold iterator variant with non-overlapping groups.

RepeatedKFold

Repeats K-Fold n times.

split(X, y=None, groups=None)[source]

Generate indices to split data into training and test set.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Training data, where n_samples is the number of samples and n_features is the number of features.

  • y (array-like of shape (n_samples,), default=None) – The target variable for supervised learning problems.

  • groups (array-like of shape (n_samples,), default=None) – Group labels for the samples used while splitting the dataset into train/test set.

Yields:
  • train (ndarray) – The training set indices for that split.

  • test (ndarray) – The testing set indices for that split.

class matbench_genmetrics.mp_time_split.utils.split.TimeSeriesOverflowSplit(n_splits=5, *, max_train_size=None, test_size=None, gap=0)[source]

Bases: _BaseKFold

Time Series cross-validator that always uses remainder of data as test data.

split(X, y=None, groups=None)[source]

Generate indices to split data into training and test set.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Training data, where n_samples is the number of samples and n_features is the number of features.

  • y (array-like of shape (n_samples,)) – Always ignored, exists for compatibility.

  • groups (array-like of shape (n_samples,)) – Always ignored, exists for compatibility.

Yields:
  • train (ndarray) – The training set indices for that split.

  • test (ndarray) – The testing set indices for that split.

matbench_genmetrics.mp_time_split.utils.split.mp_time_splitter(X, mode='TimeSeriesSplit', use_trainval_test: bool = True, n_cv_splits: int = 5, max_train_size=None, test_size=None, gap=0)[source]

Split into trainval and test sets, and optionally return test_sets.

Parameters:
  • X (pd.DataFrame) – DataFrame of Materials Project data to be split.

  • mode (str, optional) – One of {“TimeSeriesSplit”, “TimeSeriesOverflowSplit”, “TimeKFold”}, by default “TimeSeriesSplit”

  • use_trainval_test (bool, optional) – Whether to use a trainval-test split vs. just a train-test split. The idea is that you should tune your hyperparameters using training and validation sets and then keep a held-out test set that you “only ever touch once” (e.g., run immediately before and only once prior to manuscript submission). By default True.

  • n_cv_splits (int, optional) – Number of cross-validation splits to perform, by default 5

  • max_train_size (int, optional) – Maximum size for a single training set, by default None

  • test_size (int, optional) – Used to limit the size of the test set, by default None

  • gap (int, optional) – Number of samples to exclude from the end of each training set before the test set, by default 0

Returns:

  • list of (tuples of arrays) – The training and test indices for that split.

  • list of (tuples of arrays), 2-element tuple of arrays – Returned when use_trainval_test is True. The (training and validation) and test indices for that split.

Raises:

Examples

>>> mpt = MPTimeSplit(num_sites=num_sites, elements=elements)
>>> data = mpt.load(dummy=True)
>>> trainval_splits, test_split = mp_time_split(data, use_trainval_test=True)
>>> print(trainval_splits)
[
    (array([0]), array([1])),
    (array([0, 1]), array([2])),
    (array([0, 1, 2]), array([3])),
    (array([0, 1, 2, 3]), array([4])),
    (array([0, 1, 2, 3, 4]), array([5])),
]
>>> print(test_split)
(array([0, 1, 2, 3, 4, 5]), array([6, 7]))
>>>
>>> # **no held-out test set**
>>> trainval_splits = mp_time_split(data, use_trainval_test=False)
>>> print(trainval_splits)
[
    (array([0, 1, 2]), array([3])),
    (array([0, 1, 2, 3]), array([4])),
    (array([0, 1, 2, 3, 4]), array([5])),
    (array([0, 1, 2, 3, 4, 5]), array([6])),
    (array([0, 1, 2, 3, 4, 5, 6]), array([7])),
]