matbench_genmetrics.mp_time_split.utils namespace

Submodules

matbench_genmetrics.mp_time_split.utils.api module

matbench_genmetrics.mp_time_split.utils.api.fetch_data(api_key: str | None = None, fields: List[str] | None = ['structure', 'material_id', 'theoretical', 'energy_above_hull', 'formation_energy_per_atom'], num_sites: Tuple[int, int] | None = None, elements: List[str] | None = None, exclude_elements: List[str] | Literal['noble', 'radioactive', 'noble+radioactive'] | None = None, use_theoretical: bool = False, return_both_if_experimental: bool = False, one_by_one: bool = False, **search_kwargs) → DataFrame | Tuple[DataFrame, DataFrame][source]

Retrieve MP data sorted by MPID (theoretical+exptl) or pub year (exptl).

See *How do I do a time-split of Materials Project entries? e.g. pre-2018 vs. post-2018*

Output DataFrame-s will contain all specified fields unless fields is None, in which case all MPRester().summary.available_fields() will be returned. If return experimental data, the additional fields of provenance, discovery and year corresponding to emmet.core.provenance.ProvenanceDoc(), a dictionary containing earliest year and author information, and the earliest year, respectively, will also be returned.

Parameters:

api_key (Union[str, DEFAULT_API_KEY]) – mp_api() API Key. On Windows, can set as an environment variable via: setx MP_API_KEY="abc123def456". By default: mp_api.core.client.DEFAULT_API_KEY() See also: https://github.com/materialsproject/api/issues/566#issuecomment-1087941474
fields (Optional[List[str]]) – fields (List[str]): List of fields to project. When searching, it is better to only ask for the specific fields of interest to reduce the time taken to retrieve the documents. See the MPRester().summary.available_fields() property to see a list of fields to choose from. By default: ["structure", "material_id", "theoretical"].
num_sites (Tuple[int, int]) – Tuple of min and max number of sites used as filtering criteria, e.g. (1, 52) meaning at least 1 and no more than 52 sites. If None then no compounds with any number of sites are allowed. By default None.
elements (List[str]) – List of element symbols, e.g. ["Ni", "Fe"]. If None then all elements are allowed. By default None.
exclude_elements (Optional[) –

Union[List[str], Literal[“noble”, “radioactive”,

”noble+radioactive”]]

]

List of element symbols to _exclude_, e.g. ["Ar", "Ne"]. If None then all elements are allowed. If a supported string value (“noble”, “radioactive”, or “noble+radioactive”), then filters out the appropriate elements. By default None.
use_theoretical (bool, optional) – Whether to include both theoretical and experimental compounds or to filter down to only experimentally-verified compounds, by default False
return_both_if_experimental (bool, optional) – Whether to return both the full DataFrame containing theoretical+experimental (df) and the experimental-only DataFrame (expt_df) or only expt_df, by default False. This is only applicable if use_theoretical is False.
one_by_one (bool, optional) – Whether to retrieve data one-by-one instead of in bulk. This is useful for testing with a small number or in case the mp-api search is malfunctioning (since need provenance attributes). By default False.
search_kwargs (dict, optional) – kwargs: Supported search terms, e.g. nelements_max=3 for the “materials” search API. Consult the specific API route for valid search terms, i.e. MPRester().summary.available_fields()

Returns:

df (pd.DataFrame) – if use_theoretical then returns a DataFrame containing both theoretical and experimental compounds.
expt_df, df (Tuple[pd.DataFrame, pd.DataFrame]) – if not use_theoretical and return_both_if_experimental, then returns two :func:`pd.DataFrame-s containing theoretical+experimental and experimental-only.
expt_df (pd.DataFrame) – if not use_theoretical and not return_both_if_experimental, then returns a pd.DataFrame() containing the experimental-only compounds.

Examples

>>> api_key = "abc123def456"
>>> num_sites = (1, 52)
>>> elements = ["V"]
>>> expt_df = fetch_data(api_key, num_sites=num_sites, elements=elements)

>>> df = fetch_data(
        api_key,
        num_sites=num_sites,
        elements=elements,
        use_theoretical=True
    )

>>> expt_df, df = fetch_data(
        api_key,
        num_sites=num_sites,
        elements=elements,
        use_theoretical=False,
        return_both_if_experimental
    )

matbench_genmetrics.mp_time_split.utils.data module

matbench_genmetrics.mp_time_split.utils.data.get_discovery_dict(references: List[dict]) → List[dict][source]

Get a dictionary containing earliest bib info for each MP entry.

Modified from source: “How do I do a time-split of Materials Project entries? e.g. pre-2018 vs. post-2018” https://matsci.org/t/42584/4?u=sgbaird, answer by @Joseph_Montoya, Materials Project Alumni

Parameters:: provenance_results (List[dict]) – List of references results, e.g. taken from from the ProvenanceRester API results (mp_api.materials.provenance())
Returns:: Dictionary containing earliest bib info for each MP entry with keys: ["year", "authors", "num_authors"]
Return type:: discovery, List[dict]

Examples

>>> with MPRester(api_key) as mpr:
...     provenance_results = mpr.materials.provenance.search(num_sites=(1, 4), elements=["V"])
>>> discovery = get_discovery_dict(provenance_results)
[{'year': 1963, 'authors': ['Raub, E.', 'Fritzsche, W.'], 'num_authors': 2}, {'year': 1925, 'authors': ['Becker, K.', 'Ebert, F.'], 'num_authors': 2}, {'year': 1965, 'authors': ['Giessen, B.C.', 'Grant, N.J.'], 'num_authors': 2}, {'year': 1957, 'authors': ['Philip, T.V.', 'Beck, P.A.'], 'num_authors': 2}, {'year': 1963, 'authors': ['Darby, J.B.jr.'], 'num_authors': 1}, {'year': 1977, 'authors': ['Aksenova, T.V.', 'Kuprina, V.V.', 'Bernard, V.B.', 'Skolozdra, R.V.'], 'num_authors': 4}, {'year': 1964, 'authors': ['Maldonado, A.', 'Schubert, K.'], 'num_authors': 2}, {'year': 1962, 'authors': ['Darby, J.B.jr.', 'Lam, D.J.', 'Norton, L.J.', 'Downey, J.W.'], 'num_authors': 4}, {'year': 1925, 'authors': ['Becker, K.', 'Ebert, F.'], 'num_authors': 2}, {'year': 1959, 'authors': ['Dwight, A.E.'], 'num_authors': 1}] # noqa: E501

matbench_genmetrics.mp_time_split.utils.gen module

class matbench_genmetrics.mp_time_split.utils.gen.DummyGenerator[source]

Bases: object

fit(inputs)[source]

gen(n=100)[source]

This function generates a list of pymatgen Structure objects by creating random crystals using the pyxtal library. Each crystal is composed of Ba, Ti, and O in a 1:1:3 ratio.

Parameters:: n (int, optional) – The number of structures to generate, by default 100.
Returns:: A list of pymatgen Structure objects.
Return type:: List[Structure]

Examples

>>> structures = DummyGenerator().gen(n=100)

matbench_genmetrics.mp_time_split.utils.gen.get_random_sio_structure(rng=Generator(PCG64) at 0x7FA04032AA40)[source]

matbench_genmetrics.mp_time_split.utils.split module

class matbench_genmetrics.mp_time_split.utils.split.TimeKFold(n_splits=5, *, shuffle=False, random_state=None)[source]

Bases: _BaseKFold

Time Series K-Folds cross-validator

TODO: update docstring

Provides train/test indices to split data in train/test sets. Split dataset into k consecutive folds (without shuffling by default).

Each fold is then used once as a validation while the k - 1 remaining folds form the training set.

See also

StratifiedKFold: Takes group information into account to avoid building folds with imbalanced class distributions (for binary or multiclass classification tasks).
GroupKFold: K-fold iterator variant with non-overlapping groups.
RepeatedKFold: Repeats K-Fold n times.

split(X, y=None, groups=None)[source]

Generate indices to split data into training and test set.

Parameters:

X (array-like of shape (n_samples, n_features)) – Training data, where n_samples is the number of samples and n_features is the number of features.
y (array-like of shape (n_samples,), default=None) – The target variable for supervised learning problems.
groups (array-like of shape (n_samples,), default=None) – Group labels for the samples used while splitting the dataset into train/test set.

Yields:

train (ndarray) – The training set indices for that split.
test (ndarray) – The testing set indices for that split.

class matbench_genmetrics.mp_time_split.utils.split.TimeSeriesOverflowSplit(n_splits=5, *, max_train_size=None, test_size=None, gap=0)[source]

Bases: _BaseKFold

Time Series cross-validator that always uses remainder of data as test data.

split(X, y=None, groups=None)[source]

Generate indices to split data into training and test set.

Parameters:

X (array-like of shape (n_samples, n_features)) – Training data, where n_samples is the number of samples and n_features is the number of features.
y (array-like of shape (n_samples,)) – Always ignored, exists for compatibility.
groups (array-like of shape (n_samples,)) – Always ignored, exists for compatibility.

Yields:

train (ndarray) – The training set indices for that split.
test (ndarray) – The testing set indices for that split.

matbench_genmetrics.mp_time_split.utils.split.mp_time_splitter(X, mode='TimeSeriesSplit', use_trainval_test: bool = True, n_cv_splits: int = 5, max_train_size=None, test_size=None, gap=0)[source]

Split into trainval and test sets, and optionally return test_sets.

Parameters:

X (pd.DataFrame) – DataFrame of Materials Project data to be split.
mode (str, optional) – One of {“TimeSeriesSplit”, “TimeSeriesOverflowSplit”, “TimeKFold”}, by default “TimeSeriesSplit”
use_trainval_test (bool, optional) – Whether to use a trainval-test split vs. just a train-test split. The idea is that you should tune your hyperparameters using training and validation sets and then keep a held-out test set that you “only ever touch once” (e.g., run immediately before and only once prior to manuscript submission). By default True.
n_cv_splits (int, optional) – Number of cross-validation splits to perform, by default 5
max_train_size (int, optional) – Maximum size for a single training set, by default None
test_size (int, optional) – Used to limit the size of the test set, by default None
gap (int, optional) – Number of samples to exclude from the end of each training set before the test set, by default 0

Returns:

list of (tuples of arrays) – The training and test indices for that split.
list of (tuples of arrays), 2-element tuple of arrays – Returned when use_trainval_test is True. The (training and validation) and test indices for that split.

Raises:

NotImplementedError – mode={mode} not implemented. Use one of {AVAILABLE_MODES}
NotImplementedError – non-zero gap specified, not implemented for TimeKFold
NotImplementedError – non-None max_train_size specified, not implemented for TimeKFold
NotImplementedError – non-None test_size specified, not implemented for TimeKFold

Examples

>>> mpt = MPTimeSplit(num_sites=num_sites, elements=elements)
>>> data = mpt.load(dummy=True)
>>> trainval_splits, test_split = mp_time_split(data, use_trainval_test=True)
>>> print(trainval_splits)
[
    (array([0]), array([1])),
    (array([0, 1]), array([2])),
    (array([0, 1, 2]), array([3])),
    (array([0, 1, 2, 3]), array([4])),
    (array([0, 1, 2, 3, 4]), array([5])),
]
>>> print(test_split)
(array([0, 1, 2, 3, 4, 5]), array([6, 7]))
>>>
>>> # **no held-out test set**
>>> trainval_splits = mp_time_split(data, use_trainval_test=False)
>>> print(trainval_splits)
[
    (array([0, 1, 2]), array([3])),
    (array([0, 1, 2, 3]), array([4])),
    (array([0, 1, 2, 3, 4]), array([5])),
    (array([0, 1, 2, 3, 4, 5]), array([6])),
    (array([0, 1, 2, 3, 4, 5, 6]), array([7])),
]