matbench_genmetrics.mp_time_split.utils namespace
Submodules
matbench_genmetrics.mp_time_split.utils.api module
- matbench_genmetrics.mp_time_split.utils.api.fetch_data(api_key: str | None = None, fields: List[str] | None = ['structure', 'material_id', 'theoretical', 'energy_above_hull', 'formation_energy_per_atom'], num_sites: Tuple[int, int] | None = None, elements: List[str] | None = None, exclude_elements: List[str] | Literal['noble', 'radioactive', 'noble+radioactive'] | None = None, use_theoretical: bool = False, return_both_if_experimental: bool = False, one_by_one: bool = False, **search_kwargs) DataFrame | Tuple[DataFrame, DataFrame][source]
Retrieve MP data sorted by MPID (theoretical+exptl) or pub year (exptl).
See *How do I do a time-split of Materials Project entries? e.g. pre-2018 vs. post-2018*
Output
DataFrame-s will contain all specified fields unlessfields is None, in which case allMPRester().summary.available_fields()will be returned. If return experimental data, the additional fields ofprovenance,discoveryandyearcorresponding toemmet.core.provenance.ProvenanceDoc(), a dictionary containing earliest year and author information, and the earliest year, respectively, will also be returned.- Parameters:
api_key (Union[str, DEFAULT_API_KEY]) –
mp_api()API Key. On Windows, can set as an environment variable via:setx MP_API_KEY="abc123def456". By default:mp_api.core.client.DEFAULT_API_KEY()See also: https://github.com/materialsproject/api/issues/566#issuecomment-1087941474fields (Optional[List[str]]) – fields (List[str]): List of fields to project. When searching, it is better to only ask for the specific fields of interest to reduce the time taken to retrieve the documents. See the
MPRester().summary.available_fields()property to see a list of fields to choose from. By default:["structure", "material_id", "theoretical"].num_sites (Tuple[int, int]) – Tuple of min and max number of sites used as filtering criteria, e.g.
(1, 52)meaning at least1and no more than52sites. IfNonethen no compounds with any number of sites are allowed. By default None.elements (List[str]) – List of element symbols, e.g.
["Ni", "Fe"]. IfNonethen all elements are allowed. By default None.exclude_elements (Optional[) –
- Union[List[str], Literal[“noble”, “radioactive”,
”noble+radioactive”]]
]
List of element symbols to _exclude_, e.g.
["Ar", "Ne"]. IfNonethen all elements are allowed. If a supported string value (“noble”, “radioactive”, or “noble+radioactive”), then filters out the appropriate elements. By default None.use_theoretical (bool, optional) – Whether to include both theoretical and experimental compounds or to filter down to only experimentally-verified compounds, by default False
return_both_if_experimental (bool, optional) – Whether to return both the full DataFrame containing theoretical+experimental (df) and the experimental-only DataFrame (expt_df) or only expt_df, by default False. This is only applicable if use_theoretical is False.
one_by_one (bool, optional) – Whether to retrieve data one-by-one instead of in bulk. This is useful for testing with a small number or in case the mp-api search is malfunctioning (since need provenance attributes). By default False.
search_kwargs (dict, optional) – kwargs: Supported search terms, e.g. nelements_max=3 for the “materials” search API. Consult the specific API route for valid search terms, i.e.
MPRester().summary.available_fields()
- Returns:
df (pd.DataFrame) – if use_theoretical then returns a DataFrame containing both theoretical and experimental compounds.
expt_df, df (Tuple[pd.DataFrame, pd.DataFrame]) – if not use_theoretical and return_both_if_experimental, then returns two :func:`pd.DataFrame-s containing theoretical+experimental and experimental-only.
expt_df (pd.DataFrame) – if not use_theoretical and not return_both_if_experimental, then returns a
pd.DataFrame()containing the experimental-only compounds.
Examples
>>> api_key = "abc123def456" >>> num_sites = (1, 52) >>> elements = ["V"] >>> expt_df = fetch_data(api_key, num_sites=num_sites, elements=elements)
>>> df = fetch_data( api_key, num_sites=num_sites, elements=elements, use_theoretical=True )
>>> expt_df, df = fetch_data( api_key, num_sites=num_sites, elements=elements, use_theoretical=False, return_both_if_experimental )
matbench_genmetrics.mp_time_split.utils.data module
- matbench_genmetrics.mp_time_split.utils.data.get_discovery_dict(references: List[dict]) List[dict][source]
Get a dictionary containing earliest bib info for each MP entry.
Modified from source: “How do I do a time-split of Materials Project entries? e.g. pre-2018 vs. post-2018” https://matsci.org/t/42584/4?u=sgbaird, answer by @Joseph_Montoya, Materials Project Alumni
- Parameters:
provenance_results (List[dict]) – List of references results, e.g. taken from from the
ProvenanceResterAPI results (mp_api.materials.provenance())- Returns:
Dictionary containing earliest bib info for each MP entry with keys:
["year", "authors", "num_authors"]- Return type:
discovery, List[dict]
Examples
>>> with MPRester(api_key) as mpr: ... provenance_results = mpr.materials.provenance.search(num_sites=(1, 4), elements=["V"]) >>> discovery = get_discovery_dict(provenance_results) [{'year': 1963, 'authors': ['Raub, E.', 'Fritzsche, W.'], 'num_authors': 2}, {'year': 1925, 'authors': ['Becker, K.', 'Ebert, F.'], 'num_authors': 2}, {'year': 1965, 'authors': ['Giessen, B.C.', 'Grant, N.J.'], 'num_authors': 2}, {'year': 1957, 'authors': ['Philip, T.V.', 'Beck, P.A.'], 'num_authors': 2}, {'year': 1963, 'authors': ['Darby, J.B.jr.'], 'num_authors': 1}, {'year': 1977, 'authors': ['Aksenova, T.V.', 'Kuprina, V.V.', 'Bernard, V.B.', 'Skolozdra, R.V.'], 'num_authors': 4}, {'year': 1964, 'authors': ['Maldonado, A.', 'Schubert, K.'], 'num_authors': 2}, {'year': 1962, 'authors': ['Darby, J.B.jr.', 'Lam, D.J.', 'Norton, L.J.', 'Downey, J.W.'], 'num_authors': 4}, {'year': 1925, 'authors': ['Becker, K.', 'Ebert, F.'], 'num_authors': 2}, {'year': 1959, 'authors': ['Dwight, A.E.'], 'num_authors': 1}] # noqa: E501
matbench_genmetrics.mp_time_split.utils.gen module
- class matbench_genmetrics.mp_time_split.utils.gen.DummyGenerator[source]
Bases:
object- gen(n=100)[source]
This function generates a list of pymatgen Structure objects by creating random crystals using the pyxtal library. Each crystal is composed of Ba, Ti, and O in a 1:1:3 ratio.
- Parameters:
n (int, optional) – The number of structures to generate, by default 100.
- Returns:
A list of pymatgen Structure objects.
- Return type:
List[Structure]
Examples
>>> structures = DummyGenerator().gen(n=100)
matbench_genmetrics.mp_time_split.utils.split module
- class matbench_genmetrics.mp_time_split.utils.split.TimeKFold(n_splits=5, *, shuffle=False, random_state=None)[source]
Bases:
_BaseKFoldTime Series K-Folds cross-validator
TODO: update docstring
Provides train/test indices to split data in train/test sets. Split dataset into k consecutive folds (without shuffling by default).
Each fold is then used once as a validation while the k - 1 remaining folds form the training set.
Read more in the User Guide.
- Parameters:
n_splits (int, default=5) –
Number of folds. Must be at least 2.
Changed in version 0.22:
n_splitsdefault value changed from 3 to 5.
Examples
>>> import numpy as np >>> from sklearn.model_selection import KFold >>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]]) >>> y = np.array([1, 2, 3, 4]) >>> kf = KFold(n_splits=2) >>> kf.get_n_splits(X) 2 >>> print(kf) KFold(n_splits=2, random_state=None, shuffle=False) >>> for train_index, test_index in kf.split(X): ... print("TRAIN:", train_index, "TEST:", test_index) ... X_train, X_test = X[train_index], X[test_index] ... y_train, y_test = y[train_index], y[test_index] TRAIN: [2 3] TEST: [0 1] TRAIN: [0 1] TEST: [2 3]
Notes
The first
n_samples % n_splitsfolds have sizen_samples // n_splits + 1, other folds have sizen_samples // n_splits, wheren_samplesis the number of samples.Randomized CV splitters may return different results for each call of split. You can make the results identical by setting random_state to an integer.
See also
StratifiedKFoldTakes group information into account to avoid building folds with imbalanced class distributions (for binary or multiclass classification tasks).
GroupKFoldK-fold iterator variant with non-overlapping groups.
RepeatedKFoldRepeats K-Fold n times.
- split(X, y=None, groups=None)[source]
Generate indices to split data into training and test set.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Training data, where n_samples is the number of samples and n_features is the number of features.
y (array-like of shape (n_samples,), default=None) – The target variable for supervised learning problems.
groups (array-like of shape (n_samples,), default=None) – Group labels for the samples used while splitting the dataset into train/test set.
- Yields:
train (ndarray) – The training set indices for that split.
test (ndarray) – The testing set indices for that split.
- class matbench_genmetrics.mp_time_split.utils.split.TimeSeriesOverflowSplit(n_splits=5, *, max_train_size=None, test_size=None, gap=0)[source]
Bases:
_BaseKFoldTime Series cross-validator that always uses remainder of data as test data.
- split(X, y=None, groups=None)[source]
Generate indices to split data into training and test set.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Training data, where n_samples is the number of samples and n_features is the number of features.
y (array-like of shape (n_samples,)) – Always ignored, exists for compatibility.
groups (array-like of shape (n_samples,)) – Always ignored, exists for compatibility.
- Yields:
train (ndarray) – The training set indices for that split.
test (ndarray) – The testing set indices for that split.
- matbench_genmetrics.mp_time_split.utils.split.mp_time_splitter(X, mode='TimeSeriesSplit', use_trainval_test: bool = True, n_cv_splits: int = 5, max_train_size=None, test_size=None, gap=0)[source]
Split into trainval and test sets, and optionally return test_sets.
- Parameters:
X (pd.DataFrame) – DataFrame of Materials Project data to be split.
mode (str, optional) – One of {“TimeSeriesSplit”, “TimeSeriesOverflowSplit”, “TimeKFold”}, by default “TimeSeriesSplit”
use_trainval_test (bool, optional) – Whether to use a trainval-test split vs. just a train-test split. The idea is that you should tune your hyperparameters using training and validation sets and then keep a held-out test set that you “only ever touch once” (e.g., run immediately before and only once prior to manuscript submission). By default True.
n_cv_splits (int, optional) – Number of cross-validation splits to perform, by default 5
max_train_size (int, optional) – Maximum size for a single training set, by default None
test_size (int, optional) – Used to limit the size of the test set, by default None
gap (int, optional) – Number of samples to exclude from the end of each training set before the test set, by default 0
- Returns:
list of (tuples of arrays) – The training and test indices for that split.
list of (tuples of arrays), 2-element tuple of arrays – Returned when use_trainval_test is True. The (training and validation) and test indices for that split.
- Raises:
NotImplementedError – mode={mode} not implemented. Use one of {AVAILABLE_MODES}
NotImplementedError – non-zero gap specified, not implemented for TimeKFold
NotImplementedError – non-None max_train_size specified, not implemented for TimeKFold
NotImplementedError – non-None test_size specified, not implemented for TimeKFold
Examples
>>> mpt = MPTimeSplit(num_sites=num_sites, elements=elements) >>> data = mpt.load(dummy=True) >>> trainval_splits, test_split = mp_time_split(data, use_trainval_test=True) >>> print(trainval_splits) [ (array([0]), array([1])), (array([0, 1]), array([2])), (array([0, 1, 2]), array([3])), (array([0, 1, 2, 3]), array([4])), (array([0, 1, 2, 3, 4]), array([5])), ] >>> print(test_split) (array([0, 1, 2, 3, 4, 5]), array([6, 7])) >>> >>> # **no held-out test set** >>> trainval_splits = mp_time_split(data, use_trainval_test=False) >>> print(trainval_splits) [ (array([0, 1, 2]), array([3])), (array([0, 1, 2, 3]), array([4])), (array([0, 1, 2, 3, 4]), array([5])), (array([0, 1, 2, 3, 4, 5]), array([6])), (array([0, 1, 2, 3, 4, 5, 6]), array([7])), ]