-
-
Notifications
You must be signed in to change notification settings - Fork 262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Populate make_classification_df date functionality with random dates #851
base: main
Are you sure you want to change the base?
Changes from all commits
4b55f9a
424fca4
b740b27
24461fb
3a7c9d1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -73,10 +73,22 @@ def test_make_classification_df(): | |
dates=(date(2014, 1, 1), date(2015, 1, 1)), | ||
) | ||
|
||
X_df1, y_series1 = dask_ml.datasets.make_classification_df( | ||
n_samples=100, | ||
n_features=5, | ||
random_state=123, | ||
chunks=100, | ||
dates=(date(2014, 1, 1), date(2015, 1, 1)), | ||
) | ||
check_randomness = np.unique((X_df["date"] == X_df1["date"]).compute()) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good, this checks the random state. Shouldn't this also check that there's more than one unique value? That's what #845 is focused on. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, the code I've written checks for repeatability, on account of the seed. Since the the numpy's There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The same I think There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay, that sounds good. Now I've to figure out, what would be a good threshold. Will There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh yeah, |
||
|
||
assert X_df is not None | ||
assert y_series is not None | ||
assert "date" in X_df.columns | ||
assert len(X_df.columns) == 6 | ||
assert len(X_df) == 100 | ||
assert len(y_series) == 100 | ||
assert isinstance(y_series, dask.dataframe.core.Series) | ||
assert check_randomness.size == 1 | ||
assert check_randomness[0] is True | ||
assert np.unique(X_df["date"]).size >= 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens when
random_state
isn't an integer? Scikit-learn allows forrandom_state
to be an integer,None
or an instance ofnp.random.RandomState
(source).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I will have to raise a ValueError exception there, on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use this code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code above will produce the same random number since the seed(
rng
) remains the same in subsequent calls.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main point: I think
np.random.RandomState
andNone
should be acceptable types forrandom_state
. I'm fine expanding the for-loop, though I don't think that needs to happen:There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can maybe just check if random_state is one of the accepted values like this and accordingly proceed -
if random_state is not None or not isinstance(random_state, np.random.RandomState) or not isinstance(random_state,int): print("random_state is not to be accepted")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That runs counter to the use of
random_state
in Scikit-learn.random_date
is public, so it should accept all types ofrandom_state
that Scikit-learn accepts.If
random_date
were a private function, I wouldn't really careThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
random_state : int, RandomState instance or None, optional (default=None), these are values accepted by Scikit-Learn's
random_state
. I think I can check if the random_state is in neither of the accepted values(Scitkit and Numpy) and set is as the default None.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scikit-learn's
check_random_state
function will likely be useful: https://scikit-learn.org/stable/modules/generated/sklearn.utils.check_random_state.htmlIt takes those values and produces the correct random seed generator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stsievert the accepted types of
random_state
in Scikit and Numpy appear to be the same.Refer this: Scikit's version also accepts Numpy's accepted values. Refer this: https://scikit-learn.org/dev/glossary.html#term-random_state