-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Template for random forest models #43
Comments
In modelmanager.save_supplemental_object, change
It will avoid to write a to_pickle attribute for every step that requires pickling. |
Hmmm, good point. From some quick googling, it looks like So no objections to switching to |
An additional potential issue: fit and predict from sklearn have a different structure as the ones in previous steps (OLSRegression, ....). Sklearn use matrixes for inputs and target data
while OLSRegression and others use
we could:
The challenge is to find a way for 2 without writing ugly code. Note that class inheritance is not recommended since the inputs type are not the same between the fit/predict from the parent class and the modified fit/predict. See what I have done in utils.py |
Ah, right, interesting. Here's how I've been thinking about this kind of thing: The underlying stats libraries we're using need a variety of different data formats, typically either a dataframe, a numpy array, or separate x and y numpy arrays.. When we run simulations, we need to abstract out the data management so that it's happening somewhere else. The Orca library is the interface between the templates and the data layer -- we request data by table name and column name, and get it as pandas objects. (And the data might be coming from memory, from disk, from a database, etc.) There are also some extra data conventions we're trying to maintain in the templates, from earlier urbansim implementations: any time you request a table, you can (1) merge other tables onto it on the fly, and (2) also apply a list of filters to it. This functionality is mostly generalized into TemplateStep._get_df(). (Using Patsy-format model expressions is another convention, but you're right in the PR #50 discussion that this is probably related to Statsmodels supporting it directly. If it's too hard to map onto Scikit-learn models, then probably ok to drop it for these templates.) This is a clever solution you've implemented with |
We could use helper functions in TemplateStep to convert the data into a format compatible with sklearn fit and predict methods. The advantage of re-engineering those methods, however, is that whenever we call model.fit() or model.predict() we do not have to worry whether it is a statsmodel or sklearn methods (or something else). The cross_validate_score() helper is an example: it works whether the step is OLSRegression or RandomForest because it relies on a common structure for model.fit() and model.predict(). |
I'm setting up an issue for the random forest template that @Gitiauxx is working on! Tagging @waddell and @Arezoo-bz for feedback and additional guidance on use cases for the template.
Goals
Create a model step template for random forest regression models. This will be used for tasks like real estate price prediction, along the lines of this notebook: REPM_Random_Forest.ipynb
Tasks
The text was updated successfully, but these errors were encountered: