-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clone fixest
syntax parser
#335
Comments
Hi Dave, this is definitely a very interesting suggestion! More concretely, I really like this syntax pf.feols("y~x*) but we'd have to be careful, as More generally it would be amazing if you were to take a look at formula parsing via |
Thanks for pointing out the Wilkinson formula problem! I thought for a while and maybe there are two alternatives: Solution 1: Provide a helper function to resemble variable list Xs = pf.varlist("X*") # Xs = "X1 + X2"
pf.feols(f"y ~ {Xs} + f1", data=df) Solution 2: Parse the pf.feols("y ~ vlist(X*) + f1", data=df) Then we can replace Solution 2 looks easier for users to learn since it does not require users to have knowledge of f-string. If using solution 2, what name should this function called?
How do you think Alex? |
Hi Dave, I like both solutions! I don't think that syntax as For the immediate future, I'd be happy to accept any of the three solutions you have proposed! |
Unsolicited feedback: Personally, I strongly recommend against adopting any Stata-specific syntax above (R) fixest's native API if you can avoid it. It's obvs perfectly fine to write a wrapper package that does this for users that want it. (Or implement it as bonus functionality on top of the mimicked API). But similar to my comment here, I think you do a disservice to both packages, and the overall goal of this project, if the API begins to drift away from the upstream R fixest package. P.S. Ideally, a Stata user could simply consult https://stata2r.github.io/fixest/ and make the direct translation to PyFixest. That's the beauty of having a consistent API across both packages ;-) |
Hi @grantmcdermott, thank you so much for pointing out the https://stata2r.github.io/fixest/ website! It's really informative. So ctrls = c("age", "black", "hisp", "marr")
feols(wage ~ educ + .[ctrls], dat)
feols(wage ~ educ + ..('^x'), dat) # ^ = starts with
feols(wage ~ educ + ..('sp$'), dat) # $ = ends with
feols(wage ~ educ + ..('ac'), data) I try in import pyfixest as pf
df = pf.get_data()
Xs = ['X1', 'X2']
fit = pf.feols("Y ~ .[Xs] + f2", data=df) # FormulaSyntaxError
fit = pf.feols("Y ~ ('^X') + f2", data=df) # X1 and X2 not included Do you think mimicking this syntax into |
fixest
syntax parser
Yes, I think so @Wenzhi-Ding . It may be a bit trickier to match the exact syntax under Pythonic constraints. But I think you can get close enough that the translation is clear. |
I made a full review for both Overall, I think syntax like My overall feeling is that we need to follow existing prevalent packages (to mitigate switching costs), but it is also worthwhile to provide alternatives that slightly deviate from the dependent path to reduce onboarding costs (for new users to comprehend and remember syntax quickly). And of course, these two ways can co-exist. R users can use In this sense, I would like to propose a consistent "function in the formula" design pattern to make For macros (referencing list outside formula): controls = ["X1", "X2"]
pf.feols("Y ~ f2 + list(controls)", data=df)
pf.feols("Y ~ f2 + .[controls]", data=df) # fixest's syntax supported For wildcards (matching variable names by pattern): pf.feols("Y ~ f2 + vars(X*)", data=df) # support alias: v(X*)
pf.feols("Y ~ f2 + ..('ac')", data=df) # fixest's syntax supported For lead, lag, difference, and logarithm difference in panel data: pf.feols("Y ~ f2 + lead(X,1)", data=df, id_var="id", time_var="time") # support alias and fixest's syntax: f(X,1)
pf.feols("Y ~ f2 + lag(X,1)", data=df, id_var="id", time_var="time") # support alias and fixest's syntax: l(X,1)
pf.feols("Y ~ f2 + diff(X,1)", data=df, id_var="id", time_var="time") # support alias d(X,1) or fixest's syntax: d(X)
pf.feols("Y ~ f2 + logdiff(X,1)", data=df, id_var="id", time_var="time") # support alias: logd(X,1) Categorical with specified baseline (this is a good design from pf.feols("Y ~ f2 + i(X,ref=1)", data=df) Categorical/continuous: by default, the covariate part are continuous variables, and fixed effects part are categorical variables. But users can also specify variable type by function. pf.feols("Y ~ c(f2) + i(X1) | group^c(year)", data=df) # to control time trend for each group Please let me know your opinion. But anyway I wouldn't try to implement all of them at once. And I will guarantee backward compatibility for any new syntax introduced, i.e., no existing script will be broken. Here is a Reference: Please correct me if anything wrong. Maybe we can add a webpage like this to smooth R and Stata user's switching. Models
Interactions
Standard errors
PresentationThese are APIs beyond model formula. Not to be discussed in this thread. PanelNo panel-set function for
|
Thanks for sharing your perspective @grantmcdermott! If I am honest, I wasn't really aware of @Wenzhi-Ding , thanks for looking into this in so much detail. It'll probably take me a little longer to think through all of it, but here are my initial 5 cents: Whenever possible, we should try to use import pandas
from formulaic import model_matrix, Formula
def my_transform(col: pandas.Series) -> pandas.Series:
return col ** 2
# Manually add `my_transform` to the context
Formula("a + my_transform(a)").get_model_matrix(
pandas.DataFrame({"a": [1, 2, 3]}),
context={"my_transform": my_transform}, # could also use: context=locals()
) This might for example be useful when we have to combine multiple variables to "interacted" fixed effects via On syntax, do I understand your suggestion correctly that you'd like to have both: compatibility with |
Hi @Wenzhi-Ding , I just merged a PR in which I reworked the logic of model_matrix_fixest. I have moved away from In general, I think it would be best to try to handle all additional formula syntax within I also like the idea of adding a comparison page to the docs that compares |
Hi Alex @s3alfisc, it's great to hear your update! I have got a bit busy recently and haven't taken a look at implementing the above-mentioned syntax. I will work based on your update once I have time (maybe one or two weeks from now...). And for the source of syntax comparison, I borrow them from stata2r (Thanks @grantmcdermott for sharing this great source. Help me a lot in switching my project!). But I haven't tested the consistency of the results produced. (I work on my research project recent days and found some inconsistency in clustered standard errors between |
No worries and absolute 0 pressure! =)
On small discrepancies of clustered standard errors due to different small sample corrections, the fixest docs can't be beat. I don't know how exactly |
Hi Dave, After reworking the formula parsing, %load_ext autoreload
%autoreload 2
import pandas as pd
import numpy as np
import pyfixest as pf
data = pf.get_data()
fit1 = pf.feols('Y ~ i(f1)', data = data)
fit2 = pf.feols('Y ~ i(f1) + i(f2)', data = data)
fit3 = pf.feols('Y ~ i(f1, ref=1)', data = data)
fit4 = pf.feols('Y ~ i(f1, X1, ref=1)', data = data) |
Hi Alex, this is so great to see the update! I am sorry for this late reply. Got kind of busy last week. I will work based on your progress! |
Hi Alex,
I am not familiar with
fixest
but was a heavy-user of Stata previously. Is there any approach to include a vector of similar variable names in the formula? For example:reg y x*
This will include all variables like x1, x2, x3 ... into regression model. Is it fine for
pyfixest
to have this syntax? Not sure whetherfixest
has similar syntax. If you think this is feasible, I can do it. Some syntax I propose isI guess syntax 1 will not break any existing syntax and is easy to learn. For syntax 2a-3b, I am not sure whether the benefit on convenience outweigh its learning cost, since they add some complexities, from user's perspective. Maybe we can introduce syntax 1 first, and ignore the rest until several other users propose?
By the way, I will add
keep
,drop
, andexact_match
tocoefplot
this week.Best regards,
Dave
The text was updated successfully, but these errors were encountered: