Report of reviewer 2 #3

NelleV · 2023-11-28T08:56:39Z

Associate Editor: Nelle Varoquaux
Reviewer : (chose to remain anonymous)
Reviewer: Reviewing history

Paper submitted Aug 10, 2022
Reviews received Nov 21, 2022
Decision: major revision Nov 25, 2022
Paper revised Sep 18, 2023
Paper accepted Nov 20, 2023

Summary

The paper is concerned with model choice in the setting of intractable-
likelihood models (i.e., when the likelihood of the model is unavailable but
simulations from the hierarchical model can be drawn), also known as ABC
model choice. To this aim, the authors consider the Random Forest (RF)
methodology, following a previous work, which consists of generating
simulations from the different models under consideration and building an
RF to classify the model samples were drawn from. The authors attempt to
improve this methodology by considering “local” RF methods, which guide
the construction of the RF by considering the observation on which
inference is required. This is intuitively good when, as in model choice, only
a single real-world observation is present. The authors consider both
existing local RF methods as well as introduce new ones.
As the authors honestly point out, the performance of local methods on the
considered examples is disappointing. Nonetheless, I believe it is good to
have an illustration of these results in the literature and I think that running
more experiments may discover some setups in which the proposed
methodology outperforms standard RFs (see my comments below).
Moreover, the paper is a good overview of local RF methods.

Thank you for your time, comments and your constructive suggestions. We have addressed your comments in the revised version of the manuscript, and give details below.

Main comments

Experiments

I think the range of experimental evaluation is slightly limited; I suggest
the authors add one more experimental setup.

We added a novel experimental setup (section 8.3) where we tried to combine a fragmentation framework with spherical data distribution to challenge the splitting rules of standard random forests. Unfortunately, even in this setup local methods did not outperform bagging CARTs or random forests.

In the original paper on ABC model choice with RFs (Pudlo et al, 2016), a
second RF is used to estimate the model probabilities, after the first one is
used to assign the observation to a model. In the present paper, the
authors only do the first step. Could the local RF methodology be applied
to the second step too? Possibly, the local strategies may prove to
outperform standard RF in that case.

This is a good but still open question. The second RF used to estimate the model probabilities in pudlo et al, 2016 is a Regression random forest, for which implementation and comparison of local methods have not been performed. As we state in our discussion, many local approaches are already available for regression and the generalization of others should not be an obstacle, so that in principle the local RF methodology could very well be applied to the second step too.

Although some local methods are more costly than standard RFs (such as
the ones requiring 2 RFs to fit), others are less (such as the one building
RFs using only a subset of data). Do the authors think that considering an
experimental setup where the same computational budget for, say, RF
and NN-RF would change the results presented? That may lead to more
trees being used for NN-RF and therefore to better results for that
algorithm.

It is very difficult to evaluate the computational budget of each method. RF runtime is not linear in the number of point, and there is no rule linking the number of trees to the number of training data.
Moreover, as now presented in Section 8.2 and 9.3, RF is the fastest method, the bottleneck of all other methods being the preprocessing to weigh the individuals or the variables. For instance the Nearest Neighbor approach takes 44 thousand times the runtime of RF, as identifying neighbors in large dimensions is very costly.
Still we performed a comparison for NN-RF with 1000 neighbors using 100 or 200 trees, and the results were identical in both cases. With so few datapoints, 100 trees is enough to explore the variability of possible trees.

Text

I am a bit confused by the computational cost of the different local
methods. Particularly, at the end of Section 4, the authors say that a tree
built by local splitting rules may be much faster to construct and that
seems reasonable. However, in the experimental section, all methods
proposed in Section 4 are discarded saying they had “high computational
cost”. These two statements seem to disagree one with another. More in
general, I advise the authors to explicitly say how the local methodology
changes the computational cost of each of the proposed methods. Ideally,
a table comparing all the proposed methods would be provided.

We are sorry this was indeed confusing and we hope our new formulation is now clearer. Localizing trees has two aspects. The first is that indeed, at each split only the leaf containing the local instance is considered, hence the number of criterion to compute is greatly reduced. Comparing the computational cost of local and global trees with identical criterion should lean in favor of the local tree. However, the second aspect is the criterion itself. Classif RF use a Gini-based criterion which requires only to estimate the proportion of each class in each leaf, computations that are highly efficient in any standard language. When the criterion is modified to information gain or kernel-based Gini criterion, it requires the computation of one weight per training data in the leaf, which can be very burdensome. This is particularly true since given our first results, we have not optimized our codes to allow faster computations.

As suggested we have now included runtime comparison at two locations in the paper. In the small second Gaussian example (table 8.3), we have included the runtime (in seconds) of each method in the last column. In the spherical fragmented example (table 8.4), we have compared each method tested on only one test-data with respect to the classic RF runtime. This allows to be more fair to methods which perform one pre-processing per test data since in practice local methods are of interest when only one or very few data are to be classified.

Comments on specific parts

Abstract

“performances are quite good even if no tuning is performed”: I would
change this text to something like “they perform well even with little or
no tuning”

We have modified this sentence as suggested.

the last sentence of the abstract seems to refer only to the new methods
proposed by the authors, but I think, after reading through the paper, it
refers to all local methods.

You are right we have modified that. We have also modified the abstract to lighten the weight of ABC and focus directly on random forests.

Introduction

In the first paragraph, the authors talk about “eager learning”. I never
heard this nomenclature before, can they provide a reference to where
that is introduced?

It is hard to trace back to the introduction of the eager learning nomenclature. It was likely defined by opposition to instance-based or lazy learning, for which we refer to Aha et al for the first introduction.

Fourth paragraph: remove “The” at the start.

This has been corrected.

Section 4

Maybe introduce this section with a sentence like “we now turn to discuss local tree methods” or similar.

We have modified this sentence as suggested.

second paragraph: “putting aside interesting data” -> "discarding data
relevant for the considered example at early stages of the tree. " or similar

We have modified this sentence as suggested.

“It is interesting to note that building a local tree by modifying its internal construction results in building a path.” I don’t get this statement. It seems to say that building a local tree corresponds to identifying a path inside a global tree, but I don’t think that is the case, even according to the following text.

What we mean is that a traditional tree will split each node until a certain stopping criteria is reached (typically minimal entropy, or number of data points in the leave). This is not the case of a local tree, since at each recursion the algorithm will split a node into two leaves, discard entirely the leave that does not contain the data of interest, and continue the splitting process on the node of interest only. Hence only one path, or one trajectory, is visited within the tree. We have modified the paragrpah in hope of making our statement more clear.

Section 4.1

I think $K$ refers to the number of models, but I don’t recall it being
introduced before.

Thank you for your careful reading, we have now defined this notation.

Third paragraph: I think the discussion on impurity measures not using
the class is important. I think however the idea of using weights taking
into account the class for determining a split can also be used for global
trees, or am I missing something? If that is the case, maybe, the authors
could point this out here, just for the sake of clarity.

In theory it would of course be possible to use weights in global trees as well. However, the crux of the matter with local trees is that the information gain only takes into account the mother node and one of the daughter nodes. In a global tree framework, the information gain of both daughter nodes is used. Hence in our 20-80 switch proportion example, such a situation could only be reached if the second daughter was almost pure, hence of very low impurity. Over every possible cuts, this one would be very competitive, so there is no need for extra weight introduction.

First bullet point (about discretisation) I am confused here, especially by the discussion of being located on the borders. Can the authors please revise the text in this paragraph?

What we meant is that splits on noise variable may lead to pure leaves even though the variable is not informative, and this may in turn lead to false classification. An example of such situation is depicted in the figure below, where x1 is a noise variable whereas x2 is very informative. Yet a split along x1 will result in a pure leaf with sky-blue label, and thus an early stop of the algorithm, even though the data of interest should most likely have been predicted as purple label. Discretizing the variable would most likely result in bigger, and hence non-pure leaves, so that the algorithm would continue over a next iteration of split along another variable.

Figure 1: An illustrative classification problem with 2 classes (purple and sky blye), containing an informative covariate
(x2) and a non-informative covariate (x1) and an unlabeled data to classify (black star). Splitting along x1 will result in
a pure leaf with sky-blue label.

Third bullet point: I don’t get this, is this meant to say that the covariates of $\xs$ end up being the thresholds of the different cuts? That does not seem to make much sense, as how would follow along the tree if its entries exactly correspond to the thresholds?

Thank you for your careful reading, as precisely the covariates of the data of interets should NOT end up being the thresholds of the different cuts. Instead, as the covariates are now categorical, the only allowed splits are along the category values that are different from that of the data of interest, and each of those possible splits induce two leaves: one leaf with datapoints whose covariate is equal to the threshold, and one lead with the other datapoints. Then the local information gain is only computed using the mother node and this second leaf, as by definition the data of interest cannot belong to the first leaf.

Section 4.2

last paragraph: after introducing Armano and Tampone (2018), I believe
the authors should use “that work” instead of “this work” to refer to
Armano and Tampone (2018). Otherwise, I am confused as to whether
they are talking about the current work or Armano and Tampone (2018).

This has been modified.

Section 4.3

after the expression for : “the node (4.2)” I think that should be “the
daughter node”?

Yes thank you, we meant that the weights in each daughter node were modified as in (4.2).

“The major benefit of such weights is that they do not depend on the
covariate index, thus the
usual tree prediction, i.e. the majority class at the leaf where falls, can be replaced by a more
coherent strategy with the tree construction, using as a prediction the
class with the maximal weighted class proportion at the leaf.” Why can’t this be done for the univariate kernel approach as well?

In the univariate approach, the weighted frequency of a given class label depends on the covariate. When the final prediction is made at the final leaf there is no reason to chose a covariate over another. It would however be possible to compute a multivariate weighted frequency at the final leaf at the price of additional computations, but this is something we have not tried.

Overall, sections 4.2 and 4.3 describe alternative ways to take into account info about $\xs$ to build the tree, but could those be combined with 4.1? I am unsure

In theory there could be ideas that could be borrowed from one strategy to the other, but these two approaches are radically different and we are not sure it would make sense. For instance LazyDT works with categorical covariates, whereas a Gaussian kernel seems more appropriate for continuous variables. We could modify the information gain (4.3) of the kernel approaches to consider only the daughter node where the data of interest falls, but we would have to propose new normalisation to avoid negative gains and induce discriminatory power, but that would likely lead to more computations implying kernels. As they are kernel and lazyDT approaches are the most computationally expensive approaches we have tested, so it is likely that combining them would result in even more computational burden.

Section 5.2

The authors use “NN” but I don’t recall this abbreviation being defined
before (although I understand it refers to Nearest Neighbors).

[ We apologize for the confusion and have now defined NN in the text.

Section 7.1

after Eq 7.1, some ??? are present.

We apologize about that and have corrected the reference in the manuscript.

Section 8

In the last paragraph, can you please cite the docs for the R package used for random forests?

We apologize about that, their seems to have been a citation missing for the ranger R package.

Section 8.1

In the caption of Table 8.2, I think the authors should say something like “100 additional noise variables” as, if I understood correctly, 20 noise variables are already present in the setup used for Table 8.1, and here we are adding 100 more. The same holds for Table 8.4

Thank you, we have clarified.

Section 10

The authors summarize the findings and properties of the local methods
in the second and third paragraphs; I suggest they link the sections where they were introduced when discussing each method, as I find myself having to do some back-and-forth to recall which method is which.

We apologize for the inconvenience and thank you for the suggestion. Links to each relevant sections have now been added.

jchiquet added the review label Nov 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Report of reviewer 2 #3

Report of reviewer 2 #3

NelleV commented Nov 28, 2023 •

edited

Loading

Report of reviewer 2 #3

Report of reviewer 2 #3

Comments

NelleV commented Nov 28, 2023 • edited Loading

Summary

Main comments

Experiments

Text

Comments on specific parts

Abstract

Introduction

Section 4

Section 4.1

Section 4.2

Section 4.3

Section 5.2

Section 7.1

Section 8

Section 8.1

Section 10

NelleV commented Nov 28, 2023 •

edited

Loading