Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to set percentage influence of each function in function score query #15670

Closed
Vineeth-Mohan opened this issue Dec 26, 2015 · 8 comments
Labels
discuss :Search/Search Search-related issues that do not fall into other categories

Comments

@Vineeth-Mohan
Copy link

The functions score query gives a good facility to implement various aspects of the score , but then its not exactly giving control over the influence of each function.
For eg: , for the function below -

{
  "query": {
    "function_score": {
      "functions": [
        {
          "decay": {
            "gauss": {
              "date": {
                "origin": "2013-09-17",
                "scale": "10d",
                "offset": "5d",
                "decay": 0.5
              }
            }
          }
        },
        {
          "field_value_factor": {
            "field": "popularity",
            "factor": 1.2,
            "modifier": "sqrt",
            "missing": 1
          }
        },
        {
          "random_score": {}
        },
        {
          "script_score": {
            "script": {
              "lang": "lang",
              "params": {
                "param1": 2,
                "param2": 3
              },
              "inline": "_score * doc['rating'].value / pow(param1, param2)"
            }
          }
        }
      ]
    }
  }
}

There are 4 functions and they dictate the end score. Here , either of the function like the script_score function can eat up all the influence of the score. That is the value of the script_score might be in range of 1000 to 2000 and value of the decay would be between 0 and 1. Hence the influence of the decay function is not exactly passed on to the final score , rather its the script_score that eats up all the influence , rest of the functions might have little or no influence on the final score.

To fix this , it might be useful to have a influenceScore factor per function which tells what percentage of the end score , this function should influence.
For eg: , the above query can be rewriten as

{
  "query": {
    "function_score": {
      "functions": [
        {
          "influenceScore": "40%",
          "decay": {
            "gauss": {
              "date": {
                "origin": "2013-09-17",
                "scale": "10d",
                "offset": "5d",
                "decay": 0.5
              }
            }
          }
        },
        {
          "influenceScore": "30%",
          "field_value_factor": {
            "field": "popularity",
            "factor": 1.2,
            "modifier": "sqrt",
            "missing": 1
          }
        },
        {
          "influenceScore": "10%",
          "random_score": {}
        },
        {
          "influenceScore": "20%",
          "script_score": {
            "script": {
              "lang": "lang",
              "params": {
                "param1": 2,
                "param2": 3
              },
              "inline": "_score * doc['rating'].value / pow(param1, param2)"
            }
          }
        }
      ]
    }
  }
}

Here , we will have a influenceScore per function which dictates the influence of each function. This will help us in further fine tuning the score.

@s1monw
Copy link
Contributor

s1monw commented Dec 28, 2015

can't you just use the weight attribute of a function? instead of influenceScore : 20% you do weight: 0.2

@s1monw
Copy link
Contributor

s1monw commented Jan 7, 2016

@Vineeth-Mohan ping

@Vineeth-Mohan
Copy link
Author

Hello @s1monw ,

Let me walk through the motivation here.
Lets say , I am running the following query

{
  "explain": true,
  "query": {
    "function_score": {
      "functions": [
        {
          "field_value_factor": {
            "field": "dateOfJoining",
            "modifier": "sqrt",
            "missing": 1
          }
        },
        {
          "random_score": {}
        }
      ],
      "score_mode": "sum"
    }
  }
}

With this , I am seeing the following results -

{
  "_explanation": {
    "value": 1113172.4,
    "description": "function score, product of:",
    "details": [
      {
        "value": 1,
        "description": "ConstantScore(*:*), product of:",
        "details": [
          {
            "value": 1,
            "description": "boost"
          },
          {
            "value": 1,
            "description": "queryNorm"
          }
        ]
      },
      {
        "value": 1113172.4,
        "description": "Math.min of",
        "details": [
          {
            "value": 1113172.4,
            "description": "function score, score mode [sum]",
            "details": [
              {
                "value": 1113172.2,
                "description": "function score, product of:",
                "details": [
                  {
                    "value": 1,
                    "description": "match filter: *:*"
                  },
                  {
                    "value": 1113172.2,
                    "description": "field value function: sqrt(doc['dateOfJoining'].value?:1.0 * factor=1.0)"
                  }
                ]
              },
              {
                "value": 0.17271471,
                "description": "function score, product of:",
                "details": [
                  {
                    "value": 1,
                    "description": "match filter: *:*"
                  },
                  {
                    "value": 0.17271471,
                    "description": "random score function (seed: 519896482)"
                  }
                ]
              }
            ]
          },
          {
            "value": 3.4028235e+38,
            "description": "maxBoost"
          }
        ]
      },
      {
        "value": 1,
        "description": "queryBoost"
      }
    ]
  }
}

As you can see the score by field_value_factor is always shadowing the score given by random_score , as in random_score has no relevance here.

My motivation for this issue came from this problem.
One solution would be to use the weight to normalize the values , and that is how its currently done.
But then looking into the range of values for each function and deciding the weight score for all the functions and finding them manually seems like a hard case. And these weights that are computed manually might not be applicable across all documents.

The percentage suggestion was based on this , but I am finding it difficult to pen the maths behind the same. Only solution i found was to find the range of each score given by each function across all document and use that for percentage influence. But as scoring is per document , that wont be feasible.

Let me know your thoughts on the subject.

@s1monw
Copy link
Contributor

s1monw commented Jan 8, 2016

@Vineeth-Mohan I can see what you are saying and I admit it can be challenging. I personally don't see a good way to apply a general way of normalization here. I see the function score feature as a toolset of primitives that lets / forces the user to ensure that each element of the equation has it's relevant weight etc. I wonder if other ie. @brwe has some ideas?

@brwe
Copy link
Contributor

brwe commented Jan 8, 2016

It seems to me this is a case of "learning to rank". To find proper weights you would need to know what the expected ordering of result for different queries would be and the tune the weights accordingly. Without that the only thing you can do now is guess.
We currently have no way to scale functions either so they are comparable. This is something you will have to do in advance. Just in case you don't know aggregations help for that, see example below. Other than that we currently have no support to tune the weights automatically.

{
  "query": {
    "function_score": {
      "functions": [
        {
          "random_score": {},
          "weight": 1000
        }
      ]
    }
  },
  "aggs": {
    "score_agg": {
      "histogram": {
        "script": "_score",
        "interval": 50
      }
    },
    "score_stats": {
      "extended_stats": {
        "script": "_score"
      }
    }
  }
}

@gkop
Copy link

gkop commented Feb 9, 2016

@brwe another benefit of what's proposed here if I understand correctly is one could use score_mode avg which could be weighted by influenceScore to generate scores nicely distributed on a range. This can be accounted for now in the client by passing influenceScore as a param to our script (which multiplies it by the nicely distributed intermediate score), and keeping a running sum of the influence scores, but it would be quite amazing if the server took care of it for us instead.

In fact, on reading the docs at https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-function-score-query.html , I initially interpreted that we could pass weight in as an option to any kind of function_score score function to obtain this behavior, that just made sense to me. Alas I misunderstood.

@clintongormley
Copy link
Contributor

The only other thing I could suggest is to apply a min/max score to each function, eg you could force gauss to be between 0 and 2. With that, the weights would be easier to adjust.

@clintongormley clintongormley added :Search/Search Search-related issues that do not fall into other categories and removed :Query DSL labels Feb 14, 2018
@mayya-sharipova
Copy link
Contributor

mayya-sharipova commented Mar 21, 2018

Closing this in favour of #27588, where one of the desired features could be to normalize scores

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss :Search/Search Search-related issues that do not fall into other categories
Projects
None yet
Development

No branches or pull requests

6 participants