-
Notifications
You must be signed in to change notification settings - Fork 48
Geolocation and user language extraction analysis: issue #37 #100
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments as I go:
- Very nice write-ups!
- Great work getting to grips with dask.
- You mentioned that navigator.language can be both get and set. Your write-up implies all the calls are get. You could have used the
operation
column to verify this. - Nice printing out of results, using variables. Not necessary, but I have fallen in love with
f''
strings for doing this kind of reporting of results as I find it makes the text generation more natural and easier to review. - The geolocation symbol is not one I've explored, and your analysis has certainly got me intrigued about it.
- There's no need to have your write-up in both the readme file and at the bottom of your notebook - just one place is good - and less likely to lead to them getting out of sync later.
Overall:
- This is a really good start.
- The main flaw in your current work is that you're counting rows, which is not the same as what you're saying in the text of your results. What you're saying in the text of your results is the right way to think about this, so don't update your text to be accurate :-) update your code so that you're counting what you expect to be. More comments on this in my comments on the readme.
- The next step is to start linking this to a bigger picture. We can count lots of things. How does this relate to tracking / what are you trying to get out of this. You did start to explain things, but I was left wanting to know where you're going. Reading that a user is
en-US
so that, as you say, they can be provided with localized content is not necessarily tracking. - One final thought - if you're trying to look at differences in content delivery is this the dataset to do it? If not, why not? What would you change about the data collection? What other data would you like?
## Inference: | ||
|
||
|
||
Out of the total of __11292867__ websites / locations in this dataset __72304__ (0.64%) websites were found to be checking for preferred language of the user, usually the language of the browser UI, and their subsequent location/scripts can be found in the `language_pref_df` dataframe. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are not 11,292,867 websites in this dataset. There are 11,292,867 rows, each of which is a JS call. The number of locations can be found by df.location.nunique()
or the number of scripts by df.script_url.nunique()
.
Out of the total of __11292867__ websites / locations in this dataset __72304__ (0.64%) websites were found to be checking for preferred language of the user, usually the language of the browser UI, and their subsequent location/scripts can be found in the `language_pref_df` dataframe. | ||
|
||
Out of the total of __11292867__ websites / locations in this dataset __2414__ (0.02%) websites were found to be checking for user's location using the geolocation api, and their subsequent location/scripts can be found in the `geolocation_df` dataframe. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your prevalence will be likely much higher if you were to report the number of scripts making on or more of the calls you are looking for. An additional refinement that occurs to me immediately would not just be to report the incidence among unique script_urls but for script domains, or unique script_urls with any parameter string removed. Showing all three in face might make for interesting discussion. A related avenue to discuss is the scripts that do it, and the locations that contain those scripts
Going further....are those scripts on other locations but we haven't detected the calls you're looking for - is there a pattern to whether the calls you're interested are or are not present? (this is me just getting excited by the possibilities).
|
||
Out of the total of __11292867__ websites / locations in this dataset __2414__ (0.02%) websites were found to be checking for user's location using the geolocation api, and their subsequent location/scripts can be found in the `geolocation_df` dataframe. | ||
|
||
Running it on the full dataset can yield an `higher accuracy`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, but working within the confines of this dataset, can you take a stab at how much more accurate - or in particular which numbers are more or less likely to be accurate - or provide the numbers with confidence intervals? (I'm not actually expecting you do this - just a note)
- added operation table - removed readme file - update the scripts/websites statistics - added analysis of calls made by unique domains
20c934f
to
0b3d87e
Compare
Improvements for PR: mozilla#100 based on the review - added operation table to verify get calls - removed readme file - updated the scripts/websites statistics - added unique domain statistics
Improvements for PR: mozilla#100 based on the review - added operation table to verify get calls - removed readme file - updated the scripts/websites statistics - added unique domain statistics
0e78a86
to
ae0b73f
Compare
Improvements for PR: mozilla#100 based on the review - added operation table to verify get calls - removed readme file - updated the scripts/websites statistics - added unique domain statistics
ae0b73f
to
f90b0b1
Compare
Improvements for PR: mozilla#100 based on the review - added operation table to verify get calls - removed readme file - updated the scripts/websites statistics - added unique domain statistics - fixed a typo in comments
Improvements for PR: mozilla#100 based on the review - added operation table to verify get calls - removed readme file - updated the scripts/websites statistics - added unique domain statistics - fixed a typo in comments - added more information for geolocation tracking
f90b0b1
to
449a387
Compare
I was looking for |
Analysis:
My analysis revolves around finding what percentage of / which websites in this dataset are tracking users location and language preferences so as to provide them with a customized content based on the user's preferences (eg. location, language)
Dataset used: Sample 10 percent