Geolocation and user language extraction analysis: issue #37 #100

sanchittechnogeek · 2019-04-01T22:34:02Z

Analysis:

My analysis revolves around finding what percentage of / which websites in this dataset are tracking users location and language preferences so as to provide them with a customized content based on the user's preferences (eg. location, language)

Dataset used: Sample 10 percent

sample 10 percent - 3.7GB download / 7.4GB on disk

birdsarah

Comments as I go:

Very nice write-ups!
Great work getting to grips with dask.
You mentioned that navigator.language can be both get and set. Your write-up implies all the calls are get. You could have used the operation column to verify this.
Nice printing out of results, using variables. Not necessary, but I have fallen in love with f'' strings for doing this kind of reporting of results as I find it makes the text generation more natural and easier to review.
The geolocation symbol is not one I've explored, and your analysis has certainly got me intrigued about it.
There's no need to have your write-up in both the readme file and at the bottom of your notebook - just one place is good - and less likely to lead to them getting out of sync later.

Overall:

This is a really good start.
The main flaw in your current work is that you're counting rows, which is not the same as what you're saying in the text of your results. What you're saying in the text of your results is the right way to think about this, so don't update your text to be accurate :-) update your code so that you're counting what you expect to be. More comments on this in my comments on the readme.
The next step is to start linking this to a bigger picture. We can count lots of things. How does this relate to tracking / what are you trying to get out of this. You did start to explain things, but I was left wanting to know where you're going. Reading that a user is en-US so that, as you say, they can be provided with localized content is not necessarily tracking.
One final thought - if you're trying to look at differences in content delivery is this the dataset to do it? If not, why not? What would you change about the data collection? What other data would you like?

birdsarah · 2019-04-04T07:55:45Z

analyses/2019_04_sanchittechnogeek__geolocation/Readme.md

+ ## Inference:
+
+
+Out of the total of __11292867__ websites / locations in this dataset __72304__ (0.64%) websites were found to be checking for preferred language of the user, usually the language of the browser UI, and their subsequent location/scripts can be found in the `language_pref_df` dataframe.


There are not 11,292,867 websites in this dataset. There are 11,292,867 rows, each of which is a JS call. The number of locations can be found by df.location.nunique() or the number of scripts by df.script_url.nunique().

birdsarah · 2019-04-04T07:57:00Z

analyses/2019_04_sanchittechnogeek__geolocation/Readme.md

+Out of the total of __11292867__ websites / locations in this dataset __72304__ (0.64%) websites were found to be checking for preferred language of the user, usually the language of the browser UI, and their subsequent location/scripts can be found in the `language_pref_df` dataframe.
+
+Out of the total of __11292867__ websites / locations in this dataset __2414__ (0.02%) websites were found to be checking for user's location using the geolocation api, and their subsequent location/scripts can be found in the `geolocation_df` dataframe.
+


Your prevalence will be likely much higher if you were to report the number of scripts making on or more of the calls you are looking for. An additional refinement that occurs to me immediately would not just be to report the incidence among unique script_urls but for script domains, or unique script_urls with any parameter string removed. Showing all three in face might make for interesting discussion. A related avenue to discuss is the scripts that do it, and the locations that contain those scripts

Going further....are those scripts on other locations but we haven't detected the calls you're looking for - is there a pattern to whether the calls you're interested are or are not present? (this is me just getting excited by the possibilities).

birdsarah · 2019-04-04T07:59:05Z

analyses/2019_04_sanchittechnogeek__geolocation/Readme.md

+
+Out of the total of __11292867__ websites / locations in this dataset __2414__ (0.02%) websites were found to be checking for user's location using the geolocation api, and their subsequent location/scripts can be found in the `geolocation_df` dataframe.
+
+Running it on the full dataset can yield an `higher accuracy`.


True, but working within the confines of this dataset, can you take a stab at how much more accurate - or in particular which numbers are more or less likely to be accurate - or provide the numbers with confidence intervals? (I'm not actually expecting you do this - just a note)

- added operation table - removed readme file - update the scripts/websites statistics - added analysis of calls made by unique domains

Improvements for PR: mozilla#100 based on the review - added operation table to verify get calls - removed readme file - updated the scripts/websites statistics - added unique domain statistics

Improvements for PR: mozilla#100 based on the review - added operation table to verify get calls - removed readme file - updated the scripts/websites statistics - added unique domain statistics - fixed a typo in comments

Improvements for PR: mozilla#100 based on the review - added operation table to verify get calls - removed readme file - updated the scripts/websites statistics - added unique domain statistics - fixed a typo in comments - added more information for geolocation tracking

sanchittechnogeek · 2019-04-30T16:24:41Z

One final thought - if you're trying to look at differences in content delivery is this the dataset to do it? If not, why not? What would you change about the data collection? What other data would you like?

I was looking for getCurrentPosition() function calls but the crawler wasn't able to detect it properly except that it did at one location. So for changing the data collection, I would like the crawler to be run dedicatedly to detect the function calls. One other thing I would like to do is to run crawlers from different locations simultaneously so as to find what scripts are being run in/from a particular region only.

sanchittechnogeek changed the title ~~Geolocation and user language extraction analysis: isuue #37~~ Geolocation and user language extraction analysis: issue #37 Apr 1, 2019

birdsarah suggested changes Apr 4, 2019

View reviewed changes

sanchittechnogeek force-pushed the master branch from 20c934f to 0b3d87e Compare April 24, 2019 21:48

sanchittechnogeek force-pushed the master branch 2 times, most recently from 0e78a86 to ae0b73f Compare April 24, 2019 22:13

sanchittechnogeek force-pushed the master branch from ae0b73f to f90b0b1 Compare April 25, 2019 16:40

sanchittechnogeek force-pushed the master branch from f90b0b1 to 449a387 Compare April 25, 2019 17:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Geolocation and user language extraction analysis: issue #37 #100

Geolocation and user language extraction analysis: issue #37 #100

sanchittechnogeek commented Apr 1, 2019 •

edited

Loading

birdsarah left a comment

birdsarah Apr 4, 2019

birdsarah Apr 4, 2019

birdsarah Apr 4, 2019

sanchittechnogeek commented Apr 30, 2019 •

edited

Loading

		## Inference:


		Out of the total of __11292867__ websites / locations in this dataset __72304__ (0.64%) websites were found to be checking for preferred language of the user, usually the language of the browser UI, and their subsequent location/scripts can be found in the `language_pref_df` dataframe.


		Out of the total of __11292867__ websites / locations in this dataset __2414__ (0.02%) websites were found to be checking for user's location using the geolocation api, and their subsequent location/scripts can be found in the `geolocation_df` dataframe.

		Running it on the full dataset can yield an `higher accuracy`.

Geolocation and user language extraction analysis: issue #37 #100

Are you sure you want to change the base?

Geolocation and user language extraction analysis: issue #37 #100

Conversation

sanchittechnogeek commented Apr 1, 2019 • edited Loading

Analysis:

Dataset used: Sample 10 percent

birdsarah left a comment

Choose a reason for hiding this comment

birdsarah Apr 4, 2019

Choose a reason for hiding this comment

birdsarah Apr 4, 2019

Choose a reason for hiding this comment

birdsarah Apr 4, 2019

Choose a reason for hiding this comment

sanchittechnogeek commented Apr 30, 2019 • edited Loading

sanchittechnogeek commented Apr 1, 2019 •

edited

Loading

sanchittechnogeek commented Apr 30, 2019 •

edited

Loading