Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to get rid of converted characters in URLs #488

Closed
asitemade4u opened this issue Apr 29, 2020 · 15 comments
Closed

How to get rid of converted characters in URLs #488

asitemade4u opened this issue Apr 29, 2020 · 15 comments
Labels
area/stdlib Standard library issue good first issue Good for newcomers type/bug Something isn't working
Milestone

Comments

@asitemade4u
Copy link

asitemade4u commented Apr 29, 2020

I am developing a crawler and so far, so very good: thank you for this outstanding crawler.

The only issue is that, in the returned URLs, there is a & character which gets converted into \u0026, thus: "https://thedomain/alphabet=M\u0026borough=Bronx"

So I tried to replace it, either by using SUBSTITUTE:
RETURN SUBSTITUTE(prfx + letter.attributes.href, "\u0026", "&")

or REGEX_REPLACE.

In both cases, the \u0026 string is NOT replaced and remains embedded into the resulting URLs.
However, when I try SUBSTITUTEsay on a -> z it works fine.

Is it a limitation of JSON, which I use as an output format?
How can I get rid of the converted string as it prevents me from crawling at the lower levels of the website.

@ziflex
Copy link
Member

ziflex commented Apr 30, 2020

Thank you and I'm glad that you like it!

I assume you are using cdp driver, right? If so, I think the problem is that CDP is encodes urls before sending it to Ferret since it's a JSON-based communication.
DECODE_URI_COMPONENT is supposed to solve the problem but it seems not working:

Welcome to Ferret REPL
Please use `exit` or `Ctrl-D` to exit this program.
> RETURN DECODE_URI_COMPONENT("https://thedomain/alphabet=M\u0026borough=Bronx")
"https://thedomain/alphabet=M\\u0026borough=Bronx"

Therefore I mark it as a bug.

@ziflex ziflex added area/stdlib Standard library issue good first issue Good for newcomers type/bug Something isn't working labels Apr 30, 2020
@asitemade4u
Copy link
Author

asitemade4u commented Apr 30, 2020 via email

@ziflex
Copy link
Member

ziflex commented Apr 30, 2020

Not sure I fully understand you.
CDP driver uses Chrome to communicate with web pages and you need it only if your target page is dynamically rendered and/or requires some user interaction to retrieve data you need.

@asitemade4u
Copy link
Author

asitemade4u commented Apr 30, 2020 via email

@ziflex
Copy link
Member

ziflex commented Apr 30, 2020

Ah, well.
I think there is a misunderstanding going on :) Ferret CLI is an executable binary that already contains Ferret Runtime (they just happen to be in the same repo).
What you are referring to is a Chrome/Chromium with open remote debugging port that Ferret's CDP driver uses to perform web scraping of dynamic pages.

I hope it clarify things.
Could you give more context of your situation?

@asitemade4u
Copy link
Author

asitemade4u commented Apr 30, 2020 via email

@asitemade4u
Copy link
Author

asitemade4u commented Apr 30, 2020 via email

@ziflex
Copy link
Member

ziflex commented Apr 30, 2020

You mean that the "Ferret Server" on docker I have installed on my local physical server is never used except with CDP? And so, basically, I do not need it in cased I do NOT use CDP? Wow, this is would be serious misunderstanding...

By "Ferret Server" do you mean this project?

and yet, I just tried to execute: ferret <thescript>.aql > tst.jsonhost and absolutely nothing happens...

Could you share your script? Hard to tell what's wrong without seeing actual code.

@asitemade4u

This comment has been minimized.

@ziflex
Copy link
Member

ziflex commented Apr 30, 2020

For this webpage you def do not need CDP, so you do not have to run Docker. Just use CLI.

LET doc = DOCUMENT('https://www.nycgovparks.org/about/history/historical-signs')
LET prfx = 'https://www.nycgovparks.org'

// Parse boro links
LET boros = ELEMENTS(doc, 'html body div#page div#maincontent div.row div.span9 main ul.text_list li a')
LET brolnk = (
	FOR bro IN boros
		RETURN prfx +bro.attributes.href
)

// Then letter grid within each boro
// 20200429 Problem with the \u0026 encoded character of "&"
LET result = (FOR bro_link IN brolnk
    LET d = DOCUMENT(bro_link)
    LET letters = ELEMENTS(d, 'html body div#page div#maincontent div.row div.span9 main table tbody tr td div a')
	LET ltrlnk = (
		FOR letter IN letters
			RETURN DECODE_URI_COMPONENT(prfx +letter.attributes.href)
	)
	
	RETURN FLATTEN(ltrlnk)
)

RETURN FLATTEN(result)

@asitemade4u
Copy link
Author

asitemade4u commented Apr 30, 2020 via email

@ziflex
Copy link
Member

ziflex commented Apr 30, 2020

ferret <thescript>.aql 

@asitemade4u
Copy link
Author

asitemade4u commented Apr 30, 2020 via email

@ziflex
Copy link
Member

ziflex commented Apr 30, 2020

Yeah, I will add some clarifications to the docs about differences between in-memory static pages and Chrome-driven pages.

@asitemade4u
Copy link
Author

asitemade4u commented Apr 30, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/stdlib Standard library issue good first issue Good for newcomers type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants