-
-
Notifications
You must be signed in to change notification settings - Fork 303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to get rid of converted characters in URLs #488
Comments
Thank you and I'm glad that you like it! I assume you are using
Therefore I mark it as a bug. |
Thank you for the response!
I have a related 2question as you just mentioned cdp.
I would like not to use it but I have not found a way to access the
server which I do NOT host locally, but on a "real, physical" local
server, under docker.
So, is there a way to connect to a remote server without resorting to CDP?
Currently, when executing a script, I do: ferret --cdp http://<the
server ip>:9222 <thefile>.aql
Please also note I am using the AQL extension because the only VS Code
plugin that formats and lints the language does not work with FQL, only AQL.
…On 4/29/20 9:03 PM, Tim Voronov wrote:
Thank you and I'm glad that you like it!
I assume you are using |cdp| driver, right? If so, I think the problem
is that CDP is encodes urls before sending it to Ferret since it's a
JSON-based communication.
|DECODE_URI_COMPONENT| is supposed to solve the problem but it seems
not working:
|Welcome to Ferret REPL Please use `exit` or `Ctrl-D` to exit this
program. > RETURN
DECODE_URI_COMPONENT("https://thedomain/alphabet=M\u0026borough=Bronx")
"https://thedomain/alphabet=M\\u0026borough=Bronx" |
Therefore I mark it as a bug.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#488 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACXN4CW3VZOGI7M4YYVTW63RPDE4PANCNFSM4MUIA4LA>.
|
Not sure I fully understand you. |
Sorry for being so obscure.
The question was: how do I connect to a remote Ferret server, hosted on
a different IP address than localhost, from the cli?
Something like:
ferret --*host* http://<theserverip>:9222 <thescript>.aql
…On 4/29/20 9:34 PM, Tim Voronov wrote:
Not sure I fully understand you.
CDP driver uses Chrome to communicate with web pages and you need it
only if your target page is dynamically rendered and/or requires some
user interaction to retrieve data you need.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#488 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACXN4CVKK3PYI7CQIWLNQZTRPDITVANCNFSM4MUIA4LA>.
|
Ah, well. I hope it clarify things. |
You mean that the "Ferret Server" on docker I have installed on my local
physical server is never used except with CDP?
And so, basically, I do not need it in cased I do NOT use CDP?
Wow, this is would be serious misunderstanding...
…On 4/29/20 10:04 PM, Tim Voronov wrote:
Ah, well.
I think there is a misunderstanding going on :) Ferret CLI is an
executable binary that already contains Ferret Runtime (they just
happen to be in the same repo).
What you are referring to is a Chrome/Chromium with open remote
debugging port that Ferret's CDP driver uses to perform web scraping
of dynamic pages.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#488 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACXN4CUTLSN7LBFHIAPVOKTRPDMBRANCNFSM4MUIA4LA>.
|
and yet, I just tried to execute: `ferret <thescript>.aql >
tst.jsonhost` and absolutely nothing happens...
…On 4/29/20 10:04 PM, Tim Voronov wrote:
Ah, well.
I think there is a misunderstanding going on :) Ferret CLI is an
executable binary that already contains Ferret Runtime (they just
happen to be in the same repo).
What you are referring to is a Chrome/Chromium with open remote
debugging port that Ferret's CDP driver uses to perform web scraping
of dynamic pages.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#488 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACXN4CUTLSN7LBFHIAPVOKTRPDMBRANCNFSM4MUIA4LA>.
|
By "Ferret Server" do you mean this project?
Could you share your script? Hard to tell what's wrong without seeing actual code. |
This comment has been minimized.
This comment has been minimized.
For this webpage you def do not need CDP, so you do not have to run Docker. Just use CLI.
|
Great!
And what the CLI instruction will look like then?
S.
…On Apr 29, 2020, 23:16, at 23:16, Tim Voronov ***@***.***> wrote:
For this webpage you def do not need CDP.
```
LET doc =
DOCUMENT('https://www.nycgovparks.org/about/history/historical-signs')
LET prfx = 'https://www.nycgovparks.org'
// Parse boro links
LET boros = ELEMENTS(doc, 'html body div#page div#maincontent div.row
div.span9 main ul.text_list li a')
LET brolnk = (
FOR bro IN boros
RETURN prfx +bro.attributes.href
)
// Then letter grid within each boro
// 20200429 Problem with the \u0026 encoded character of "&"
LET result = (FOR bro_link IN brolnk
LET d = DOCUMENT(bro_link)
LET letters = ELEMENTS(d, 'html body div#page div#maincontent div.row
div.span9 main table tbody tr td div a')
LET ltrlnk = (
FOR letter IN letters
RETURN DECODE_URI_COMPONENT(prfx +letter.attributes.href)
)
RETURN FLATTEN(ltrlnk)
)
RETURN FLATTEN(result)
```
--
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub:
#488 (comment)
|
|
Oh I get it now. There might be some additional explanations to provide in the doc...
Thank you SO MUCH!
S.
…On Apr 29, 2020, 23:25, at 23:25, Tim Voronov ***@***.***> wrote:
```
ferret <thescript>.aql
```
--
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub:
#488 (comment)
|
Yeah, I will add some clarifications to the docs about differences between in-memory static pages and Chrome-driven pages. |
Thank you for the remodeled script. It works perfectly, except for the
\u0026 issue which, unfortunately, is blocking as I am not able to get
down one level...
…On 4/29/20 11:30 PM, Tim Voronov wrote:
Yeah, I will add some clarifications to the docs about differences
between in-memory static pages and Chrome-driven pages.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#488 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACXN4CUXTGJWHKG64C2HSP3RPDWHHANCNFSM4MUIA4LA>.
|
I am developing a crawler and so far, so very good: thank you for this outstanding crawler.
The only issue is that, in the returned URLs, there is a
&
character which gets converted into\u0026
, thus:"https://thedomain/alphabet=M\u0026borough=Bronx"
So I tried to replace it, either by using
SUBSTITUTE
:RETURN SUBSTITUTE(prfx + letter.attributes.href, "\u0026", "&")
or
REGEX_REPLACE
.In both cases, the
\u0026
string is NOT replaced and remains embedded into the resulting URLs.However, when I try
SUBSTITUTE
say ona
->z
it works fine.Is it a limitation of JSON, which I use as an output format?
How can I get rid of the converted string as it prevents me from crawling at the lower levels of the website.
The text was updated successfully, but these errors were encountered: