Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"invalid regex" error during manual OSM import #279

Closed
nick-rv opened this issue Dec 4, 2024 · 4 comments
Closed

"invalid regex" error during manual OSM import #279

nick-rv opened this issue Dec 4, 2024 · 4 comments
Labels

Comments

@nick-rv
Copy link

nick-rv commented Dec 4, 2024

Describe the bug

Hi, i am trying to import the following file proposed for France into polylines module of Pelias : https://data.geocode.earth/osm/2022-35/france-valhalla.polylines.0sv.gz

And a couple of errors appeared while i am not sure if the root cause comes from the data itself or the importer code.

Steps to Reproduce

The file have been previously gunzipped.

Running this command : pelias import polylines

It brings then this kind of error:

[polyline] polyline document error message=invalid regex test, Maire Bouxières http://www.mairie-bouxieres-aux-dames.fr/wp-content/uploads/2005/01/Les-sentiers-de-Bouxi%C3%A8res-aux-Dames.pdf should not match /https?:\/\//, stack=PeliasModelError: invalid regex test, Maire Bouxières http://www.mairie-bouxieres-aux-dames.fr/wp-content/uploads/2005/01/Les-sentiers-de-Bouxi%C3%A8res-aux-Dames.pdf should not match /https?:\/\//

The job seems to continue anyway until its end.

Complete logs

$ pelias import polylines
2024-12-04T09:13:55.775Z - info: [dbclient-polylines] paused=true, transient=5, current_length=0
2024-12-04T09:13:55.776Z - info: [dbclient-polylines] paused=true, transient=5, current_length=0
2024-12-04T09:14:05.838Z - info: [dbclient-polylines] paused=true, transient=5, current_length=0, indexed=134000, batch_ok=268, street=134000, batch_retries=0, failed_records=0, persec=13400
2024-12-04T09:14:15.842Z - info: [dbclient-polylines] paused=true, transient=5, current_length=0, indexed=268000, batch_ok=536, street=268000, batch_retries=0, failed_records=0, persec=13400
2024-12-04T09:14:16.801Z - error: [polyline] polyline document error message=invalid regex test, Maire Bouxières http://www.mairie-bouxieres-aux-dames.fr/wp-content/uploads/2005/01/Les-sentiers-de-Bouxi%C3%A8res-aux-Dames.pdf should not match /https?:///, stack=PeliasModelError: invalid regex test, Maire Bouxières http://www.mairie-bouxieres-aux-dames.fr/wp-content/uploads/2005/01/Les-sentiers-de-Bouxi%C3%A8res-aux-Dames.pdf should not match /https?:///
at Object.nomatch (/code/pelias/polylines/node_modules/pelias-model/util/valid.js:117:13)
at Document.setName (/code/pelias/polylines/node_modules/pelias-model/Document.js:244:18)
at DestroyableTransform._transform (/code/pelias/polylines/stream/document.js:30:11)
at DestroyableTransform.Transform._read (/code/pelias/polylines/node_modules/through2/node_modules/readable-stream/lib/_stream_transform.js:166:10)
at DestroyableTransform.Readable.read (/code/pelias/polylines/node_modules/through2/node_modules/readable-stream/lib/_stream_readable.js:428:10)
at flow (/code/pelias/polylines/node_modules/through2/node_modules/readable-stream/lib/_stream_readable.js:858:34)
at DestroyableTransform.pipeOnDrainFunctionResult (/code/pelias/polylines/node_modules/through2/node_modules/readable-stream/lib/_stream_readable.js:690:7)
at DestroyableTransform.emit (node:events:513:28)
at onwriteDrain (/code/pelias/polylines/node_modules/through2/node_modules/readable-stream/lib/_stream_writable.js:453:12)
at afterWrite (/code/pelias/polylines/node_modules/through2/node_modules/readable-stream/lib/_stream_writable.js:441:18), name=PeliasModelError
2024-12-04T09:14:16.802Z - error: [polyline] polyline document error message=invalid regex test, Mairie Bouxières http://www.mairie-bouxieres-aux-dames.fr/wp-content/uploads/2005/01/Les-sentiers-de-Bouxi%C3%A8res-aux-Dames.pdf should not match /https?:///, stack=PeliasModelError: invalid regex test, Mairie Bouxières http://www.mairie-bouxieres-aux-dames.fr/wp-content/uploads/2005/01/Les-sentiers-de-Bouxi%C3%A8res-aux-Dames.pdf should not match /https?:///
at Object.nomatch (/code/pelias/polylines/node_modules/pelias-model/util/valid.js:117:13)
at Document.setName (/code/pelias/polylines/node_modules/pelias-model/Document.js:244:18)
at DestroyableTransform._transform (/code/pelias/polylines/stream/document.js:30:11)
at DestroyableTransform.Transform._read (/code/pelias/polylines/node_modules/through2/node_modules/readable-stream/lib/_stream_transform.js:166:10)
at DestroyableTransform.Readable.read (/code/pelias/polylines/node_modules/through2/node_modules/readable-stream/lib/_stream_readable.js:428:10)
at flow (/code/pelias/polylines/node_modules/through2/node_modules/readable-stream/lib/_stream_readable.js:858:34)
at DestroyableTransform.pipeOnDrainFunctionResult (/code/pelias/polylines/node_modules/through2/node_modules/readable-stream/lib/_stream_readable.js:690:7)
at DestroyableTransform.emit (node:events:513:28)
at onwriteDrain (/code/pelias/polylines/node_modules/through2/node_modules/readable-stream/lib/_stream_writable.js:453:12)
at afterWrite (/code/pelias/polylines/node_modules/through2/node_modules/readable-stream/lib/_stream_writable.js:441:18), name=PeliasModelError
2024-12-04T09:14:25.843Z - info: [dbclient-polylines] paused=true, transient=3, current_length=0, indexed=410000, batch_ok=820, street=410000, batch_retries=0, failed_records=0, persec=14200
2024-12-04T09:14:35.873Z - info: [dbclient-polylines] paused=false, transient=3, current_length=412, indexed=548000, batch_ok=1096, street=548000, batch_retries=0, failed_records=0, persec=13800
2024-12-04T09:14:45.874Z - info: [dbclient-polylines] paused=true, transient=3, current_length=0, indexed=684500, batch_ok=1369, street=684500, batch_retries=0, failed_records=0, persec=13650
2024-12-04T09:14:55.884Z - info: [dbclient-polylines] paused=true, transient=5, current_length=0, indexed=820000, batch_ok=1640, street=820000, batch_retries=0, failed_records=0, persec=13550
2024-12-04T09:15:05.890Z - info: [dbclient-polylines] paused=true, transient=3, current_length=0, indexed=948500, batch_ok=1897, street=948500, batch_retries=0, failed_records=0, persec=12850
2024-12-04T09:15:15.898Z - info: [dbclient-polylines] paused=true, transient=5, current_length=0, indexed=1072500, batch_ok=2145, street=1072500, batch_retries=0, failed_records=0, persec=12400
2024-12-04T09:15:25.899Z - info: [dbclient-polylines] paused=false, transient=3, current_length=309, indexed=1205000, batch_ok=2410, street=1205000, batch_retries=0, failed_records=0, persec=13250
2024-12-04T09:15:35.935Z - info: [dbclient-polylines] paused=true, transient=5, current_length=0, indexed=1332000, batch_ok=2664, street=1332000, batch_retries=0, failed_records=0, persec=12700
2024-12-04T09:15:45.953Z - info: [dbclient-polylines] paused=true, transient=5, current_length=0, indexed=1458500, batch_ok=2917, street=1458500, batch_retries=0, failed_records=0, persec=12650
2024-12-04T09:15:56.030Z - info: [dbclient-polylines] paused=true, transient=5, current_length=0, indexed=1586500, batch_ok=3173, street=1586500, batch_retries=0, failed_records=0, persec=12800
2024-12-04T09:16:06.145Z - info: [dbclient-polylines] paused=true, transient=5, current_length=0, indexed=1720000, batch_ok=3440, street=1720000, batch_retries=0, failed_records=0, persec=13350
2024-12-04T09:16:16.171Z - info: [dbclient-polylines] paused=false, transient=2, current_length=214, indexed=1813500, batch_ok=3627, street=1813500, batch_retries=0, failed_records=0, persec=9350
2024-12-04T09:16:26.186Z - info: [dbclient-polylines] paused=false, transient=2, current_length=30, indexed=1900500, batch_ok=3801, street=1900500, batch_retries=0, failed_records=0, persec=8700
2024-12-04T09:16:36.223Z - info: [dbclient-polylines] paused=false, transient=1, current_length=274, indexed=1974000, batch_ok=3948, street=1974000, batch_retries=0, failed_records=0, persec=7350
2024-12-04T09:16:46.251Z - info: [dbclient-polylines] paused=false, transient=1, current_length=340, indexed=2031500, batch_ok=4063, street=2031500, batch_retries=0, failed_records=0, persec=5750
2024-12-04T09:16:56.257Z - info: [dbclient-polylines] paused=false, transient=1, current_length=201, indexed=2098500, batch_ok=4197, street=2098500, batch_retries=0, failed_records=0, persec=6700
2024-12-04T09:17:05.647Z - info: [dbclient-polylines] paused=false, transient=0, current_length=0, indexed=2171870, batch_ok=4344, street=2171870, batch_retries=0, failed_records=0, persec=7337
2024-12-04T09:17:05.647Z - info: [dbclient-polylines] paused=false, transient=0, current_length=0, indexed=2171870, batch_ok=4344, street=2171870, batch_retries=0, failed_records=0, persec=7337

Environment (please complete the following information):

The concerned environment is the docker stack provided into https://github.com/pelias/docker , running on a Debian 6.1 machine.

Thanks in advance

@nick-rv nick-rv added the bug label Dec 4, 2024
@nick-rv
Copy link
Author

nick-rv commented Dec 4, 2024

Does this mean that the http URLs are banned ?

@missinglink
Copy link
Member

missinglink commented Dec 9, 2024

These warnings are generated when OSM data contains a URL in the name field.

Does this mean that the http URLs are banned ?

Records containing URLs in the name field are skipped, these are considered 'bad data' as we don't want URLs to end up in the search engine.

This feature was introduced in: pelias/model#115

These warnings are unfortunately common since OSM contains many data errors, you can ignore them:

curl -Ls https://data.geocode.earth/osm/2022-35/france-valhalla.polylines.0sv.gz | pigz -d | grep -a 'http://' | cut -d '' -f2- --output-delimiter=$'\t'
Sofinel Runs Path	http://www.sofinelruns.com
Sentier des Zaubis	Maire Bouxières http://www.mairie-bouxieres-aux-dames.fr/wp-content/uploads/2005/01/Les-sentiers-de-Bouxi%C3%A8res-aux-Dames.pdf
Sentier des Chasupes	Mairie Bouxières http://www.mairie-bouxieres-aux-dames.fr/wp-content/uploads/2005/01/Les-sentiers-de-Bouxi%C3%A8res-aux-Dames.pdf
Sentier des Quarterons	http://www.mairie-bouxieres-aux-dames.fr/wp-content/uploads/2005/01/Les-sentiers-de-Bouxi%C3%A8res-aux-Dames.pdf

@missinglink
Copy link
Member

I noticed that we improved detection of these streets in d4c5305

That was 5+ years ago, are you possibly running some ancient docker containers or something?

@missinglink
Copy link
Member

This functionality was improved today in pelias/model#160, which added infix removal of URLs within pelias/model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants