Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ST_GeomFromGeoJson, ST_Intersects, and versions of Spatial Framework & of Hive #77

Closed
randallwhitman opened this issue Mar 25, 2015 · 18 comments
Milestone

Comments

@randallwhitman
Copy link
Contributor

Migrating Esri/geometry-api-java#81 to spatial-framework-for-hadoop from geometry-api-java

@randallwhitman
Copy link
Contributor Author

@uamadman When you built spatial-sdk-hive-1.0.3-SNAPSHOT.jar did you update the add jar statement to match?

@randallwhitman
Copy link
Contributor Author

Also, there is a pre-built JAR file in the maven repositories:
http://mvnrepository.com/artifact/com.esri.hadoop/spatial-sdk-hive/1.1

@uamadman
Copy link

Yes, restarted the cluster, uploaded 1.03 and ran my tests for my previous report.
This time with your new link....
I downloaded the .jar marked 1.1 from the mvn repo and followed the same test pattern.
I removed any and all extra .jars. only EGA-1.2.1.jar and SSH-1.1.jar remain.
image

I did the tests in reverse this time just in case there was some remnant of caching issues.

All False

SELECT ST_Intersects(ST_GeomFromText('LINESTRING (2.5 2.5, 8.00000001 9.00000001)'),
                     ST_GeomFromText('LINESTRING (1.5 1.5, 7.00000001 8.00000001)'));
                     -- false

SELECT ST_Intersects(ST_GeomFromGeoJson('{"type": "LineString", "coordinates": [[2.5,2.5], [8.00000001,9.00000001]]}'),
                     ST_GeomFromGeoJson('{"type": "LineString", "coordinates": [[1.5,1.5], [7.00000001,8.00000001]]}'));
                     -- true

SELECT ST_Intersects(ST_GeomFromText('linestring(1 1, 2 1, 3 1, 4 1)'),
                     ST_GeomFromText('linestring(1 0, 2 0, 3 0, 4 0)'));
                     -- false

All True

SELECT ST_Intersects(ST_GeomFromText('LINESTRING (1.5 1.5, 7.00000001 8.00000001)'),
                     ST_GeomFromText('LINESTRING (1.5 1.5, 7.00000001 8.00000001)'));
                     -- true

SELECT ST_Intersects(ST_GeomFromGeoJson('{"type": "LineString", "coordinates": [[1.5,1.5], [7.00000001,8.00000001]]}'),
                     ST_GeomFromGeoJson('{"type": "LineString", "coordinates": [[1.5,1.5], [7.00000001,8.00000001]]}'));
                     -- true

SELECT ST_Intersects(ST_GeomFromText('linestring(1 0, 2 0, 3 1, 4 1)'),
                     ST_GeomFromText('linestring(1 0, 2 0, 3 0, 4 0)'));
                     -- true

Thoughts?

@randallwhitman
Copy link
Contributor Author

Please paste in your add jar statement.

@uamadman
Copy link

I accidentally left in ST_Linestring and ST_Polygon for this test...

Prior to the following tests my add jar was taken directly out of the sample.sql
IE:

add jar
  ../lib/esri-geometry-api.jar
  ../lib/spatial-sdk-hadoop.jar

the current .sql file for my tests.

add jar
  ../lib/esri-geometry-api-1.2.1.jar
  ../lib/spatial-sdk-hive-1.1.jar;


create temporary function ST_Intersects as 'com.esri.hadoop.hive.ST_Intersects';
create temporary function ST_GeomFromGeoJson as 'com.esri.hadoop.hive.ST_GeomFromGeoJson';
create temporary function ST_AsText as 'com.esri.hadoop.hive.ST_AsText';
create temporary function ST_LineString as 'com.esri.hadoop.hive.ST_LineString';
create temporary function ST_Polygon as 'com.esri.hadoop.hive.ST_Polygon';
create temporary function ST_GeomFromText as 'com.esri.hadoop.hive.ST_GeomFromText';

SELECT ST_Intersects(ST_GeomFromText('LINESTRING (2.5 2.5, 8.00000001 9.00000001)'),
                     ST_GeomFromText('LINESTRING (1.5 1.5, 7.00000001 8.00000001)'));

SELECT ST_Intersects(ST_GeomFromGeoJson('{"type": "LineString", "coordinates": [[2.5,2.5], [8.00000001,9.00000001]]}'),
                     ST_GeomFromGeoJson('{"type": "LineString", "coordinates": [[1.5,1.5], [7.00000001,8.00000001]]}'));

SELECT ST_Intersects(ST_GeomFromText('linestring(1 1, 2 1, 3 1, 4 1)'),
                     ST_GeomFromText('linestring(1 0, 2 0, 3 0, 4 0)'));

The terminal output:

uamadman@HDFS06:~/gis-tools-for-hadoop/samples/point-in-polygon-aggregation-hive$ nano ESRI_JSON_TEST.sql
uamadman@HDFS06:~/gis-tools-for-hadoop/samples/point-in-polygon-aggregation-hive$ hive -i ESRI_JSON_TEST.sql
15/03/25 18:36:26 WARN conf.HiveConf: DEPRECATED: Configuration property hive.metastore.local no longer has any effect. Make sure to provide a valid value for hive.metastore.uris if you are connecting to a remote metastore.

Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/jars/hive-common-0.13.1-cdh5.3.0.jar!/hive-log4j.properties
false
true
false
hive>

@randallwhitman
Copy link
Contributor Author

For the query with ST_GeomFromGeoJson, I see false on Hive-.14.0 but true on Hive-.13.1 and Hive-.12 versions. Differing by Hive version seems surprising. (I used git master of geometry-api-java and spatial-framework-for-hadoop, the same for all the Hive versions.)

@uamadman
Copy link

My initial research makes it seem that migrating to .14 is unavailable for average users on the cloudera standard 5.3.1. What are the next steps to help nail down this issue for .13.1?

@randallwhitman
Copy link
Contributor Author

Try setting hive.cache.expr.evaluation to false and see if that fixes it, as in Esri/gis-tools-for-hadoop#23.

@uamadman
Copy link

That is a good start! It works for the basic tests above...

uamadman@HDFS06:~/gis-tools-for-hadoop/samples/point-in-polygon-aggregation-hive$ hive -i ESRI_JSON_TEST.sql
15/03/26 15:23:11 WARN conf.HiveConf: DEPRECATED: Configuration property hive.metastore.local no longer has any effect. Make sure to provide a valid value for hive.metastore.uris if you are connecting to a remote metastore.

Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/jars/hive-common-0.13.1-cdh5.3.0.jar!/hive-log4j.properties
false
false
false

I am currently testing against some larger databases and they are behaving much differently. I'm not sure what that change has done but initial observations have shifted the load of larger jobs from my map to my reduce in a major way. I'm restructuring my select statements to get a sampling of results. My current .sql looks as such.

create temporary function ST_GeomFromGeoJson as 'com.esri.hadoop.hive.ST_GeomFromGeoJson';
create temporary function ST_Intersects as 'com.esri.hadoop.hive.ST_Intersects';

set hive.cache.expr.evaluation = false;
SELECT osm_lines_roads.rowkey, count(*) cnt FROM osm_lines_roads
JOIN tt_f 
Where ST_Intersects(ST_GeomFromGeoJson(tt_f.shape), ST_GeomFromGeoJson(osm_lines_roads.shape))
GROUP BY osm_lines_roads.rowkey
ORDER BY cnt desc;

Initial samples:

set hive.cache.expr.evaluation = true
Original Average time for 1 map out of 143 was ~35 minutes
Original Average time for 1 reduce out of 3 was ~3 minutes

set hive.cache.expr.evaluation = false
New Average times for 1 map is 3.5 minutes.
New Average time for 1 reduce is in excess of 2 hours.-- I need more sampling

Just for a scale reference tt_f is 3 small polygons and osm_roads is 92.5million linears

Thoughts? Cheers!

@smambrose
Copy link
Contributor

That's great news @uamadman, glad to see it's working.

With Hive 13, this is the best workaround we know of. By setting caching to false, the performance in Hive is no longer optimized - which explains the slowdown.

If you are still looking for increased performance, your best bet may be to contact Cloudera directly and see if they have a patched 13 version, or can get you on 14.

Let us know how it goes :)

@randallwhitman randallwhitman changed the title ST_GeomFromGeoJson, ST_Intersects, and outdated builds ST_GeomFromGeoJson, ST_Intersects, and versions of Spatial Framework & of Hive Mar 26, 2015
@uamadman
Copy link

I'll be switching the entire cluster over to the HortonWorks in order to gain access to 14. I will report back in a week.

I would like to recommend 12 and 13.1 be documented in the requirements section in the read me. I would hate for people to simply give up after trying due to a rather difficult to diagnose hive issue.

Cheers!

@smambrose
Copy link
Contributor

We've recently added a Hive Compatibility wiki that will be helpful in addressing this.

@randallwhitman
Copy link
Contributor Author

Similar for Esri JSON as with GeoJSON:

SELECT ST_Intersects(ST_GeomFromJson('{"paths":[[[2.5,2.5],[8,0]]],"spatialReference":{"wkid":4326}}'),
                     ST_GeomFromJson('{"paths":[[[1.5,1.5],[0,7]]],"spatialReference":{"wkid":4326}}'));
true

@uamadman
Copy link

uamadman commented Apr 4, 2015

Basic test is confirmed to work properly in Hortonworks 2.2 with Hive .14
Ill be attempting the a more complicated .sql statement later today. I'll pass along my final results in a day or two.

Logging initialized using configuration in file:/etc/hive/conf/hive-log4j.properties
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/hdp/2.2.0.0-2041/hadoop/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/2.2.0.0-2041/hive/lib/hive-jdbc-0.14.0.2.2.0.0-2041-standalone.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
false
false
false

@randallwhitman
Copy link
Contributor Author

We are investigating the getDisplayString method on the GenericUDF subclasses including ST_GeomFromGeoJson.

randallwhitman added a commit that referenced this issue Apr 6, 2015
Apparently used by  Hive .12 & .13 with hive.cache.expr.evaluation
climbage added a commit that referenced this issue Apr 6, 2015
JSON constructors getDisplayString  (#77)
@randallwhitman
Copy link
Contributor Author

Reference on GenericUDF.getDisplayString and expression caching (issue fixed in hive-0.14 by HIVE-7314).

@uamadman
Copy link

Ran into hive server stability issues with Hortonworks and their 2.2 build. Moved testing to Couldera's nightly build 5.5.0-1.cdh5.5.0.p0.869 and Hive-1.1.

Initial observations of Hive-1.1 prove it to be compatible... And responding much faster (Atleast 10x) than Hive - 0.14.0 or below.

Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.869/jars/hive-common-1.1.0-cdh5.5.0-SNAPSHOT.jar!/hive-log4j.properties
false
false
false
hive>

@randallwhitman
Copy link
Contributor Author

Good to know, thanks.

@randallwhitman randallwhitman added this to the v1.2 milestone Mar 10, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants