Geo Clustering (geo_cluster)
The geo_cluster
function performs geo-spatial clustering and noise reduction. The underlying algorithm used is DBSCAN clustering using haversine meters for the distance measure. The geo_cluster function takes four parameters:
-
The latitude field
-
The longitude field
-
The distance in meters to be considered a neighbor
-
The smallest number of points to be considered a cluster
Sample syntax
select geo_cluster(lat_d, lon_d, 100, 5) as cluster,
lat_d,
lon_d
from nyc311
where lat_d is not null
and
desc_t = 'Rat Sighting'
limit 5000
Result set
The geo_cluster
result set contains a random sample of records that match the WHERE
clause.
If no WHERE
clause is included the random sample will be taken from the entire result set.
The size of the random sample can be controlled by the LIMIT
clause. The default sample size, if no limit is applied, is 25,000.
Points that match the WHERE
clause but are not assigned to a cluster are not included in the result set.
The noise-reduced result set makes it easy to quickly find hot spots or clusters in geo-spatial data.
Due to the noise reduction the final result will likely be smaller than the limit.
The geo_cluster
function returns the cluster name for each latitude/longitude point. The latitude/longitude point fields can also be selected for plotting.
Visualization
The geo_cluster
output can be visualized on a map or scatter plot. The example below shows the geo_cluster
output visualized with an Apache Zeppelin map visualization.