-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathusing-the-lower-level-invoke-api-to-manipulate-sparks-java-objects-from-r.html
565 lines (523 loc) · 60 KB
/
using-the-lower-level-invoke-api-to-manipulate-sparks-java-objects-from-r.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="" xml:lang="">
<head>
<meta charset="utf-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
<title>Chapter 9 Using the lower-level invoke API to manipulate Spark’s Java objects from R | Using Spark from R for performance with arbitrary code</title>
<meta name="description" content="This bookdown publication attempts to provide practical insights into using the sparklyr interface to gain the benefits of Apache Spark while still retaining the ability to use R code organized in custom-built functions and packages." />
<meta name="generator" content="bookdown 0.16 and GitBook 2.6.7" />
<meta property="og:title" content="Chapter 9 Using the lower-level invoke API to manipulate Spark’s Java objects from R | Using Spark from R for performance with arbitrary code" />
<meta property="og:type" content="book" />
<meta property="og:description" content="This bookdown publication attempts to provide practical insights into using the sparklyr interface to gain the benefits of Apache Spark while still retaining the ability to use R code organized in custom-built functions and packages." />
<meta name="twitter:card" content="summary" />
<meta name="twitter:title" content="Chapter 9 Using the lower-level invoke API to manipulate Spark’s Java objects from R | Using Spark from R for performance with arbitrary code" />
<meta name="twitter:description" content="This bookdown publication attempts to provide practical insights into using the sparklyr interface to gain the benefits of Apache Spark while still retaining the ability to use R code organized in custom-built functions and packages." />
<meta name="author" content="Jozef Hajnala" />
<meta name="date" content="2020-02-20" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<meta name="apple-mobile-web-app-capable" content="yes" />
<meta name="apple-mobile-web-app-status-bar-style" content="black" />
<link rel="prev" href="constructing-sql-and-executing-it-with-spark.html"/>
<link rel="next" href="exploring-the-invoke-api-from-r-with-java-reflection-and-examining-invokes-with-logs.html"/>
<script src="libs/jquery-2.2.3/jquery.min.js"></script>
<link href="libs/gitbook-2.6.7/css/style.css" rel="stylesheet" />
<link href="libs/gitbook-2.6.7/css/plugin-table.css" rel="stylesheet" />
<link href="libs/gitbook-2.6.7/css/plugin-bookdown.css" rel="stylesheet" />
<link href="libs/gitbook-2.6.7/css/plugin-highlight.css" rel="stylesheet" />
<link href="libs/gitbook-2.6.7/css/plugin-search.css" rel="stylesheet" />
<link href="libs/gitbook-2.6.7/css/plugin-fontsettings.css" rel="stylesheet" />
<link href="libs/gitbook-2.6.7/css/plugin-clipboard.css" rel="stylesheet" />
<!-- Global site tag (gtag.js) - Google Analytics -->
<script async src="https://www.googletagmanager.com/gtag/js?id=UA-1149069-22"></script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
gtag('config', 'UA-1149069-22');
</script>
<style type="text/css">
a.sourceLine { display: inline-block; line-height: 1.25; }
a.sourceLine { pointer-events: none; color: inherit; text-decoration: inherit; }
a.sourceLine:empty { height: 1.2em; }
.sourceCode { overflow: visible; }
code.sourceCode { white-space: pre; position: relative; }
pre.sourceCode { margin: 0; }
@media screen {
div.sourceCode { overflow: auto; }
}
@media print {
code.sourceCode { white-space: pre-wrap; }
a.sourceLine { text-indent: -1em; padding-left: 1em; }
}
pre.numberSource a.sourceLine
{ position: relative; left: -4em; }
pre.numberSource a.sourceLine::before
{ content: attr(data-line-number);
position: relative; left: -1em; text-align: right; vertical-align: baseline;
border: none; pointer-events: all; display: inline-block;
-webkit-touch-callout: none; -webkit-user-select: none;
-khtml-user-select: none; -moz-user-select: none;
-ms-user-select: none; user-select: none;
padding: 0 4px; width: 4em;
color: #aaaaaa;
}
pre.numberSource { margin-left: 3em; border-left: 1px solid #aaaaaa; padding-left: 4px; }
div.sourceCode
{ }
@media screen {
a.sourceLine::before { text-decoration: underline; }
}
code span.al { color: #ff0000; font-weight: bold; } /* Alert */
code span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */
code span.at { color: #7d9029; } /* Attribute */
code span.bn { color: #40a070; } /* BaseN */
code span.bu { } /* BuiltIn */
code span.cf { color: #007020; font-weight: bold; } /* ControlFlow */
code span.ch { color: #4070a0; } /* Char */
code span.cn { color: #880000; } /* Constant */
code span.co { color: #60a0b0; font-style: italic; } /* Comment */
code span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */
code span.do { color: #ba2121; font-style: italic; } /* Documentation */
code span.dt { color: #902000; } /* DataType */
code span.dv { color: #40a070; } /* DecVal */
code span.er { color: #ff0000; font-weight: bold; } /* Error */
code span.ex { } /* Extension */
code span.fl { color: #40a070; } /* Float */
code span.fu { color: #06287e; } /* Function */
code span.im { } /* Import */
code span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */
code span.kw { color: #007020; font-weight: bold; } /* Keyword */
code span.op { color: #666666; } /* Operator */
code span.ot { color: #007020; } /* Other */
code span.pp { color: #bc7a00; } /* Preprocessor */
code span.sc { color: #4070a0; } /* SpecialChar */
code span.ss { color: #bb6688; } /* SpecialString */
code span.st { color: #4070a0; } /* String */
code span.va { color: #19177c; } /* Variable */
code span.vs { color: #4070a0; } /* VerbatimString */
code span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */
</style>
<link rel="stylesheet" href="static/css/style.css" type="text/css" />
</head>
<body>
<div class="book without-animation with-summary font-size-2 font-family-1" data-basepath=".">
<div class="book-summary">
<nav role="navigation">
<ul class="summary">
<li><a href="./index.html">Using Spark from R for performance</a></li>
<li class="divider"></li>
<li class="chapter" data-level="1" data-path="index.html"><a href="index.html"><i class="fa fa-check"></i><b>1</b> Welcome</a><ul>
<li class="chapter" data-level="1.1" data-path="index.html"><a href="index.html#what-will-you-find-in-this-book"><i class="fa fa-check"></i><b>1.1</b> What will you find in this book</a></li>
<li class="chapter" data-level="1.2" data-path="index.html"><a href="index.html#book-sources"><i class="fa fa-check"></i><b>1.2</b> Book sources</a></li>
<li class="chapter" data-level="1.3" data-path="index.html"><a href="index.html#acknowledgments"><i class="fa fa-check"></i><b>1.3</b> Acknowledgments</a></li>
</ul></li>
<li class="chapter" data-level="2" data-path="setting-up-spark-with-r-and-sparklyr.html"><a href="setting-up-spark-with-r-and-sparklyr.html"><i class="fa fa-check"></i><b>2</b> Setting up Spark with R and sparklyr</a><ul>
<li class="chapter" data-level="2.1" data-path="setting-up-spark-with-r-and-sparklyr.html"><a href="setting-up-spark-with-r-and-sparklyr.html#interactive-manual-installation"><i class="fa fa-check"></i><b>2.1</b> Interactive manual installation</a></li>
</ul></li>
<li class="chapter" data-level="3" data-path="using-a-ready-made-docker-image.html"><a href="using-a-ready-made-docker-image.html"><i class="fa fa-check"></i><b>3</b> Using a ready-made Docker Image</a><ul>
<li class="chapter" data-level="3.1" data-path="using-a-ready-made-docker-image.html"><a href="using-a-ready-made-docker-image.html#installing-docker"><i class="fa fa-check"></i><b>3.1</b> Installing Docker</a></li>
<li class="chapter" data-level="3.2" data-path="using-a-ready-made-docker-image.html"><a href="using-a-ready-made-docker-image.html#using-the-docker-image-with-r"><i class="fa fa-check"></i><b>3.2</b> Using the Docker image with R</a><ul>
<li class="chapter" data-level="3.2.1" data-path="using-a-ready-made-docker-image.html"><a href="using-a-ready-made-docker-image.html#interactively-with-rstudio"><i class="fa fa-check"></i><b>3.2.1</b> Interactively with RStudio</a></li>
<li class="chapter" data-level="3.2.2" data-path="using-a-ready-made-docker-image.html"><a href="using-a-ready-made-docker-image.html#interactively-with-the-r-console"><i class="fa fa-check"></i><b>3.2.2</b> Interactively with the R console</a></li>
<li class="chapter" data-level="3.2.3" data-path="using-a-ready-made-docker-image.html"><a href="using-a-ready-made-docker-image.html#running-an-example-r-script"><i class="fa fa-check"></i><b>3.2.3</b> Running an example R script</a></li>
</ul></li>
<li class="chapter" data-level="3.3" data-path="using-a-ready-made-docker-image.html"><a href="using-a-ready-made-docker-image.html#interactively-with-the-spark-shell"><i class="fa fa-check"></i><b>3.3</b> Interactively with the Spark shell</a></li>
</ul></li>
<li class="chapter" data-level="4" data-path="connecting-and-using-a-local-spark-instance.html"><a href="connecting-and-using-a-local-spark-instance.html"><i class="fa fa-check"></i><b>4</b> Connecting and using a local Spark instance</a><ul>
<li class="chapter" data-level="4.1" data-path="connecting-and-using-a-local-spark-instance.html"><a href="connecting-and-using-a-local-spark-instance.html#packages-and-data"><i class="fa fa-check"></i><b>4.1</b> Packages and data</a></li>
<li class="chapter" data-level="4.2" data-path="connecting-and-using-a-local-spark-instance.html"><a href="connecting-and-using-a-local-spark-instance.html#connecting-to-spark-and-providing-it-with-data"><i class="fa fa-check"></i><b>4.2</b> Connecting to Spark and providing it with data</a></li>
<li class="chapter" data-level="4.3" data-path="connecting-and-using-a-local-spark-instance.html"><a href="connecting-and-using-a-local-spark-instance.html#first-glance-at-the-data"><i class="fa fa-check"></i><b>4.3</b> First glance at the data</a></li>
</ul></li>
<li class="chapter" data-level="5" data-path="communication-between-spark-and-sparklyr.html"><a href="communication-between-spark-and-sparklyr.html"><i class="fa fa-check"></i><b>5</b> Communication between Spark and sparklyr</a><ul>
<li class="chapter" data-level="5.1" data-path="communication-between-spark-and-sparklyr.html"><a href="communication-between-spark-and-sparklyr.html#sparklyr-as-a-spark-interface-provider"><i class="fa fa-check"></i><b>5.1</b> Sparklyr as a Spark interface provider</a></li>
<li class="chapter" data-level="5.2" data-path="communication-between-spark-and-sparklyr.html"><a href="communication-between-spark-and-sparklyr.html#an-r-function-translated-to-spark-sql"><i class="fa fa-check"></i><b>5.2</b> An R function translated to Spark SQL</a><ul>
<li class="chapter" data-level="5.2.1" data-path="communication-between-spark-and-sparklyr.html"><a href="communication-between-spark-and-sparklyr.html#how-does-spark-know-the-r-function-tolower"><i class="fa fa-check"></i><b>5.2.1</b> How does Spark know the R function <code>tolower()</code>?</a></li>
</ul></li>
<li class="chapter" data-level="5.3" data-path="communication-between-spark-and-sparklyr.html"><a href="communication-between-spark-and-sparklyr.html#an-r-function-not-translated-to-spark-sql"><i class="fa fa-check"></i><b>5.3</b> An R function not translated to Spark SQL</a></li>
<li class="chapter" data-level="5.4" data-path="communication-between-spark-and-sparklyr.html"><a href="communication-between-spark-and-sparklyr.html#a-hive-built-in-function-not-existing-in-r"><i class="fa fa-check"></i><b>5.4</b> A Hive built-in function not existing in R</a></li>
<li class="chapter" data-level="5.5" data-path="communication-between-spark-and-sparklyr.html"><a href="communication-between-spark-and-sparklyr.html#using-non-translated-functions-with-sparklyr"><i class="fa fa-check"></i><b>5.5</b> Using non-translated functions with sparklyr</a></li>
</ul></li>
<li class="chapter" data-level="6" data-path="non-translated-functions-with-spark-apply.html"><a href="non-translated-functions-with-spark-apply.html"><i class="fa fa-check"></i><b>6</b> Non-translated functions with spark_apply</a><ul>
<li class="chapter" data-level="6.1" data-path="non-translated-functions-with-spark-apply.html"><a href="non-translated-functions-with-spark-apply.html#what-is-so-important-about-this-distinction"><i class="fa fa-check"></i><b>6.1</b> What is so important about this distinction?</a></li>
<li class="chapter" data-level="6.2" data-path="non-translated-functions-with-spark-apply.html"><a href="non-translated-functions-with-spark-apply.html#what-happens-when-we-use-custom-functions-with-spark_apply"><i class="fa fa-check"></i><b>6.2</b> What happens when we use custom functions with <code>spark_apply()</code></a></li>
<li class="chapter" data-level="6.3" data-path="non-translated-functions-with-spark-apply.html"><a href="non-translated-functions-with-spark-apply.html#what-happens-when-we-use-translated-or-hive-built-in-functions"><i class="fa fa-check"></i><b>6.3</b> What happens when we use translated or Hive built-in functions</a></li>
<li class="chapter" data-level="6.4" data-path="non-translated-functions-with-spark-apply.html"><a href="non-translated-functions-with-spark-apply.html#which-r-functionality-is-currently-translated-and-built-in-to-hive"><i class="fa fa-check"></i><b>6.4</b> Which R functionality is currently translated and built-in to Hive</a></li>
<li class="chapter" data-level="6.5" data-path="non-translated-functions-with-spark-apply.html"><a href="non-translated-functions-with-spark-apply.html#making-serialization-faster-with-apache-arrow"><i class="fa fa-check"></i><b>6.5</b> Making serialization faster with Apache Arrow</a><ul>
<li class="chapter" data-level="6.5.1" data-path="non-translated-functions-with-spark-apply.html"><a href="non-translated-functions-with-spark-apply.html#what-is-apache-arrow-and-how-it-improves-performance"><i class="fa fa-check"></i><b>6.5.1</b> What is Apache Arrow and how it improves performance</a></li>
</ul></li>
<li class="chapter" data-level="6.6" data-path="non-translated-functions-with-spark-apply.html"><a href="non-translated-functions-with-spark-apply.html#conclusion-take-home-messages"><i class="fa fa-check"></i><b>6.6</b> Conclusion, take-home messages</a></li>
<li class="chapter" data-level="6.7" data-path="non-translated-functions-with-spark-apply.html"><a href="non-translated-functions-with-spark-apply.html#but-we-still-need-arbitrary-functions-to-run-fast"><i class="fa fa-check"></i><b>6.7</b> But we still need arbitrary functions to run fast</a></li>
</ul></li>
<li class="chapter" data-level="7" data-path="constructing-functions-by-piping-dplyr-verbs.html"><a href="constructing-functions-by-piping-dplyr-verbs.html"><i class="fa fa-check"></i><b>7</b> Constructing functions by piping dplyr verbs</a><ul>
<li class="chapter" data-level="7.1" data-path="constructing-functions-by-piping-dplyr-verbs.html"><a href="constructing-functions-by-piping-dplyr-verbs.html#r-functions-as-combinations-of-dplyr-verbs-and-spark"><i class="fa fa-check"></i><b>7.1</b> R functions as combinations of dplyr verbs and Spark</a><ul>
<li class="chapter" data-level="7.1.1" data-path="constructing-functions-by-piping-dplyr-verbs.html"><a href="constructing-functions-by-piping-dplyr-verbs.html#trying-it-with-base-r-functions"><i class="fa fa-check"></i><b>7.1.1</b> Trying it with base R functions</a></li>
<li class="chapter" data-level="7.1.2" data-path="constructing-functions-by-piping-dplyr-verbs.html"><a href="constructing-functions-by-piping-dplyr-verbs.html#using-a-combination-of-supported-dplyr-verbs-and-operations"><i class="fa fa-check"></i><b>7.1.2</b> Using a combination of supported dplyr verbs and operations</a></li>
<li class="chapter" data-level="7.1.3" data-path="constructing-functions-by-piping-dplyr-verbs.html"><a href="constructing-functions-by-piping-dplyr-verbs.html#investigating-the-sql-translation-and-its-spark-plan"><i class="fa fa-check"></i><b>7.1.3</b> Investigating the SQL translation and its Spark plan</a></li>
</ul></li>
<li class="chapter" data-level="7.2" data-path="constructing-functions-by-piping-dplyr-verbs.html"><a href="constructing-functions-by-piping-dplyr-verbs.html#a-more-complex-use-case---joins-group-bys-and-aggregations"><i class="fa fa-check"></i><b>7.2</b> A more complex use case - Joins, group bys, and aggregations</a></li>
<li class="chapter" data-level="7.3" data-path="constructing-functions-by-piping-dplyr-verbs.html"><a href="constructing-functions-by-piping-dplyr-verbs.html#using-the-functions-with-local-versus-remote-datasets"><i class="fa fa-check"></i><b>7.3</b> Using the functions with local versus remote datasets</a><ul>
<li class="chapter" data-level="7.3.1" data-path="constructing-functions-by-piping-dplyr-verbs.html"><a href="constructing-functions-by-piping-dplyr-verbs.html#unified-front-end-different-back-ends"><i class="fa fa-check"></i><b>7.3.1</b> Unified front-end, different back-ends</a></li>
<li class="chapter" data-level="7.3.2" data-path="constructing-functions-by-piping-dplyr-verbs.html"><a href="constructing-functions-by-piping-dplyr-verbs.html#differences-in-na-and-nan-values"><i class="fa fa-check"></i><b>7.3.2</b> Differences in <code>NA</code> and <code>NaN</code> values</a></li>
<li class="chapter" data-level="7.3.3" data-path="constructing-functions-by-piping-dplyr-verbs.html"><a href="constructing-functions-by-piping-dplyr-verbs.html#dates-times-and-time-zones"><i class="fa fa-check"></i><b>7.3.3</b> Dates, times and time zones</a></li>
<li class="chapter" data-level="7.3.4" data-path="constructing-functions-by-piping-dplyr-verbs.html"><a href="constructing-functions-by-piping-dplyr-verbs.html#joins"><i class="fa fa-check"></i><b>7.3.4</b> Joins</a></li>
<li class="chapter" data-level="7.3.5" data-path="constructing-functions-by-piping-dplyr-verbs.html"><a href="constructing-functions-by-piping-dplyr-verbs.html#portability-of-used-methods"><i class="fa fa-check"></i><b>7.3.5</b> Portability of used methods</a></li>
</ul></li>
<li class="chapter" data-level="7.4" data-path="constructing-functions-by-piping-dplyr-verbs.html"><a href="constructing-functions-by-piping-dplyr-verbs.html#conclusion-take-home-messages-1"><i class="fa fa-check"></i><b>7.4</b> Conclusion, take-home messages</a></li>
</ul></li>
<li class="chapter" data-level="8" data-path="constructing-sql-and-executing-it-with-spark.html"><a href="constructing-sql-and-executing-it-with-spark.html"><i class="fa fa-check"></i><b>8</b> Constructing SQL and executing it with Spark</a><ul>
<li class="chapter" data-level="8.1" data-path="constructing-sql-and-executing-it-with-spark.html"><a href="constructing-sql-and-executing-it-with-spark.html#r-functions-as-spark-sql-generators"><i class="fa fa-check"></i><b>8.1</b> R functions as Spark SQL generators</a></li>
<li class="chapter" data-level="8.2" data-path="constructing-sql-and-executing-it-with-spark.html"><a href="constructing-sql-and-executing-it-with-spark.html#executing-the-generated-queries-via-spark"><i class="fa fa-check"></i><b>8.2</b> Executing the generated queries via Spark</a><ul>
<li class="chapter" data-level="8.2.1" data-path="constructing-sql-and-executing-it-with-spark.html"><a href="constructing-sql-and-executing-it-with-spark.html#using-dbi-as-the-interface"><i class="fa fa-check"></i><b>8.2.1</b> Using DBI as the interface</a></li>
<li class="chapter" data-level="8.2.2" data-path="constructing-sql-and-executing-it-with-spark.html"><a href="constructing-sql-and-executing-it-with-spark.html#invoking-sql-on-a-spark-session-object"><i class="fa fa-check"></i><b>8.2.2</b> Invoking sql on a Spark session object</a></li>
<li class="chapter" data-level="8.2.3" data-path="constructing-sql-and-executing-it-with-spark.html"><a href="constructing-sql-and-executing-it-with-spark.html#using-tbl-with-dbplyrs-sql"><i class="fa fa-check"></i><b>8.2.3</b> Using tbl with dbplyr’s sql</a></li>
<li class="chapter" data-level="8.2.4" data-path="constructing-sql-and-executing-it-with-spark.html"><a href="constructing-sql-and-executing-it-with-spark.html#wrapping-the-tbl-approach-into-functions"><i class="fa fa-check"></i><b>8.2.4</b> Wrapping the tbl approach into functions</a></li>
</ul></li>
<li class="chapter" data-level="8.3" data-path="constructing-sql-and-executing-it-with-spark.html"><a href="constructing-sql-and-executing-it-with-spark.html#where-sql-can-be-better-than-dbplyr-translation"><i class="fa fa-check"></i><b>8.3</b> Where SQL can be better than dbplyr translation</a><ul>
<li class="chapter" data-level="8.3.1" data-path="constructing-sql-and-executing-it-with-spark.html"><a href="constructing-sql-and-executing-it-with-spark.html#when-a-translation-is-not-there"><i class="fa fa-check"></i><b>8.3.1</b> When a translation is not there</a></li>
<li class="chapter" data-level="8.3.2" data-path="constructing-sql-and-executing-it-with-spark.html"><a href="constructing-sql-and-executing-it-with-spark.html#when-translation-does-not-provide-expected-results"><i class="fa fa-check"></i><b>8.3.2</b> When translation does not provide expected results</a></li>
<li class="chapter" data-level="8.3.3" data-path="constructing-sql-and-executing-it-with-spark.html"><a href="constructing-sql-and-executing-it-with-spark.html#when-portability-is-important"><i class="fa fa-check"></i><b>8.3.3</b> When portability is important</a></li>
</ul></li>
</ul></li>
<li class="chapter" data-level="9" data-path="using-the-lower-level-invoke-api-to-manipulate-sparks-java-objects-from-r.html"><a href="using-the-lower-level-invoke-api-to-manipulate-sparks-java-objects-from-r.html"><i class="fa fa-check"></i><b>9</b> Using the lower-level invoke API to manipulate Spark’s Java objects from R</a><ul>
<li class="chapter" data-level="9.1" data-path="using-the-lower-level-invoke-api-to-manipulate-sparks-java-objects-from-r.html"><a href="using-the-lower-level-invoke-api-to-manipulate-sparks-java-objects-from-r.html#the-invoke-api-of-sparklyr"><i class="fa fa-check"></i><b>9.1</b> The invoke() API of sparklyr</a></li>
<li class="chapter" data-level="9.2" data-path="using-the-lower-level-invoke-api-to-manipulate-sparks-java-objects-from-r.html"><a href="using-the-lower-level-invoke-api-to-manipulate-sparks-java-objects-from-r.html#getting-started-with-the-invoke-api"><i class="fa fa-check"></i><b>9.2</b> Getting started with the invoke API</a></li>
<li class="chapter" data-level="9.3" data-path="using-the-lower-level-invoke-api-to-manipulate-sparks-java-objects-from-r.html"><a href="using-the-lower-level-invoke-api-to-manipulate-sparks-java-objects-from-r.html#grouping-and-aggregation-with-invoke-chains"><i class="fa fa-check"></i><b>9.3</b> Grouping and aggregation with invoke chains</a><ul>
<li class="chapter" data-level="9.3.1" data-path="using-the-lower-level-invoke-api-to-manipulate-sparks-java-objects-from-r.html"><a href="using-the-lower-level-invoke-api-to-manipulate-sparks-java-objects-from-r.html#what-is-all-that-extra-code"><i class="fa fa-check"></i><b>9.3.1</b> What is all that extra code?</a></li>
</ul></li>
<li class="chapter" data-level="9.4" data-path="using-the-lower-level-invoke-api-to-manipulate-sparks-java-objects-from-r.html"><a href="using-the-lower-level-invoke-api-to-manipulate-sparks-java-objects-from-r.html#wrapping-the-invocations-into-r-functions"><i class="fa fa-check"></i><b>9.4</b> Wrapping the invocations into R functions</a></li>
<li class="chapter" data-level="9.5" data-path="using-the-lower-level-invoke-api-to-manipulate-sparks-java-objects-from-r.html"><a href="using-the-lower-level-invoke-api-to-manipulate-sparks-java-objects-from-r.html#reconstructing-variable-normalization"><i class="fa fa-check"></i><b>9.5</b> Reconstructing variable normalization</a></li>
<li class="chapter" data-level="9.6" data-path="using-the-lower-level-invoke-api-to-manipulate-sparks-java-objects-from-r.html"><a href="using-the-lower-level-invoke-api-to-manipulate-sparks-java-objects-from-r.html#where-invoke-can-be-better-than-dplyr-translation-or-sql"><i class="fa fa-check"></i><b>9.6</b> Where invoke can be better than dplyr translation or SQL</a></li>
<li class="chapter" data-level="9.7" data-path="using-the-lower-level-invoke-api-to-manipulate-sparks-java-objects-from-r.html"><a href="using-the-lower-level-invoke-api-to-manipulate-sparks-java-objects-from-r.html#conclusion"><i class="fa fa-check"></i><b>9.7</b> Conclusion</a></li>
</ul></li>
<li class="chapter" data-level="10" data-path="exploring-the-invoke-api-from-r-with-java-reflection-and-examining-invokes-with-logs.html"><a href="exploring-the-invoke-api-from-r-with-java-reflection-and-examining-invokes-with-logs.html"><i class="fa fa-check"></i><b>10</b> Exploring the invoke API from R with Java reflection and examining invokes with logs</a><ul>
<li class="chapter" data-level="10.1" data-path="exploring-the-invoke-api-from-r-with-java-reflection-and-examining-invokes-with-logs.html"><a href="exploring-the-invoke-api-from-r-with-java-reflection-and-examining-invokes-with-logs.html#examining-available-methods-from-r"><i class="fa fa-check"></i><b>10.1</b> Examining available methods from R</a></li>
<li class="chapter" data-level="10.2" data-path="exploring-the-invoke-api-from-r-with-java-reflection-and-examining-invokes-with-logs.html"><a href="exploring-the-invoke-api-from-r-with-java-reflection-and-examining-invokes-with-logs.html#using-the-java-reflection-api-to-list-the-available-methods"><i class="fa fa-check"></i><b>10.2</b> Using the Java reflection API to list the available methods</a></li>
<li class="chapter" data-level="10.3" data-path="exploring-the-invoke-api-from-r-with-java-reflection-and-examining-invokes-with-logs.html"><a href="exploring-the-invoke-api-from-r-with-java-reflection-and-examining-invokes-with-logs.html#investigating-dataset-and-sparkcontext-class-methods"><i class="fa fa-check"></i><b>10.3</b> Investigating DataSet and SparkContext class methods</a><ul>
<li class="chapter" data-level="10.3.1" data-path="exploring-the-invoke-api-from-r-with-java-reflection-and-examining-invokes-with-logs.html"><a href="exploring-the-invoke-api-from-r-with-java-reflection-and-examining-invokes-with-logs.html#using-helpers-to-explore-the-methods"><i class="fa fa-check"></i><b>10.3.1</b> Using helpers to explore the methods</a></li>
<li class="chapter" data-level="10.3.2" data-path="exploring-the-invoke-api-from-r-with-java-reflection-and-examining-invokes-with-logs.html"><a href="exploring-the-invoke-api-from-r-with-java-reflection-and-examining-invokes-with-logs.html#unexported-helpers-provided-by-sparklyr"><i class="fa fa-check"></i><b>10.3.2</b> Unexported helpers provided by sparklyr</a></li>
</ul></li>
<li class="chapter" data-level="10.4" data-path="exploring-the-invoke-api-from-r-with-java-reflection-and-examining-invokes-with-logs.html"><a href="exploring-the-invoke-api-from-r-with-java-reflection-and-examining-invokes-with-logs.html#how-sparklyr-communicates-with-spark-invoke-logging"><i class="fa fa-check"></i><b>10.4</b> How sparklyr communicates with Spark, invoke logging</a><ul>
<li class="chapter" data-level="10.4.1" data-path="exploring-the-invoke-api-from-r-with-java-reflection-and-examining-invokes-with-logs.html"><a href="exploring-the-invoke-api-from-r-with-java-reflection-and-examining-invokes-with-logs.html#using-dplyr-verbs-translated-with-dbplyr"><i class="fa fa-check"></i><b>10.4.1</b> Using dplyr verbs translated with dbplyr</a></li>
<li class="chapter" data-level="10.4.2" data-path="exploring-the-invoke-api-from-r-with-java-reflection-and-examining-invokes-with-logs.html"><a href="exploring-the-invoke-api-from-r-with-java-reflection-and-examining-invokes-with-logs.html#using-dbi-to-send-queries"><i class="fa fa-check"></i><b>10.4.2</b> Using DBI to send queries</a></li>
<li class="chapter" data-level="10.4.3" data-path="exploring-the-invoke-api-from-r-with-java-reflection-and-examining-invokes-with-logs.html"><a href="exploring-the-invoke-api-from-r-with-java-reflection-and-examining-invokes-with-logs.html#using-the-invoke-interface"><i class="fa fa-check"></i><b>10.4.3</b> Using the invoke interface</a></li>
<li class="chapter" data-level="10.4.4" data-path="exploring-the-invoke-api-from-r-with-java-reflection-and-examining-invokes-with-logs.html"><a href="exploring-the-invoke-api-from-r-with-java-reflection-and-examining-invokes-with-logs.html#redirecting-the-invoke-logs"><i class="fa fa-check"></i><b>10.4.4</b> Redirecting the invoke logs</a></li>
</ul></li>
<li class="chapter" data-level="10.5" data-path="exploring-the-invoke-api-from-r-with-java-reflection-and-examining-invokes-with-logs.html"><a href="exploring-the-invoke-api-from-r-with-java-reflection-and-examining-invokes-with-logs.html#conclusion-1"><i class="fa fa-check"></i><b>10.5</b> Conclusion</a></li>
</ul></li>
<li class="chapter" data-level="11" data-path="combining-approaches-into-lazy-datasets.html"><a href="combining-approaches-into-lazy-datasets.html"><i class="fa fa-check"></i><b>11</b> Combining approaches into lazy datasets</a></li>
<li class="chapter" data-level="12" data-path="references-and-resources.html"><a href="references-and-resources.html"><i class="fa fa-check"></i><b>12</b> References and resources</a><ul>
<li class="chapter" data-level="12.1" data-path="references-and-resources.html"><a href="references-and-resources.html#online-resources"><i class="fa fa-check"></i><b>12.1</b> Online resources</a><ul>
<li class="chapter" data-level="12.1.1" data-path="references-and-resources.html"><a href="references-and-resources.html#getting-started-with-sparklyr"><i class="fa fa-check"></i><b>12.1.1</b> Getting started with sparklyr</a></li>
<li class="chapter" data-level="12.1.2" data-path="references-and-resources.html"><a href="references-and-resources.html#dbi-spark-sql-hive"><i class="fa fa-check"></i><b>12.1.2</b> DBI, Spark SQL, Hive</a></li>
<li class="chapter" data-level="12.1.3" data-path="references-and-resources.html"><a href="references-and-resources.html#docker"><i class="fa fa-check"></i><b>12.1.3</b> Docker</a></li>
<li class="chapter" data-level="12.1.4" data-path="references-and-resources.html"><a href="references-and-resources.html#spark-api-java-scala-and-friends"><i class="fa fa-check"></i><b>12.1.4</b> Spark API, Java, Scala and friends</a></li>
<li class="chapter" data-level="12.1.5" data-path="references-and-resources.html"><a href="references-and-resources.html#apache-arrow"><i class="fa fa-check"></i><b>12.1.5</b> Apache Arrow</a></li>
</ul></li>
<li class="chapter" data-level="12.2" data-path="references-and-resources.html"><a href="references-and-resources.html#physical-books"><i class="fa fa-check"></i><b>12.2</b> Physical Books</a></li>
</ul></li>
<li class="chapter" data-level="13" data-path="appendices.html"><a href="appendices.html"><i class="fa fa-check"></i><b>13</b> Appendices</a><ul>
<li class="chapter" data-level="13.1" data-path="appendices.html"><a href="appendices.html#r-session-information"><i class="fa fa-check"></i><b>13.1</b> R session information</a></li>
<li class="chapter" data-level="13.2" data-path="appendices.html"><a href="appendices.html#setup-of-apache-arrow"><i class="fa fa-check"></i><b>13.2</b> Setup of Apache Arrow</a></li>
</ul></li>
<li class="divider"></li>
<li><a href="https://github.com/rstudio/bookdown" target="blank">Published with bookdown</a></li>
</ul>
</nav>
</div>
<div class="book-body">
<div class="body-inner">
<div class="book-header" role="navigation">
<h1>
<i class="fa fa-circle-o-notch fa-spin"></i><a href="./">Using Spark from R for performance with arbitrary code</a>
</h1>
</div>
<div class="page-wrapper" tabindex="-1" role="main">
<div class="page-inner">
<section class="normal" id="section-">
<div id="using-the-lower-level-invoke-api-to-manipulate-sparks-java-objects-from-r" class="section level1">
<h1><span class="header-section-number">Chapter 9</span> Using the lower-level invoke API to manipulate Spark’s Java objects from R</h1>
<div class="wizardry">
<p>
There will be no foolish wand-waving or silly incantations in this class.
</p>
<ul>
<li>
Severus Snape
</li>
</ul>
</div>
<p>In the previous chapters, we have shown how to write functions as both <a href="constructing-functions-by-piping-dplyr-verbs.html">combinations of dplyr verbs</a> and <a href="using-r-to-construct-sql-queries-and-let-spark-execute-them.html">SQL query generators</a> that can be executed by Spark, how to execute them with DBI and how to achieve lazy SQL statements that only get executed when needed.</p>
<p>In this chapter, we will look at how to write R functions that interface with Spark via a lower-level invocation API that lets us use all the functionality that is exposed by the Scala Spark APIs. We will also show how such R calls relate to Scala code.</p>
<div id="the-invoke-api-of-sparklyr" class="section level2">
<h2><span class="header-section-number">9.1</span> The invoke() API of sparklyr</h2>
<p>So far when interfacing with Spark from R, we have used the sparklyr package in three ways:</p>
<ul>
<li>Writing combinations of <span class="rpackage">dplyr</span> verbs that would be translated to Spark SQL via the <span class="rpackage">dbplyr</span> package and the SQL executed by Spark when requested</li>
<li>Generating Spark SQL code directly and sending it for execution in multiple ways</li>
<li>Combinations of the above two methods</li>
</ul>
<p>What these methods have in common is that they translate operations written in R to Spark SQL and that SQL code is then sent for execution by our Spark instance.</p>
<p>There is however another approach that we can use with <span class="rpackage">sparklyr</span>, which will be more familiar to users or developers who have worked with <a href="https://jozef.io/r901-primer-java-from-r-1/">packages like <span class="rpackage">rJava</span></a> or <span class="rpackage">rscala</span> before. Even though possibly less convenient than the APIs provided by the 2 aforementioned packages, <span class="rpackage">sparklyr</span> provides an invocation API that exposes 3 functions:</p>
<ol style="list-style-type: decimal">
<li><code>invoke(jobj, method, ...)</code> to execute a method on a Java object reference</li>
<li><code>invoke_static(sc, class, method, ...)</code> to execute a static method associated with a Java class</li>
<li><code>invoke_new(sc, class, ...)</code> to invoke a constructor associated with a Java class</li>
</ol>
<p>Let us have a look at how we can use those functions in practice to efficiently work with Spark from R.</p>
</div>
<div id="getting-started-with-the-invoke-api" class="section level2">
<h2><span class="header-section-number">9.2</span> Getting started with the invoke API</h2>
<p>We can start with a few very simple examples of <code>invoke()</code> usage, for instance getting the number of rows of the <code>tbl_flights</code>:</p>
<div class="sourceCode" id="cb97"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb97-1" data-line-number="1"><span class="co"># Get the count of rows</span></a>
<a class="sourceLine" id="cb97-2" data-line-number="2">tbl_flights <span class="op">%>%</span></a>
<a class="sourceLine" id="cb97-3" data-line-number="3"><span class="st"> </span><span class="kw">spark_dataframe</span>() <span class="op">%>%</span></a>
<a class="sourceLine" id="cb97-4" data-line-number="4"><span class="st"> </span><span class="kw">invoke</span>(<span class="st">"count"</span>)</a></code></pre></div>
<pre><code>## [1] 336776</code></pre>
<p>We see one extra operation before invoking the count: <code>spark_dataframe()</code>. This is because the <code>invoke()</code> interface works with Java object references and not <code>tbl</code> objects in remote sources such as <code>tbl_flights</code>. We, therefore, need to convert <code>tbl_flights</code> to a Java object reference, for which we use the <code>spark_dataframe()</code> function.</p>
<p>Now, for something more exciting, let us compute a summary of the variables in <code>tbl_flights</code> using the <code>describe</code> method:</p>
<div class="sourceCode" id="cb99"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb99-1" data-line-number="1">tbl_flights_summary <-<span class="st"> </span>tbl_flights <span class="op">%>%</span></a>
<a class="sourceLine" id="cb99-2" data-line-number="2"><span class="st"> </span><span class="kw">spark_dataframe</span>() <span class="op">%>%</span></a>
<a class="sourceLine" id="cb99-3" data-line-number="3"><span class="st"> </span><span class="kw">invoke</span>(<span class="st">"describe"</span>, <span class="kw">as.list</span>(<span class="kw">colnames</span>(tbl_flights))) <span class="op">%>%</span></a>
<a class="sourceLine" id="cb99-4" data-line-number="4"><span class="st"> </span><span class="kw">sdf_register</span>()</a>
<a class="sourceLine" id="cb99-5" data-line-number="5">tbl_flights_summary</a></code></pre></div>
<pre><code>## # Source: spark<?> [?? x 20]
## summary id year month day dep_time sched_dep_time dep_delay arr_time
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 count 3367… 3367… 3367… 3367… 328521 336776 328521 328063
## 2 mean 1683… 2013… 6.54… 15.7… 1349.10… 1344.25484001… 12.63907… 1502.05…
## 3 stddev 9721… 0.0 3.41… 8.76… 488.281… 467.335755734… 40.21006… 533.264…
## 4 min 1 2013 1 1 1 106 -43.0 1
## 5 max 3367… 2013 12 31 2400 2359 1301.0 2400
## # … with 11 more variables: sched_arr_time <chr>, arr_delay <chr>,
## # carrier <chr>, flight <chr>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <chr>, distance <chr>, hour <chr>, minute <chr></code></pre>
<p>We also one see extra operation after invoking the describe method: <code>sdf_register()</code>. This is because the <code>invoke()</code> interface also <em>returns</em> Java object references and we may like to see a more user-friendly <code>tbl</code> object instead. This is where <code>sdf_register()</code> comes in to register a Spark DataFrame and return a <code>tbl_spark</code> object back to us.</p>
<p>And indeed, we can see that the wrapper <code>sdf_describe()</code> provided by the sparklyr package itself works in a very similar fashion:</p>
<div class="sourceCode" id="cb101"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb101-1" data-line-number="1"><span class="kw">body</span>(sparklyr<span class="op">::</span>sdf_describe)</a></code></pre></div>
<pre><code>## {
## in_df <- cols %in% colnames(x)
## if (any(!in_df)) {
## msg <- paste0("The following columns are not in the data frame: ",
## paste0(cols[which(!in_df)], collapse = ", "))
## stop(msg)
## }
## cols <- cast_character_list(cols)
## x %>% spark_dataframe() %>% invoke("describe", cols) %>%
## sdf_register()
## }</code></pre>
<p>If we so wish, for DataFrame related object references, we can also call <code>collect()</code> to retrieve the results directly, without using <code>sdf_register()</code> first, for instance retrieving the full content of the <code>origin</code> column:</p>
<div class="sourceCode" id="cb103"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb103-1" data-line-number="1">tbl_flights <span class="op">%>%</span></a>
<a class="sourceLine" id="cb103-2" data-line-number="2"><span class="st"> </span><span class="kw">spark_dataframe</span>() <span class="op">%>%</span></a>
<a class="sourceLine" id="cb103-3" data-line-number="3"><span class="st"> </span><span class="kw">invoke</span>(<span class="st">"select"</span>, <span class="st">"origin"</span>, <span class="kw">list</span>()) <span class="op">%>%</span></a>
<a class="sourceLine" id="cb103-4" data-line-number="4"><span class="st"> </span><span class="kw">collect</span>()</a></code></pre></div>
<pre><code>## # A tibble: 336,776 x 1
## origin
## <chr>
## 1 EWR
## 2 LGA
## 3 JFK
## 4 JFK
## 5 LGA
## 6 EWR
## 7 EWR
## 8 LGA
## 9 JFK
## 10 LGA
## # … with 336,766 more rows</code></pre>
<p>It can also be helpful to investigate the schema of our <code>flights</code> DataFrame:</p>
<div class="sourceCode" id="cb105"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb105-1" data-line-number="1">tbl_flights <span class="op">%>%</span></a>
<a class="sourceLine" id="cb105-2" data-line-number="2"><span class="st"> </span><span class="kw">spark_dataframe</span>() <span class="op">%>%</span></a>
<a class="sourceLine" id="cb105-3" data-line-number="3"><span class="st"> </span><span class="kw">invoke</span>(<span class="st">"schema"</span>)</a></code></pre></div>
<pre><code>## <jobj[733]>
## org.apache.spark.sql.types.StructType
## StructType(StructField(id,IntegerType,true), StructField(year,IntegerType,true), StructField(month,IntegerType,true), StructField(day,IntegerType,true), StructField(dep_time,IntegerType,true), StructField(sched_dep_time,IntegerType,true), StructField(dep_delay,DoubleType,true), StructField(arr_time,IntegerType,true), StructField(sched_arr_time,IntegerType,true), StructField(arr_delay,DoubleType,true), StructField(carrier,StringType,true), StructField(flight,IntegerType,true), StructField(tailnum,StringType,true), StructField(origin,StringType,true), StructField(dest,StringType,true), StructField(air_time,DoubleType,true), StructField(distance,DoubleType,true), StructField(hour,DoubleType,true), StructField(minute,DoubleType,true), StructField(time_hour,TimestampType,true))</code></pre>
<p>We can also use the invoke interface on other objects, for instance the <code>SparkContext</code>. Let’s for instance retrieve the <code>uiWebUrl</code> of our context:</p>
<div class="sourceCode" id="cb107"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb107-1" data-line-number="1">sc <span class="op">%>%</span></a>
<a class="sourceLine" id="cb107-2" data-line-number="2"><span class="st"> </span><span class="kw">spark_context</span>() <span class="op">%>%</span></a>
<a class="sourceLine" id="cb107-3" data-line-number="3"><span class="st"> </span><span class="kw">invoke</span>(<span class="st">"uiWebUrl"</span>) <span class="op">%>%</span></a>
<a class="sourceLine" id="cb107-4" data-line-number="4"><span class="st"> </span><span class="kw">invoke</span>(<span class="st">"toString"</span>)</a></code></pre></div>
<pre><code>## [1] "Some(http://localhost:4040)"</code></pre>
</div>
<div id="grouping-and-aggregation-with-invoke-chains" class="section level2">
<h2><span class="header-section-number">9.3</span> Grouping and aggregation with invoke chains</h2>
<p>Imagine we would like to do simple aggregations of a Spark DataFrame, such as an average of a column grouped by another column. For reference, we can do this very simply using the <span class="rpackage">dplyr</span> approach. Let’s compute the average departure delay by origin of the flight:</p>
<div class="sourceCode" id="cb109"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb109-1" data-line-number="1">tbl_flights <span class="op">%>%</span></a>
<a class="sourceLine" id="cb109-2" data-line-number="2"><span class="st"> </span><span class="kw">group_by</span>(origin) <span class="op">%>%</span></a>
<a class="sourceLine" id="cb109-3" data-line-number="3"><span class="st"> </span><span class="kw">summarise</span>(<span class="kw">avg</span>(dep_delay))</a></code></pre></div>
<pre><code>## # Source: spark<?> [?? x 2]
## origin `avg(dep_delay)`
## <chr> <dbl>
## 1 EWR 15.1
## 2 JFK 12.1
## 3 LGA 10.3</code></pre>
<p>Now we will show how to do the same aggregation via the lower level API. Using the Spark shell we would simply write in Scala:</p>
<div class="sourceCode" id="cb111"><pre class="sourceCode scala"><code class="sourceCode scala"><a class="sourceLine" id="cb111-1" data-line-number="1">flights.</a>
<a class="sourceLine" id="cb111-2" data-line-number="2"> <span class="fu">groupBy</span>(<span class="st">"origin"</span>).</a>
<a class="sourceLine" id="cb111-3" data-line-number="3"> <span class="fu">agg</span>(<span class="fu">avg</span>(<span class="st">"dep_delay"</span>))</a></code></pre></div>
<p>Translating that into the lower level <code>invoke()</code> API provided by <span class="rpackage">sparklyr</span> can look similar to the following code.</p>
<div class="sourceCode" id="cb112"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb112-1" data-line-number="1">tbl_flights <span class="op">%>%</span></a>
<a class="sourceLine" id="cb112-2" data-line-number="2"><span class="st"> </span><span class="kw">spark_dataframe</span>() <span class="op">%>%</span></a>
<a class="sourceLine" id="cb112-3" data-line-number="3"><span class="st"> </span><span class="kw">invoke</span>(<span class="st">"groupBy"</span>, <span class="st">"origin"</span>, <span class="kw">list</span>()) <span class="op">%>%</span></a>
<a class="sourceLine" id="cb112-4" data-line-number="4"><span class="st"> </span><span class="kw">invoke</span>(<span class="st">"agg"</span>, <span class="kw">invoke_static</span>(sc, <span class="st">"org.apache.spark.sql.functions"</span>, <span class="st">"expr"</span>, <span class="st">"avg(dep_delay)"</span>), <span class="kw">list</span>()) <span class="op">%>%</span></a>
<a class="sourceLine" id="cb112-5" data-line-number="5"><span class="st"> </span><span class="kw">sdf_register</span>()</a></code></pre></div>
<pre><code>## # Source: spark<?> [?? x 2]
## origin `avg(dep_delay)`
## <chr> <dbl>
## 1 EWR 15.1
## 2 JFK 12.1
## 3 LGA 10.3</code></pre>
<div id="what-is-all-that-extra-code" class="section level3">
<h3><span class="header-section-number">9.3.1</span> What is all that extra code?</h3>
<p>Now, compared to the very simple 2 operations in the Scala version, we have some gotchas to examine:</p>
<ul>
<li><p>one of the <code>invoke()</code> calls is quite long. Instead of just <code>avg("dep_delay")</code> like in the Scala example, we use <code>invoke_static(sc, "org.apache.spark.sql.functions", "expr", "avg(dep_delay)")</code>. This is because the <code>avg("dep_delay")</code> expression is somewhat of a syntactic sugar provided by Scala, but when calling from R we need to provide the object reference hidden behind that sugar.</p></li>
<li><p>the empty <code>list()</code> at the end of the <code>"groupBy"</code> and <code>"agg"</code> invokes. This is needed as a workaround some Scala methods <a href="https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset@groupBy(col1:String,cols:String*):org.apache.spark.sql.RelationalGroupedDataset">take String, String*</a> as arguments and sparklyr currently does not support variable parameters. We can pass <code>list()</code> to represent an empty <code>String[]</code> in Scala as the needed second argument.</p></li>
</ul>
</div>
</div>
<div id="wrapping-the-invocations-into-r-functions" class="section level2">
<h2><span class="header-section-number">9.4</span> Wrapping the invocations into R functions</h2>
<p>Seeing the above example, we can quickly write a useful wrapper to ease the pain a little. First, we can create a small function that will generate the aggregation expression we can use with <code>invoke("agg", ...)</code>.</p>
<div class="sourceCode" id="cb114"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb114-1" data-line-number="1">agg_expr <-<span class="st"> </span><span class="cf">function</span>(tbl, exprs) {</a>
<a class="sourceLine" id="cb114-2" data-line-number="2"> sparklyr<span class="op">::</span><span class="kw">invoke_static</span>(</a>
<a class="sourceLine" id="cb114-3" data-line-number="3"> tbl[[<span class="st">"src"</span>]][[<span class="st">"con"</span>]],</a>
<a class="sourceLine" id="cb114-4" data-line-number="4"> <span class="st">"org.apache.spark.sql.functions"</span>,</a>
<a class="sourceLine" id="cb114-5" data-line-number="5"> <span class="st">"expr"</span>,</a>
<a class="sourceLine" id="cb114-6" data-line-number="6"> exprs</a>
<a class="sourceLine" id="cb114-7" data-line-number="7"> )</a>
<a class="sourceLine" id="cb114-8" data-line-number="8">}</a></code></pre></div>
<p>Next, we can wrap around the entire process to make a more generic aggregation function, using the fact that a remote tibble has the details on <code>sc</code> within its <code>tbl[["src"]][["con"]]</code> element:</p>
<div class="sourceCode" id="cb115"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb115-1" data-line-number="1">grpagg_invoke <-<span class="st"> </span><span class="cf">function</span>(tbl, colName, groupColName, aggOperation) {</a>
<a class="sourceLine" id="cb115-2" data-line-number="2"> avgColumn <-<span class="st"> </span>tbl <span class="op">%>%</span><span class="st"> </span><span class="kw">agg_expr</span>(<span class="kw">paste0</span>(aggOperation, <span class="st">"("</span>, colName, <span class="st">")"</span>))</a>
<a class="sourceLine" id="cb115-3" data-line-number="3"> tbl <span class="op">%>%</span></a>
<a class="sourceLine" id="cb115-4" data-line-number="4"><span class="st"> </span><span class="kw">spark_dataframe</span>() <span class="op">%>%</span></a>
<a class="sourceLine" id="cb115-5" data-line-number="5"><span class="st"> </span><span class="kw">invoke</span>(<span class="st">"groupBy"</span>, groupColName, <span class="kw">list</span>()) <span class="op">%>%</span></a>
<a class="sourceLine" id="cb115-6" data-line-number="6"><span class="st"> </span><span class="kw">invoke</span>(<span class="st">"agg"</span>, avgColumn, <span class="kw">list</span>()) <span class="op">%>%</span></a>
<a class="sourceLine" id="cb115-7" data-line-number="7"><span class="st"> </span><span class="kw">sdf_register</span>()</a>
<a class="sourceLine" id="cb115-8" data-line-number="8">}</a></code></pre></div>
<p>And finally use our wrapper to get the same results in a more user-friendly way:</p>
<div class="sourceCode" id="cb116"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb116-1" data-line-number="1">tbl_flights <span class="op">%>%</span><span class="st"> </span></a>
<a class="sourceLine" id="cb116-2" data-line-number="2"><span class="st"> </span><span class="kw">grpagg_invoke</span>(<span class="st">"arr_delay"</span>, <span class="dt">groupColName =</span> <span class="st">"origin"</span>, <span class="dt">aggOperation =</span> <span class="st">"avg"</span>)</a></code></pre></div>
<pre><code>## # Source: spark<?> [?? x 2]
## origin `avg(arr_delay)`
## <chr> <dbl>
## 1 EWR 9.11
## 2 JFK 5.55
## 3 LGA 5.78</code></pre>
</div>
<div id="reconstructing-variable-normalization" class="section level2">
<h2><span class="header-section-number">9.5</span> Reconstructing variable normalization</h2>
<p>Now we will attempt to construct the variable normalization that we have shown in the previous parts with <span class="rpackage">dplyr</span> verbs and SQL generation - we will normalize the values of a column by first subtracting the mean value and then dividing the values by the standard deviation:</p>
<div class="sourceCode" id="cb118"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb118-1" data-line-number="1">normalize_invoke <-<span class="st"> </span><span class="cf">function</span>(tbl, colName) {</a>
<a class="sourceLine" id="cb118-2" data-line-number="2"> sdf <-<span class="st"> </span>tbl <span class="op">%>%</span><span class="st"> </span><span class="kw">spark_dataframe</span>()</a>
<a class="sourceLine" id="cb118-3" data-line-number="3"> stdCol <-<span class="st"> </span><span class="kw">agg_expr</span>(tbl, <span class="kw">paste0</span>(<span class="st">"stddev_samp("</span>, colName, <span class="st">")"</span>))</a>
<a class="sourceLine" id="cb118-4" data-line-number="4"> avgCol <-<span class="st"> </span><span class="kw">agg_expr</span>(tbl, <span class="kw">paste0</span>(<span class="st">"avg("</span>, colName, <span class="st">")"</span>))</a>
<a class="sourceLine" id="cb118-5" data-line-number="5"> avgTemp <-<span class="st"> </span>sdf <span class="op">%>%</span></a>
<a class="sourceLine" id="cb118-6" data-line-number="6"><span class="st"> </span><span class="kw">invoke</span>(<span class="st">"agg"</span>, avgCol, <span class="kw">list</span>()) <span class="op">%>%</span></a>
<a class="sourceLine" id="cb118-7" data-line-number="7"><span class="st"> </span><span class="kw">invoke</span>(<span class="st">"first"</span>)</a>
<a class="sourceLine" id="cb118-8" data-line-number="8"> stdTemp <-<span class="st"> </span>sdf <span class="op">%>%</span></a>
<a class="sourceLine" id="cb118-9" data-line-number="9"><span class="st"> </span><span class="kw">invoke</span>(<span class="st">"agg"</span>, stdCol, <span class="kw">list</span>()) <span class="op">%>%</span></a>
<a class="sourceLine" id="cb118-10" data-line-number="10"><span class="st"> </span><span class="kw">invoke</span>(<span class="st">"first"</span>)</a>
<a class="sourceLine" id="cb118-11" data-line-number="11"> newCol <-<span class="st"> </span>sdf <span class="op">%>%</span></a>
<a class="sourceLine" id="cb118-12" data-line-number="12"><span class="st"> </span><span class="kw">invoke</span>(<span class="st">"col"</span>, colName) <span class="op">%>%</span></a>
<a class="sourceLine" id="cb118-13" data-line-number="13"><span class="st"> </span><span class="kw">invoke</span>(<span class="st">"minus"</span>, <span class="kw">as.numeric</span>(avgTemp)) <span class="op">%>%</span></a>
<a class="sourceLine" id="cb118-14" data-line-number="14"><span class="st"> </span><span class="kw">invoke</span>(<span class="st">"divide"</span>, <span class="kw">as.numeric</span>(stdTemp))</a>
<a class="sourceLine" id="cb118-15" data-line-number="15"> sdf <span class="op">%>%</span></a>
<a class="sourceLine" id="cb118-16" data-line-number="16"><span class="st"> </span><span class="kw">invoke</span>(<span class="st">"withColumn"</span>, colName, newCol) <span class="op">%>%</span></a>
<a class="sourceLine" id="cb118-17" data-line-number="17"><span class="st"> </span><span class="kw">sdf_register</span>()</a>
<a class="sourceLine" id="cb118-18" data-line-number="18">}</a>
<a class="sourceLine" id="cb118-19" data-line-number="19"></a>
<a class="sourceLine" id="cb118-20" data-line-number="20">tbl_weather <span class="op">%>%</span><span class="st"> </span><span class="kw">normalize_invoke</span>(<span class="st">"temp"</span>)</a></code></pre></div>
<pre><code>## # Source: spark<?> [?? x 16]
## id origin year month day hour temp dewp humid wind_dir wind_speed
## <int> <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 EWR 2013 1 1 1 -0.913 26.1 59.4 270 10.4
## 2 2 EWR 2013 1 1 2 -0.913 27.0 61.6 250 8.06
## 3 3 EWR 2013 1 1 3 -0.913 28.0 64.4 240 11.5
## 4 4 EWR 2013 1 1 4 -0.862 28.0 62.2 250 12.7
## 5 5 EWR 2013 1 1 5 -0.913 28.0 64.4 260 12.7
## 6 6 EWR 2013 1 1 6 -0.974 28.0 67.2 240 11.5
## 7 7 EWR 2013 1 1 7 -0.913 28.0 64.4 240 15.0
## 8 8 EWR 2013 1 1 8 -0.862 28.0 62.2 250 10.4
## 9 9 EWR 2013 1 1 9 -0.862 28.0 62.2 260 15.0
## 10 10 EWR 2013 1 1 10 -0.802 28.0 59.6 260 13.8
## # … with more rows, and 5 more variables: wind_gust <dbl>, precip <dbl>,
## # pressure <dbl>, visib <dbl>, time_hour <dttm></code></pre>
<p>The above implementation is just an example and far from optimal, but it also has a few interesting points about it:</p>
<ul>
<li>Using <code>invoke("first")</code> will actually compute and collect the value into the R session</li>
<li>Those collected values are then sent back during the <code>invoke("minus", as.numeric(avgTemp))</code> and <code>invoke("divide", as.numeric(stdTemp))</code></li>
</ul>
<p>This means that there is unnecessary overhead when sending those values from the Spark instance into R and back, which will have slight performance penalties.</p>
</div>
<div id="where-invoke-can-be-better-than-dplyr-translation-or-sql" class="section level2">
<h2><span class="header-section-number">9.6</span> Where invoke can be better than dplyr translation or SQL</h2>
<p>As we have seen in the above examples, working with the <code>invoke()</code> API can prove more difficult than using the intuitive syntax of <span class="rpackage">dplyr</span> or SQL queries. In some use cases, the trade-off may still be worth it. In our practice, these are some examples of such situations:</p>
<ul>
<li>When Scala’s Spark API is more flexible, powerful or suitable for a particular task and the translation is not as good</li>
<li>When performance is crucial and we can produce more optimal solutions using the invocations</li>
<li>When we know the Scala API well and not want to invest time to learn the <span class="rpackage">dplyr</span> syntax, but it is easier to translate the Scala calls into a series of <code>invoke()</code> calls</li>
<li>When we need to interact and manipulate other Java objects apart from the standard Spark DataFrames</li>
</ul>
</div>
<div id="conclusion" class="section level2">
<h2><span class="header-section-number">9.7</span> Conclusion</h2>
<p>In this chapter, we have looked at how to use the lower-level invoke interface provided by <span class="rpackage">sparklyr</span> to manipulate Spark objects and other Java object references. In the following chapter, we will look a bit deeper and look into using Java’s reflection API to make the invoke interface more accessible from R, getting detail invocation logs and more.</p>
</div>
</div>
</section>
</div>
</div>
</div>
<a href="constructing-sql-and-executing-it-with-spark.html" class="navigation navigation-prev " aria-label="Previous page"><i class="fa fa-angle-left"></i></a>
<a href="exploring-the-invoke-api-from-r-with-java-reflection-and-examining-invokes-with-logs.html" class="navigation navigation-next " aria-label="Next page"><i class="fa fa-angle-right"></i></a>
</div>
</div>
<script src="libs/gitbook-2.6.7/js/app.min.js"></script>
<script src="libs/gitbook-2.6.7/js/lunr.js"></script>
<script src="libs/gitbook-2.6.7/js/clipboard.min.js"></script>
<script src="libs/gitbook-2.6.7/js/plugin-search.js"></script>
<script src="libs/gitbook-2.6.7/js/plugin-sharing.js"></script>
<script src="libs/gitbook-2.6.7/js/plugin-fontsettings.js"></script>
<script src="libs/gitbook-2.6.7/js/plugin-bookdown.js"></script>
<script src="libs/gitbook-2.6.7/js/jquery.highlight.js"></script>
<script src="libs/gitbook-2.6.7/js/plugin-clipboard.js"></script>
<script>
gitbook.require(["gitbook"], function(gitbook) {
gitbook.start({
"sharing": {
"github": false,
"facebook": true,
"twitter": true,
"linkedin": false,
"weibo": false,
"instapaper": false,
"vk": false,
"all": ["facebook", "twitter", "linkedin", "weibo", "instapaper"]
},
"fontsettings": {
"theme": "white",
"family": "sans",
"size": 2
},
"edit": {
"link": null,
"text": null
},
"history": {
"link": null,
"text": null
},
"view": {
"link": null,
"text": null
},
"download": null,
"toc": {
"collapse": "subsection"
}
});
});
</script>
</body>
</html>