spark_read_warc {sparkwarc} | R Documentation |
Reads a WARC (Web ARChive) file into Apache Spark using sparklyr.
spark_read_warc(sc, name, path, repartition = 0L, memory = TRUE, overwrite = TRUE, group = FALSE, parse = FALSE, ...)
sc |
An active |
name |
The name to assign to the newly generated table. |
path |
The path to the file. Needs to be accessible from the cluster. Supports the "hdfs://", "s3n://" and "file://" protocols. |
repartition |
The number of partitions used to distribute the generated table. Use 0 (the default) to avoid partitioning. |
memory |
Boolean; should the data be loaded eagerly into memory? (That is, should the table be cached?) |
overwrite |
Boolean; overwrite the table with the given name if it already exists? |
group |
|
parse |
|
... |
Additional arguments reserved for future use. |
library(sparklyr) sc <- spark_connect(master = "spark://HOST:PORT") df <- spark_read_warc( sc, system.file("samples/sample.warc", package = "sparkwarc"), repartition = FALSE, memory = FALSE, overwrite = FALSE ) spark_disconnect(sc)