spark_rcpp_read_warc {sparkwarc} | R Documentation |
Reads a WARC (Web ARChive) file using Rcpp.
spark_rcpp_read_warc(path, match_warc, match_line)
path |
The path to the file. Needs to be accessible from the cluster. Supports the ‘"hdfs://"’, ‘"s3n://"’ and ‘"file://"’ protocols. |
match_warc |
include only warc files mathcing this character string. |
match_line |
include only lines mathcing this character string. |