Class WarcReader
java.lang.Object
com.github.bottomlessarchive.warc.service.WarcReader
public class WarcReader
extends java.lang.Object
This class provides basic functions to read and parse a WARC file. Providing a compressed or an
uncompressed stream of WARC file, WarcReader reads WARC records and parses them to
WarcRecord objects.-
Field Summary
Fields Modifier and Type Field Description static java.nio.charset.CharsetDEFAULT_CHARSETThe defaultCharsetused by the parser when no otherCharsetis provided. -
Constructor Summary
Constructors Constructor Description WarcReader(java.io.InputStream datasource)Create a newWarcReaderand set the provided stream as the data source.WarcReader(java.io.InputStream datasource, java.nio.charset.Charset charset)Create a newWarcReaderand set the provided stream as the data source.WarcReader(java.io.InputStream datasource, java.nio.charset.Charset charset, boolean compressed)Create a newWarcReaderand set the provided stream as the data source.WarcReader(java.net.URL datasourceLocation)Create a newWarcReaderand set the file on the providedURLlocation as the data source.WarcReader(java.net.URLConnection datasourceConnection, java.nio.charset.Charset charset, boolean compressed)Create a newWarcReaderand set the file on the providedURLConnectionas the data source.WarcReader(java.net.URL datasourceLocation, java.nio.charset.Charset charset)Create a newWarcReaderand set the file on the providedURLlocation as the data source.WarcReader(java.net.URL datasourceLocation, java.nio.charset.Charset charset, boolean compressed)Create a newWarcReaderand set the file on the providedURLlocation as the data source. -
Method Summary
Modifier and Type Method Description protected java.util.Optional<WarcRecord<WarcContentBlock>>parse()This method based on the WARC format specification parses a WARC record and creates aWarcRecordobject.java.util.Optional<WarcRecord<WarcContentBlock>>readRecord()Read a WARC record from the provided data source.
-
Field Details
-
DEFAULT_CHARSET
public static final java.nio.charset.Charset DEFAULT_CHARSETThe defaultCharsetused by the parser when no otherCharsetis provided.
-
-
Constructor Details
-
WarcReader
public WarcReader(java.net.URL datasourceLocation)Create a newWarcReaderand set the file on the providedURLlocation as the data source.- Parameters:
datasourceLocation- the location of the data source to back this reader
-
WarcReader
public WarcReader(java.net.URL datasourceLocation, java.nio.charset.Charset charset)Create a newWarcReaderand set the file on the providedURLlocation as the data source.- Parameters:
datasourceLocation- the location of the data source to back this readercharset- character set for the parser
-
WarcReader
public WarcReader(java.net.URL datasourceLocation, java.nio.charset.Charset charset, boolean compressed)Create a newWarcReaderand set the file on the providedURLlocation as the data source. The default timeout value for connecting to theURLis 120 seconds.- Parameters:
datasourceLocation- the location of the data source to back this readercharset- character set for the parsercompressed- true if the input stream is compressed, false otherwise
-
WarcReader
public WarcReader(java.net.URLConnection datasourceConnection, java.nio.charset.Charset charset, boolean compressed)Create a newWarcReaderand set the file on the providedURLConnectionas the data source.- Parameters:
datasourceConnection- the location of the data source to back this readercharset- character set for the parsercompressed- true if the input stream is compressed, false otherwise
-
WarcReader
public WarcReader(java.io.InputStream datasource)Create a newWarcReaderand set the provided stream as the data source.- Parameters:
datasource- the data source to back this reader
-
WarcReader
public WarcReader(java.io.InputStream datasource, java.nio.charset.Charset charset)Create a newWarcReaderand set the provided stream as the data source.- Parameters:
datasource- the data source to back this readercharset- character set for the parser
-
WarcReader
public WarcReader(java.io.InputStream datasource, java.nio.charset.Charset charset, boolean compressed)Create a newWarcReaderand set the provided stream as the data source.- Parameters:
datasource- the data source to back this readercharset- character set for the parsercompressed- true if the input stream is compressed, false otherwise
-
-
Method Details
-
readRecord
Read a WARC record from the provided data source. If the returned Optional is empty then the reader reached the end of the data source.- Returns:
- the freshly read WARC record
-
parse
This method based on the WARC format specification parses a WARC record and creates aWarcRecordobject.This function throws a
WarcFormatExceptionif the structure of an input file is invalid. Explanation for parsing error is provided in the message of the exception.- Returns:
- the parsed WARC record
- Throws:
WarcFormatException- when unable to parse the next record
-