Class WarcReader

java.lang.Object
com.github.bottomlessarchive.warc.service.WarcReader

public class WarcReader
extends java.lang.Object
This class provides basic functions to read and parse a WARC file. Providing a compressed or an uncompressed stream of WARC file, WarcReader reads WARC records and parses them to WarcRecord objects.
  • Field Summary

    Fields 
    Modifier and Type Field Description
    static java.nio.charset.Charset DEFAULT_CHARSET
    The default Charset used by the parser when no other Charset is provided.
  • Constructor Summary

    Constructors 
    Constructor Description
    WarcReader​(java.io.InputStream datasource)
    Create a new WarcReader and set the provided stream as the data source.
    WarcReader​(java.io.InputStream datasource, java.nio.charset.Charset charset)
    Create a new WarcReader and set the provided stream as the data source.
    WarcReader​(java.io.InputStream datasource, java.nio.charset.Charset charset, boolean compressed)
    Create a new WarcReader and set the provided stream as the data source.
    WarcReader​(java.net.URL datasourceLocation)
    Create a new WarcReader and set the file on the provided URL location as the data source.
    WarcReader​(java.net.URLConnection datasourceConnection, java.nio.charset.Charset charset, boolean compressed)
    Create a new WarcReader and set the file on the provided URLConnection as the data source.
    WarcReader​(java.net.URL datasourceLocation, java.nio.charset.Charset charset)
    Create a new WarcReader and set the file on the provided URL location as the data source.
    WarcReader​(java.net.URL datasourceLocation, java.nio.charset.Charset charset, boolean compressed)
    Create a new WarcReader and set the file on the provided URL location as the data source.
  • Method Summary

    Modifier and Type Method Description
    protected java.util.Optional<WarcRecord<WarcContentBlock>> parse()
    This method based on the WARC format specification parses a WARC record and creates a WarcRecord object.
    java.util.Optional<WarcRecord<WarcContentBlock>> readRecord()
    Read a WARC record from the provided data source.

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Field Details

    • DEFAULT_CHARSET

      public static final java.nio.charset.Charset DEFAULT_CHARSET
      The default Charset used by the parser when no other Charset is provided.
  • Constructor Details

    • WarcReader

      public WarcReader​(java.net.URL datasourceLocation)
      Create a new WarcReader and set the file on the provided URL location as the data source.
      Parameters:
      datasourceLocation - the location of the data source to back this reader
    • WarcReader

      public WarcReader​(java.net.URL datasourceLocation, java.nio.charset.Charset charset)
      Create a new WarcReader and set the file on the provided URL location as the data source.
      Parameters:
      datasourceLocation - the location of the data source to back this reader
      charset - character set for the parser
    • WarcReader

      public WarcReader​(java.net.URL datasourceLocation, java.nio.charset.Charset charset, boolean compressed)
      Create a new WarcReader and set the file on the provided URL location as the data source. The default timeout value for connecting to the URL is 120 seconds.
      Parameters:
      datasourceLocation - the location of the data source to back this reader
      charset - character set for the parser
      compressed - true if the input stream is compressed, false otherwise
    • WarcReader

      public WarcReader​(java.net.URLConnection datasourceConnection, java.nio.charset.Charset charset, boolean compressed)
      Create a new WarcReader and set the file on the provided URLConnection as the data source.
      Parameters:
      datasourceConnection - the location of the data source to back this reader
      charset - character set for the parser
      compressed - true if the input stream is compressed, false otherwise
    • WarcReader

      public WarcReader​(java.io.InputStream datasource)
      Create a new WarcReader and set the provided stream as the data source.
      Parameters:
      datasource - the data source to back this reader
    • WarcReader

      public WarcReader​(java.io.InputStream datasource, java.nio.charset.Charset charset)
      Create a new WarcReader and set the provided stream as the data source.
      Parameters:
      datasource - the data source to back this reader
      charset - character set for the parser
    • WarcReader

      public WarcReader​(java.io.InputStream datasource, java.nio.charset.Charset charset, boolean compressed)
      Create a new WarcReader and set the provided stream as the data source.
      Parameters:
      datasource - the data source to back this reader
      charset - character set for the parser
      compressed - true if the input stream is compressed, false otherwise
  • Method Details

    • readRecord

      public java.util.Optional<WarcRecord<WarcContentBlock>> readRecord()
      Read a WARC record from the provided data source. If the returned Optional is empty then the reader reached the end of the data source.
      Returns:
      the freshly read WARC record
    • parse

      protected java.util.Optional<WarcRecord<WarcContentBlock>> parse()
      This method based on the WARC format specification parses a WARC record and creates a WarcRecord object.

      This function throws a WarcFormatException if the structure of an input file is invalid. Explanation for parsing error is provided in the message of the exception.

      Returns:
      the parsed WARC record
      Throws:
      WarcFormatException - when unable to parse the next record