Class BiojavaSparkUtils


  • public class BiojavaSparkUtils
    extends java.lang.Object
    A class of Biojava related Spark utility methods. These extend SparkUtils.
    Author:
    Anthony Bradley
    • Method Summary

      All Methods Static Methods Concrete Methods 
      Modifier and Type Method Description
      static org.rcsb.mmtf.api.StructureDataInterface convertToStructDataInt​(org.biojava.nbio.structure.Structure structure)
      Get a StructureDataInterface from a Biojava Structure.
      static org.rcsb.mmtf.spark.data.SegmentDataRDD filterSequenceSimilar​(org.rcsb.mmtf.spark.data.SegmentDataRDD segmentDataRDD, java.lang.String inputSequence, double minSimilarity)
      Filter the SegmentDataRDD based on minimum sequence similarity to a reference sequence.
      static AtomDataRDD findAtoms​(org.rcsb.mmtf.spark.data.StructureDataRDD structureDataRDD)
      Find all the atoms in the RDD.
      static AtomDataRDD findAtoms​(org.rcsb.mmtf.spark.data.StructureDataRDD structureDataRDD, org.rcsb.mmtf.spark.data.AtomSelectObject selectObjectOne)
      Find the given type of atoms for each structure in the PDB.
      static AtomContactRDD findContacts​(org.rcsb.mmtf.spark.data.StructureDataRDD structureDataRDD, double cutoff)
      Find the contacts for each structure in the PDB.
      static AtomContactRDD findContacts​(org.rcsb.mmtf.spark.data.StructureDataRDD structureDataRDD, org.rcsb.mmtf.spark.data.AtomSelectObject selectObjectOne, double cutoff)
      Find the contacts for each structure in the PDB.
      static AtomContactRDD findContacts​(org.rcsb.mmtf.spark.data.StructureDataRDD structureDataRDD, org.rcsb.mmtf.spark.data.AtomSelectObject selectObjectOne, org.rcsb.mmtf.spark.data.AtomSelectObject selectObjectTwo, double cutoff)
      Find the contacts for each structure in the PDB.
      static org.biojava.nbio.structure.contact.AtomContactSet getAtomContacts​(java.util.List<org.biojava.nbio.structure.Atom> atoms, double cutoff)
      Get all the atom contacts in a list of atoms.
      static org.biojava.nbio.structure.contact.AtomContactSet getAtomContacts​(java.util.List<org.biojava.nbio.structure.Atom> atomListOne, java.util.List<org.biojava.nbio.structure.Atom> atomListTwo, double cutoff)
      Get the contacts between two lists of atoms
      static org.biojava.nbio.structure.contact.AtomContactSet getAtomContactsSlow​(java.util.List<org.biojava.nbio.structure.Atom> atomListOne, java.util.List<org.biojava.nbio.structure.Atom> atomListTwo, double cutoff)
      Get the contacts between two lists of atoms using iteration and not grids
      static java.util.List<org.biojava.nbio.structure.Atom> getAtoms​(org.rcsb.mmtf.api.StructureDataInterface structure)
      Get all the atoms in the structure using a StructureDataInterface.
      static java.util.List<org.biojava.nbio.structure.Atom> getAtoms​(org.rcsb.mmtf.api.StructureDataInterface structure, org.rcsb.mmtf.spark.data.AtomSelectObject atomSelectObject)
      Get all the atoms of a given name or in a given group in the structure using a StructureDataInterface.
      static org.apache.spark.api.java.JavaPairRDD<java.lang.String,​org.biojava.nbio.structure.Structure> getBiojavaRdd​(java.lang.String filePath)
      Get an JavaPairRDD of String Structure from a file path.
      static org.biojava.nbio.structure.Atom[] getCaAtoms​(org.rcsb.mmtf.spark.data.Segment segment)
      Gets the C-alpha Atom for the given input Segment.
      static org.apache.spark.api.java.JavaPairRDD<java.lang.String,​org.biojava.nbio.structure.Atom[]> getChainRDD​(java.lang.String filePath, int minLength, double sample)
      Get the JavaPairRDD of Key: PDBID.CHAINID and Value: Atom array of the C-alpha coordinates.
      static org.apache.spark.api.java.JavaPairRDD<java.lang.String,​org.biojava.nbio.structure.Atom[]> getChainRDD​(java.util.List<java.lang.String> pdbIdList)
      Get the JavaPairRDD of Key: PDBID.CHAINID and Value: Atom array of the C-alpha coordinates.
      static org.apache.spark.api.java.JavaPairRDD<java.lang.String,​org.biojava.nbio.structure.Atom[]> getChainRDD​(java.util.List<java.lang.String> pdbIdList, int minLength)
      Get the JavaPairRDD of Key: PDBID.CHAINID and Value: Atom array of the C-alpha coordinates.
      static org.apache.spark.api.java.JavaPairRDD<java.lang.String,​org.biojava.nbio.structure.Atom[]> getChainRDD​(org.rcsb.mmtf.spark.data.StructureDataRDD structureDataRDD, int minLength)
      Get the JavaPairRDD of Key: PDBID.CHAINID and Value: Atom array of the C-alpha coordinates.
      static org.apache.spark.api.java.JavaPairRDD<java.lang.String,​org.biojava.nbio.structure.Structure> getFromList​(java.io.File[] pdbIdList)
      Generate a JavaPairRDD of String Structure from a list of PDB files.
      static java.lang.String getGroupAtomName​(org.biojava.nbio.structure.Atom atom)
      Get a conjoined group atom name from an atom.
      static org.rcsb.mmtf.spark.data.StructureDataRDD getStructureRDDFromMmcif​(java.lang.String filePath)
      Function (for benchmarking) to get a StructureDataRDD from a Hadoop file of mmCIF data.
      static java.lang.String getTypeFromChainId​(org.rcsb.mmtf.api.StructureDataInterface structureDataInterface, int chainInd)
      Get the type of a given chain index - SHOULD BE MOVED INTO ENCODER UTILS
      static org.biojava.nbio.structure.align.util.AtomCache setUpBioJava()
      Set up the configuration parameters for BioJava.
      static org.biojava.nbio.structure.align.util.AtomCache setUpBioJava​(java.lang.String ccBaseUrl)
      Set up the configuration parameters for BioJava.
      static void writeToFile​(java.util.List<java.lang.String> pdbCodeList, java.lang.String uri, java.lang.String producer)
      Write a list of PDB ids to a hadoop sequence file in MMTF format.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Constructor Detail

      • BiojavaSparkUtils

        public BiojavaSparkUtils()
    • Method Detail

      • getCaAtoms

        public static org.biojava.nbio.structure.Atom[] getCaAtoms​(org.rcsb.mmtf.spark.data.Segment segment)
        Gets the C-alpha Atom for the given input Segment.
        Parameters:
        segment - the input Segment object
        Returns:
        the C-alpha array of Atom objects
      • findContacts

        public static AtomContactRDD findContacts​(org.rcsb.mmtf.spark.data.StructureDataRDD structureDataRDD,
                                                  org.rcsb.mmtf.spark.data.AtomSelectObject selectObjectOne,
                                                  org.rcsb.mmtf.spark.data.AtomSelectObject selectObjectTwo,
                                                  double cutoff)
        Find the contacts for each structure in the PDB.
        Parameters:
        selectObjectOne - the first type of atoms
        selectObjectTwo - the second type of atoms
        cutoff - the cutoff distance (max) in Angstrom
        Returns:
        the JavaPairRDD of AtomContact objects
      • findContacts

        public static AtomContactRDD findContacts​(org.rcsb.mmtf.spark.data.StructureDataRDD structureDataRDD,
                                                  org.rcsb.mmtf.spark.data.AtomSelectObject selectObjectOne,
                                                  double cutoff)
        Find the contacts for each structure in the PDB.
        Parameters:
        selectObjectOne - the type of atoms
        cutoff - the cutoff distance (max) in Angstrom
        Returns:
        the JavaPairRDD of AtomContact objects
      • findContacts

        public static AtomContactRDD findContacts​(org.rcsb.mmtf.spark.data.StructureDataRDD structureDataRDD,
                                                  double cutoff)
        Find the contacts for each structure in the PDB.
        Parameters:
        cutoff - the cutoff distance (max) in Angstrom
        Returns:
        the JavaPairRDD of AtomContact objects
      • findAtoms

        public static AtomDataRDD findAtoms​(org.rcsb.mmtf.spark.data.StructureDataRDD structureDataRDD,
                                            org.rcsb.mmtf.spark.data.AtomSelectObject selectObjectOne)
        Find the given type of atoms for each structure in the PDB.
        Parameters:
        selectObjectOne - the type of atom to find
        Returns:
        the JavaRDD of Atom objects
      • findAtoms

        public static AtomDataRDD findAtoms​(org.rcsb.mmtf.spark.data.StructureDataRDD structureDataRDD)
        Find all the atoms in the RDD.
        Returns:
        the JavaRDD of Atom objects
      • getBiojavaRdd

        public static org.apache.spark.api.java.JavaPairRDD<java.lang.String,​org.biojava.nbio.structure.Structure> getBiojavaRdd​(java.lang.String filePath)
        Get an JavaPairRDD of String Structure from a file path.
        Parameters:
        filePath - the input path to the hadoop sequence file
        Returns:
        the JavaPairRDD of String Structure
      • getAtoms

        public static java.util.List<org.biojava.nbio.structure.Atom> getAtoms​(org.rcsb.mmtf.api.StructureDataInterface structure,
                                                                               org.rcsb.mmtf.spark.data.AtomSelectObject atomSelectObject)
        Get all the atoms of a given name or in a given group in the structure using a StructureDataInterface.
        Parameters:
        structure - the input StructureDataInterface
        Returns:
        the list of atoms fitting the given criteria
      • getAtomContacts

        public static org.biojava.nbio.structure.contact.AtomContactSet getAtomContacts​(java.util.List<org.biojava.nbio.structure.Atom> atoms,
                                                                                        double cutoff)
        Get all the atom contacts in a list of atoms.
        Parameters:
        atoms - the list of Atoms
        cutoff - the cutoff distance
        Returns:
        the AtomContactSet of the contacts
      • getAtomContacts

        public static org.biojava.nbio.structure.contact.AtomContactSet getAtomContacts​(java.util.List<org.biojava.nbio.structure.Atom> atomListOne,
                                                                                        java.util.List<org.biojava.nbio.structure.Atom> atomListTwo,
                                                                                        double cutoff)
        Get the contacts between two lists of atoms
        Parameters:
        atomListOne - the first list of Atoms
        atomListTwo - the second list of Atoms
        cutoff - the cutoff to define a contact
        Returns:
        the AtomContactSet of the contacts
      • getAtomContactsSlow

        public static org.biojava.nbio.structure.contact.AtomContactSet getAtomContactsSlow​(java.util.List<org.biojava.nbio.structure.Atom> atomListOne,
                                                                                            java.util.List<org.biojava.nbio.structure.Atom> atomListTwo,
                                                                                            double cutoff)
        Get the contacts between two lists of atoms using iteration and not grids
        Parameters:
        atomListOne - the first list of Atoms
        atomListTwo - the second list of Atoms
        cutoff - the cutoff to define a contact
        Returns:
        the AtomContactSet of the contacts
      • getChainRDD

        public static org.apache.spark.api.java.JavaPairRDD<java.lang.String,​org.biojava.nbio.structure.Atom[]> getChainRDD​(java.util.List<java.lang.String> pdbIdList,
                                                                                                                                  int minLength)
                                                                                                                           throws java.io.IOException
        Get the JavaPairRDD of Key: PDBID.CHAINID and Value: Atom array of the C-alpha coordinates.
        Parameters:
        pdbIdList - the input list of PDB ids
        minLength - the minimum length of each chain
        Returns:
        the JavaPairRDD of Key: PDBID.CHAINID and Value: Atom array of the C-alpha coordinates
        Throws:
        java.io.IOException - due to an error reading the input file
      • getChainRDD

        public static org.apache.spark.api.java.JavaPairRDD<java.lang.String,​org.biojava.nbio.structure.Atom[]> getChainRDD​(java.lang.String filePath,
                                                                                                                                  int minLength,
                                                                                                                                  double sample)
                                                                                                                           throws java.io.IOException
        Get the JavaPairRDD of Key: PDBID.CHAINID and Value: Atom array of the C-alpha coordinates.
        Parameters:
        filePath - the Haddoop file to read from
        minLength - the minimum length of each chain
        sample - the sample of this file to take
        Returns:
        the JavaPairRDD of Key: PDBID.CHAINID and Value: Atom array of the C-alpha coordinates
        Throws:
        java.io.IOException - due to an error reading the input file
      • getChainRDD

        public static org.apache.spark.api.java.JavaPairRDD<java.lang.String,​org.biojava.nbio.structure.Atom[]> getChainRDD​(java.util.List<java.lang.String> pdbIdList)
                                                                                                                           throws java.io.IOException
        Get the JavaPairRDD of Key: PDBID.CHAINID and Value: Atom array of the C-alpha coordinates.
        Parameters:
        pdbIdList - the input list of PDB ids
        Returns:
        the JavaPairRDD of Key: PDBID.CHAINID and Value: Atom array of the C-alpha coordinates
        Throws:
        java.io.IOException - due to an error reading the input file
      • getChainRDD

        public static org.apache.spark.api.java.JavaPairRDD<java.lang.String,​org.biojava.nbio.structure.Atom[]> getChainRDD​(org.rcsb.mmtf.spark.data.StructureDataRDD structureDataRDD,
                                                                                                                                  int minLength)
                                                                                                                           throws java.io.IOException
        Get the JavaPairRDD of Key: PDBID.CHAINID and Value: Atom array of the C-alpha coordinates.
        Parameters:
        structureDataRDD - the input StructureDataRDD
        minLength - the minimum length of each chain
        Returns:
        the JavaPairRDD of Key: PDBID.CHAINID and Value: Atom array of the C-alpha coordinates
        Throws:
        java.io.IOException - due to an error reading the input file
      • getAtoms

        public static java.util.List<org.biojava.nbio.structure.Atom> getAtoms​(org.rcsb.mmtf.api.StructureDataInterface structure)
        Get all the atoms in the structure using a StructureDataInterface.
        Parameters:
        structure - the input StructureDataInterface
        Returns:
        the list of atoms
      • filterSequenceSimilar

        public static org.rcsb.mmtf.spark.data.SegmentDataRDD filterSequenceSimilar​(org.rcsb.mmtf.spark.data.SegmentDataRDD segmentDataRDD,
                                                                                    java.lang.String inputSequence,
                                                                                    double minSimilarity)
                                                                             throws org.biojava.nbio.core.exceptions.CompoundNotFoundException
        Filter the SegmentDataRDD based on minimum sequence similarity to a reference sequence.
        Parameters:
        inputSequence - the reference sequence to compare
        minSimilarity - the minimum similarity (as a double between 0.00 and 1.00)
        Returns:
        the SegmentDataRDD after being filtered
        Throws:
        org.biojava.nbio.core.exceptions.CompoundNotFoundException - if Biojava cannot accurately convert the String sequence to a ProteinSequence
      • getGroupAtomName

        public static java.lang.String getGroupAtomName​(org.biojava.nbio.structure.Atom atom)
        Get a conjoined group atom name from an atom.
        Parameters:
        atom - the input atom
        Returns:
        the String describing the conjoined group atom name.
      • getStructureRDDFromMmcif

        public static org.rcsb.mmtf.spark.data.StructureDataRDD getStructureRDDFromMmcif​(java.lang.String filePath)
        Function (for benchmarking) to get a StructureDataRDD from a Hadoop file of mmCIF data.
        Parameters:
        filePath - the path of the Hadoop sequnece file
        Returns:
        the StructureDataRDD generated
      • convertToStructDataInt

        public static org.rcsb.mmtf.api.StructureDataInterface convertToStructDataInt​(org.biojava.nbio.structure.Structure structure)
        Get a StructureDataInterface from a Biojava Structure.
        Parameters:
        structure - the input structure to covnert
        Returns:
        the StructureDataInterface of the Biojava Structure
      • getFromList

        public static org.apache.spark.api.java.JavaPairRDD<java.lang.String,​org.biojava.nbio.structure.Structure> getFromList​(java.io.File[] pdbIdList)
        Generate a JavaPairRDD of String Structure from a list of PDB files.
        Parameters:
        pdbIdList - the input list of PDB files
        Returns:
        the JavaPairRDD of String Structure
      • getTypeFromChainId

        public static java.lang.String getTypeFromChainId​(org.rcsb.mmtf.api.StructureDataInterface structureDataInterface,
                                                          int chainInd)
        Get the type of a given chain index - SHOULD BE MOVED INTO ENCODER UTILS
        Parameters:
        structureDataInterface - the input StructureDataInterface
        chainInd - the index of the relevant chain
        Returns:
        the String describing the chain
      • writeToFile

        public static void writeToFile​(java.util.List<java.lang.String> pdbCodeList,
                                       java.lang.String uri,
                                       java.lang.String producer)
        Write a list of PDB ids to a hadoop sequence file in MMTF format.
        Parameters:
        pdbCodeList - the input list of PDB ids
      • setUpBioJava

        public static org.biojava.nbio.structure.align.util.AtomCache setUpBioJava()
        Set up the configuration parameters for BioJava.
      • setUpBioJava

        public static org.biojava.nbio.structure.align.util.AtomCache setUpBioJava​(java.lang.String ccBaseUrl)
        Set up the configuration parameters for BioJava.
        Parameters:
        ccBaseUrl - base URL for chemcomp files (in sandbox layout .../H/HEM/HEM.cif) from which chem comp cif files will be read