Package org.biojava.spark.utils
Class BiojavaSparkUtils
- java.lang.Object
-
- org.biojava.spark.utils.BiojavaSparkUtils
-
public class BiojavaSparkUtils extends java.lang.ObjectA class of Biojava related Spark utility methods. These extendSparkUtils.- Author:
- Anthony Bradley
-
-
Constructor Summary
Constructors Constructor Description BiojavaSparkUtils()
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description static org.rcsb.mmtf.api.StructureDataInterfaceconvertToStructDataInt(org.biojava.nbio.structure.Structure structure)Get aStructureDataInterfacefrom a BiojavaStructure.static org.rcsb.mmtf.spark.data.SegmentDataRDDfilterSequenceSimilar(org.rcsb.mmtf.spark.data.SegmentDataRDD segmentDataRDD, java.lang.String inputSequence, double minSimilarity)Filter theSegmentDataRDDbased on minimum sequence similarity to a reference sequence.static AtomDataRDDfindAtoms(org.rcsb.mmtf.spark.data.StructureDataRDD structureDataRDD)Find all the atoms in the RDD.static AtomDataRDDfindAtoms(org.rcsb.mmtf.spark.data.StructureDataRDD structureDataRDD, org.rcsb.mmtf.spark.data.AtomSelectObject selectObjectOne)Find the given type of atoms for each structure in the PDB.static AtomContactRDDfindContacts(org.rcsb.mmtf.spark.data.StructureDataRDD structureDataRDD, double cutoff)Find the contacts for each structure in the PDB.static AtomContactRDDfindContacts(org.rcsb.mmtf.spark.data.StructureDataRDD structureDataRDD, org.rcsb.mmtf.spark.data.AtomSelectObject selectObjectOne, double cutoff)Find the contacts for each structure in the PDB.static AtomContactRDDfindContacts(org.rcsb.mmtf.spark.data.StructureDataRDD structureDataRDD, org.rcsb.mmtf.spark.data.AtomSelectObject selectObjectOne, org.rcsb.mmtf.spark.data.AtomSelectObject selectObjectTwo, double cutoff)Find the contacts for each structure in the PDB.static org.biojava.nbio.structure.contact.AtomContactSetgetAtomContacts(java.util.List<org.biojava.nbio.structure.Atom> atoms, double cutoff)Get all the atom contacts in a list of atoms.static org.biojava.nbio.structure.contact.AtomContactSetgetAtomContacts(java.util.List<org.biojava.nbio.structure.Atom> atomListOne, java.util.List<org.biojava.nbio.structure.Atom> atomListTwo, double cutoff)Get the contacts between two lists of atomsstatic org.biojava.nbio.structure.contact.AtomContactSetgetAtomContactsSlow(java.util.List<org.biojava.nbio.structure.Atom> atomListOne, java.util.List<org.biojava.nbio.structure.Atom> atomListTwo, double cutoff)Get the contacts between two lists of atoms using iteration and not gridsstatic java.util.List<org.biojava.nbio.structure.Atom>getAtoms(org.rcsb.mmtf.api.StructureDataInterface structure)Get all the atoms in the structure using aStructureDataInterface.static java.util.List<org.biojava.nbio.structure.Atom>getAtoms(org.rcsb.mmtf.api.StructureDataInterface structure, org.rcsb.mmtf.spark.data.AtomSelectObject atomSelectObject)Get all the atoms of a given name or in a given group in the structure using aStructureDataInterface.static org.apache.spark.api.java.JavaPairRDD<java.lang.String,org.biojava.nbio.structure.Structure>getBiojavaRdd(java.lang.String filePath)Get anJavaPairRDDofStringStructurefrom a file path.static org.biojava.nbio.structure.Atom[]getCaAtoms(org.rcsb.mmtf.spark.data.Segment segment)Gets the C-alphaAtomfor the given inputSegment.static org.apache.spark.api.java.JavaPairRDD<java.lang.String,org.biojava.nbio.structure.Atom[]>getChainRDD(java.lang.String filePath, int minLength, double sample)Get theJavaPairRDDof Key: PDBID.CHAINID and Value:Atomarray of the C-alpha coordinates.static org.apache.spark.api.java.JavaPairRDD<java.lang.String,org.biojava.nbio.structure.Atom[]>getChainRDD(java.util.List<java.lang.String> pdbIdList)Get theJavaPairRDDof Key: PDBID.CHAINID and Value:Atomarray of the C-alpha coordinates.static org.apache.spark.api.java.JavaPairRDD<java.lang.String,org.biojava.nbio.structure.Atom[]>getChainRDD(java.util.List<java.lang.String> pdbIdList, int minLength)Get theJavaPairRDDof Key: PDBID.CHAINID and Value:Atomarray of the C-alpha coordinates.static org.apache.spark.api.java.JavaPairRDD<java.lang.String,org.biojava.nbio.structure.Atom[]>getChainRDD(org.rcsb.mmtf.spark.data.StructureDataRDD structureDataRDD, int minLength)Get theJavaPairRDDof Key: PDBID.CHAINID and Value:Atomarray of the C-alpha coordinates.static org.apache.spark.api.java.JavaPairRDD<java.lang.String,org.biojava.nbio.structure.Structure>getFromList(java.io.File[] pdbIdList)Generate aJavaPairRDDof StringStructurefrom a list of PDB files.static java.lang.StringgetGroupAtomName(org.biojava.nbio.structure.Atom atom)Get a conjoined group atom name from an atom.static org.rcsb.mmtf.spark.data.StructureDataRDDgetStructureRDDFromMmcif(java.lang.String filePath)Function (for benchmarking) to get aStructureDataRDDfrom a Hadoop file of mmCIF data.static java.lang.StringgetTypeFromChainId(org.rcsb.mmtf.api.StructureDataInterface structureDataInterface, int chainInd)Get the type of a given chain index - SHOULD BE MOVED INTO ENCODER UTILSstatic org.biojava.nbio.structure.align.util.AtomCachesetUpBioJava()Set up the configuration parameters for BioJava.static org.biojava.nbio.structure.align.util.AtomCachesetUpBioJava(java.lang.String ccBaseUrl)Set up the configuration parameters for BioJava.static voidwriteToFile(java.util.List<java.lang.String> pdbCodeList, java.lang.String uri, java.lang.String producer)Write a list of PDB ids to a hadoop sequence file in MMTF format.
-
-
-
Method Detail
-
getCaAtoms
public static org.biojava.nbio.structure.Atom[] getCaAtoms(org.rcsb.mmtf.spark.data.Segment segment)
Gets the C-alphaAtomfor the given inputSegment.- Parameters:
segment- the inputSegmentobject- Returns:
- the C-alpha array of
Atomobjects
-
findContacts
public static AtomContactRDD findContacts(org.rcsb.mmtf.spark.data.StructureDataRDD structureDataRDD, org.rcsb.mmtf.spark.data.AtomSelectObject selectObjectOne, org.rcsb.mmtf.spark.data.AtomSelectObject selectObjectTwo, double cutoff)
Find the contacts for each structure in the PDB.- Parameters:
selectObjectOne- the first type of atomsselectObjectTwo- the second type of atomscutoff- the cutoff distance (max) in Angstrom- Returns:
- the
JavaPairRDDofAtomContactobjects
-
findContacts
public static AtomContactRDD findContacts(org.rcsb.mmtf.spark.data.StructureDataRDD structureDataRDD, org.rcsb.mmtf.spark.data.AtomSelectObject selectObjectOne, double cutoff)
Find the contacts for each structure in the PDB.- Parameters:
selectObjectOne- the type of atomscutoff- the cutoff distance (max) in Angstrom- Returns:
- the
JavaPairRDDofAtomContactobjects
-
findContacts
public static AtomContactRDD findContacts(org.rcsb.mmtf.spark.data.StructureDataRDD structureDataRDD, double cutoff)
Find the contacts for each structure in the PDB.- Parameters:
cutoff- the cutoff distance (max) in Angstrom- Returns:
- the
JavaPairRDDofAtomContactobjects
-
findAtoms
public static AtomDataRDD findAtoms(org.rcsb.mmtf.spark.data.StructureDataRDD structureDataRDD, org.rcsb.mmtf.spark.data.AtomSelectObject selectObjectOne)
Find the given type of atoms for each structure in the PDB.- Parameters:
selectObjectOne- the type of atom to find- Returns:
- the
JavaRDDofAtomobjects
-
findAtoms
public static AtomDataRDD findAtoms(org.rcsb.mmtf.spark.data.StructureDataRDD structureDataRDD)
Find all the atoms in the RDD.- Returns:
- the
JavaRDDofAtomobjects
-
getBiojavaRdd
public static org.apache.spark.api.java.JavaPairRDD<java.lang.String,org.biojava.nbio.structure.Structure> getBiojavaRdd(java.lang.String filePath)
Get anJavaPairRDDofStringStructurefrom a file path.- Parameters:
filePath- the input path to the hadoop sequence file- Returns:
- the
JavaPairRDDofStringStructure
-
getAtoms
public static java.util.List<org.biojava.nbio.structure.Atom> getAtoms(org.rcsb.mmtf.api.StructureDataInterface structure, org.rcsb.mmtf.spark.data.AtomSelectObject atomSelectObject)Get all the atoms of a given name or in a given group in the structure using aStructureDataInterface.- Parameters:
structure- the inputStructureDataInterface- Returns:
- the list of atoms fitting the given criteria
-
getAtomContacts
public static org.biojava.nbio.structure.contact.AtomContactSet getAtomContacts(java.util.List<org.biojava.nbio.structure.Atom> atoms, double cutoff)Get all the atom contacts in a list of atoms.- Parameters:
atoms- the list ofAtomscutoff- the cutoff distance- Returns:
- the
AtomContactSetof the contacts
-
getAtomContacts
public static org.biojava.nbio.structure.contact.AtomContactSet getAtomContacts(java.util.List<org.biojava.nbio.structure.Atom> atomListOne, java.util.List<org.biojava.nbio.structure.Atom> atomListTwo, double cutoff)Get the contacts between two lists of atoms- Parameters:
atomListOne- the first list ofAtomsatomListTwo- the second list ofAtomscutoff- the cutoff to define a contact- Returns:
- the
AtomContactSetof the contacts
-
getAtomContactsSlow
public static org.biojava.nbio.structure.contact.AtomContactSet getAtomContactsSlow(java.util.List<org.biojava.nbio.structure.Atom> atomListOne, java.util.List<org.biojava.nbio.structure.Atom> atomListTwo, double cutoff)Get the contacts between two lists of atoms using iteration and not grids- Parameters:
atomListOne- the first list ofAtomsatomListTwo- the second list ofAtomscutoff- the cutoff to define a contact- Returns:
- the
AtomContactSetof the contacts
-
getChainRDD
public static org.apache.spark.api.java.JavaPairRDD<java.lang.String,org.biojava.nbio.structure.Atom[]> getChainRDD(java.util.List<java.lang.String> pdbIdList, int minLength) throws java.io.IOExceptionGet theJavaPairRDDof Key: PDBID.CHAINID and Value:Atomarray of the C-alpha coordinates.- Parameters:
pdbIdList- the input list of PDB idsminLength- the minimum length of each chain- Returns:
- the
JavaPairRDDof Key: PDBID.CHAINID and Value:Atomarray of the C-alpha coordinates - Throws:
java.io.IOException- due to an error reading the input file
-
getChainRDD
public static org.apache.spark.api.java.JavaPairRDD<java.lang.String,org.biojava.nbio.structure.Atom[]> getChainRDD(java.lang.String filePath, int minLength, double sample) throws java.io.IOExceptionGet theJavaPairRDDof Key: PDBID.CHAINID and Value:Atomarray of the C-alpha coordinates.- Parameters:
filePath- the Haddoop file to read fromminLength- the minimum length of each chainsample- the sample of this file to take- Returns:
- the
JavaPairRDDof Key: PDBID.CHAINID and Value:Atomarray of the C-alpha coordinates - Throws:
java.io.IOException- due to an error reading the input file
-
getChainRDD
public static org.apache.spark.api.java.JavaPairRDD<java.lang.String,org.biojava.nbio.structure.Atom[]> getChainRDD(java.util.List<java.lang.String> pdbIdList) throws java.io.IOExceptionGet theJavaPairRDDof Key: PDBID.CHAINID and Value:Atomarray of the C-alpha coordinates.- Parameters:
pdbIdList- the input list of PDB ids- Returns:
- the
JavaPairRDDof Key: PDBID.CHAINID and Value:Atomarray of the C-alpha coordinates - Throws:
java.io.IOException- due to an error reading the input file
-
getChainRDD
public static org.apache.spark.api.java.JavaPairRDD<java.lang.String,org.biojava.nbio.structure.Atom[]> getChainRDD(org.rcsb.mmtf.spark.data.StructureDataRDD structureDataRDD, int minLength) throws java.io.IOExceptionGet theJavaPairRDDof Key: PDBID.CHAINID and Value:Atomarray of the C-alpha coordinates.- Parameters:
structureDataRDD- the inputStructureDataRDDminLength- the minimum length of each chain- Returns:
- the
JavaPairRDDof Key: PDBID.CHAINID and Value:Atomarray of the C-alpha coordinates - Throws:
java.io.IOException- due to an error reading the input file
-
getAtoms
public static java.util.List<org.biojava.nbio.structure.Atom> getAtoms(org.rcsb.mmtf.api.StructureDataInterface structure)
Get all the atoms in the structure using aStructureDataInterface.- Parameters:
structure- the inputStructureDataInterface- Returns:
- the list of atoms
-
filterSequenceSimilar
public static org.rcsb.mmtf.spark.data.SegmentDataRDD filterSequenceSimilar(org.rcsb.mmtf.spark.data.SegmentDataRDD segmentDataRDD, java.lang.String inputSequence, double minSimilarity) throws org.biojava.nbio.core.exceptions.CompoundNotFoundExceptionFilter theSegmentDataRDDbased on minimum sequence similarity to a reference sequence.- Parameters:
inputSequence- the reference sequence to compareminSimilarity- the minimum similarity (as a double between 0.00 and 1.00)- Returns:
- the
SegmentDataRDDafter being filtered - Throws:
org.biojava.nbio.core.exceptions.CompoundNotFoundException- if Biojava cannot accurately convert the String sequence to aProteinSequence
-
getGroupAtomName
public static java.lang.String getGroupAtomName(org.biojava.nbio.structure.Atom atom)
Get a conjoined group atom name from an atom.- Parameters:
atom- the input atom- Returns:
- the String describing the conjoined group atom name.
-
getStructureRDDFromMmcif
public static org.rcsb.mmtf.spark.data.StructureDataRDD getStructureRDDFromMmcif(java.lang.String filePath)
Function (for benchmarking) to get aStructureDataRDDfrom a Hadoop file of mmCIF data.- Parameters:
filePath- the path of the Hadoop sequnece file- Returns:
- the
StructureDataRDDgenerated
-
convertToStructDataInt
public static org.rcsb.mmtf.api.StructureDataInterface convertToStructDataInt(org.biojava.nbio.structure.Structure structure)
Get aStructureDataInterfacefrom a BiojavaStructure.- Parameters:
structure- the input structure to covnert- Returns:
- the
StructureDataInterfaceof the BiojavaStructure
-
getFromList
public static org.apache.spark.api.java.JavaPairRDD<java.lang.String,org.biojava.nbio.structure.Structure> getFromList(java.io.File[] pdbIdList)
Generate aJavaPairRDDof StringStructurefrom a list of PDB files.- Parameters:
pdbIdList- the input list of PDB files- Returns:
- the
JavaPairRDDofStringStructure
-
getTypeFromChainId
public static java.lang.String getTypeFromChainId(org.rcsb.mmtf.api.StructureDataInterface structureDataInterface, int chainInd)Get the type of a given chain index - SHOULD BE MOVED INTO ENCODER UTILS- Parameters:
structureDataInterface- the inputStructureDataInterfacechainInd- the index of the relevant chain- Returns:
- the
Stringdescribing the chain
-
writeToFile
public static void writeToFile(java.util.List<java.lang.String> pdbCodeList, java.lang.String uri, java.lang.String producer)Write a list of PDB ids to a hadoop sequence file in MMTF format.- Parameters:
pdbCodeList- the input list of PDB ids
-
setUpBioJava
public static org.biojava.nbio.structure.align.util.AtomCache setUpBioJava()
Set up the configuration parameters for BioJava.
-
setUpBioJava
public static org.biojava.nbio.structure.align.util.AtomCache setUpBioJava(java.lang.String ccBaseUrl)
Set up the configuration parameters for BioJava.- Parameters:
ccBaseUrl- base URL for chemcomp files (in sandbox layout .../H/HEM/HEM.cif) from which chem comp cif files will be read
-
-