Esecuzione di operazioni su un grande set di dati

Devo eseguire alcune analisi su un record PSL che contiene informazioni sui frammenti di sequenza del DNA. Fondamentalmente devo trovare le voci che sono della stessa lettura nello stesso contig (questi sono entrambi i valori nella voce PSL). Il problema è che i record PSL sono grandi (documenti di testo 10-30 Mb). Ho scritto un programma che funziona su brevi record e sui lunghi record dato abbastanza tempo ma ha richiesto molto più tempo di quanto specificato. Mi è stato detto che il programma non dovrebbe richiedere più di ~ 15 secondi. Il mio ha richiesto oltre 15 minuti.Esecuzione di operazioni su un grande set di dati

record PSL simile a questa:

275 11 0 0 0 0 0 0 - M02034:35:000000000-A7UU0:1:1101:19443:1992/2 286 0 286 NODE_406138_length_13407_cov_13.425076 13465 408 694 1 286, 0, 408, 

171 5 0 0 0 0 0 0 + M02034:35:000000000-A7UU0:1:1101:13497:2001/2 294 0 176 NODE_500869_length_34598_cov_30.643419 34656 34334 34510 1 176, 0, 34334, 

188 14 0 10 0 0 0 0 + M02034:35:000000000-A7UU0:1:1101:18225:2002/1 257 45 257 NODE_455027_length_12018_cov_13.759444 12076 11322 11534 1 212, 45, 11322,

Il mio codice è simile al seguente:

import sys 
class PSLreader : 
    ''' 
    Class to provide reading of a file containing psl alignments 
    formatted sequences: 
    object instantiation: 
    myPSLreader = PSLreader(<file name>): 

    object attributes: 
    fname: the initial file name 

    methods: 
    readPSL() : reads psl file, yielding those alignments that are within the first or last 
       1000 nt 

    readPSLpairs() : yields psl pairs that support a circular hypothesis 

    Author: David Bernick 
    Date: May 12, 2013 
    ''' 

    def __init__ (self, fname=''): 
     '''contructor: saves attribute fname ''' 

     self.fname = fname 

    def doOpen (self): 
     if self.fname is '': 
      return sys.stdin 
     else: 
      return open(self.fname) 

    def readPSL (self): 
     ''' 
     using filename given in init, returns each filtered psl records 
     that contain alignments that are within the terminal 1000nt of 
     the target. Incomplete psl records are discarded. 
     If filename was not provided, stdin is used. 

     This method selects for alignments that could may be part of a 
     circle. 

     Illumina pairs aligned to the top strand would have read1(+) and read2(-). 
     For the bottoms trand, read1(-) and read2(+). 

     For potential circularity, 
     these are the conditions that can support circularity: 
     read1(+) near the 3' terminus 
     read1(-) near the 5' terminus 
     read2(-) near the 5' terminus 
     read2(+) near the 3' terminus 

     so... 
     any read(+) near the 3', or 
     any read(-) near the 5' 

     ''' 

     nearEnd = 1000 # this constant determines "near the end" 
     with self.doOpen() as fileH: 

      for line in fileH: 
       pslList = line.split() 
       if len(pslList) < 17: 
        continue 
       tSize = int(pslList[14]) 
       tStart = int(pslList[15]) 
       strand = str(pslList[8]) 

       if strand.startswith('+') and (tSize - tStart > nearEnd): 
        continue 
       elif strand.startswith('-') and (tStart > nearEnd): 
        continue 

       yield line 

    def readPSLpairs (self): 
     read1 = [] 
     read2 = [] 

     for psl in self.readPSL(): 
      parsed_psl = psl.split() 
      strand = parsed_psl[9][-1] 
      if strand == '1': 
       read1.append(parsed_psl) 
      elif strand == '2': 
       read2.append(parsed_psl) 

     output = {} 
     for psl1 in read1: 
      name1 = psl1[9][:-1] 
      contig1 = psl1[13] 
      for psl2 in read2: 
       name2 = psl2[9][:-1] 
       contig2 = psl2[13] 
       if name1 == name2 and contig1 == contig2: 
        try: 
         output[contig1] += 1 
         break 
        except: 
         output[contig1] = 1 
         break 

     print(output) 


PSL_obj = PSLreader('EEV14-Vf.filtered.psl') 
PSL_obj.readPSLpairs()

mi è stato dato qualche esempio di codice che assomiglia a questo:

def doSomethingPairwise (a): 
    for leftItem in a[1]: 
     for rightItem in a[2]: 
      if leftItem[1] is rightItem[1]: 
       print (a) 
thisStream = [['David', 'guitar', 1], ['David', 'guitar', 2], 
['John', 'violin', 1], ['John', 'oboe', 2], 
['Patrick', 'theremin', 1], ['Patrick', 'lute',2] ] 
thisGroup = None 
thisGroupList = [ [], [], [] ] 

for name, instrument, num in thisStream: 
    if name != thisGroup: 

     doSomethingPairwise(thisGroupList) 

     thisGroup = name 
     thisGroupList = [ [], [], [] ] 

    thisGroupList[num].append([name, instrument, num]) 
doSomethingPairwise(thisGroupList)

Ma quando Ho provato a implementarlo, il mio programma ha richiesto molto tempo. Sto pensando a questo nel modo sbagliato? Mi rendo conto che il ciclo annidato è lento ma non vedo un'alternativa.

Modifica: l'ho capito, i dati sono stati preordinati, il che ha reso la mia soluzione di forza bruta molto poco pratica e inutile.

fonte

2015-05-26 Sam

Spero aiuto, dal momento che, la questione ha bisogno di un file di esempio migliore di ingresso

#is better create PSLRecord class 
class PSLRecord: 
    def __init__(self, line): 
    pslList = line.split() 
    properties = ("matches", "misMatches", "repMatches", "nCount", 
       "qNumInsert", "qBaseInsert", "tNumInsert", 
       "tBaseInsert", "strand", "qName", "qSize", "qStart", 
       "qEnd", "tName", "tSize", "tStart", "tEnd", "blockCount", 
       "blockSizes", "qStarts", "tStarts") 
    self.__dict__.update(dict(zip(properties, pslList))) 

class PSLreader : 
    def __init__ (self, fname=''): 
    self.fname = fname 

    def doOpen (self): 
    if self.fname is '': 
     return sys.stdin 
    else: 
     return open(self.fname) 

    def readPSL (self): 
    with self.doOpen() as fileH: 
     for line in fileH: 
     pslrc = PSLRecord(line) 
     yield pslrc 

    #return a dictionary with all psl records group by qName and tName 
    def readPSLpairs (self): 
    dictpsl = {} 
    for pslrc in self.readPSL(): 
     #OP requirement, remove '1' or '2' char, in pslrc.qName[:-1] 
     key = (pslrc.qName[:-1], pslrc.tName) 
     if not key in dictpsl: 
     dictpsl[key] = [] 
     dictpsl[key].append(pslrc) 
    return dictpsl 

#Function filter .... is better out and self-contained 
def f_filter(pslrec, nearEnd = 1000): 
    if (pslrec.strand.startswith('+') and 
    (int(pslrec.tSize) - int(pslrec.tStart) > nearEnd)): 
    return False 
    if (pslrec.strand.startswith('-') and 
    (int(pslrec.tStart) > nearEnd)): 
    return False 
    return True 

PSL_obj = PSLreader('EEV14-Vf.filtered.psl') 

#read dictionary of pairs 
dictpsl = PSL_obj.readPSLpairs() 

from itertools import product 
#product from itertools 
#(1) x (2,3) = (1,2),(1,3) 

output = {} 
for key, v in dictpsl.items(): 
    name, contig = key 
    #i get filters aligns in principal strand 
    strand_princ = [pslrec for pslrec in v if f_filter(pslrec) and 
       pslrec.qName[-1] == '1'] 
    #i get filters aligns in secondary strand 
    strand_sec = [pslrec for pslrec in v if f_filter(pslrec) and 
       pslrec.qName[-1] == '2'] 
    for pslrec_princ, pslrec_sec in product(strand_princ, strand_sec): 
    #This For has fewer comparisons, since I was grouped before 
    if not contig in output: 
     output[contig] = 1 
    output[contig] += 1

Nota: 10-30 Mb non è file di grandi dimensioni, se mi chiedete

fonte

2015-05-26 20:58:08

Esecuzione di operazioni su un grande set di dati

risposta

Problemi correlati