代码之家  ›  专栏  ›  技术社区  ›  Gloom

如何查找边缘列表中的所有团

  •  1
  • Gloom  · 技术社区  · 7 年前

    我有一个巨大的边缘列表文件(约80 GB)的同源基因从一个老的OrthOMCL过程。我想解析边缘列表中的所有团(所有顶点彼此共享一条边的子图),然后将每个团折叠为一行,同时忽略还原性(如genea、geneb<->geneb、genea)和自命中(genea<->genea)。我正在试用Python的NETWorkX(FiffyCules),但我是一个经验不足的程序员,所以我没有得到一个理想的输出。如果有人在网络结构方面有任何经验,你能告诉我正确的方向吗?

    下面是一个输入示例:

    GeneA,GeneA
    GeneA,GeneB
    GeneA,GeneC
    GeneB,GeneA
    GeneB,GeneB
    GeneB,GeneC
    GeneC,GeneA
    GeneC,GeneB
    GeneC,GeneC
    GeneD,GeneD
    GeneD,GeneE
    GeneD,GeneF
    GeneE,GeneD
    GeneE,GeneE
    GeneE,GeneF
    GeneF,GeneD
    GeneF,GeneE
    GeneF,GeneF
    GeneH,GeneH
    GeneH,GeneI
    GeneH,GeneJ
    GeneH,GeneK
    GeneH,GeneL
    GeneH,GeneM
    GeneH,GeneN
    GeneH,GeneO
    GeneH,GeneP
    GeneH,GeneQ
    GeneI,GeneH
    GeneI,GeneI
    GeneI,GeneJ
    GeneI,GeneK
    GeneI,GeneL
    GeneI,GeneM
    GeneI,GeneN
    GeneI,GeneO
    GeneI,GeneP
    GeneI,GeneQ
    GeneJ,GeneH
    GeneJ,GeneI
    GeneJ,GeneJ
    GeneJ,GeneK
    GeneJ,GeneL
    GeneJ,GeneM
    GeneJ,GeneN
    GeneJ,GeneO
    GeneJ,GeneP
    GeneJ,GeneQ
    GeneK,GeneH
    GeneK,GeneI
    GeneK,GeneJ
    GeneK,GeneK
    GeneK,GeneL
    GeneK,GeneM
    GeneK,GeneN
    GeneK,GeneO
    GeneK,GeneP
    GeneK,GeneQ
    GeneL,GeneH
    GeneL,GeneI
    GeneL,GeneJ
    GeneL,GeneK
    GeneL,GeneL
    GeneL,GeneM
    GeneL,GeneN
    GeneL,GeneO
    GeneL,GeneP
    GeneL,GeneQ
    GeneM,GeneH
    GeneM,GeneI
    GeneM,GeneJ
    GeneM,GeneK
    GeneM,GeneL
    GeneM,GeneM
    GeneM,GeneN
    GeneM,GeneO
    GeneM,GeneP
    GeneM,GeneQ
    GeneN,GeneH
    GeneN,GeneI
    GeneN,GeneJ
    GeneN,GeneK
    GeneN,GeneL
    GeneN,GeneM
    GeneN,GeneN
    GeneN,GeneO
    GeneN,GeneP
    GeneN,GeneQ
    GeneO,GeneH
    GeneO,GeneI
    GeneO,GeneJ
    GeneO,GeneK
    GeneO,GeneL
    GeneO,GeneM
    GeneO,GeneN
    GeneO,GeneO
    GeneO,GeneP
    GeneO,GeneQ
    GeneP,GeneH
    GeneP,GeneI
    GeneP,GeneJ
    GeneP,GeneK
    GeneP,GeneL
    GeneP,GeneM
    GeneP,GeneN
    GeneP,GeneO
    GeneP,GeneP
    GeneP,GeneQ
    GeneQ,GeneH
    GeneQ,GeneI
    GeneQ,GeneJ
    GeneQ,GeneK
    GeneQ,GeneL
    GeneQ,GeneM
    GeneQ,GeneN
    GeneQ,GeneO
    GeneQ,GeneP
    GeneQ,GeneQ
    GeneR,GeneR
    GeneR,GeneS
    GeneR,GeneT
    GeneR,GeneU
    GeneS,GeneR
    GeneS,GeneS
    GeneS,GeneT
    GeneS,GeneU
    GeneT,GeneR
    GeneT,GeneS
    GeneT,GeneT
    GeneT,GeneU
    GeneU,GeneR
    GeneU,GeneS
    GeneU,GeneT
    GeneU,GeneU
    GeneV,GeneW
    GeneW,GeneV
    GeneX,GeneX
    GeneX,GeneY
    GeneX,GeneZ
    GeneY,GeneX
    GeneY,GeneY
    GeneY,GeneZ
    GeneZ,GeneX
    GeneZ,GeneY
    GeneZ,GeneZ
    

    以下是所需的输出:

    GeneA,GeneB,GeneC
    GeneD,GeneE,GeneF
    GeneH,GeneI,GeneJ,GeneK,GeneL,GeneM,GeneN,GeneO,GeneP,GeneQ
    GeneR,GeneS,GeneT,GeneU
    GeneV,GeneW
    GeneX,GeneY,GeneZ
    

    提前谢谢!

    1 回复  |  直到 7 年前
        1
  •  4
  •   Gambit1614    7 年前

    你可以简单地试一下这个功能 find_cliques function

    import networkx as nx
    G = nx.read_edgelist("edgelist.txt",delimiter=',')
    
    for clq in nx.clique.find_cliques(G):
        print clq
    

    输出

    [u'GeneX', u'GeneY', u'GeneZ']
    [u'GeneP', u'GeneQ', u'GeneH', u'GeneI', u'GeneJ', u'GeneK', u'GeneL', u'GeneM', u'GeneN', u'GeneO']
    [u'GeneR', u'GeneS', u'GeneT', u'GeneU']
    [u'GeneV', u'GeneW']
    [u'GeneA', u'GeneB', u'GeneC']
    [u'GeneD', u'GeneE', u'GeneF']
    

    还有其他几个 functions in networkx for manipulating cliques 如果你想看看。