代码之家  ›  专栏  ›  技术社区  ›  Victoria Stuart

在Neo4j中创建代谢途径

  •  5
  • Victoria Stuart  · 技术社区  · 8 年前

    我正试图利用以下数据在Neo4j中创建此问题底部图像中所示的糖酵解途径:

    糖酵解\生物实体。csv

    name
    α-D-glucose
    glucose 6-phosphate
    fructose 6-phosphate
    "fructose 1,6-bisphosphate"
    dihydroxyacetone phosphate
    D-glyceraldehyde 3-phosphate
    "1,3-bisphosphoglycerate"
    3-phosphoglycerate
    2-phosphoglycerate
    phosphoenolpyruvate
    pyruvate
    hexokinase
    glucose-6-phosphatase
    phosphoglucose isomerase
    phosphofructokinase
    "fructose-bisphosphate aldolase, class I"
    triosephosphate isomerase (TIM)
    glyceraldehyde-3-phosphate dehydrogenase
    phosphoglycerate kinase
    phosphoglycerate mutase
    enolase
    pyruvate kinase
    

    糖酵解关系。csv

    source,relation,target
    α-D-glucose,substrate_of,hexokinase
    hexokinase,yields,glucose 6-phosphate
    glucose 6-phosphate,substrate_of,glucose-6-phosphatase
    glucose-6-phosphatase,yields,α-D-glucose
    glucose 6-phosphate,substrate_of,phosphoglucose isomerase
    phosphoglucose isomerase,yields,fructose 6-phosphate
    fructose 6-phosphate,substrate_of,phosphofructokinase
    phosphofructokinase,yields,"fructose 1,6-bisphosphate"
    "fructose 1,6-bisphosphate",substrate_of,"fructose-bisphosphate aldolase, class I"
    "fructose-bisphosphate aldolase, class I",yields,D-glyceraldehyde 3-phosphate
    D-glyceraldehyde 3-phosphate,substrate_of,glyceraldehyde-3-phosphate dehydrogenase
    D-glyceraldehyde 3-phosphate,substrate_of,triosephosphate isomerase (TIM)
    triosephosphate isomerase (TIM),yields,dihydroxyacetone phosphate
    glyceraldehyde-3-phosphate dehydrogenase,yields,"1,3-bisphosphoglycerate"
    "1,3-bisphosphoglycerate",substrate_of,phosphoglycerate kinase
    phosphoglycerate kinase,yields,3-phosphoglycerate
    3-phosphoglycerate,substrate_of,phosphoglycerate mutase
    phosphoglycerate mutase,yields,2-phosphoglycerate
    2-phosphoglycerate,substrate_of,enolase
    enolase,yields,phosphoenolpyruvate
    phosphoenolpyruvate,substrate_of,pyruvate kinase
    pyruvate kinase,yields,pyruvate
    

    这就是我目前所拥有的,

    enter image description here

    。。。使用此密码(传递给 Cycli cypher-shell ):

    LOAD CSV WITH HEADERS FROM "file:/glycolysis_relations.csv" AS row
    MERGE (s:Glycolysis {source: row.source})
    MERGE (r:Glycolysis {relation: row.relation})
    MERGE (t:Glycolysis {target: row.target})
    FOREACH (x in case row.relation when "substrate_of" then [1] else [] end |
      MERGE (s)-[r:substrate_of]->(t)
    )
    FOREACH (x in case row.relation when "yields" then [1] else [] end |
      MERGE (s)-[r:yields]->(t)
      );
    

    我想创建完全连接的路径,所有节点上都有标题。建议?

    enter image description here

    3 回复  |  直到 8 年前
        1
  •  3
  •   cybersam    8 年前

    [更新]

    存在多个问题和可能的改进:

    1. 第二个 MERGE 应删除,因为它会创建孤立节点。不应将关系类型调整为 Glycolysis 节点,这样的节点永远不会连接到任何其他节点。
    2. 第一和第三 合并 子句必须使用相同的属性名称(例如, name )对于源节点和目标节点,或者相同的化学品最终可能有2个节点(具有不同的属性键)。这就是为什么您最终得到的节点没有所有预期的连接。
    3. APOC程序 apoc.cypher.doIt 可用于在某种程度上简化 合并 与动态名称的关系。
    4. 这个 glycolysis_bioentities.csv 此用例不需要。

    通过上述更改,您最终会得到如下结果,这将生成一个与输入数据匹配的连接图:

    LOAD CSV WITH HEADERS FROM "file:/glycolysis_relations.csv" AS row
    MERGE (s:Glycolysis {name: row.source})
    MERGE (t:Glycolysis {name: row.target})
    WITH s, t, row
    CALL apoc.cypher.doIt(
      'MERGE (s)-[r:' + row.relation + ']->(t)',
      {s:s, t:t}) YIELD value
    RETURN 1;
    
        2
  •  3
  •   Victoria Stuart    8 年前

    @cybersam的回答非常好,提供了最优雅的解决方案(再次感谢!)--请对那个被接受的答案投赞成票。

    由于其他人可能对这个问题/答案/主题感兴趣,我想提及我的代码(基于此SO线程, How to specify relationship type in CSV? ,并根据@cybersam提供的提示进行修改)现在可以工作了,并显示结果:

    解决方案1(我的原始帖子,已更新):

    LOAD CSV WITH HEADERS FROM "file:/glycolysis_relations.csv" AS row
    MERGE (s:Glycolysis {name:row.source})
    MERGE (t:Glycolysis {name:row.target})
    FOREACH (x in case row.relation when "substrate_of" then [1] else [] end |
      MERGE (s)-[r:substrate_of]->(t)
    )
    FOREACH (x in case row.relation when "yields" then [1] else [] end |
      MERGE (s)-[r:yields]->(t)
      );
    

    解决方案2(cybersam,更新):

    LOAD CSV WITH HEADERS FROM "file:/glycolysis_relations.csv" AS row
    MERGE (s:Metabolism:Glycolysis {name: row.source})
    MERGE (t:Metabolism:Glycolysis {name: row.target})
    WITH s, t, row
      // "Bug" -- additional duplicate relations with each iteration of this statement/script:
      // CALL apoc.create.relationship(s, row.relation, {}, t) YIELD rel
      // Solution: 
      // https://github.com/neo4j-contrib/neo4j-apoc-procedures/issues/271
      // https://stackoverflow.com/questions/47808421/neo4j-load-csv-to-create-dynamic-relationship-types
      CALL apoc.merge.relationship(s, row.relation, {}, {}, t) YIELD rel
    RETURN COUNT(*);
    

    两种解决方案生成的图形相同,如下所示:-D

    neo4j_glycolytc_pathway

        3
  •  0
  •   Victoria Stuart    8 年前

    如果允许的话,我想再发布一个后续答案——我的原因是目前在Neo4j中重建代谢途径的研究很少,下面将对此进行完整的总结 StackOverflow标题/主题,“在Neo4j中创建代谢途径”。

    就像我的 糖酵解 上面的路径,我在Neo4j中重新创建 TCA公司 ( 柠檬酸循环 | 克雷布循环 )途径:

    TCA cycle

    [TCA循环图像源: https://metabolicpathways.stanford.edu/]

    在创建TCA路径图的过程中出现的一个问题是,其中一个节点(酶,“乌头酸酶”)被使用了两次,因此在创建图的过程中 MERGE 合并了公共节点 aconitase 作为单个实体,导致此布局,

    aconitase - 'incorrect' layout

    。。。不是这个,

    aconitase - 'correct' layout

    我对该问题的解决方案是使用节点属性创建“TCA图”,以临时区别标记受影响的源节点和目标节点(稍后在正确创建图后删除这些标记)。

    我还添加了 :Metabolism 标签,以便我可以选择各个路径( :Glycolysis | :TCA )或完整的代谢网络( :新陈代谢 ),根据需要。

    最后,我需要连接这两条路径( :糖酵解 | :TCA )通过其公共节点, pyruvate ,我可以通过APOC程序完成(这里,附在我的 glycolysis.cql (Cypher)脚本。

    这是我的CSV数据文件*。cql密码脚本、脚本执行和结果图。

    糖酵解。csv:

    source,relation,target
    α-D-glucose,substrate_of,hexokinase
    hexokinase,yields,glucose 6-phosphate
    glucose 6-phosphate,substrate_of,glucose-6-phosphatase
    glucose-6-phosphatase,yields,α-D-glucose
    glucose 6-phosphate,substrate_of,phosphoglucose isomerase
    phosphoglucose isomerase,yields,fructose 6-phosphate
    fructose 6-phosphate,substrate_of,phosphofructokinase
    phosphofructokinase,yields,"fructose 1,6-bisphosphate"
    "fructose 1,6-bisphosphate",substrate_of,"fructose-bisphosphate aldolase, class I"
    "fructose-bisphosphate aldolase, class I",yields,D-glyceraldehyde 3-phosphate
    D-glyceraldehyde 3-phosphate,substrate_of,glyceraldehyde-3-phosphate dehydrogenase
    D-glyceraldehyde 3-phosphate,substrate_of,triosephosphate isomerase (TIM)
    triosephosphate isomerase (TIM),yields,dihydroxyacetone phosphate
    glyceraldehyde-3-phosphate dehydrogenase,yields,"1,3-bisphosphoglycerate"
    "1,3-bisphosphoglycerate",substrate_of,phosphoglycerate kinase
    phosphoglycerate kinase,yields,3-phosphoglycerate
    3-phosphoglycerate,substrate_of,phosphoglycerate mutase
    phosphoglycerate mutase,yields,2-phosphoglycerate
    2-phosphoglycerate,substrate_of,enolase
    enolase,yields,phosphoenolpyruvate
    phosphoenolpyruvate,substrate_of,pyruvate kinase
    pyruvate kinase,yields,pyruvate
    

    tca。csv:

    source,relation,target,tag1,tag2
    pyruvate,substrate_of,pyruvate dehydrogenase,,
    pyruvate dehydrogenase,yields,acetyl CoA,,
    acetyl CoA,substrate_of,citrate synthase,,
    oxaloacetate,substrate_of,citrate synthase,,
    citrate synthase,yields,citrate,,
    citrate,substrate_of,aconitase,,1
    aconitase,yields,cis-aconitate,1,
    cis-aconitate,substrate_of,aconitase,,2
    aconitase,yields,isocitrate,2,
    isocitrate,substrate_of,isocitrate dehydrogenase,,
    isocitrate dehydrogenase,yields,α-ketoglutarate,,
    α-ketoglutarate,substrate_of,α-ketoglutarate dehydrogenase,,
    α-ketoglutarate dehydrogenase,yields,succinyl-CoA,,
    succinyl-CoA,substrate_of,succinyl-CoA synthetase,,
    succinyl-CoA synthetase,yields,succinate,,
    succinate,substrate_of,succinate dehydrogenase,,
    succinate dehydrogenase,yields,fumarate,,
    fumarate,substrate_of,fumarase,,
    fumarase,yields,S-malate,,
    S-malate,substrate_of,malate dehydrogenase,,
    malate dehydrogenase,yields,oxaloacetate,,
    

    “tsv.csv”中的“tag1”和“tag”2用于在通过“tca.cql”脚本创建源节点和目标节点时唯一地使用它们:

    tca。cql:

    // CREATE INDICES:
    CREATE INDEX ON :Metabolism(name);
    CREATE INDEX ON :TCA(name);
    
    // CREATE GRAPH:
    // USING PERIODIC COMMIT 5000
    LOAD CSV WITH HEADERS FROM "file:/mnt/Vancouver/Programming/data/metabolism/tca.csv" AS row
    MERGE (s:Metabolism:TCA {name: row.source, tag:COALESCE(row.tag1, '')})
    MERGE (t:Metabolism:TCA {name: row.target, tag:COALESCE(row.tag2, '')})
    WITH s, t, row
      CALL apoc.merge.relationship(s, row.relation, {}, {}, t) YIELD rel
      REMOVE s.tag, t.tag
    RETURN COUNT(*);
    

    糖酵解。cql:

    // CREATE INDICES:
    CREATE INDEX ON :Metabolism(name);
    CREATE INDEX ON :Glycolysis(name);
    
    // CREATE GRAPH:
    //USING PERIODIC COMMIT 5000
    LOAD CSV WITH HEADERS FROM "file:/mnt/Vancouver/Programming/data/metabolism/glycolysis.csv" AS row
    MERGE (s:Metabolism:Glycolysis {name: row.source})
    MERGE (t:Metabolism:Glycolysis {name: row.target})
    WITH s, t, row
      CALL apoc.merge.relationship(s, row.relation, {}, {}, t) YIELD rel
    RETURN COUNT(*);
    
    // MERGE COMMON NODE (GLYCOLYSIS: PYRUVATE; TCA: PYRUVATE):
    // As presented, run "tca.cql" first, then "glycolysis.cql"
    
    MATCH (g:Glycolysis), (t:TCA) WHERE g.name = t.name
    CALL apoc.refactor.mergeNodes([g,t]) YIELD node
      RETURN node;
    

    脚本执行:

    $ cat tca.cql |  cypher-shell -u *** -p ***
      COUNT(*)
      21
    
    $ cat glycolysis.cql |  cypher-shell -u *** -p ***
      COUNT(*)
      22
      node
      (:Metabolism:TCA:Glycolysis {name: "pyruvate"})
    
    $ 
    

    Neo4j图( :新陈代谢 视图):

    Neo4j Browser: Glycolysis + TCA metabolic psathways