代码之家  ›  专栏  ›  技术社区  ›  Sridhar Sarnobat

Mahout聚类-单个聚类中的所有文本向量-为什么?

  •  0
  • Sridhar Sarnobat  · 技术社区  · 6 年前

    我运行了以下示例:

    https://github.com/technobium/mahout-clustering/blob/master/src/main/java/com/technobium/ClusteringDemo.java#L64

    Document 1 -> John saw a red car.
    Document 2 -> Marta found a red bike.
    Document 3 -> Don need a blue coat.
    Document 4 -> Mike bought a blue boat.
    Document 5 -> Albert wants a blue dish.
    Document 6 -> Lara likes blue glasses.
    Document 7 -> Donna, do you have red apples?
    Document 8 -> Sonia needs blue books.
    Document 9 -> I like blue eyes.
    Document 10 -> Arleen has a red carpet.
    

    它能像预期的那样工作 EuclideanDistanceMeasure . 但我不知道为什么这篇文章要测量距离( TanimotoDistanceMeasure CosineDistanceMeasure

    这是为什么?我并不是假装我对这两个距离测量结果一无所知,但我需要改变什么呢?里面有太多的数字,我无法理解每一个数字的效果。虽然我只读了两章,但我确实有一本书叫《马霍特在行动》。

    欧氏常数(2组-良好)

     Clusters: 
             7 -> wt: 1.0 distance: 4.4960791719810365  vec: Document 1 = [8:2.609, 21:2.609, 29:1.693, 30:2.609]
             7 -> wt: 1.0 distance: 4.496079376645949  vec: Document 10 = [2:2.609, 9:2.609, 18:2.609, 29:1.693]
             7 -> wt: 1.0 distance: 4.496079576525459  vec: Document 2 = [3:2.609, 16:2.609, 25:2.609, 29:1.693]
             9 -> wt: 1.0 distance: 4.389955960700927  vec: Document 3 = [4:1.357, 10:2.609, 13:2.609, 27:2.609]
             9 -> wt: 1.0 distance: 4.389956011306051  vec: Document 4 = [4:1.357, 5:2.609, 7:2.609, 26:2.609]
             9 -> wt: 1.0 distance: 4.3899560687101395  vec: Document 5 = [0:2.609, 4:1.357, 11:2.609, 32:2.609]
             9 -> wt: 1.0 distance: 4.389956137136399  vec: Document 6 = [4:1.357, 17:2.609, 22:2.609, 24:2.609]
             7 -> wt: 1.0 distance: 5.577549042707083  vec: Document 7 = [1:2.609, 12:2.609, 14:2.609, 19:2.609, 29:1.693, 33:2.609]
             9 -> wt: 1.0 distance: 4.389956708176695  vec: Document 8 = [4:1.357, 6:2.609, 28:2.609, 31:2.609]
             9 -> wt: 1.0 distance: 4.389471924190491  vec: Document 9 = [4:1.357, 15:2.609, 20:2.609, 23:2.609]
    

    制作单位:

        CanopyDriver.run(new Path(vectorsFolder), new Path(canopyCentroids), new EuclideanDistanceMeasure(), 20, 5,
                true, 0, true);
    
        FuzzyKMeansDriver.run(new Path(vectorsFolder), new Path(canopyCentroids, "clusters-0-final"),
                new Path(clusterOutput), 0.01, 20, 2, true, true, 0, false);
    

    余弦距离度量(仅1个群集-坏)

    Clusters: 
             0 -> wt: 1.0 distance: 0.6362357041216559  vec: Document 1 = [8:2.609, 21:2.609, 29:1.693, 30:2.609]
             0 -> wt: 1.0 distance: 0.6362357041216559  vec: Document 10 = [2:2.609, 9:2.609, 18:2.609, 29:1.693]
             0 -> wt: 1.0 distance: 0.636235704121656  vec: Document 2 = [3:2.609, 16:2.609, 25:2.609, 29:1.693]
             0 -> wt: 1.0 distance: 0.6328896123664868  vec: Document 3 = [4:1.357, 10:2.609, 13:2.609, 27:2.609]
             0 -> wt: 1.0 distance: 0.6328896123664868  vec: Document 4 = [4:1.357, 5:2.609, 7:2.609, 26:2.609]
             0 -> wt: 1.0 distance: 0.6328896123664868  vec: Document 5 = [0:2.609, 4:1.357, 11:2.609, 32:2.609]
             0 -> wt: 1.0 distance: 0.6328896123664868  vec: Document 6 = [4:1.357, 17:2.609, 22:2.609, 24:2.609]
             0 -> wt: 1.0 distance: 0.5876411474816594  vec: Document 7 = [1:2.609, 12:2.609, 14:2.609, 19:2.609, 29:1.693, 33:2.609]
             0 -> wt: 1.0 distance: 0.6328896123664868  vec: Document 8 = [4:1.357, 6:2.609, 28:2.609, 31:2.609]
             0 -> wt: 1.0 distance: 0.6328896123664868  vec: Document 9 = [4:1.357, 15:2.609, 20:2.609, 23:2.609]
    

        CanopyDriver.run(new Path(vectorsFolder), new Path(canopyCentroids), new CosineDistanceMeasure(), 20, 5,
                true, 0, true);
    
        FuzzyKMeansDriver.run(new Path(vectorsFolder), new Path(canopyCentroids, "clusters-0-final"),
                new Path(clusterOutput), 0.01, 20, 2, true, true, 0, false);
    

    TanimotoDistanceMeasure(仅1个集群-坏)

     Clusters: 
             0 -> wt: 1.0 distance: 0.8637279689324617  vec: Document 1 = [8:2.609, 21:2.609, 29:1.693, 30:2.609]
             0 -> wt: 1.0 distance: 0.8637279689324617  vec: Document 10 = [2:2.609, 9:2.609, 18:2.609, 29:1.693]
             0 -> wt: 1.0 distance: 0.8637279689324617  vec: Document 2 = [3:2.609, 16:2.609, 25:2.609, 29:1.693]
             0 -> wt: 1.0 distance: 0.8596377086023765  vec: Document 3 = [4:1.357, 10:2.609, 13:2.609, 27:2.609]
             0 -> wt: 1.0 distance: 0.8596377086023765  vec: Document 4 = [4:1.357, 5:2.609, 7:2.609, 26:2.609]
             0 -> wt: 1.0 distance: 0.8596377086023765  vec: Document 5 = [0:2.609, 4:1.357, 11:2.609, 32:2.609]
             0 -> wt: 1.0 distance: 0.8596377086023765  vec: Document 6 = [4:1.357, 17:2.609, 22:2.609, 24:2.609]
             0 -> wt: 1.0 distance: 0.8723755210900389  vec: Document 7 = [1:2.609, 12:2.609, 14:2.609, 19:2.609, 29:1.693, 33:2.609]
             0 -> wt: 1.0 distance: 0.8596377086023765  vec: Document 8 = [4:1.357, 6:2.609, 28:2.609, 31:2.609]
             0 -> wt: 1.0 distance: 0.8596377086023765  vec: Document 9 = [4:1.357, 15:2.609, 20:2.609, 23:2.609]
    

        CanopyDriver.run(new Path(vectorsFolder), new Path(canopyCentroids), new TanimotoDistanceMeasure(), 20, 5,
                true, 0, true);
    
        FuzzyKMeansDriver.run(new Path(vectorsFolder), new Path(canopyCentroids, "clusters-0-final"),
                new Path(clusterOutput), 0.01, 20, 2, true, true, 0, false);
    
    1 回复  |  直到 6 年前
        1
  •  0
  •   Sridhar Sarnobat    6 年前

    正如Anony Mousse在他的第一个回复中所说,我输入的数据属于一个集群。经过最近几周的反省(或者更具体地说,直接使用距离测量类进行实验),我发现了一个数据集,它产生了多个集群:

    Text id1 = new Text("Document 1");
    Text text1 = new Text("Atletico Madrid win");
    writer.append(id1, text1);
    
    Text id6 = new Text("Document 6");
    Text text6 = new Text("Both apple and orange are fruit");
    writer.append(id6, text6);
    
    Text id7 = new Text("Document 7");
    Text text7 = new Text("Both orange and apple are fruit");
    writer.append(id7, text7);
    

    2) 确定好的半径值

    a) 使用示例数据尝试DistanceMeasure类

    Vector v1 = toVector("Atletico Madrid win");
    Vector v2 = toVector("Both apple and orange are fruit");
    Vector v3 = toVector("Both orange and apple are fruit");
    of = ImmutableList.of(v1, v2, v3);
    
    List<Vector> vectorList = new LinkedList();
    vectorList.addAll(of);
    List<Canopy> canopies = CanopyClusterer.createCanopies(vectorList, new CosineDistanceMeasure(), 0.3, 0.3);
    for (Canopy canopy : canopies) {
        System.out.println("DistanceMeasureMain.main() " + canopy.asFormatString());
    }
    

    生产:

    DistanceMeasureMain.main() distance is 0.19193857965451055
    DistanceMeasureMain.main() distance is 0.5281191379648771
    DistanceMeasureMain.main() distance is 0.19193857965451055
    DistanceMeasureMain.main() C0: {0:1.1,117724:1.0,378550445:1.0,1997849123:1.0}
    DistanceMeasureMain.main() C1: {0:1.1,96727:1.0,96852:1.0,2076577:1.0,93029210:1.0,97711124:1.0,1008851410:1.0}
    

    我认为 t1 t2 价值观( 0.2 0.2分 )为了 CanopyDriver.run()

        // CosineDistanceMeasure
        CanopyDriver.run(new Path(vectorsFolder),
                new Path(canopyCentroids), new CosineDistanceMeasure(),
                0.2, 0.2, true, 1, true);
    
        FuzzyKMeansDriver.run(new Path(vectorsFolder), new Path(
                canopyCentroids, "clusters-0-final"), new Path(
                clusterOutput), 0.01, 20, 2, true, true, 0, false);
    

    输出

    Document 1 -> Atletico Madrid win
    Document 6 -> Both apple and orange are fruit
    Document 7 -> Both orange and apple are fruit
    
     Clusters: 
             0 -> wt: 1.0 distance: 0.0  vec: Document 1 = [1:1.405, 4:1.405, 6:1.405]
             1 -> wt: 1.0 distance: 0.0  vec: Document 6 = [0:1.000, 2:1.000, 3:1.000, 5:1.000]
             1 -> wt: 1.0 distance: 0.0  vec: Document 7 = [0:1.000, 2:1.000, 3:1.000, 5:1.000]