代码之家 › 专栏 › 技术社区 › user2924127

Cassandra->500mb CSV文件生成约50mb大小的表?

cassandra

user2924127 · 技术社区 · 10 年前

我是卡桑德拉的新手,想弄清楚尺码是如何工作的。我创建了一个键空间和一个表。然后,我生成了一个脚本,用java将100万行创建到csv文件中,并将其插入到数据库中。CSV文件的大小约为545 mb。然后我将其加载到数据库并运行 节点工具cfstats 命令,并收到此输出。它表示使用的总空间为50555052字节(~50 mb)。这怎么可能?在索引、列等开销的情况下,我的总数据如何能比原始CSV数据小(不仅小,而且小得多)?也许我没有正确阅读这里的内容,但这看起来正确吗?我在一台机器上使用Cassandra 2.2.1。

Table: users
        SSTable count: 1
        Space used (live): 50555052
        Space used (total): 50555052
        Space used by snapshots (total): 0
        Off heap memory used (total): 1481050
        SSTable Compression Ratio: 0.03029072054256705
        Number of keys (estimate): 984133
        Memtable cell count: 240336
        Memtable data size: 18385704
        Memtable off heap memory used: 0
        Memtable switch count: 19
        Local read count: 0
        Local read latency: NaN ms
        Local write count: 1000000
        Local write latency: 0.044 ms
        Pending flushes: 0
        Bloom filter false positives: 0
        Bloom filter false ratio: 0.00000
        Bloom filter space used: 1192632
        Bloom filter off heap memory used: 1192624
        Index summary off heap memory used: 203778
        Compression metadata off heap memory used: 84648
        Compacted partition minimum bytes: 643
        Compacted partition maximum bytes: 770
        Compacted partition mean bytes: 770
        Average live cells per slice (last five minutes): 0.0
        Maximum live cells per slice (last five minutes): 0
        Average tombstones per slice (last five minutes): 0.0
        Maximum tombstones per slice (last five minutes): 0

生成CSV文件的Java代码如下所示:

try{

            FileWriter writer = new FileWriter(sFileName);
            for(int i=0;i<1000000;i++){


            writer.append("Username " + i);
            writer.append(',');
            writer.append(new Timestamp(date.getTime()).toString());
            writer.append(',');
            writer.append("myfakeemailaccnt@email.com");
            writer.append(',');
            writer.append(new Timestamp(date.getTime()).toString());
            writer.append(',');
            writer.append("eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6IkpvaG4gRG9lIiwiYWRtaW4iOnRydWV9.TJVA95OrM7E2cBab30RMHrHDcEfxjoYZgeFONFh7HgQ");
            writer.append(',');
            writer.append("eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6IkpvaG4gRG9lIiwiYWRtaW4iOnRydWV9.TJVA95OrM7E2cBab30RMHrHDcEfxjoYZgeFONFh7HgQ");
            writer.append(',');
            writer.append("eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6IkpvaG4gRG9lIiwiYWRtaW4iOnRydWV9.TJVA95OrM7E2cBab30RMHrHDcEfxjoYZgeFONFh7HgQ");
            writer.append(',');
            writer.append("tr");
            writer.append('\n');

            }   
            writer.flush();
            writer.close();

        }
        catch(IOException e)
        {
             e.printStackTrace();
        }

1 回复 | 直到 10 年前

user2924127 10 年前

所以我想到了最大的3条数据:

eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6IkpvaG4gRG9lIiwiYWRtaW4iOnRydWV9.TJVA95OrM7E2cBab30RMHrHDcEfxjoYZgeFONFh7HgQ

并认为它们是一样的,也许卡桑德拉正在压缩它们,尽管它说只有3%的比例。所以我改变了Java代码以生成不同的数据。

public class Main {

    private static final String ALPHA_NUMERIC_STRING = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";

    public static void main(String[] args) {

        generateCassandraCSVData("users.csv");

    }

    public static String randomAlphaNumeric(int count) {
        StringBuilder builder = new StringBuilder();
        while (count-- != 0) {
        int character = (int)(Math.random()*ALPHA_NUMERIC_STRING.length());
        builder.append(ALPHA_NUMERIC_STRING.charAt(character));
        }
        return builder.toString();
        }


    public static void generateCassandraCSVData(String sFileName){

    java.util.Date date= new java.util.Date();


        try{

            FileWriter writer = new FileWriter(sFileName);
            for(int i=0;i<1000000;i++){



            writer.append("Username " + i);
            writer.append(',');
            writer.append(new Timestamp(date.getTime()).toString());
            writer.append(',');
            writer.append("myfakeemailaccnt@email.com");
            writer.append(',');
            writer.append(new Timestamp(date.getTime()).toString());
            writer.append(',');
            writer.append("" + randomAlphaNumeric(150) + "");
            writer.append(',');
            writer.append("" + randomAlphaNumeric(150) + "");
            writer.append(',');
            writer.append("" + randomAlphaNumeric(150) + "");
            writer.append(',');
            writer.append("tr");
            writer.append('\n');


            //generate whatever data you want
            }   
            writer.flush();
            writer.close();

        }
        catch(IOException e)
        {
             e.printStackTrace();
        } 

    }

}

因此,现在这3个大列的数据都是随机字符串,不再相同。这是现在制作的:

Table: users
        SSTable count: 4
        Space used (live): 554671040
        Space used (total): 554671040
        Space used by snapshots (total): 0
        Off heap memory used (total): 1886175
        SSTable Compression Ratio: 0.6615549506522498
        Number of keys (estimate): 1019477
        Memtable cell count: 270024
        Memtable data size: 20758095
        Memtable off heap memory used: 0
        Memtable switch count: 25
        Local read count: 0
        Local read latency: NaN ms
        Local write count: 1323546
        Local write latency: 0.048 ms
        Pending flushes: 0
        Bloom filter false positives: 0
        Bloom filter false ratio: 0.00000
        Bloom filter space used: 1533512
        Bloom filter off heap memory used: 1533480
        Index summary off heap memory used: 257175
        Compression metadata off heap memory used: 95520
        Compacted partition minimum bytes: 311
        Compacted partition maximum bytes: 770
        Compacted partition mean bytes: 686
        Average live cells per slice (last five minutes): 0.0
        Maximum live cells per slice (last five minutes): 0
        Average tombstones per slice (last five minutes): 0.0
        Maximum tombstones per slice (last five minutes): 0

所以现在CSV文件又是约550mb,我的表现在也是约550mb。那么,如果非关键列数据是相同的(基数较低)Cassandra,它会以某种方式非常有效地压缩这些数据吗?如果是这种情况,那么在建模数据库时,这是一个非常重要的概念(我以前从未读过),因为如果您记住这一点,那么可以节省大量存储空间。