代码之家  ›  专栏  ›  技术社区  ›  user2924127

Cassandra->500mb CSV文件生成约50mb大小的表?

  •  0
  • user2924127  · 技术社区  · 10 年前

    我是卡桑德拉的新手,想弄清楚尺码是如何工作的。我创建了一个键空间和一个表。然后,我生成了一个脚本,用java将100万行创建到csv文件中,并将其插入到数据库中。CSV文件的大小约为545 mb。然后我将其加载到数据库并运行 节点工具cfstats 命令,并收到此输出。它表示使用的总空间为50555052字节(~50 mb)。这怎么可能?在索引、列等开销的情况下,我的总数据如何能比原始CSV数据小(不仅小,而且小得多)?也许我没有正确阅读这里的内容,但这看起来正确吗?我在一台机器上使用Cassandra 2.2.1。

    Table: users
            SSTable count: 1
            Space used (live): 50555052
            Space used (total): 50555052
            Space used by snapshots (total): 0
            Off heap memory used (total): 1481050
            SSTable Compression Ratio: 0.03029072054256705
            Number of keys (estimate): 984133
            Memtable cell count: 240336
            Memtable data size: 18385704
            Memtable off heap memory used: 0
            Memtable switch count: 19
            Local read count: 0
            Local read latency: NaN ms
            Local write count: 1000000
            Local write latency: 0.044 ms
            Pending flushes: 0
            Bloom filter false positives: 0
            Bloom filter false ratio: 0.00000
            Bloom filter space used: 1192632
            Bloom filter off heap memory used: 1192624
            Index summary off heap memory used: 203778
            Compression metadata off heap memory used: 84648
            Compacted partition minimum bytes: 643
            Compacted partition maximum bytes: 770
            Compacted partition mean bytes: 770
            Average live cells per slice (last five minutes): 0.0
            Maximum live cells per slice (last five minutes): 0
            Average tombstones per slice (last five minutes): 0.0
            Maximum tombstones per slice (last five minutes): 0
    

    生成CSV文件的Java代码如下所示:

    try{
    
                FileWriter writer = new FileWriter(sFileName);
                for(int i=0;i<1000000;i++){
    
    
                writer.append("Username " + i);
                writer.append(',');
                writer.append(new Timestamp(date.getTime()).toString());
                writer.append(',');
                writer.append("myfakeemailaccnt@email.com");
                writer.append(',');
                writer.append(new Timestamp(date.getTime()).toString());
                writer.append(',');
                writer.append("eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6IkpvaG4gRG9lIiwiYWRtaW4iOnRydWV9.TJVA95OrM7E2cBab30RMHrHDcEfxjoYZgeFONFh7HgQ");
                writer.append(',');
                writer.append("eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6IkpvaG4gRG9lIiwiYWRtaW4iOnRydWV9.TJVA95OrM7E2cBab30RMHrHDcEfxjoYZgeFONFh7HgQ");
                writer.append(',');
                writer.append("eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6IkpvaG4gRG9lIiwiYWRtaW4iOnRydWV9.TJVA95OrM7E2cBab30RMHrHDcEfxjoYZgeFONFh7HgQ");
                writer.append(',');
                writer.append("tr");
                writer.append('\n');
    
                }   
                writer.flush();
                writer.close();
    
            }
            catch(IOException e)
            {
                 e.printStackTrace();
            } 
    
    1 回复  |  直到 10 年前
        1
  •  1
  •   user2924127    10 年前

    所以我想到了最大的3条数据:

    eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6IkpvaG4gRG9lIiwiYWRtaW4iOnRydWV9.TJVA95OrM7E2cBab30RMHrHDcEfxjoYZgeFONFh7HgQ
    

    并认为它们是一样的,也许卡桑德拉正在压缩它们,尽管它说只有3%的比例。所以我改变了Java代码以生成不同的数据。

    public class Main {
    
        private static final String ALPHA_NUMERIC_STRING = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
    
        public static void main(String[] args) {
    
            generateCassandraCSVData("users.csv");
    
        }
    
        public static String randomAlphaNumeric(int count) {
            StringBuilder builder = new StringBuilder();
            while (count-- != 0) {
            int character = (int)(Math.random()*ALPHA_NUMERIC_STRING.length());
            builder.append(ALPHA_NUMERIC_STRING.charAt(character));
            }
            return builder.toString();
            }
    
    
        public static void generateCassandraCSVData(String sFileName){
    
        java.util.Date date= new java.util.Date();
    
    
            try{
    
                FileWriter writer = new FileWriter(sFileName);
                for(int i=0;i<1000000;i++){
    
    
    
                writer.append("Username " + i);
                writer.append(',');
                writer.append(new Timestamp(date.getTime()).toString());
                writer.append(',');
                writer.append("myfakeemailaccnt@email.com");
                writer.append(',');
                writer.append(new Timestamp(date.getTime()).toString());
                writer.append(',');
                writer.append("" + randomAlphaNumeric(150) + "");
                writer.append(',');
                writer.append("" + randomAlphaNumeric(150) + "");
                writer.append(',');
                writer.append("" + randomAlphaNumeric(150) + "");
                writer.append(',');
                writer.append("tr");
                writer.append('\n');
    
    
                //generate whatever data you want
                }   
                writer.flush();
                writer.close();
    
            }
            catch(IOException e)
            {
                 e.printStackTrace();
            } 
    
        }
    
    }
    

    因此,现在这3个大列的数据都是随机字符串,不再相同。这是现在制作的:

    Table: users
            SSTable count: 4
            Space used (live): 554671040
            Space used (total): 554671040
            Space used by snapshots (total): 0
            Off heap memory used (total): 1886175
            SSTable Compression Ratio: 0.6615549506522498
            Number of keys (estimate): 1019477
            Memtable cell count: 270024
            Memtable data size: 20758095
            Memtable off heap memory used: 0
            Memtable switch count: 25
            Local read count: 0
            Local read latency: NaN ms
            Local write count: 1323546
            Local write latency: 0.048 ms
            Pending flushes: 0
            Bloom filter false positives: 0
            Bloom filter false ratio: 0.00000
            Bloom filter space used: 1533512
            Bloom filter off heap memory used: 1533480
            Index summary off heap memory used: 257175
            Compression metadata off heap memory used: 95520
            Compacted partition minimum bytes: 311
            Compacted partition maximum bytes: 770
            Compacted partition mean bytes: 686
            Average live cells per slice (last five minutes): 0.0
            Maximum live cells per slice (last five minutes): 0
            Average tombstones per slice (last five minutes): 0.0
            Maximum tombstones per slice (last five minutes): 0
    

    所以现在CSV文件又是约550mb,我的表现在也是约550mb。那么,如果非关键列数据是相同的(基数较低)Cassandra,它会以某种方式非常有效地压缩这些数据吗?如果是这种情况,那么在建模数据库时,这是一个非常重要的概念(我以前从未读过),因为如果您记住这一点,那么可以节省大量存储空间。