所以我想到了最大的3条数据:
eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6IkpvaG4gRG9lIiwiYWRtaW4iOnRydWV9.TJVA95OrM7E2cBab30RMHrHDcEfxjoYZgeFONFh7HgQ
并认为它们是一样的,也许卡桑德拉正在压缩它们,尽管它说只有3%的比例。所以我改变了Java代码以生成不同的数据。
public class Main {
private static final String ALPHA_NUMERIC_STRING = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
public static void main(String[] args) {
generateCassandraCSVData("users.csv");
}
public static String randomAlphaNumeric(int count) {
StringBuilder builder = new StringBuilder();
while (count-- != 0) {
int character = (int)(Math.random()*ALPHA_NUMERIC_STRING.length());
builder.append(ALPHA_NUMERIC_STRING.charAt(character));
}
return builder.toString();
}
public static void generateCassandraCSVData(String sFileName){
java.util.Date date= new java.util.Date();
try{
FileWriter writer = new FileWriter(sFileName);
for(int i=0;i<1000000;i++){
writer.append("Username " + i);
writer.append(',');
writer.append(new Timestamp(date.getTime()).toString());
writer.append(',');
writer.append("myfakeemailaccnt@email.com");
writer.append(',');
writer.append(new Timestamp(date.getTime()).toString());
writer.append(',');
writer.append("" + randomAlphaNumeric(150) + "");
writer.append(',');
writer.append("" + randomAlphaNumeric(150) + "");
writer.append(',');
writer.append("" + randomAlphaNumeric(150) + "");
writer.append(',');
writer.append("tr");
writer.append('\n');
//generate whatever data you want
}
writer.flush();
writer.close();
}
catch(IOException e)
{
e.printStackTrace();
}
}
}
因此,现在这3个大列的数据都是随机字符串,不再相同。这是现在制作的:
Table: users
SSTable count: 4
Space used (live): 554671040
Space used (total): 554671040
Space used by snapshots (total): 0
Off heap memory used (total): 1886175
SSTable Compression Ratio: 0.6615549506522498
Number of keys (estimate): 1019477
Memtable cell count: 270024
Memtable data size: 20758095
Memtable off heap memory used: 0
Memtable switch count: 25
Local read count: 0
Local read latency: NaN ms
Local write count: 1323546
Local write latency: 0.048 ms
Pending flushes: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 1533512
Bloom filter off heap memory used: 1533480
Index summary off heap memory used: 257175
Compression metadata off heap memory used: 95520
Compacted partition minimum bytes: 311
Compacted partition maximum bytes: 770
Compacted partition mean bytes: 686
Average live cells per slice (last five minutes): 0.0
Maximum live cells per slice (last five minutes): 0
Average tombstones per slice (last five minutes): 0.0
Maximum tombstones per slice (last five minutes): 0
所以现在CSV文件又是约550mb,我的表现在也是约550mb。那么,如果非关键列数据是相同的(基数较低)Cassandra,它会以某种方式非常有效地压缩这些数据吗?如果是这种情况,那么在建模数据库时,这是一个非常重要的概念(我以前从未读过),因为如果您记住这一点,那么可以节省大量存储空间。