代码之家  ›  专栏  ›  技术社区  ›  Nicholas DiPiazza

比较两个文本大文件和只使用外部内存的Java中的URL?

  •  0
  • Nicholas DiPiazza  · 技术社区  · 7 年前

    • url文本文件A
    • url文本文件B

    我需要计算:

    • B中不在A中的所有url

    我发现的所有Java diff示例都会在线将整个列表加载到内存中(使用Map或MMap解决方案)。我的系统没有交换功能,而且没有足够的内存,因此无法在没有外部内存的情况下完成此操作。

    有人知道解决办法吗?

    这个项目可以在不占用大量内存的情况下进行巨大的文件排序 https://github.com/lemire/externalsortinginjava

    2 回复  |  直到 7 年前
        1
  •  1
  •   Shamit Verma    7 年前

    如果系统有足够的存储空间,您可以通过数据库来实现。例如:

    创建一个H2或sqlite数据库(存储在磁盘上的数据,尽可能多地分配 系统可负担的缓存)

    select url from A where URL not in (select distinct url from B)
    select url from B where URL not in (select distinct url from A)
    
        2
  •  0
  •   Nicholas DiPiazza    7 年前

    以下是我提出的解决方案的要点: https://gist.github.com/nddipiazza/16cb2a0d23ee60a07121893c26065de4

    import com.google.common.collect.Sets;
    import org.apache.commons.io.FileUtils;
    import org.apache.commons.io.LineIterator;
    
    import java.io.File;
    import java.io.IOException;
    import java.util.HashSet;
    import java.util.Set;
    
    public class DiffTextFilesUtil {
      static public int CHUNK_SIZE = 100000;
    
      static public class DiffResult {
        public Set<String> addedVals = new HashSet<>();
        public Set<String> removedVals = new HashSet<>();
      }
    
      /**
       * Gets diff result of two sorted files with each other.
       * @param lhs left hand file - sort this using com.google.code.externalsortinginjava:externalsortinginjava:0.2.5
       * @param rhs right hand file - sort this using com.google.code.externalsortinginjava:externalsortinginjava:0.2.5
       * @return DiffResult.addedVals were added from lhs to rhs. DiffResult.removedVals were removed from lhs to rhs.
       * @throws IOException
       */
      public static DiffResult diff(File lhs, File rhs) throws IOException {
    
        DiffResult diffResult = new DiffResult();
    
        LineIterator lhsIter = FileUtils.lineIterator(lhs);
        LineIterator rhsIter = FileUtils.lineIterator(rhs);
    
        String lhsTop = null;
        String rhsTop = null;
        while (lhsIter.hasNext()) {
          int ct = CHUNK_SIZE;
    
          Set<String> setLhs = Sets.newHashSet();
          Set<String> setRhs = Sets.newHashSet();
          while (lhsIter.hasNext() && --ct > 0) {
            lhsTop = lhsIter.nextLine();
            setLhs.add(lhsTop);
          }
          while (rhsIter.hasNext()) {
            if (rhsTop != null && rhsTop.compareTo(lhsTop) > 0) {
              break;
            } else if (rhsTop != null && rhsTop.compareTo(lhsTop) == 0) {
              setRhs.add(rhsTop);
              rhsTop = null;
              break;
            } else if (rhsTop != null) {
              setRhs.add(rhsTop);
            }
            rhsTop = rhsIter.next();
          }
          if (rhsTop != null) {
            setRhs.add(rhsTop);
          }
          Sets.difference(setLhs, setRhs).copyInto(diffResult.removedVals);
          Sets.difference(setRhs, setLhs).copyInto(diffResult.addedVals);
        }
        return diffResult;
      }
    }