代码之家 › 专栏 › 技术社区 › arranjdavis

从另一个SQL数据库创建一个SQL数据库时内存使用过多

sqlite pandas python

arranjdavis · 技术社区 · 6 年前

我试图遍历一个sqlite数据库(我称之为database a),从其中的数据创建一些新变量,然后将这些新数据写入一个新的sqlite数据库(database b)。

数据库A是由表组成的,这些表由关于某个月的特定术语的tweet组成(每个tweet及其元数据是一行,包括该月的每一天)。每个表的大小约为0.5 GB。

所以,我遍历这些表,创建一个变量,然后将这个新数据写入/提交到数据库B。

问题是,在遍历了几个表之后,我使用的服务器上的工作内存(我有16GB的RAM)就完全用完了(使用 free -m 命令在bash中,我可以看到大约一半的ram被“buff/cache”使用。这不会生成任何可以在输出文件中看到的错误(通常显示python错误消息),但脚本将停止运行。

我认为这是sqlite创建临时文件的结果( https://www.sqlite.org/tempfiles.html ,随着for循环的继续,它将继续增长。因此,我尝试逐日遍历表中的行,并在每天之后将新数据提交到数据库B,以便删除回滚日志(请参见上面的链接)(这些临时SQL文件之一)(从而释放内存)。然而,即使在做了这些更改之后,我也遇到了同样的问题(脚本停止)。

我不确定这里的代码有多大帮助,但这里有一个我正在做的基本概述:

import sqlite3
import pandas

#this defines the SQL query; [long list of columns] is just comma separated column names: id, date, time, etc.
sql_query = ("SELECT DISTINCT [long list of columns] "
            "FROM term "
            "WHERE date = 'day';")

### HERE I GET ALL TABLES IN DATABASE A ###

#go through all the tables in Database A
for t in tables:

   term = t

   ### HERE I GET THE DAYS IN THE CURRENT TABLE ###

   #go through each day in the current table in Database A
   for day in days:

      #open the databases
      connection = sqlite3.connect("../SQL_database/Database_A.db3")
      lite_cursor = connection.cursor()
      connection_new = sqlite3.connect("../SQL_database/Database_B.db3")
      lite_cursor_new = connection_new.cursor()

      #change SQL query to match current day and term
      sql_query = sql_query.replace('day', day)

      #extract the data from the database and put it in the new database
      for chunk in pandas.read_sql_query(sql_query, connection, chunksize = 10000):

         ### HERE I PROCESS THE DATA ###

         #add the current data set to Database B
         new_table = term
         chunk.to_sql(new_table, connection_new, if_exists='append', index=True)

         #redefine SQL query; [long list of columns] is just comma separated column names: id, date, time, etc.
         sql_query = ("SELECT DISTINCT [long list of columns] "
                     "FROM term "
                     "WHERE date = 'day';")

         #commit the changes
         connection_new.commit()

         #close the databases
         connection.close()
         connection_new.close()

当然,我想要的是脚本运行时不停顿/崩溃!有没有什么方法可以清除sqlite内存缓存,以便在for循环继续时不会耗尽ram?我以为commit()会释放一些内存,但显然它释放的内存不够。

提前谢谢你!

0 回复 | 直到 6 年前

Serge Ballesta 6 年前

我将尝试直接在sqlite级别执行此操作。

sqlite能够将一个额外的数据库附加到当前连接,这允许在不同的数据库之间轻松地复制表。由于你没有添加很多处理,熊猫是相当无用的,而且 ATTACH DATABASE 应该足够了:

import sqlite3

#this defines the SQL query; [long list of columns] is just comma separated column names: id, date, time, etc.
sql_query = ("SELECT DISTINCT [long list of columns] "
            "FROM term "
            "WHERE date = 'day';")

#open the databases
connection = sqlite3.connect("../SQL_database/Database_A.db3")
connection.execute("ATTACH DATABASE '../SQL_database/Database_B.db3' as db_B")

### HERE I GET ALL TABLES IN DATABASE A ###

#go through all the tables in Database A
for t in tables:

   term = t

   ### HERE I GET THE DAYS IN THE CURRENT TABLE ###

   #go through each day in the current table in Database A
   for day in days:

      #change SQL query to match current day and term
      # but don't change original query because we'll need it on next iteration
      sql_query2 = sql_query.replace('day', day) 
      sql_query2 = sql_query2.replace('term', term) 

      # print(sql_query2, end=' ')           # uncomment to make sure of what happens

      # copy table values
      try:
          connection.execute("INSERT INTO db_B.{} ".format(term) + sql_query2)
          # print('inserted')                    # uncomment for traces
      except sqlite3.OperationalError:  # table does not exists
          connection.rollback()
          connection.execute("CREATE TABLE db_B.{} AS ".format(term) + sql_query2)
          # print('created')                     # uncomment for traces

   connection.commit()

connection.close()

这里唯一可能消耗资源的操作是 SELECT DISTINCT 这需要扫描整个表,以便在给定的一天只保留不同的行,但是每次提交时都应该释放资源。根据表的数量和大小,这可能需要一些时间,但不应该崩溃。

很晚才添加此答案,但我刚刚意识到有许多请求使用 DISTINCT 关键字与 WHERE date = . 索引可以大大提高数据库性能。在此,在提取信息之前添加索引将对时间和内存产生重大影响:

...
for t in tables:

   term = t

   connection.execute("CREATE INDEX IF NOT EXISTS I{0} ON {0}(date)"
                      .format(term))

   ### HERE I GET THE DAYS IN THE CURRENT TABLE ###
...