代码之家 › 专栏 › 技术社区 › Gary

使用incrbob根据文本字符串中的字符位置来确定BLOB段中的字节位置?

tcl sqlite

Gary · 技术社区 · 1 年前

我的问题涉及这种情况。我在Tcl接口中使用SQLite的增量BLOB I/O将文本存储为BLOB。数据库中的另一个表包含指向此BLOB的段的指针数据行。由于Tcl通道寻找字节位置,而不是字符位置(我认为这是常见的,我不是在批评Tcl),所以我需要跟踪每个指针的byte_start、char_start、byte_length和char_length。

Tcl收到一个请求,指示现有指针(不是实际的缓冲区数据,而是指针本身)需要从某个字符位置开始拆分为两个指针。请求来自UI,而该代码对字节位置一无所知,实际上对BLOB一无所知。

因此,我需要提取BLOB的段,将其转换为文本,并确定两个新指针中至少一个的字节长度,并使用该信息来确定两个指针的起始字节和字节长度。

我本来打算在SQLite中使用下面这样的东西,但根据Hipp博士的说法,必须将整个BLOB读取到内存中,才能在BLOB上使用substr。

select
   octet_length(
      substr(
         cast(
            substr(
               buffer,
               byte_start,
               byte_length
            ) as text
         ),
         1,
         :len
      )
   )
from
   mem.pt_buffers
where
       doc_id = :doc_id
   and buffer_id = :buffer_id_s
;

因此,我在Tcl中使用incrbob,这个例子说明了这一点。它似乎产生了正确的结果,但似乎也需要做很多工作。例如,从BLOB中提取内容只是为了确定分割点的字符位置的字节位置。内容不会以其他方式使用,BLOB也不会被修改。如果UI不存在字节位置未知的问题,则不需要提取内容,并且操作仅为算术运算。

我的问题是:

我这样做是不是很艰难;有没有更简单的方法?
是否可以/应该在SQLite中直接完成更多的操作?

谢谢你考虑我的问题。

package require sqlite3
sqlite3 db
db eval {create table test (id integer, data blob);}
db eval {insert into test values (1, zeroblob(100));}
puts [db eval {select id, cast(data as text) from test;}]
# 1 {}

# Previously, a request had to come in to append this string
# to the BLOB; to the non-zero portion, anyway. This is re-
# quired set-up for the question.
set fdBlob [db incrblob main test data 1]
chan configure $fdBlob -translation binary -buffering none
set bindata [encoding convertto utf-8\
     {This is some ×Ö· / × Ö¼Ö¸×Ö´Ö×× text cast as a BLOB.}]
chan puts -nonewline $fdBlob $bindata

puts [db eval {
   select
      length(data),
      length(cast(data as text)),
      length(:bindata)
   from
      test
   ;
}]
# 100 47 57

# Pre-split pointer data:
# piece  char_start char_length byte_start byte_length
# -----  ---------- ----------- ---------- -----------
# orig        0          47           0         57

# Request comes in to split the pointer at the 27th character
# into two pointers: characters 0-26 and 27-end.

# Retrieve the segment of the BLOB. Have to retrieve the full
# piece because do not know where to split the bytes. Convert
# from binary to text. Note [chan read numChars] reads chars,
# but since channel is configured as binary, same as bytes.
chan seek $fdBlob 0 start
set data [encoding convertfrom utf-8 [chan read $fdBlob 57]]

# Split the piece. Need only one or the other, not both.
set frontEnd [string range $data 0 26]
set tailEnd [string range $data 27 end]
puts "\"$frontEnd\" \"$tailEnd\""
# "This is some ×Ö· / × Ö¼Ö¸×Ö´Ö×× " "text cast as a BLOB."

# Convert the substring back to binary and determine byte length.
set frontEndByteLen [string length [encoding convertto utf-8 $frontEnd]]
set tailEndByteLen [expr {57-$frontEndByteLen}]
puts "Front-end data: byte_start: 0 byte_length $frontEndByteLen"
puts "Tail-end data: byte_start: $frontEndByteLen byte_length $tailEndByteLen"
# Front-end data: byte_start: 0 byte_length 37
# Tail-end data: byte_start: 37 byte_length 20

# Test it out by seeking to the start byte and extracting the tail-end.
chan seek $fdBlob $frontEndByteLen start
set data [encoding convertfrom utf-8 [chan read $fdBlob $tailEndByteLen]]
puts $data
# text cast as a BLOB.

# Then the pointer table will have these new pieces inserted.
# piece  char_start char_length byte_start byte_length
# -----  ---------- ----------- ---------- -----------
# front       0          27           0         37
# tail       27          20          37         20

编辑: 我的大脑花了一段时间来处理答案的结束语,尽管它很简单:

另一个需要考虑的问题是,虽然chan-search对字节偏移量进行操作,但chan-read对字符进行操作,因此以这种方式提取片段可能更容易(尤其是如果这是您想要的第一部分-只需搜索到0并读取那么多字符)

正如原帖子中所评论的那样,我知道这一点,但没有正确使用它的意识。所需要的只是将通道重新配置为 -encoding utf-8 , chan seek 在字节上,然后 chan read 在字符上。因此,可以直接在incrbob i/o中读取前端段,而无需读取完整部分并使用 string range 在Tcl中。我愚蠢地认为blob必须读成二进制。这就省去了一点工作。

# Version 2
set frontEnd_byte_start 0
set frontEnd_char_length 27
chan configure $fdBlob -encoding utf-8 -buffering none
chan seek $fdBlob $frontEnd_byte_start start
set frontEndText [chan read $fdBlob $frontEnd_char_length]
set frontEnd_byte_length [string length [encoding convertto utf-8 $frontEndText]]
set tailEnd_byte_start [expr {$frontEnd_byte_start + $frontEnd_byte_length}]
puts "frontEnd text: $frontEndText"
puts "frontEnd byte length: $frontEnd_byte_length"
puts "tailEnd byte start: $tailEnd_byte_start"

chan seek $fdBlob $tailEnd_byte_start start
set data [chan read $fdBlob 20]
puts $data

# frontEnd text: This is some ×Ö· / × Ö¼Ö¸×Ö´Ö×× 
# frontEnd byte length: 37
# tailEnd byte start: 37
# text cast as a BLOB.

0 回复 | 直到 1 年前

Cyan Ogilvie 1 年前

您的例子是我将如何处理它,特别是如果字符偏移来自Tcl UI。由于sqlite是进程中的,因此不需要考虑网络延迟——实际上只是由sqlite的代码还是Tcl的代码执行字符偏移量计算。Tcl在这方面真的很擅长(对字符串和其他一切都很感兴趣)。

特别是如果UI是Tcl,因为在代理对等情况下字符索引会一致。

只选择blob值,用拆分可能更有效 string range 命令和更新行,但这将取决于db值的通道包装是如何实现的,以及它意味着什么样的复杂性。最好是对你的实际案例进行基准测试。另一个考虑因素是 chan seek 对字节偏移进行操作, chan read 以字符为单位工作,因此以这种方式提取片段可能更容易(尤其是如果这是您想要的第一部分——只需查找0并读取那么多字符)