代码之家  ›  专栏  ›  技术社区  ›  stevebot

按字节截断字符串

  •  8
  • stevebot  · 技术社区  · 15 年前

    我创建以下代码,将java中的字符串截断为具有给定字节数的新字符串。

            String truncatedValue = "";
            String currentValue = string;
            int pivotIndex = (int) Math.round(((double) string.length())/2);
            while(!truncatedValue.equals(currentValue)){
                currentValue = string.substring(0,pivotIndex);
                byte[] bytes = null;
                bytes = currentValue.getBytes(encoding);
                if(bytes==null){
                    return string;
                }
                int byteLength = bytes.length;
                int newIndex =  (int) Math.round(((double) pivotIndex)/2);
                if(byteLength > maxBytesLength){
                    pivotIndex = newIndex;
                } else if(byteLength < maxBytesLength){
                    pivotIndex = pivotIndex + 1;
                } else {
                    truncatedValue = currentValue;
                }
            }
            return truncatedValue;
    

    这是我想到的第一件事,我知道我可以改进它。我看到另一个帖子在那里问了一个类似的问题,但是他们用字节而不是字节来截断字符串字符串.子字符串. 我想我宁愿用字符串.子字符串就我而言。

    12 回复  |  直到 12 年前
        1
  •  14
  •   Rex Kerr    11 年前

    // Assuming that Java will always produce valid UTF8 from a string, so no error checking!
    // (Is this always true, I wonder?)
    public class UTF8Cutter {
      public static String cut(String s, int n) {
        byte[] utf8 = s.getBytes();
        if (utf8.length < n) n = utf8.length;
        int n16 = 0;
        int advance = 1;
        int i = 0;
        while (i < n) {
          advance = 1;
          if ((utf8[i] & 0x80) == 0) i += 1;
          else if ((utf8[i] & 0xE0) == 0xC0) i += 2;
          else if ((utf8[i] & 0xF0) == 0xE0) i += 3;
          else { i += 4; advance = 2; }
          if (i <= n) n16 += advance;
        }
        return s.substring(0,n16);
      }
    }
    

    注:2014年8月25日编辑修复错误

        2
  •  7
  •   kan    10 年前

    更明智的解决方案是使用解码器:

    final Charset CHARSET = Charset.forName("UTF-8"); // or any other charset
    final byte[] bytes = inputString.getBytes(CHARSET);
    final CharsetDecoder decoder = CHARSET.newDecoder();
    decoder.onMalformedInput(CodingErrorAction.IGNORE);
    decoder.reset();
    final CharBuffer decoded = decoder.decode(ByteBuffer.wrap(bytes, 0, limit));
    final String outputString = decoded.toString();
    
        3
  •  5
  •   Zsolt Taskai    10 年前

    我认为Rex Kerr的解决方案有两个错误。

    • 首先,如果非ASCII字符刚好在限制之前,它将截断为限制+1。截断“1234567891”将产生“123456789”,在UTF-8中用11个字符表示。
    • https://en.wikipedia.org/wiki/UTF-8#Description 显示UTF序列开头的110xxxxx告诉我们表示长度为2个字符(而不是3个字符)。这就是他的实现通常不会占用所有可用空间的原因(正如nissimavitan所指出的)。

    请在下面找到我的更正版本:

    public String cut(String s, int charLimit) throws UnsupportedEncodingException {
        byte[] utf8 = s.getBytes("UTF-8");
        if (utf8.length <= charLimit) {
            return s;
        }
        int n16 = 0;
        boolean extraLong = false;
        int i = 0;
        while (i < charLimit) {
            // Unicode characters above U+FFFF need 2 words in utf16
            extraLong = ((utf8[i] & 0xF0) == 0xF0);
            if ((utf8[i] & 0x80) == 0) {
                i += 1;
            } else {
                int b = utf8[i];
                while ((b & 0x80) > 0) {
                    ++i;
                    b = b << 1;
                }
            }
            if (i <= charLimit) {
                n16 += (extraLong) ? 2 : 1;
            }
        }
        return s.substring(0, n16);
    }
    

    private byte[] cutToBytes(String s, int charLimit) throws UnsupportedEncodingException {
        byte[] utf8 = s.getBytes("UTF-8");
        if (utf8.length <= charLimit) {
            return utf8;
        }
        if ((utf8[charLimit] & 0x80) == 0) {
            // the limit doesn't cut an UTF-8 sequence
            return Arrays.copyOf(utf8, charLimit);
        }
        int i = 0;
        while ((utf8[charLimit-i-1] & 0x80) > 0 && (utf8[charLimit-i-1] & 0x40) == 0) {
            ++i;
        }
        if ((utf8[charLimit-i-1] & 0x80) > 0) {
            // we have to skip the starter UTF-8 byte
            return Arrays.copyOf(utf8, charLimit-i-1);
        } else {
            // we passed all UTF-8 bytes
            return Arrays.copyOf(utf8, charLimit-i);
        }
    }
    

    有趣的是,在实际的20-500字节限制下,它们的性能几乎相同 再次从字节数组创建字符串。

    请注意,这两个方法都假设一个有效的utf-8输入,这是在使用Java的getBytes()函数之后的一个有效假设。

        4
  •  4
  •   Ilya Lysenko    5 年前
    String s = "FOOBAR";
    
    int limit = 3;
    s = new String(s.getBytes(), 0, limit);
    

    结果值 s :

    FOO
    
        5
  •  3
  •   bmargulies    14 年前

    使用UTF-8 CharsetEncoder,通过查找代码结果溢出.

        6
  •  2
  •   shadow    14 年前
        7
  •  2
  •   Nissim Avitan    12 年前

    如前所述,Peter Lawrey解决方案在性能上有很大的劣势(10000次约为3500MSC),Rex Kerr要好得多(10000次约为500msc),但结果并不准确—它的削减量远远超过了需要的量(例如,它没有保留4000字节,而是保留3500字节)。在此附上我的解决方案(~250msc 10000次),假设UTF-8最大长度字符(字节)为4(感谢维基百科):

    public static String cutWord (String word, int dbLimit) throws UnsupportedEncodingException{
        double MAX_UTF8_CHAR_LENGTH = 4.0;
        if(word.length()>dbLimit){
            word = word.substring(0, dbLimit);
        }
        if(word.length() > dbLimit/MAX_UTF8_CHAR_LENGTH){
            int residual=word.getBytes("UTF-8").length-dbLimit;
            if(residual>0){
                int tempResidual = residual,start, end = word.length();
                while(tempResidual > 0){
                    start = end-((int) Math.ceil((double)tempResidual/MAX_UTF8_CHAR_LENGTH));
                    tempResidual = tempResidual - word.substring(start,end).getBytes("UTF-8").length;
                    end=start;
                }
                word = word.substring(0, end);
            }
        }
        return word;
    }
    
        8
  •  1
  •   Peter Lawrey    15 年前

    您可以将字符串转换为字节,然后仅将这些字节转换回字符串。

    public static String substring(String text, int maxBytes) {
       StringBuilder ret = new StringBuilder();
       for(int i = 0;i < text.length(); i++) {
           // works out how many bytes a character takes, 
           // and removes these from the total allowed.
           if((maxBytes -= text.substring(i, i+1).getBytes().length) < 0) break;
           ret.append(text.charAt(i));
       }
       return ret.toString();
    }
    
        9
  •  0
  •   Baby Groot Duleendra    12 年前

    stringtoConvert = stringtoConvert.replaceAll("^[\\s ]*", "").replaceAll("[\\s ]*$", "");
    
        10
  •  0
  •   Matt McMinn    11 年前

    这是我的:

    private static final int FIELD_MAX = 2000;
    private static final Charset CHARSET =  Charset.forName("UTF-8"); 
    
    public String trancStatus(String status) {
    
        if (status != null && (status.getBytes(CHARSET).length > FIELD_MAX)) {
            int maxLength = FIELD_MAX;
    
            int left = 0, right = status.length();
            int index = 0, bytes = 0, sizeNextChar = 0;
    
            while (bytes != maxLength && (bytes > maxLength || (bytes + sizeNextChar < maxLength))) {
    
                index = left + (right - left) / 2;
    
                bytes = status.substring(0, index).getBytes(CHARSET).length;
                sizeNextChar = String.valueOf(status.charAt(index + 1)).getBytes(CHARSET).length;
    
                if (bytes < maxLength) {
                    left = index - 1;
                } else {
                    right = index + 1;
                }
            }
    
            return status.substring(0, index);
    
        } else {
            return status;
        }
    }
    
        11
  •  0
  •   Saúl Martínez Vidals    10 年前

    public static String substring(String s, int byteLimit) {
        if (s.getBytes().length <= byteLimit) {
            return s;
        }
    
        int n = Math.min(byteLimit-1, s.length()-1);
        do {
            s = s.substring(0, n--);
        } while (s.getBytes().length > byteLimit);
    
        return s;
    }
    
        12
  •  0
  •   Hans Brende    9 年前

    我改进了Peter Lawrey的解决方案,以精确地处理代理对。此外,我优化的基础上,事实上,最大的字节数每 char 在UTF-8中,编码是3。

    public static String substring(String text, int maxBytes) {
        for (int i = 0, len = text.length(); (len - i) * 3 > maxBytes;) {
            int j = text.offsetByCodePoints(i, 1);
            if ((maxBytes -= text.substring(i, j).getBytes(StandardCharsets.UTF_8).length) < 0)  
                return text.substring(0, i);
            i = j;
        }
        return text;
    }
    
        13
  •  0
  •   nafg    5 年前

    private def bytes(s: String) = s.getBytes("UTF-8")
    
    def truncateToByteLength(string: String, length: Int): String =
      if (length <= 0 || string.isEmpty) ""
      else {
        @tailrec
        def loop(badLen: Int, goodLen: Int, good: String): String = {
          assert(badLen > goodLen, s"""badLen is $badLen but goodLen is $goodLen ("$good")""")
          if (badLen == goodLen + 1) good
          else {
            val mid = goodLen + (badLen - goodLen) / 2
            val midStr = string.take(mid)
            if (bytes(midStr).length > length)
              loop(mid, goodLen, good)
            else
              loop(badLen, mid, midStr)
          }
        }
    
        loop(string.length * 2, 0, "")
      }
    
    推荐文章