Come leggere il file di testo utf16 su string in golang?

posso leggere il file di byte gammaCome leggere il file di testo utf16 su string in golang?

ma quando ho convertirlo in stringa

si tratti i byte UTF16 come ASCII

Come convertire correttamente?

package main 

import ("fmt" 
"os" 
"bufio" 
) 

func main(){ 
    // read whole the file 
    f, err := os.Open("test.txt") 
    if err != nil { 
     fmt.Printf("error opening file: %v\n",err) 
     os.Exit(1) 
    } 
    r := bufio.NewReader(f) 
    var s,b,e = r.ReadLine() 
    if e==nil{ 
     fmt.Println(b) 
     fmt.Println(s) 
     fmt.Println(string(s)) 
    } 
}

uscita:

falsa

[255 254 91 0 83 0 99 0 114 0 105 0 112 0 116 0 32 0 73 0 110 0 102 0 111 0 93 0 13 0]

S cripta I nfo]

Aggiornamento:

Dopo aver testato i due esempi, ho capito qual è il problema esatto ora.

In Windows, se aggiungo l'interruzione di riga (CR + LF) alla fine della riga, il CR verrà letto nella riga. Poiché la funzione readline non può gestire unicode correttamente ([OD OA] = ok, [OD 00 OA 00] = non ok).

Se la funzione di lettura può riconoscere unicode, deve comprendere [OD 00 OA 00] e restituire [] uint16 anziché [] byte.

Quindi penso che non dovrei usare bufio.NewReader dato che non è in grado di leggere utf16, non vedo bufio.NewReader.ReadLine può accettare parametri come flag per indicare che il testo di lettura è utf8, utf16le/be o UTF-32. Esiste una funzione readline per il testo unicode nella libreria go?

fonte

2013-04-03 CL So

UTF16, UTF8 e Byte Order Marks sono definiti dal Unicode Consortium: UTF-16 FAQ, UTF-8 FAQ e Byte Order Mark (BOM) FAQ.

Issue 4802: bufio: reading lines is too cumbersome

righe di lettura da un file è troppo ingombrante in Go.

Le persone sono spesso attratti da bufio.Reader.ReadLine a causa del suo nome, ma ha una firma strano, di ritorno ([] byte, isPrefix bool, errore di linea err), e richiede un sacco di lavoro.

ReadSlice e ReadString richiedono un byte delimitatore, che è quasi sempre l'ovvio e sgradevoli '\ n', e può anche tornare sia una linea e un EOF

Revision: f685026a2d38

bufio: nuova interfaccia scanner

Aggiungere una nuova interfaccia semplice per sc dati anning (probabilmente testuali), basati su un nuovo tipo chiamato Scanner. Fa il proprio buffer interno , quindi dovrebbe essere plausibilmente efficiente anche senza iniettare un bufio.Reader . Il formato dell'input è definito da una "divisione ", per impostazione predefinita suddivisione in righe.

go1.1beta1 released

È possibile scaricare le distribuzioni binari e sorgenti dal solito posto: https://code.google.com/p/go/downloads/list?q=go1.1beta1

Ecco un programma che utilizza le regole Unicode per convertire UTF16 righe del file di testo da utilizzare per le stringhe con codifica UTF8. Il codice è stato rivisto per sfruttare la nuova interfaccia bufio.Scanner in Go 1.1.

package main 

import (
    "bufio" 
    "bytes" 
    "encoding/binary" 
    "fmt" 
    "os" 
    "runtime" 
    "unicode/utf16" 
    "unicode/utf8" 
) 

// UTF16BytesToString converts UTF-16 encoded bytes, in big or little endian byte order, 
// to a UTF-8 encoded string. 
func UTF16BytesToString(b []byte, o binary.ByteOrder) string { 
    utf := make([]uint16, (len(b)+(2-1))/2) 
    for i := 0; i+(2-1) < len(b); i += 2 { 
     utf[i/2] = o.Uint16(b[i:]) 
    } 
    if len(b)/2 < len(utf) { 
     utf[len(utf)-1] = utf8.RuneError 
    } 
    return string(utf16.Decode(utf)) 
} 

// UTF-16 endian byte order 
const (
    unknownEndian = iota 
    bigEndian 
    littleEndian 
) 

// dropCREndian drops a terminal \r from the endian data. 
func dropCREndian(data []byte, t1, t2 byte) []byte { 
    if len(data) > 1 { 
     if data[len(data)-2] == t1 && data[len(data)-1] == t2 { 
      return data[0 : len(data)-2] 
     } 
    } 
    return data 
} 

// dropCRBE drops a terminal \r from the big endian data. 
func dropCRBE(data []byte) []byte { 
    return dropCREndian(data, '\x00', '\r') 
} 

// dropCRLE drops a terminal \r from the little endian data. 
func dropCRLE(data []byte) []byte { 
    return dropCREndian(data, '\r', '\x00') 
} 

// dropCR drops a terminal \r from the data. 
func dropCR(data []byte) ([]byte, int) { 
    var endian = unknownEndian 
    switch ld := len(data); { 
    case ld != len(dropCRLE(data)): 
     endian = littleEndian 
    case ld != len(dropCRBE(data)): 
     endian = bigEndian 
    } 
    return data, endian 
} 

// SplitFunc is a split function for a Scanner that returns each line of 
// text, stripped of any trailing end-of-line marker. The returned line may 
// be empty. The end-of-line marker is one optional carriage return followed 
// by one mandatory newline. In regular expression notation, it is `\r?\n`. 
// The last non-empty line of input will be returned even if it has no 
// newline. 
func ScanUTF16LinesFunc(byteOrder binary.ByteOrder) (bufio.SplitFunc, func() binary.ByteOrder) { 

    // Function closure variables 
    var endian = unknownEndian 
    switch byteOrder { 
    case binary.BigEndian: 
     endian = bigEndian 
    case binary.LittleEndian: 
     endian = littleEndian 
    } 
    const bom = 0xFEFF 
    var checkBOM bool = endian == unknownEndian 

    // Scanner split function 
    splitFunc := func(data []byte, atEOF bool) (advance int, token []byte, err error) { 

     if atEOF && len(data) == 0 { 
      return 0, nil, nil 
     } 

     if checkBOM { 
      checkBOM = false 
      if len(data) > 1 { 
       switch uint16(bom) { 
       case uint16(data[0])<<8 | uint16(data[1]): 
        endian = bigEndian 
        return 2, nil, nil 
       case uint16(data[1])<<8 | uint16(data[0]): 
        endian = littleEndian 
        return 2, nil, nil 
       } 
      } 
     } 

     // Scan for newline-terminated lines. 
     i := 0 
     for { 
      j := bytes.IndexByte(data[i:], '\n') 
      if j < 0 { 
       break 
      } 
      i += j 
      switch e := i % 2; e { 
      case 1: // UTF-16BE 
       if endian != littleEndian { 
        if i > 1 { 
         if data[i-1] == '\x00' { 
          endian = bigEndian 
          // We have a full newline-terminated line. 
          return i + 1, dropCRBE(data[0 : i-1]), nil 
         } 
        } 
       } 
      case 0: // UTF-16LE 
       if endian != bigEndian { 
        if i+1 < len(data) { 
         i++ 
         if data[i] == '\x00' { 
          endian = littleEndian 
          // We have a full newline-terminated line. 
          return i + 1, dropCRLE(data[0 : i-1]), nil 
         } 
        } 
       } 
      } 
      i++ 
     } 

     // If we're at EOF, we have a final, non-terminated line. Return it. 
     if atEOF { 
      // drop CR. 
      advance = len(data) 
      switch endian { 
      case bigEndian: 
       data = dropCRBE(data) 
      case littleEndian: 
       data = dropCRLE(data) 
      default: 
       data, endian = dropCR(data) 
      } 
      if endian == unknownEndian { 
       if runtime.GOOS == "windows" { 
        endian = littleEndian 
       } else { 
        endian = bigEndian 
       } 
      } 
      return advance, data, nil 
     } 

     // Request more data. 
     return 0, nil, nil 
    } 

    // Endian byte order function 
    orderFunc := func() (byteOrder binary.ByteOrder) { 
     switch endian { 
     case bigEndian: 
      byteOrder = binary.BigEndian 
     case littleEndian: 
      byteOrder = binary.LittleEndian 
     } 
     return byteOrder 
    } 

    return splitFunc, orderFunc 
} 

func main() { 
    file, err := os.Open("utf16.le.txt") 
    if err != nil { 
     fmt.Println(err) 
     os.Exit(1) 
    } 
    defer file.Close() 
    fmt.Println(file.Name()) 

    rdr := bufio.NewReader(file) 
    scanner := bufio.NewScanner(rdr) 
    var bo binary.ByteOrder // unknown, infer from data 
    // bo = binary.LittleEndian // windows 
    splitFunc, orderFunc := ScanUTF16LinesFunc(bo) 
    scanner.Split(splitFunc) 

    for scanner.Scan() { 
     b := scanner.Bytes() 
     s := UTF16BytesToString(b, orderFunc()) 
     fmt.Println(len(s), s) 
     fmt.Println(len(b), b) 
    } 
    fmt.Println(orderFunc()) 

    if err := scanner.Err(); err != nil { 
     fmt.Println(err) 
    } 
}

uscita:

utf16.le.txt 
15 "Hello, 世界" 
22 [34 0 72 0 101 0 108 0 108 0 111 0 44 0 32 0 22 78 76 117 34 0] 
0 
0 [] 
15 "Hello, 世界" 
22 [34 0 72 0 101 0 108 0 108 0 111 0 44 0 32 0 22 78 76 117 34 0] 
LittleEndian 

utf16.be.txt 
15 "Hello, 世界" 
22 [0 34 0 72 0 101 0 108 0 108 0 111 0 44 0 32 78 22 117 76 0 34] 
0 
0 [] 
15 "Hello, 世界" 
22 [0 34 0 72 0 101 0 108 0 108 0 111 0 44 0 32 78 22 117 76 0 34] 
BigEndian

fonte

2013-04-03 17:31:13 peterSO

Ora capisco il problema che non è in conversione, è in readline. Quindi la domanda è aggiornata. –

Ecco un programma aggiornato per risolvere il tuo problema. – peterSO

Grazie per il tuo programma, lo correggerò in base alla tua revisione, perché l'interruzione di riga ha ancora molti standard [link] (http://en.wikipedia.org/wiki/Newline). Dal momento che nessun pacchetto in go to read utf16, penso che dovrei segnalare questo problema a google, perché al giorno d'oggi, il linguaggio di programmazione moderno dovrebbe essere in grado di elaborare unicode correttamente, in particolare nell'applicazione internet. –

Ad esempio:

package main 

import (
     "errors" 
     "fmt" 
     "log" 
     "unicode/utf16" 
) 

func utf16toString(b []uint8) (string, error) { 
     if len(b)&1 != 0 { 
       return "", errors.New("len(b) must be even") 
     } 

     // Check BOM 
     var bom int 
     if len(b) >= 2 { 
       switch n := int(b[0])<<8 | int(b[1]); n { 
       case 0xfffe: 
         bom = 1 
         fallthrough 
       case 0xfeff: 
         b = b[2:] 
       } 
     } 

     w := make([]uint16, len(b)/2) 
     for i := range w { 
       w[i] = uint16(b[2*i+bom&1])<<8 | uint16(b[2*i+(bom+1)&1]) 
     } 
     return string(utf16.Decode(w)), nil 
} 

func main() { 
     // Simulated data from e.g. a file 
     b := []byte{255, 254, 91, 0, 83, 0, 99, 0, 114, 0, 105, 0, 112, 0, 116, 0, 32, 0, 73, 0, 110, 0, 102, 0, 111, 0, 93, 0, 13, 0} 
     s, err := utf16toString(b) 
     if err != nil { 
       log.Fatal(err) 
     } 

     fmt.Printf("%q", s) 
}

(anche here)

uscita:

"[Script Info]\r"

fonte

2013-04-03 10:05:33 zzzz

Vorrei anche usare 'encoding/binary' per leggerlo come [] uint16 per cominciare. – cthom06

@ cthom06: non lo consiglierei. – zzzz

@ cthom06 Perché? Fai attenzione che i caratteri in UTF16 non siano sempre codificati su due byte (vale solo per il BMP). –

L'ultima versione di golang.org/x/text/encoding/unicode rende più facile fare questo perché comprende unicode.BOMOverride, che intelligentemente interpretare la distinta base.

Ecco ReadFileUTF16(), che è come os.ReadFile() ma decodifica UTF-16.

package main 

import (
    "bytes" 
    "fmt" 
    "io/ioutil" 
    "log" 
    "strings" 

    "golang.org/x/text/encoding/unicode" 
    "golang.org/x/text/transform" 
) 

// Similar to ioutil.ReadFile() but decodes UTF-16. Useful when 
// reading data from MS-Windows systems that generate UTF-16BE files, 
// but will do the right thing if other BOMs are found. 
func ReadFileUTF16(filename string) ([]byte, error) { 

    // Read the file into a []byte: 
    raw, err := ioutil.ReadFile(filename) 
    if err != nil { 
     return nil, err 
    } 

    // Make an tranformer that converts MS-Win default to UTF8: 
    win16be := unicode.UTF16(unicode.BigEndian, unicode.IgnoreBOM) 
    // Make a transformer that is like win16be, but abides by BOM: 
    utf16bom := unicode.BOMOverride(win16be.NewDecoder()) 

    // Make a Reader that uses utf16bom: 
    unicodeReader := transform.NewReader(bytes.NewReader(raw), utf16bom) 

    // decode and print: 
    decoded, err := ioutil.ReadAll(unicodeReader) 
    return decoded, err 
} 

func main() { 
    data, err := ReadFileUTF16("inputfile.txt") 
    if err != nil { 
     log.Fatal(err) 
    } 
    final := strings.Replace(string(data), "\r\n", "\n", -1) 
    fmt.Println(final) 

}

Ecco NewScannerUTF16 che è come os.Open() ma restituisce uno scanner.

package main 

import (
    "bufio" 
    "fmt" 
    "log" 
    "os" 

    "golang.org/x/text/encoding/unicode" 
    "golang.org/x/text/transform" 
) 

type utfScanner interface { 
    Read(p []byte) (n int, err error) 
} 

// Creates a scanner similar to os.Open() but decodes the file as UTF-16. 
// Useful when reading data from MS-Windows systems that generate UTF-16BE 
// files, but will do the right thing if other BOMs are found. 
func NewScannerUTF16(filename string) (utfScanner, error) { 

    // Read the file into a []byte: 
    file, err := os.Open(filename) 
    if err != nil { 
     return nil, err 
    } 

    // Make an tranformer that converts MS-Win default to UTF8: 
    win16be := unicode.UTF16(unicode.BigEndian, unicode.IgnoreBOM) 
    // Make a transformer that is like win16be, but abides by BOM: 
    utf16bom := unicode.BOMOverride(win16be.NewDecoder()) 

    // Make a Reader that uses utf16bom: 
    unicodeReader := transform.NewReader(file, utf16bom) 
    return unicodeReader, nil 
} 

func main() { 

    s, err := NewScannerUTF16("inputfile.txt") 
    if err != nil { 
     log.Fatal(err) 
    } 

    scanner := bufio.NewScanner(s) 
    for scanner.Scan() { 
     fmt.Println(scanner.Text()) // Println will add back the final '\n' 
    } 
    if err := scanner.Err(); err != nil { 
     fmt.Fprintln(os.Stderr, "reading inputfile:", err) 
    } 

}

fonte

2016-01-21 17:44:20 TomOnTime

Come leggere il file di testo utf16 su string in golang?

risposta

Problemi correlati