お気楽 Haskell プログラミング入門

はじめに

今回は「バイト列 (bytestring)」を扱うモジュール Data.ByteString について説明します。bytestring はリストと似たデータ構造ですが、要素を 1 バイト (8 ビット) の整数値に固定したものです。Haskell の文字列は要素が文字型 (Char) のリストですが、Char はユニコード文字 (4 バイト) で表されているので、ファイルからデータを文字列として読み込むと無駄が多くなってしまいます。このような場合、bytestring を使うと効率的に処理することができます。

●bytestring の種類

bytestring を扱うモジュールは Data.ByteString だけではなく、次に示す 4 つのモジュールがあります。

Data.ByteString
Data.ByteString.Char8
Data.ByteString.Lazy
Data.ByteString.Lazy.Char8

Data.ByteString は正格評価で、Lazy が付くモジュールは遅延評価を行います。Data.ByteString は要素を Word8 (0 から 255 までの整数値) として扱いますが、Char8 が付くモジュールは要素を 8 ビットの文字として扱います。

正格 bytestring の場合、すべての要素が評価されます。たとえば、ファイルを正格 bytestring で全部読み込む場合、そのデータを格納するだけのメモリが必要になります。これに対し、bytestring の遅延評価は 1 バイトずつ行われるのではなく、64K バイトずつ行われます。この塊を「チャンク (chunk)」といいます。ファイルを先頭から順番に処理する場合、大きなファイルでも遅延 bytestring を使って処理していくことが可能です。もちろん、正格 bytestring でも読み込むバイト数を指定することで、大きなファイルを処理することができます。

●pack と unpack

関数 pack は [Word8] を bytestring に変換します。逆に、unpack は bytestring を [Word8] に変換します。pack と unpack の型を示します。

pack   :: [Word8] -> ByteString
unpack :: ByteString -> [Word8]

簡単な使用例を示します。

ghci> import qualified Data.ByteString as S
ghci> import qualified Data.ByteString.Lazy as B
ghci> a = S.pack [97..122]
ghci> a
"abcdefghijklmnopqrstuvwxyz"
ghci> S.unpack a
[97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,
117,118,119,120,121,122]

ghci> b = B.pack [97..122]
ghci> b
"abcdefghijklmnopqrstuvwxyz"
ghci> B.unpack b
[97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,
117,118,119,120,121,122]

ghci> S.empty
""
ghci> B.empty
""
ghci> :t S.empty
S.empty :: S.ByteString
ghci> :t B.empty
B.empty :: B.ByteString

qualified 付きインポートで Data.ByteString に別名 S を、Data.ByteString.Lazy に別名 B を付けます。空の bytestring は変数 empty として定義されています。

関数 singleton は 1 バイトの bytestring を生成します。

singleton :: Word8 -> ByteString

ghci> S.singleton 97
"a"
ghci> B.singleton 97
"a"

●遅延 bytestring と正格 bytestring の変換

モジュール Data.ByteString.Lazy に定義されている関数 toChunks は遅延 bytestring を正格 bytestring のリストに変換します。逆に、fromChunks は正格 bytestring のリストを遅延 bytestring に変換します。toChunks と fromChunks の型を示します。

toChunks   :: ByteString -> [ByteString]
fromChunks :: [ByteString] -> ByteString

簡単な使用例を示します。

ghci> a
"abcdefghijklmnopqrstuvwxyz"
ghci> b
"abcdefghijklmnopqrstuvwxyz"
ghci> :t a
a :: S.ByteString
ghci> :t b
b :: B.ByteString

ghci> B.fromChunks [a]
"abcdefghijklmnopqrstuvwxyz"
ghci> B.toChunks b
["abcdefghijklmnopqrstuvwxyz"]

ghci> c = B.fromChunks [S.pack [97..100], S.pack [101..105]]
ghci> c
"abcdefghi"
ghci> B.toChunks c
["abcd","efghi"]

●基本的な操作関数

bytestring はリストと違ってパターンマッチングを使用することができません。Data.ByteString にはリストの関数と同様の動作を行う関数が用意されているので、それらを使って bytestring を操作することになります。

基本的な関数を下表に示します。

表 : 基本的な bytestring の操作関数
関数名	型	機能
cons	Word8 -> ByteString -> ByteString	bytestring の先頭に 1 バイト追加する
head	ByteString -> Word8	bytestring の先頭 1 バイトを取り出す
tail	ByteString -> ByteString	bytestring の先頭 1 バイトを取り除く
append	ByteString -> ByteString -> ByteString	bytestring を連結する
null	ByteString -> Bool	空の bytestring ならば True を返す
length	ByteString -> Int64	bytestring の長さを求める
reverse	ByteString -> ByteString	bytestring を反転する

簡単な使用例を示します。

ghci> a
"abcdefghijklmnopqrstuvwxyz"
ghci> b
"abcdefghijklmnopqrstuvwxyz"
ghci> S.cons 65 a
"Aabcdefghijklmnopqrstuvwxyz"
ghci> B.cons 65 b
"Aabcdefghijklmnopqrstuvwxyz"
ghci> S.head a
97
ghci> B.head b
97
ghci> S.tail a
"bcdefghijklmnopqrstuvwxyz"
ghci> B.tail b
"bcdefghijklmnopqrstuvwxyz"
ghci> S.append (S.pack [97..100]) (S.pack [101..105])
"abcdefghi"
ghci> B.append (B.pack [97..100]) (B.pack [101..105])
"abcdefghi"
ghci> S.length a
26
ghci> B.length b
26
ghci> S.reverse a
"zyxwvutsrqponmlkjihgfedcba"
ghci> B.reverse b
"zyxwvutsrqponmlkjihgfedcba"
ghci> B.reverse $ B.append (B.pack [97..100]) (B.pack [101..105])
"ihgfedcba"

正格 bytestring の length は O(1) で長さを求めることができます。このほかにも、map, foldl, foldr などの高階関数や take, drop など bytestring を操作する関数が多数用意されています。詳細は Data.ByteString のマニュアルをお読みください。

●bytestring によるファイル入出力

Data.ByteString には bytestring を使って入出力処理を行う関数が用意されています。標準入出力に対して処理を行う関数を下表に示します。

表 : 標準入出力用の関数
関数名	型	機能
getLine	IO ByteString	標準入力から 1 行読み込み bytestring にして返す
getContents	IO ByteString	標準入力からデータを読み込んで bytestring にして返す
putStr	ByteString -> IO ()	bytestring を標準出力に書き込む, [非推奨]
putStrLn	ByteString -> IO ()	bytestring を標準出力に書き込む (改行付き), [非推奨]

関数 getLine は遅延 bytestring ではサポートされていません。正格 bytestring の場合、関数 getContents は遅延評価しません。今のバージョン (ver 8.8.4) では、putStr, putStrLn の使用は非推奨になりました。かわりに Data.ByteString.Char8, Data.ByteString.Lazy.Char8 にある putStr, putStrLn を使います。

簡単な使用例を示します。

ghci> S.getLine
hello, world
"hello, world"
ghci> Data.ByteString.Char8.putStrLn $ S.pack [97 .. 110]
abcdefghijklmn
ghci> Data.ByteString.Lazy.Char8.putStrLn $ B.pack [97 .. 110]
abcdefghijklmn

Data.ByteString には関数 readFile, writeFile もあります。

表 : ファイル用の関数
関数名	型	機能
readFile	FilePath -> IO ByteString	ファイルを読み込み bytestring にして返す
writeFile	FilePath -> ByteString -> IO ()	bytestring をファイルに書き込む

簡単な実行例を示します。test00.txt の内容を表示します。

hello, world
hello, Haskell
foo bar baz
oops! oops! oops!
abcd efgh ijkl

図 : test00.txt

ghci> readFile "test00.txt"
"hello, world\nhello, Haskell\nfoo bar baz\noops! oops! oops!\nabcd efgh ijkl\n"

ghci> S.readFile "test00.txt"
"hello, world\nhello, Haskell\nfoo bar baz\noops! oops! oops!\nabcd efgh ijkl\n"

ghci> B.readFile "test00.txt"
"hello, world\nhello, Haskell\nfoo bar baz\noops! oops! oops!\nabcd efgh ijkl\n"

Data.ByteString にはハンドルを使った入出力関数も用意されています。

表 : ハンドル用の関数
関数名	型	機能
hGetLine	Handle -> IO ByteString	ハンドルから 1 行読み込む
hGetContents	Handle -> IO String	ハンドルから全データを読み込んで bytestring にして返す
hGet	Handle -> Int -> IO ByteString	ハンドルから n バイト読み込む
hPut	Handle -> ByteString -> IO ()	ハンドルに bytestring を書き込む
hPutStr	Handle -> ByteString -> IO ()	hPut と同じ
hPutStrLn	Handle -> ByteString -> IO ()	hPut と同じ (改行付き)

関数 hGetLine, hPutStrLn は遅延 bytestring ではサポートされていません。hGet の場合、ファイルが途中で EOF になったならば、指定したバイト数よりも短い bytestring を返します。ファイルが EOF の場合は空の bytestring を返します。

●ファイルのコピー

それでは簡単な例題として、正格 bytestring の hGet と hPut を使ってファイルをコピーする関数 copyFile を作ってみましょう。readFile, writeFile を使ったほうが簡単ですが、hGet と hPut の簡単な使用例ということで、ご容赦くださいませ。

プログラムは次のようになります。

リスト : ファイルのコピー

import qualified Data.ByteString as S
import System.IO

copyFile :: FilePath -> FilePath -> IO ()
copyFile src dst = do
  hs <- openFile src ReadMode
  hd <- openFile dst WriteMode
  copy hs hd
  hClose hs
  hClose hd
    where size = 8
          copy hs hd = do
            contents <- S.hGet hs size
            if S.null contents
              then return ()
              else do S.hPut hd contents
                      copy hs hd

実際の処理は局所関数 copy で行います。S.hGet でファイルハンドル hs から size バイト読み込みます。今回は簡単な例題ということで、あえて小さな値 (8) にしています。ファイルが EOF の場合、S.hGet は空の byteString を返すので、関数 S.null でチェックします。空の bytestring であれば return で unit を IO に格納して返します。そうでなければ、S.hPut で出力先のファイルハンドル hd に contents を書き込み、copy を再帰呼び出しします。

それでは実行してみましょう。

ghci> copyFile "test00.txt" "test000.txt"
ghci> S.readFile "test00.txt"
"hello, world\nhello, Haskell\nfoo bar baz\noops! oops! oops!\nabcd efgh ijkl\n"
ghci> S.readFile "test000.txt"
"hello, world\nhello, Haskell\nfoo bar baz\noops! oops! oops!\nabcd efgh ijkl\n"

正常にコピーされていますね。

お気楽 Haskell プログラミング入門

応用編 : ByteString

はじめに

●bytestring の種類

●pack と unpack

●遅延 bytestring と正格 bytestring の変換

●基本的な操作関数

●bytestring によるファイル入出力

●ファイルのコピー