Split a file on any character in Python
Split a file on any character in Python
April 15, 2010
I need to split a big text file on a certain character. I expect I am being thick about this, but split doesn’t quite do what I want because it includes the matching line, whereas I want to split right on the matching character.
My Python answer:
def readlines(filename, endings, chunksize=4096):
"""Returns a generator that splits on lines in a file with the given
line-ending.
"""
line = ''
while True:
buf = filename.read(chunksize)
if not buf:
yield line
break
line = line + buf
while endings in line:
idx = line.index(endings) + len(endings)
yield line[:idx]
line = line[idx:]
if __name__ == "__main__":
import sys, os
FORMFEED = chr(12) # ASCII 12
basename = os.path.basename(sys.argv[1])
for num, data in enumerate(readlines(open(sys.argv[1]), endings=FORMFEED)):
filename = basename + '-' + str(num)
open(filename, 'wb').write(data)
This is also useful when reading data exported from some old-fashioned Mac application like Filemaker 5 where the line-endings are ASCII 13 not ASCII 10.
This post was inspired by Lotus Notes version 8.5, which is so advanced that to save a message in a file on disk you have to export it as structured text. And if you want to save a whole bunch of messages as individual files you must forget that drag-and-drop was introduced with System 7, that would be too obvious.
Last updated on