Despite the march of technology plain text files are still a popular way to transfer and process data. Comma Separated Value (CSV) files are still used to import and export data between spreadsheets, databases, online tools and other data repositories.
The script shown below will take a text file and split it into a number of smaller files based on a specified line count. This works for normal text as well as CSV files. I use it to split large data sets into smaller batches for import into database systems.
#!/usr/bin/env python # Import module import os # Define file_splitter function def file_splitter(fullfilepath, lines=50): """Splits a plain text file based on line count.""" path, filename = os.path.split(fullfilepath) basename, ext = os.path.splitext(filename) # Open source text file with open(fullfilepath, 'r') as f_in: try: # Open first output file f_output = os.path.join(path, '{}_{}{}'.format(basename, 0, ext)) f_out = open(f_output, 'w') # Read input file one line at a time for i, line in enumerate(f_in): # When current line can be divided by the line # count close the output file and open the next one if i % lines == 0: f_out.close() f_output = os.path.join(path, '{}_{}{}'.format(basename, i, ext)) f_out = open(f_output, 'w') # Write current line to output file f_out.write(line) finally: # Close last output file f_out.close() # Call function with source text file and line count file_splitter('input_file.txt', 20)
This example takes the file “input_file.txt” and splits it into a series of files containing a maximum of 20 lines each. The last file may contain less than 20 depending on the number of lines in the source file.
The output files are numbered and are based on the input file name. The example would produce a set of output files named :
- input_file_0.txt
- input_file_20.txt
- input_file_40.txt
- input_file_60.txt
- etc …
The exact number of files produced will depend on the number of lines in the source file and the number specified in the call to file_splitter.
File_splitter must be passed the full path to the file. If it is not in the same location as the Python script you may need to include an appropriate parent path or drive letter.