Is it safe to slicing byte directly to split a big file?

In my case, the big file is tar.gz, I have myBigFile.tar.gz with size 52GB, I splitted it with chunk size 2GB therefore I have 27 parts file.

Here is the code I program from scratch:

from time import sleep
from glob import glob
import filecmp
import os

CHUNK_SIZE = 2097152000  # bytes
# CHUNK_SIZE = 1000000  # bytes
# CHUNK_SIZE = 2  # bytes

ORIGINAL_FILE_DIR = './data/original'
SPLITTED_FILE_DIR = './data/splitted'
JOINED_FILE_DIR = './data/joined'


def get_original_filepath(filename):
  return f'{ORIGINAL_FILE_DIR}/{filename}'


def get_splitted_filepath(filename, overwrite=False):
  partspath = f'{SPLITTED_FILE_DIR}/{filename}.parts'
  if overwrite:
    try:
      os.rmdir(partspath)
    except Exception as e:
      print(e)
    try:
      os.mkdir(partspath)
    except Exception as e:
      print(e)
  return partspath


def get_joined_filepath(filename):
  return f'{JOINED_FILE_DIR}/{filename}'


def get_part_extension(part, pad_num=8):
  if isinstance(part, int):
    return f'{part:0{pad_num}d}.part'
  elif isinstance(part, str):
    return f'{part}.part'
  else:
    raise Exception('Unknown typeof <part>', type(part))


def get_part_filename(filename, part, pad_num=8):
  part_extension = get_part_extension(part, pad_num)
  return f'{filename}.{part_extension}'


def get_file_size(filepath):
  return os.path.getsize(filepath)


def get_number_of_chunks(total_size, chunk_size):
  return total_size // chunk_size + (total_size % chunk_size > 0)


def is_directory_empty(directory_path):
  try:
    # Get the list of files and directories in the specified path
    files = os.listdir(directory_path)

    # Check if there are any files in the list
    if len(files) == 0:
      return True
    else:
      return False
  except:
    # Handle the case when the directory does not exist
    return True


def split_file(filename, chunk_size=CHUNK_SIZE):
  original_path = get_original_filepath(filename)
  if get_file_size(original_path) == 0:
    print(Exception('E: Original file not found!'))
  splitted_path = get_splitted_filepath(filename, overwrite=True)
  with open(original_path, 'rb') as readfile:
    number_of_chunks = get_number_of_chunks(get_file_size(original_path),
                                            chunk_size)
    for part in range(number_of_chunks):
      chunk = readfile.read(chunk_size)
      part_filename = get_part_filename(filename, part,
                                        len(str(number_of_chunks)))
      with open(f'{splitted_path}/{part_filename}', 'wb') as writefile:
        writefile.write(chunk)


def join_file(filename):
  splitted_path = get_splitted_filepath(filename)
  joined_path = get_joined_filepath(filename)
  if is_directory_empty(splitted_path):
    print(Exception('E: Splitted file not found!'))
  part = '*'  # wilcard
  part_filename = get_part_filename(filename, part)
  partfiles = [
      os.path.normpath(fn) for fn in glob(f'{splitted_path}/{part_filename}')
  ]
  with open(joined_path, 'ab') as appendfile:
    for partfile in partfiles:
      with open(partfile, 'rb') as readfile:
        appendfile.write(readfile.read())


def compare_file(filename):
  # Specify the paths of the two files
  file1_path = get_original_filepath(filename)
  file2_path = get_joined_filepath(filename)

  return f'{filename} is identical.' if filecmp.cmp(
      file1_path, file2_path) else f'{filename} is not identical.'


filename = 'myBigFile.tar.gz'

split_file(filename)
join_file(filename)
print(compare_file(filename))

So the splitted_path is looks like this:

./data/myBigFile.tar.gz.parts/myBigFile.tar.gz.00.part
./data/myBigFile.tar.gz.parts/myBigFile.tar.gz.01.part
...
./data/myBigFile.tar.gz.parts/myBigFile.tar.gz.25.part

I know that I can just use Unix Utility such as tar, zip, or another archiver.

I tested it in small file with small CHUNK_SIZE too, it joined file without any problem.

You can split a binary file at any byte point you like.

If you were splitting a text file you could still split it at any byte point, but you might end up splitting in the middle of a multi-byte Unicode character. However, provided you concatenated the files before trying to interpret its contents this would not be an issue. (And you’d have to concatenate the parts of a binary file before trying to progress its contents too, so there’s no difference.)

Note, that using a variable number of digits for the output pieces like you do in your Python code means you can’t use the trivial cat myBigFile.tar.gz.*.part to reconstitute the original. (For 26 parts you’d get 1, 10, 11, 12 … 19, 2, 20, 21 … 26, 3, 4, 5, 6, 7, 8, 9 in that order.)

Here’s how I would split myBigFile.tar.gz into 2GB parts, using your own naming convention:

split --bytes=2G --numeric-suffixes=1 --suffix-length=2 --additional-suffix=.part myBigFile.tar.gz myBigFile.tar.gz.

See man split for details of the command line switches.

Example output files:

myBigFile.tar.gz.01.part
myBigFile.tar.gz.02.part
myBigFile.tar.gz.03.part
…

Having got these files you can then use a simple command and shell globbing to reconstitute the original:

cat myBigFile.tar.gz.??.part >myBigFile.tar.gz
Answered By: roaima
Categories: Answers Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.