Why do tar and gzip files usually have a file extension?
File extensions are not necessary on unices, still every tarred, gzipped or bzipped file I encounter has a file extension like
Is there any special reason for that or is that just convention?
They may not need an extension, but it sure makes identifying them easier in the output of
File extensions are primarily a convention for the humans who use the system. There are tools which do use the filename extension to do things. For example Nautilus shows me a different icon based on the file extension.
If I gave you a file called
file, you might not know how to open this file. However, if I gave you a file named
file.tar you could quickly and easily figure it out.
Originally, on unix systems, the extensions on file names were a matter of convention. They allowed a human being to choose the right program to open a file. The modern convention is to use extensions in most cases; common exceptions are:
- Only regular files have an extension, not directories or device names. The mere fact of being a directory or device is enough file type indication.
- Executables that are meant to be invoked directly don’t have an extension. The mere fact of being executable is enough information for the user, and the kernel doesn’t care about file names.
- Files beginning with a word in all caps are often text files, e.g.
TODO. Sometimes there is an additional part that indicate a subcategory, e.g.
- Files whose name begins with a dot are configuration or state files of a particular application, and often don’t have an extension, e.g.
- There are a few traditional cases, e.g.
(These are common cases, not hard-and-fast rules.)
Most binary file formats also contain some kind of header that describes properties of the file, and typically allows the file format to be identified through magic numbers. The
file command looks at this information and shows you its guesses.
Sometimes the file extension gives more information than the file format, sometimes it’s the other way round. For example many file formats consist of a zip archive: Java libraries (
.jar), OpenOffice documents (
.odt, …), Microsoft Office document (
.docx, …), etc. Another example is source code files, where the extension indicates the programming language, which can be difficult for a computer to guess automatically from the file contents. Conversely, some extensions are wildly ambiguous, for example
.o is used for compiled code files (object files), but inspection of the file contents usually easily reveals what machine type and operating system the object file is for.
An advantage of the extension is that it’s a lot faster to recognize it than to open the file and look for magic sequences. For example completion of file names in shells is almost always based on the name (mainly the extension), because reading every file in a large directory can take a long time whereas just reading the file names is fast enough for a Tab press.
Sometimes changing a file’s extension can allow you to say how a file is to be interpreted, when two file formats are almost, but not wholly identical. For example a web server might treat
.html differently, the former undergoing some server-side preprocessing, the latter being served as-is.
In the case of gzip archives,
gzip won’t recompress files whose name ends in
.tgz and a few other extensions. That way you can run
gzip * to compress every file in a directory, and already compressed files are not modified.