2.0 More On Working With Files

2.1Analyzing File Contents

Once you have created a file, you may want to take advantage of the many file-handling utilities available with the UNIX operating system. Some utilities are especially useful when working with files of written text; most are also equally useful for program files, and some even extend their usefulness to binary program files. The tasks these utilities perform include analyzing, sorting, searching, and collating; the data they work with ranges from entire files to individual characters within files.


The wc, or word count, command is used to count the number of lines, words, and/or characters in a file. Typing wc filename1 filename2 ... filenameN with no flags and pressing RETURN will cause wc to print the number of lines, words, and characters in each of files filename1 through filenameN, one file per line. The -l flag will list the line count, -w will list the word count, and -c will list the character count. Any combination of flags can be used with the wc command (e.g., wc -lw filename will cause wc to list both the line and word counts for the named file, but not the number of characters).


The sort command sorts the lines of the input file(s) alphabetically by line or by a specified section (field) of each line. To execute, enter sort -flags filename1 filename2...filenameN where flags is the list of added flags. All lines in files filename1 through filenameN are then collected, sorted, and sent to standard output as a list (by default, printed on the screen). By adding the appropriate flags, sort will ignore case (-f); spaces at beginnings of lines (-b); characters outside of the ASCII range 040-176, such as control characters (-i); or characters other than letters, numbers, and spaces (-d). There are many other flags for sort; for a more detailed look, see its man page.


The uniq utility is useful for dealing with files that have adjacent duplicate lines. Typing
uniq filename

and pressing RETURN would print the file filename, but with second and succeeding copies of repeated lines removed. If the order of lines is not important, duplicate lines can be made adjacent using sort (see above). The addition of the -u flag would cause only non-repeated lines to be printed. Addition of the -d flag would cause a single copy of each repeated line only to be printed. The uniq utility will ignore the first number characters if the +number flag is included, or it will ignore the first number words (words, or fields, are sets of characters separated by spaces or tabs) if the -number flag is included.


Three commands that compare two files are diff, comm, and cmp. They are used by typing
command filename1 filename2

and pressing RETURN, where command is any of these three commands. The diff command lists lines that differ between the two files. The diff utility usually finds the simplest differences between the two files: if the only difference between them is a line existing in one file but not in the other, diff would list only that single line as being different, rather than listing each subsequent (offset) line. By adding a -e flag, diff will produce a list of ed commands which would change filename1 into filename2. (For more information on ed type man ed.)

Other possible flags include -b, which causes diff to treat all whitespace (any combination of Tabs, Returns, or Spaces) as equal, and -w, which causes diff to ignore all white space. The -i flag makes diff case insensitive in its comparisons (e.g., "a" would be equal to "A"), and the -cNumber flag (with no space between c and the number) causes each line to be listed in context: the Number of lines previous to, and Number of lines after each differing line are also listed.


Like diff, the comm utility also expects the names of two files as its arguments, but expects, in addition, that the files be previously sorted. The comm utility can display those lines that are only in the first file, those that are only in the second file, those that are in both files, or any combination thereof.

Typing comm filename1 filename2 and pressing RETURN (with no flags) displays three columns of output containing the three categories of lines listed above, in the same order. Any combination of the flags -1, -2, or -3 will display any combination of those three columns (e.g., comm -13 filename1 filename2 will display two columns: the lines only in filename1, and the lines in both filename1 and filename2. It will not display the lines only in the second file).


The cmp utility compares two files, one byte at a time. If the two files are identical, it reports nothing. If the files differ, it reports at which line and character they first differ.

grep, egrep, fgrep

Most searching on UNIX systems is handled by the grep utilities: grep, egrep, and fgrep. They differ only in the syntax in which the character string to be searched for is specified. The grep utility is more commonly used. It can be executed by typing
grep regularExpression filename

where regularExpression is an ed-style string (see the man pages for ed), and where filename is the name of a file or set of files to be searched. The fgrep utility works the same way, but it can search only for absolute strings--no wildcard characters are allowed.

There are many different uses for the grep utilities. By default, grep prints the contents of the line in which a matching string was found. Adding the -n flag would cause grep to list each matching line's position within the file as well. Since it is often necessary to read a set of lines surrounding a matching character string, this line number can be used as a reference mark when viewing the file. The -l flag is used when searching multiple files; it causes grep to list the names of those files in which matching strings were found. If desired, the named files can then be searched or viewed.

Other flags include -i, which causes grep to ignore case when comparing, and -w, which causes grep to treat the search string as a word (delimited by spaces, punctuation, etc.). For more information, see their man pages.

2.2 File Types

When storing a file, the UNIX operating system stores several bits of information about the file, which can be listed with the ls -l command. A string of ten characters (the "permission field") appears at the left of the output from the ls -l command. The first of these characters describes the type of file (i.e., "d" for directory or "-" for an ordinary file). The remaining nine describe who may read, write to, or execute the given file. The first, second, and third three-character sets that comprise this field respectively correspond to the permission settings for the owner of the file, all system users in the owner's group, and all users. Within these three sets, the three characters correspond to the read/write/execute permissions for that group. NOTE: for a directory, the execute permission enables searching of the directory.


The chmod program will change privileges to certain files that you own. Typing chmod xxx filename and pressing RETURN will change the privileges to your specifications. Each x is a number representing a number from 1 to 7 that gives certain combinations of privileges, with the first digit representing your privileges, the second your group's, and the third everyone's privileges. To get the number you want, use these numbers:
4 - read privileges

2 - write privileges

1 - execute privileges

For example, typing chmod 755 filename and pressing ENTER will change the privileges of filename to this (represented by the ls -l output):

-rwxr-xr-x 1 yourLoginName student 20000 Jan 1 12:00 filename

For more information about chmod and its other capabilites, see its man page.

2.3 Conserving Space and Archiving

Listed below are some UNIX utilities that help conserve disk space by compressing files and archiving. Wesleyan also has resources that make it possible for the user to take advantage of external data storage.


Since disk space for files is at a premium, data compression is a valuable means of maximizing available storage space. The compress utility implements a widely used standard for file compression on UNIX systems. The compress command is usually used without flags; type compress filename and press RETURN. A new file that is a compressed version of the old file will be created with the same name as the old name, but with the extension ".Z" added. The original file will be deleted. Larger text files see the most dramatic reductions in size-usually 50 to 60 percent. Smaller files and binary (program) files will see much less reduction, although the amount will vary greatly from file to file.

Once compressed, files cannot be used directly: they must be extracted with the uncompress utility. Type uncompress filename and press RETURN. When using uncompress on a file, you may include the .Z extension in the filename, but it is not mandatory.

The zcat utility allows you to print a compressed text file without first using uncompress. To use it, type zcat filename and press RETURN. The zcat utility will print an uncompressed copy of the file to the standard output (the screen) while leaving the original compressed file unchanged. This uncompressed standard output may then be redirected or piped as the user sees fit (e.g., it can be piped to more for viewing--see Section 1.6, "Redirections and Pipes").

Tar Files for Organization

Although it does not reduce the size of stored files, the tar command (for "tape archive") is nevertheless a valuable organizational tool that can reduce unnecessary disk use. It combines a group of files and/or directories into a single tar archive file. Traditionally, the tar command is used to store data on magnetic tapes, but it is valuable in its own right for archiving files on disk, making file transfers (see Section 3.2, "ftp") and compression a simpler task. Each tar file maintains a table of contents that can be listed without un-archiving, further aiding the removal of duplicate files. When a tar file is un-archived, any directory structure present at the time of archiving is restored.

Archiving multiple disk files/directories to a single tar file is done by adding the c flag (for "create"; no hyphen is needed before flags when using the tar command). To make a tar file, type
tar cf tarFilename file1...fileN

where tarFilename is the name of the archived file you are creating, and where file1 through fileN are names of the files or directories you wish to archive. To extract a tar file, use the x flag: tar xf tarFilename. Other useful flags include v, which prints more information about tar's activities as it works, and t, which displays the table of contents of the tar file. (To see the table of contents, enter tar tf tarFilename.) For more features, see the man page.

Next Section Previous Section Contents