UNIX
2.0 More On Working With Files
2.1Analyzing File Contents
Once you have created a file, you may want to take advantage of the many
file-handling utilities available with the UNIX operating system. Some utilities
are especially useful when working with files of written text; most are
also equally useful for program files, and some even extend their usefulness
to binary program files. The tasks these utilities perform include analyzing,
sorting, searching, and collating; the data they work with ranges from entire
files to individual characters within files.
wc
The wc, or word count, command is used to count the number of lines,
words, and/or characters in a file. Typing wc filename1 filename2
... filenameN with no flags and pressing RETURN will cause wc
to print the number of lines, words, and characters in each of files filename1
through filenameN, one file per line. The -l flag
will list the line count, -w will list the word count, and -c
will list the character count. Any combination of flags can be used with
the wc command (e.g., wc -lw filename will cause wc
to list both the line and word counts for the named file, but not the number
of characters).
sort
The sort command sorts the lines of the input file(s) alphabetically
by line or by a specified section (field) of each line. To execute, enter
sort -flags filename1 filename2...filenameN where flags
is the list of added flags. All lines in files filename1
through filenameN are then collected, sorted, and sent to
standard output as a list (by default, printed on the screen). By adding
the appropriate flags, sort will ignore case (-f); spaces
at beginnings of lines (-b); characters outside of the ASCII range
040-176, such as control characters (-i); or characters other than
letters, numbers, and spaces (-d). There are many other flags for
sort; for a more detailed look, see its man page.
uniq
The uniq utility is useful for dealing with files that have adjacent
duplicate lines. Typing
uniq filename
and pressing RETURN would print the file filename, but with
second and succeeding copies of repeated lines removed. If the order of
lines is not important, duplicate lines can be made adjacent using sort
(see above). The addition of the -u flag would
cause only non-repeated lines to be printed. Addition of the -d
flag would cause a single copy of each repeated line only to be printed.
The uniq utility will ignore the first number characters
if the +number flag is included, or it will ignore the first
number words (words, or fields, are sets of characters separated
by spaces or tabs) if the -number flag is included.
diff
Three commands that compare two files are diff, comm,
and cmp. They are used by typing
command filename1 filename2
and pressing RETURN, where command is any of these three
commands. The diff command lists lines that differ between the
two files. The diff utility usually finds the simplest differences
between the two files: if the only difference between them is a line existing
in one file but not in the other, diff would list only that single
line as being different, rather than listing each subsequent (offset) line.
By adding a -e flag, diff will produce a list of ed commands
which would change filename1 into filename2. (For more
information on ed type man ed.)
Other possible flags include -b, which causes diff to treat all
whitespace (any combination of Tabs, Returns, or Spaces) as equal, and -w,
which causes diff to ignore all white space. The -i flag makes
diff case insensitive in its comparisons (e.g., "a" would
be equal to "A"), and the -cNumber flag (with no space
between c and the number) causes each line to be listed in context:
the Number of lines previous to, and Number
of lines after each differing line are also listed.
comm
Like diff, the comm utility also expects the names of
two files as its arguments, but expects, in addition, that the files be
previously sorted. The comm utility can display those lines that
are only in the first file, those that are only in the second file, those
that are in both files, or any combination thereof.
Typing comm filename1 filename2 and pressing RETURN (with
no flags) displays three columns of output containing the three categories
of lines listed above, in the same order. Any combination of the flags -1,
-2, or -3 will display any combination of those three
columns (e.g., comm -13 filename1 filename2 will display
two columns: the lines only in filename1, and the lines
in both filename1 and filename2. It will
not display the lines only in the second file).
cmp
The cmp utility compares two files, one byte at a time. If the
two files are identical, it reports nothing. If the files differ, it reports
at which line and character they first differ.
grep, egrep, fgrep
Most searching on UNIX systems is handled by the grep utilities:
grep, egrep, and fgrep. They differ only in the
syntax in which the character string to be searched for is specified. The
grep utility is more commonly used. It can be executed by typing
grep regularExpression filename
where regularExpression is an ed-style string (see
the man pages for ed), and where filename
is the name of a file or set of files to be searched. The fgrep utility
works the same way, but it can search only for absolute strings--no wildcard
characters are allowed.
There are many different uses for the grep utilities. By default,
grep prints the contents of the line in which a matching string
was found. Adding the -n flag would cause grep to list
each matching line's position within the file as well. Since it is often
necessary to read a set of lines surrounding a matching character string,
this line number can be used as a reference mark when viewing the file.
The -l flag is used when searching multiple files; it causes grep
to list the names of those files in which matching strings were found. If
desired, the named files can then be searched or viewed.
Other flags include -i, which causes grep to ignore case when comparing,
and -w, which causes grep to treat the search string as
a word (delimited by spaces, punctuation, etc.). For more information, see
their man pages.
2.2 File Types
When storing a file, the UNIX operating system stores several bits of information
about the file, which can be listed with the ls -l command. A string
of ten characters (the "permission field") appears at the left
of the output from the ls -l command. The first of these characters
describes the type of file (i.e., "d" for directory or
"-" for an ordinary file). The remaining nine describe
who may read, write to, or execute the given file. The first, second, and
third three-character sets that comprise this field respectively correspond
to the permission settings for the owner of the file, all system users in
the owner's group, and all users. Within these three sets, the three characters
correspond to the read/write/execute permissions for
that group. NOTE: for a directory, the execute permission enables
searching of the directory.
chmod
The chmod program will change privileges to certain files that
you own. Typing chmod xxx filename and pressing RETURN will
change the privileges to your specifications. Each x is
a number representing a number from 1 to 7 that gives certain combinations
of privileges, with the first digit representing your privileges, the second
your group's, and the third everyone's privileges. To get the number you
want, use these numbers:
4 - read privileges
2 - write privileges
1 - execute privileges
For example, typing chmod 755 filename and pressing ENTER
will change the privileges of filename to this (represented
by the ls -l output):
-rwxr-xr-x 1 yourLoginName student 20000 Jan 1 12:00 filename
For more information about chmod and its other capabilites, see
its man page.
2.3 Conserving Space and Archiving
Listed below are some UNIX utilities that help conserve disk space by compressing
files and archiving. Wesleyan also has resources that make it possible for
the user to take advantage of external data storage.
compress/uncompress/zcat
Since disk space for files is at a premium, data compression is a valuable
means of maximizing available storage space. The compress utility
implements a widely used standard for file compression on UNIX systems.
The compress command is usually used without flags; type compress
filename and press RETURN. A new file that is a compressed version
of the old file will be created with the same name as the old name, but
with the extension ".Z" added. The original file will
be deleted. Larger text files see the most dramatic reductions in size-usually
50 to 60 percent. Smaller files and binary (program) files will see much
less reduction, although the amount will vary greatly from file to file.
Once compressed, files cannot be used directly: they must be extracted with
the uncompress utility. Type uncompress filename
and press RETURN. When using uncompress on a file, you may include
the .Z extension in the filename, but it is not mandatory.
The zcat utility allows you to print a compressed text file without
first using uncompress. To use it, type zcat filename and
press RETURN. The zcat utility will print an uncompressed copy
of the file to the standard output (the screen) while leaving the original
compressed file unchanged. This uncompressed standard output may then be
redirected or piped as the user sees fit (e.g., it can be piped to more
for viewing--see Section 1.6, "Redirections
and Pipes").
Tar Files for Organization
Although it does not reduce the size of stored files, the tar command
(for "tape archive") is nevertheless a valuable organizational
tool that can reduce unnecessary disk use. It combines a group of files
and/or directories into a single tar archive file. Traditionally,
the tar command is used to store data on magnetic tapes, but it
is valuable in its own right for archiving files on disk, making file transfers
(see Section 3.2, "ftp")
and compression a simpler task. Each tar file maintains a table
of contents that can be listed without un-archiving, further aiding the
removal of duplicate files. When a tar file is un-archived, any
directory structure present at the time of archiving is restored.
Archiving multiple disk files/directories to a single tar file
is done by adding the c flag (for "create"; no hyphen
is needed before flags when using the tar command). To make a tar
file, type
tar cf tarFilename file1...fileN
where tarFilename is the name of the archived file you are
creating, and where file1 through fileN
are names of the files or directories you wish to archive. To extract a
tar file, use the x flag: tar xf tarFilename.
Other useful flags include v, which prints more information about
tar's activities as it works, and t, which displays the
table of contents of the tar file. (To see the table of contents,
enter tar tf tarFilename.) For more features, see the man
page.