sort

aina ya mistari ya files Nakala

   File: coreutils.info,  Node: sort invocation,  Next: shuf invocation,  Up: Operating on sorted files

7.1 `sort': Sort text files
===========================

`sort' sorts, merges, or compares all the lines from the given files,
or standard input if none are given or for a FILE of `-'.  By default,
`sort' writes the results to standard output.  Synopsis:

     sort [OPTION]... [FILE]...

   `sort' has three modes of operation: sort (the default), merge, and
check for sortedness.  The following options change the operation mode:

`-c'
`--check'
`--check=diagnose-first'
     Check whether the given file is already sorted: if it is not all
     sorted, print a diagnostic containing the first out-of-order line
     and exit with a status of 1.  Otherwise, exit successfully.  At
     most one input file can be given.

`-C'
`--check=quiet'
`--check=silent'
     Exit successfully if the given file is already sorted, and exit
     with status 1 otherwise.  At most one input file can be given.
     This is like `-c', except it does not print a diagnostic.

`-m'
`--merge'
     Merge the given files by sorting them as a group.  Each input file
     must always be individually sorted.  It always works to sort
     instead of merge; merging is provided because it is faster, in the
     case where it works.


   A pair of lines is compared as follows: `sort' compares each pair of
fields, in the order specified on the command line, according to the
associated ordering options, until a difference is found or no fields
are left.  If no key fields are specified, `sort' uses a default key of
the entire line.  Finally, as a last resort when all keys compare
equal, `sort' compares entire lines as if no ordering options other
than `--reverse' (`-r') were specified.  The `--stable' (`-s') option
disables this "last-resort comparison" so that lines in which all
fields compare equal are left in their original relative order.  The
`--unique' (`-u') option also disables the last-resort comparison.

   Unless otherwise specified, all comparisons use the character
collating sequence specified by the `LC_COLLATE' locale.(1)

   GNU `sort' (as specified for all GNU utilities) has no limit on
input line length or restrictions on bytes allowed within lines.  In
addition, if the final byte of an input file is not a newline, GNU
`sort' silently supplies one.  A line's trailing newline is not part of
the line for comparison purposes.

   Exit status:

     0 if no error occurred
     1 if invoked with `-c' or `-C' and the input is not sorted
     2 if an error occurred

   If the environment variable `TMPDIR' is set, `sort' uses its value
as the directory for temporary files instead of `/tmp'.  The
`--temporary-directory' (`-T') option in turn overrides the environment
variable.

   The following options affect the ordering of output lines.  They may
be specified globally or as part of a specific key field.  If no key
fields are specified, global options apply to comparison of entire
lines; otherwise the global options are inherited by key fields that do
not specify any special options of their own.  In pre-POSIX versions of
`sort', global options affect only later key fields, so portable shell
scripts should specify global options first.

`-b'
`--ignore-leading-blanks'
     Ignore leading blanks when finding sort keys in each line.  By
     default a blank is a space or a tab, but the `LC_CTYPE' locale can
     change this.

`-d'
`--dictionary-order'
     Sort in "phone directory" order: ignore all characters except
     letters, digits and blanks when sorting.  By default letters and
     digits are those of ASCII and a blank is a space or a tab, but the
     `LC_CTYPE' locale can change this.

`-f'
`--ignore-case'
     Fold lowercase characters into the equivalent uppercase characters
     when comparing so that, for example, `b' and `B' sort as equal.
     The `LC_CTYPE' locale determines character types.

`-g'
`--general-numeric-sort'
`--sort=general-numeric'
     Sort numerically, using the standard C function `strtod' to convert
     a prefix of each line to a double-precision floating point number.
     This allows floating point numbers to be specified in scientific
     notation, like `1.0e-34' and `10e100'.  The `LC_NUMERIC' locale
     determines the decimal-point character.  Do not report overflow,
     underflow, or conversion errors.  Use the following collating
     sequence:

        * Lines that do not start with numbers (all considered to be
          equal).

        * NaNs ("Not a Number" values, in IEEE floating point
          arithmetic) in a consistent but machine-dependent order.

        * Minus infinity.

        * Finite numbers in ascending numeric order (with -0 and +0
          equal).

        * Plus infinity.

     Use this option only if there is no alternative; it is much slower
     than `--numeric-sort' (`-n') and it can lose information when
     converting to floating point.

`-i'
`--ignore-nonprinting'
     Ignore nonprinting characters.  The `LC_CTYPE' locale determines
     character types.  This option has no effect if the stronger
     `--dictionary-order' (`-d') option is also given.

`-M'
`--month-sort'
`--sort=month'
     An initial string, consisting of any amount of blanks, followed by
     a month name abbreviation, is folded to UPPER case and compared in
     the order `JAN' < `FEB' < ... < `DEC'.  Invalid names compare low
     to valid names.  The `LC_TIME' locale category determines the
     month spellings.  By default a blank is a space or a tab, but the
     `LC_CTYPE' locale can change this.

`-n'
`--numeric-sort'
`--sort=numeric'
     Sort numerically.  The number begins each line and consists of
     optional blanks, an optional `-' sign, and zero or more digits
     possibly separated by thousands separators, optionally followed by
     a decimal-point character and zero or more digits.  An empty
     number is treated as `0'.  The `LC_NUMERIC' locale specifies the
     decimal-point character and thousands separator.  By default a
     blank is a space or a tab, but the `LC_CTYPE' locale can change
     this.

     Comparison is exact; there is no rounding error.

     Neither a leading `+' nor exponential notation is recognized.  To
     compare such strings numerically, use the `--general-numeric-sort'
     (`-g') option.

`-r'
`--reverse'
     Reverse the result of comparison, so that lines with greater key
     values appear earlier in the output instead of later.

`-R'
`--random-sort'
`--sort=random'
     Sort by hashing the input keys and then sorting the hash values.
     Choose the hash function at random, ensuring that it is free of
     collisions so that differing keys have differing hash values.
     This is like a random permutation of the inputs (*note shuf
     invocation::), except that keys with the same value sort together.

     If multiple random sort fields are specified, the same random hash
     function is used for all fields.  To use different random hash
     functions for different fields, you can invoke `sort' more than
     once.

     The choice of hash function is affected by the `--random-source'
     option.


   Other options are:

`--compress-program=PROG'
     Compress any temporary files with the program PROG.

     With no arguments, PROG must compress standard input to standard
     output, and when given the `-d' option it must decompress standard
     input to standard output.

     Terminate with an error if PROG exits with nonzero status.

     White space and the backslash character should not appear in PROG;
     they are reserved for future use.

`-k POS1[,POS2]'
`--key=POS1[,POS2]'
     Specify a sort field that consists of the part of the line between
     POS1 and POS2 (or the end of the line, if POS2 is omitted),
     _inclusive_.

     Each POS has the form `F[.C][OPTS]', where F is the number of the
     field to use, and C is the number of the first character from the
     beginning of the field.  Fields and character positions are
     numbered starting with 1; a character position of zero in POS2
     indicates the field's last character.  If `.C' is omitted from
     POS1, it defaults to 1 (the beginning of the field); if omitted
     from POS2, it defaults to 0 (the end of the field).  OPTS are
     ordering options, allowing individual keys to be sorted according
     to different rules; see below for details.  Keys can span multiple
     fields.

     Example:  To sort on the second field, use `--key=2,2' (`-k 2,2').
     See below for more examples.

`-o OUTPUT-FILE'
`--output=OUTPUT-FILE'
     Write output to OUTPUT-FILE instead of standard output.  Normally,
     `sort' reads all input before opening OUTPUT-FILE, so you can
     safely sort a file in place by using commands like `sort -o F F'
     and `cat F | sort -o F'.  However, `sort' with `--merge' (`-m')
     can open the output file before reading all input, so a command
     like `cat F | sort -m -o F - G' is not safe as `sort' might start
     writing `F' before `cat' is done reading it.

     On newer systems, `-o' cannot appear after an input file if
     `POSIXLY_CORRECT' is set, e.g., `sort F -o F'.  Portable scripts
     should specify `-o OUTPUT-FILE' before any input files.

`--random-source=FILE'
     Use FILE as a source of random data used to determine which random
     hash function to use with the `-R' option.  *Note Random sources::.

`-s'
`--stable'
     Make `sort' stable by disabling its last-resort comparison.  This
     option has no effect if no fields or global ordering options other
     than `--reverse' (`-r') are specified.

`-S SIZE'
`--buffer-size=SIZE'
     Use a main-memory sort buffer of the given SIZE.  By default, SIZE
     is in units of 1024 bytes.  Appending `%' causes SIZE to be
     interpreted as a percentage of physical memory.  Appending `K'
     multiplies SIZE by 1024 (the default), `M' by 1,048,576, `G' by
     1,073,741,824, and so on for `T', `P', `E', `Z', and `Y'.
     Appending `b' causes SIZE to be interpreted as a byte count, with
     no multiplication.

     This option can improve the performance of `sort' by causing it to
     start with a larger or smaller sort buffer than the default.
     However, this option affects only the initial buffer size.  The
     buffer grows beyond SIZE if `sort' encounters input lines larger
     than SIZE.

`-t SEPARATOR'
`--field-separator=SEPARATOR'
     Use character SEPARATOR as the field separator when finding the
     sort keys in each line.  By default, fields are separated by the
     empty string between a non-blank character and a blank character.
     By default a blank is a space or a tab, but the `LC_CTYPE' locale
     can change this.

     That is, given the input line ` foo bar', `sort' breaks it into
     fields ` foo' and ` bar'.  The field separator is not considered
     to be part of either the field preceding or the field following,
     so with `sort -t " "' the same input line has three fields: an
     empty field, `foo', and `bar'.  However, fields that extend to the
     end of the line, as `-k 2', or fields consisting of a range, as
     `-k 2,3', retain the field separators present between the
     endpoints of the range.

     To specify a null character (ASCII NUL) as the field separator,
     use the two-character string `\0', e.g., `sort -t '\0''.

`-T TEMPDIR'
`--temporary-directory=TEMPDIR'
     Use directory TEMPDIR to store temporary files, overriding the
     `TMPDIR' environment variable.  If this option is given more than
     once, temporary files are stored in all the directories given.  If
     you have a large sort or merge that is I/O-bound, you can often
     improve performance by using this option to specify directories on
     different disks and controllers.

`-u'
`--unique'
     Normally, output only the first of a sequence of lines that compare
     equal.  For the `--check' (`-c' or `-C') option, check that no
     pair of consecutive lines compares equal.

     This option also disables the default last-resort comparison.

     The commands `sort -u' and `sort | uniq' are equivalent, but this
     equivalence does not extend to arbitrary `sort' options.  For
     example, `sort -n -u' inspects only the value of the initial
     numeric string when checking for uniqueness, whereas `sort -n |
     uniq' inspects the entire line.  *Note uniq invocation::.

`-z'
`--zero-terminated'
     Treat the input as a set of lines, each terminated by a null
     character (ASCII NUL) instead of a line feed (ASCII LF).  This
     option can be useful in conjunction with `perl -0' or `find
     -print0' and `xargs -0' which do the same in order to reliably
     handle arbitrary file names (even those containing blanks or other
     special characters).


   Historical (BSD and System V) implementations of `sort' have
differed in their interpretation of some options, particularly `-b',
`-f', and `-n'.  GNU sort follows the POSIX behavior, which is usually
(but not always!) like the System V behavior.  According to POSIX, `-n'
no longer implies `-b'.  For consistency, `-M' has been changed in the
same way.  This may affect the meaning of character positions in field
specifications in obscure cases.  The only fix is to add an explicit
`-b'.

   A position in a sort field specified with `-k' may have any of the
option letters `Mbdfinr' appended to it, in which case the global
ordering options are not used for that particular field.  The `-b'
option may be independently attached to either or both of the start and
end positions of a field specification, and if it is inherited from the
global options it will be attached to both.  If input lines can contain
leading or adjacent blanks and `-t' is not used, then `-k' is typically
combined with `-b', `-g', `-M', or `-n'; otherwise the varying numbers
of leading blanks in fields can cause confusing results.

   If the start position in a sort field specifier falls after the end
of the line or after the end field, the field is empty.  If the `-b'
option was specified, the `.C' part of a field specification is counted
from the first nonblank character of the field.

   On older systems, `sort' supports an obsolete origin-zero syntax
`+POS1 [-POS2]' for specifying sort keys.  This obsolete behavior can
be enabled or disabled with the `_POSIX2_VERSION' environment variable
(*note Standards conformance::); it can also be enabled when
`POSIXLY_CORRECT' is not set by using the obsolete syntax with `-POS2'
present.

   Scripts intended for use on standard hosts should avoid obsolete
syntax and should use `-k' instead.  For example, avoid `sort +2',
since it might be interpreted as either `sort ./+2' or `sort -k 3'.  If
your script must also run on hosts that support only the obsolete
syntax, it can use a test like `if sort -k 1 </dev/null >/dev/null
2>&1; then ...' to decide which syntax to use.

   Here are some examples to illustrate various combinations of options.

   * Sort in descending (reverse) numeric order.

          sort -n -r

   * Sort alphabetically, omitting the first and second fields and the
     blanks at the start of the third field.  This uses a single key
     composed of the characters beginning at the start of the first
     nonblank character in field three and extending to the end of each
     line.

          sort -k 3b

   * Sort numerically on the second field and resolve ties by sorting
     alphabetically on the third and fourth characters of field five.
     Use `:' as the field delimiter.

          sort -t : -k 2,2n -k 5.3,5.4

     Note that if you had written `-k 2n' instead of `-k 2,2n' `sort'
     would have used all characters beginning in the second field and
     extending to the end of the line as the primary _numeric_ key.
     For the large majority of applications, treating keys spanning
     more than one field as numeric will not do what you expect.

     Also note that the `n' modifier was applied to the field-end
     specifier for the first key.  It would have been equivalent to
     specify `-k 2n,2' or `-k 2n,2n'.  All modifiers except `b' apply
     to the associated _field_, regardless of whether the modifier
     character is attached to the field-start and/or the field-end part
     of the key specifier.

   * Sort the password file on the fifth field and ignore any leading
     blanks.  Sort lines with equal values in field five on the numeric
     user ID in field three.  Fields are separated by `:'.

          sort -t : -k 5b,5 -k 3,3n /etc/passwd
          sort -t : -n -k 5b,5 -k 3,3 /etc/passwd
          sort -t : -b -k 5,5 -k 3,3n /etc/passwd

     These three commands have equivalent effect.  The first specifies
     that the first key's start position ignores leading blanks and the
     second key is sorted numerically.  The other two commands rely on
     global options being inherited by sort keys that lack modifiers.
     The inheritance works in this case because `-k 5b,5b' and `-k
     5b,5' are equivalent, as the location of a field-end lacking a `.C'
     character position is not affected by whether initial blanks are
     skipped.

   * Sort a set of log files, primarily by IPv4 address and secondarily
     by time stamp.  If two lines' primary and secondary keys are
     identical, output the lines in the same order that they were
     input.  The log files contain lines that look like this:

          4.150.156.3 - - [01/Apr/2004:06:31:51 +0000] message 1
          211.24.3.231 - - [24/Apr/2004:20:17:39 +0000] message 2

     Fields are separated by exactly one space.  Sort IPv4 addresses
     lexicographically, e.g., 212.61.52.2 sorts before 212.129.233.201
     because 61 is less than 129.

          sort -s -t ' ' -k 4.9n -k 4.5M -k 4.2n -k 4.14,4.21 file*.log |
          sort -s -t '.' -k 1,1n -k 2,2n -k 3,3n -k 4,4n

     This example cannot be done with a single `sort' invocation, since
     IPv4 address components are separated by `.' while dates come just
     after a space.  So it is broken down into two invocations of
     `sort': the first sorts by time stamp and the second by IPv4
     address.  The time stamp is sorted by year, then month, then day,
     and finally by hour-minute-second field, using `-k' to isolate each
     field.  Except for hour-minute-second there's no need to specify
     the end of each key field, since the `n' and `M' modifiers sort
     based on leading prefixes that cannot cross field boundaries.  The
     IPv4 addresses are sorted lexicographically.  The second sort uses
     `-s' so that ties in the primary key are broken by the secondary
     key; the first sort uses `-s' so that the combination of the two
     sorts is stable.

   * Generate a tags file in case-insensitive sorted order.

          find src -type f -print0 | sort -z -f | xargs -0 etags --append

     The use of `-print0', `-z', and `-0' in this case means that file
     names that contain blanks or other special characters are not
     broken up by the sort operation.

   * Shuffle a list of directories, but preserve the order of files
     within each directory.  For instance, one could use this to
     generate a music playlist in which albums are shuffled but the
     songs of each album are played in order.

          ls */* | sort -t / -k 1,1R -k 2,2


   ---------- Footnotes ----------

   (1) If you use a non-POSIX locale (e.g., by setting `LC_ALL' to
`en_US'), then `sort' may produce output that is sorted differently
than you're accustomed to.  In that case, set the `LC_ALL' environment
variable to `C'.  Note that setting only `LC_COLLATE' has two problems.
First, it is ineffective if `LC_ALL' is also set.  Second, it has
undefined behavior if `LC_CTYPE' (or `LANG', if `LC_CTYPE' is unset) is
set to an incompatible value.  For example, you get undefined behavior
if `LC_CTYPE' is `ja_JP.PCK' but `LC_COLLATE' is `en_US.UTF-8'.