Hacker Public Radio

Your ideas, projects, opinions - podcasted.

New episodes Monday through Friday.


HPR4657: UNIX Curio #8 - Comparing Files

Hosted by Vance on 2026-06-09 01:00:00
Download or Listen

HPR Comments


Vance says: Appreciate the comments

RE: hpr4657::2026-06-09 UNIX Curio #8 - Comparing Files by Vance
00:14:20 Listen in ogg, opus, or mp3 format.
When I was uploading this episode, I started to feel like maybe I was selling "comm" short a bit. Nice to hear that it's been useful for you, Whiskeyjack. While its functionality can't be easily replicated with "cut" or "awk", those utilities can certainly make good use of the output from "comm".

Interesting to read about your experience in terms of runtime. I have taken to using a single call to "awk" in situations where I might otherwise call on several utilities in a pipeline. Your comment is a good reminder that if something will be used repeatedly, it's best to measure its runtime instead of automatically assuming that one tool or set of tools will be faster than another.

Whiskeyjack says: HPR4657 - use of comm

RE: hpr4657::2026-06-09 UNIX Curio #8 - Comparing Files by Vance
00:14:20 Listen in ogg, opus, or mp3 format.
comm is actually a very, very, useful program in scripts if you know how to make good use of it.

For example, bash can be fairly slow when used in a classic looping algorithm over a large amount of data.

However, if you can reformat the data so that it can be compared with comm, then you can use comm as a filter without any loops.

As an example, in one application a simple loop took 3.7 seconds to work its way through the data, which was far too long.

However, by using awk and sort to reformat one file, and a combination of find, cut, sort, uniq, and awk on the directory structure to generate a second file and then comparing them with comm, I was able to filter the information down to just the records that had relevant changes, and then use the slower looping algorithm on those.

This cut that time down to 0.150 seconds, which was more or less instantaneous from a user perspective. Despite this method appearing to have a lot more transformations in it, it was 25 times faster. This is because there are actually far fewer calls to commands in the second algorithm, even though more different commands are involved.

So comm is a very useful command to know, and if you have a lot of information to process it should be one of the tools that you turn to when figuring out the best way to do it.

Whiskeyjack says: Reply to Vance on awk in HPR4657

RE: hpr4657::2026-06-09 UNIX Curio #8 - Comparing Files by Vance
00:14:20 Listen in ogg, opus, or mp3 format.
Awk is another extremely useful command to know in terms of improving performance in cases where you might otherwise need to use a loop.

I don't have any relative numbers to hand in this instance, but I know that I have increased performance very significantly by structuring an algorithm to allow use of awk.

The history of awk may be a good subject for a Unix Curio episode.

"Expect" would be a good command to cover as well. I have used this for scripted log-ins to test VMs over SSH in cases where SSH keys wouldn't work for some reason.

candycanearter07 says: comparisons

RE: hpr4657::2026-06-09 UNIX Curio #8 - Comparing Files by Vance
00:14:20 Listen in ogg, opus, or mp3 format.
cmp is, while not as useful as it may have been, still quite useful for testing whether two files are bit copies of each other.

xmanmonk says: Great Show (again)

RE: hpr4657::2026-06-09 UNIX Curio #8 - Comparing Files by Vance
00:14:20 Listen in ogg, opus, or mp3 format.
Another great show on these often forgotten commands. Glad to hear you have some more episodes in the works! Looking forward to them!

Mastodon Comments



More Information...


Copyright Information

Unless otherwise stated, our shows are released under a Creative Commons Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0) license.

The HPR Website Design is released to the Public Domain.