Finding differences between CSV files

Don’t write another CSV parser, even in Python.

There lurks a dangerous temptation for people with experience in other languages when they start using Python. It is so easy to juggle data that loading and parsing a file feels like fun. But even if you enjoy re-implementing text processing, I would prefer not to.

Files with the same structure

If you are analyzing row-based data, use an existing library. For Python, pandas DataFrames should do the trick.

Here are two files with identical structures:

file1a.csv

indx,   val
100012, 0.987
100034, 1.341
99234,  17.622

file1b.csv

indx,   val
100012, 0.912
99234,  17.251
100034, 1.333

To calculate the difference between the ‘val’ columns, just read the two files into DataFrames. Then merge them on the ‘indx’ column. Because of the index, the order of the rows does not matter.

import pandas as pd
a = pd.read_csv("file1a.csv", skipinitialspace=True)
b = pd.read_csv("file1b.csv", skipinitialspace=True)
c = a.merge(b, on='indx', suffixes=('a', 'b'))
c['diff'] = c['vala'] - c['valb']
print(c)

If you adapt the above to add some some command-line parameters, please also don’t write your own parsing code for those. Anyway, the output of those few lines is exactly what you’d expect:

     indx    vala    valb   diff
0  100012   0.987   0.912  0.075
1  100034   1.341   1.333  0.008
2   99234  17.622  17.251  0.371

Different differences

If you’re attached to your beautiful hand-rolled parser and the analysis it enables, you may think that the above is all very well for trivial files, but in your complicated real-world application where everything is special, you just won’t get away with it. Well, perhaps.

But you can compare columns with different names, missing or duplicate rows, files with data separated by ‘|’, ‘~’ or anything else.

It is easier to write and understand 5 lines of code than 5!