Finding differences between CSV files
Don't write another CSV parser, even in Python.
There lurks a dangerous temptation for people with experience in other languages when they start using Python. It is so easy to juggle data that loading and parsing a file feels like fun. But even if you enjoy re-implementing text processing, I would prefer not to.
Files with the same structure
If you are analyzing row-based data, use an existing library. For Python, pandas DataFrames should do the trick.
Here are two files with identical structures:
file1a.csv
indx, val 100012, 0.987 100034, 1.341 99234, 17.622
file1b.csv
indx, val 100012, 0.912 99234, 17.251 100034, 1.333
To calculate the difference between the ‘val’ columns, just read the two files into DataFrames. Then merge them on the ‘indx’ column. Because of the index, the order of the rows does not matter.
import pandas as pd
a = pd.read_csv("file1a.csv", skipinitialspace=True)
b = pd.read_csv("file1b.csv", skipinitialspace=True)
c = a.merge(b, on='indx', suffixes=('a', 'b'))
c['diff'] = c['vala'] - c['valb']
print(c)
If you adapt the above to add some some command-line parameters, please also don't write your own parsing code for those. Anyway, the output of those few lines is exactly what you'd expect:
indx vala valb diff
0 100012 0.987 0.912 0.075
1 100034 1.341 1.333 0.008
2 99234 17.622 17.251 0.371
Different differences
If you're attached to your beautiful hand-rolled parser and the analysis it enables, you may think that the above is all very well for trivial files, but in your complicated real-world application where everything is special, you just won't get away with it. Well, perhaps.
But you can compare columns with different names, missing or duplicate rows, files with data separated by ‘|’, ‘~’ or anything else.
It is easier to write and understand 5 lines of code than 5!
⁂