I have a data file that looks like this:
TOPIC:topic_0 2056 ab 2.0 cd 5.0 ef 3.0 gh 10.0 TOPIC:topic_1 1000 aa 3.0 bd 5.0 gh 2.0
and so on……till TOPIC:topic_2000. The first line is the topic and it’s weight. That is, I have the words in that specific topic and their respective weights.
Now, I want to sum up the second column of each topic and check what value it gives. That is, I want to get the output as:
Topic:topic_0 20 Topic:topic_1 10
That is, the topic number and the sum of column value (that is in topic 1, the weights of the words are 2,5,3,10). I tried using:
with open('Input.txt') as in_file:
for line in in_file:
columns = line.split(' ')
value = columns[0]
if value[:6] == 'TOPIC:':
total_value = columns[1]
total_value = total_value[:-1]
total_values = float(total_value)
#print 'n'
print columns[0]
But, I am not sure how to proceed from this. This is just printing the topic numbers. Please help!
Advertisement
Answer
import re
input = """
TOPIC:topic_0 2056
ab 2.0
cd 5.0
ef 3.0
gh 10.0
TOPIC:topic_1 1000
aa 3.0
bd 5.0
gh 2.0
"""
result = {}
for line in input.splitlines():
line = line.strip()
if not line:
continue
columns = re.split(r"s+", line)
value = columns[0]
if value[:6] == 'TOPIC:':
result[value] = []
points = result[value]
continue
points.append(float(columns[1]))
for k, v in result.items():
print k, sum(v)