Skip to content
Advertisement

diagnosing when I’m being limited by disk i/o

I’m running Python 2.7 on a Linux machine, and by far the slowest part of my script is loading a large json file from disk (a SSD) using the ujson library. When I check top during this loading process, my cpu usage is basically at 100%, leading me to believe that I’m being bottlenecked by parsing the json rather than by transferring the bytes from disk into memory. Is this a valid assumption to be making, or will ujson burn empty loops or something while waiting for the disk? I’m interested in knowing because I’m not sure whether dedicating another core of my cpu for another script that does a lot of disk i/o will significantly slow down the first script.

Advertisement

Answer

Without seeing your code, I’m going to assume you are doing something like this:

with open('data.json') as datafile:
    data = json.loads(datafile.read())

Instead, you could split the steps of reading the file and parsing it:

with open('data.json') as datafile:
    raw_data = datafile.read()
    data = json.loads(raw_data)

If you add some timing calls, you can determine how long each step is taking:

# Timing decorator from https://www.andreas-jung.com/contents/a-python-decorator-for-measuring-the-execution-time-of-methods
import time                                                

def timeit(method):

    def timed(*args, **kw):
        ts = time.time()
        result = method(*args, **kw)
        te = time.time()

        print '%r (%r, %r) %2.2f sec' % 
              (method.__name__, args, kw, te-ts)
        return result

    return timed

with open('data.json') as datafile:
    @timeit
    raw_data = datafile.read()
    @timeit
    data = json.loads(raw_data)
User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement