Process big JSON files using steam based JsonParser

Published by Alexander Braun on 21 Apr 2019 - tagged with Java

Most JSON data can be parsed in memory, but there are some edge cases when we have to process big data files - e.g. a database export or a data from a third party provider. For these scenarios the Jackson library provides stream-based read and write mechanisms to handle huge files efficiently.

Motivation

In the last few years, I was working in projects where I had to process big JSON files. We are not talking about a few 100 KB or 2 MB, but file sizes that often exceeded 10 GB. In these situations we cannot handle the data in memory, thus we have to use streaming technologies to read and process the source file in smaller parts, basically as a continuous stream of data blocks until we reach the end of the file.

Luckily, the Jackson library has built in features to read and write JSON as a stream.

You can find the complete code for this project at github.

Example data

To demonstrate the basic features of stream-based processing, we will use the following example data structure:

{
  "extractionDate": "2019-04-20",
  "records": [
    {
      "customerId": "1234567",
      "firstName": "John",
      "lastName": "Doe",
      "active": true,
      "totalSales": 100000
    },
    {
      "customerId": "1234568",
      "firstName": "Jane",
      "lastName": "Doe",
      "active": true,
      "totalSales": 80000
    },
    {
      "customerId": "1234569",
      "firstName": "Jack",
      "lastName": "Smith",
      "active": false,
      "totalSales": 9000
    },
    {
      "customerId": "1234570",
      "firstName": "Fred",
      "lastName": "Smith",
      "active": true,
      "totalSales": 2000
    }
  ]
}

Read JSON Data

The first section covers the preparation steps to be able to stream JSON data.

Initialize JsonParser

We first have to initialize a JsonParser instance

    /**
     * Simple helper method to create a {@link JsonParser} instance used for
     * streaming JSON data from a file.
     * @param file the path to the JSON file to read
     * @return {@link JsonParser} instance
     * @throws JsonParseException
     * @throws FileNotFoundException
     * @throws IOException
     */
    public static JsonParser getJsonParser(String file) 
        throws JsonParseException, FileNotFoundException, IOException {

        return JSON_FACTORY.createParser(
                new InputStreamReader(
                        new FileInputStream(file), Charset.forName("UTF-8")));
    }

This method chains a FileInputStream with an InputStreamReader to create the JsonParser. I usually also set the Charset to UTF-8 or whatever the input file format actually is.

Read and print out the content

As a first example we will use the JsonParser to test if streaming the JSON data works as expected. For this purpose I have added a unit test. You can find the test class here.

We initialize the JsonParser instance
Use a while loop to iterate through the data with JsonParser.nextToken() until we reach the end
We grab the current token with JsonParser.currentToken()
And finally we print the Token type with JsonToken.name() and the value with JsonParser.getText()

    @Test
    public void testParseJsonAsStream() throws JsonParseException, FileNotFoundException, IOException {
        JsonParser jsonParser = JsonStreamUtils.getJsonParser(SOURCE_FILE);
        assertFalse(jsonParser.isClosed());
        while(jsonParser.nextToken() != null) {
            JsonToken token = jsonParser.currentToken();
            System.out.println(String.format("Token type: %s value: %s", token.name(), jsonParser.getText()));
        }
    }

The output being produced by the first test case is shown below. We can already see how the streaming of JSON data actually works. The JsonParser splits up the stream into consumable tokens or events. When iterating through the stream we can handle these tokens individually. Tokens are produced for:

Start and end of objects
Start and end of arrays
Field names
Field values

There is one caveat, though. We can only move in forward direction, there is currently no way to return to previous tokens.

Token type: START_OBJECT value: {
Token type: FIELD_NAME value: extractionDate
Token type: VALUE_STRING value: 2019-04-20
Token type: FIELD_NAME value: records
Token type: START_ARRAY value: [
Token type: START_OBJECT value: {
Token type: FIELD_NAME value: customerId
Token type: VALUE_STRING value: 1234567
Token type: FIELD_NAME value: firstName
Token type: VALUE_STRING value: John
Token type: FIELD_NAME value: lastName
Token type: VALUE_STRING value: Doe
Token type: FIELD_NAME value: active
Token type: VALUE_TRUE value: true
Token type: FIELD_NAME value: totalSales
Token type: VALUE_NUMBER_INT value: 100000
Token type: END_OBJECT value: }
Token type: START_OBJECT value: {
Token type: FIELD_NAME value: customerId
Token type: VALUE_STRING value: 1234568
Token type: FIELD_NAME value: firstName
Token type: VALUE_STRING value: Jane
Token type: FIELD_NAME value: lastName
Token type: VALUE_STRING value: Doe
Token type: FIELD_NAME value: active
Token type: VALUE_TRUE value: true
Token type: FIELD_NAME value: totalSales
Token type: VALUE_NUMBER_INT value: 80000
Token type: END_OBJECT value: }
Token type: START_OBJECT value: {
Token type: FIELD_NAME value: customerId
Token type: VALUE_STRING value: 1234569
Token type: FIELD_NAME value: firstName
Token type: VALUE_STRING value: Jack
Token type: FIELD_NAME value: lastName
Token type: VALUE_STRING value: Smith
Token type: FIELD_NAME value: active
Token type: VALUE_FALSE value: false
Token type: FIELD_NAME value: totalSales
Token type: VALUE_NUMBER_INT value: 9000
Token type: END_OBJECT value: }
Token type: START_OBJECT value: {
Token type: FIELD_NAME value: customerId
Token type: VALUE_STRING value: 1234570
Token type: FIELD_NAME value: firstName
Token type: VALUE_STRING value: Fred
Token type: FIELD_NAME value: lastName
Token type: VALUE_STRING value: Smith
Token type: FIELD_NAME value: active
Token type: VALUE_TRUE value: true
Token type: FIELD_NAME value: totalSales
Token type: VALUE_NUMBER_INT value: 2000
Token type: END_OBJECT value: }
Token type: END_ARRAY value: ]
Token type: END_OBJECT value: }

Count the number of objects in an array

Let's implement a little bit more complicated example. The next two methods are used to iterate through the data stream and count the number of customer objects.

As before, we first get an instance of our JsonParser
Then we have to find the start of the array - we go through the findStartOfArray() method below
We iterate through the stream until we reach JsonToken.END_ARRAY
In the inner loop we iterate until we reach JsonToken.END_OBJECT, this indicates that we have parsed one object in the array
We can increase the counter variable and continue with the next token

    /**
     * This methods is used to count the number of objects within an array. The attribute name that
     * references the array in the JSON file has to be provided. This method can only be used for
     * scenarios where the array only contains "flat objects" - which means the object itself
     * doesn't include arrays.
     * @param file the path to the JSON file to read
     * @param arrayAttributeName the attribute name that references the array
     * @return the number of objects within the array
     * @throws JsonParseException
     * @throws FileNotFoundException
     * @throws IOException
     */
    public static Long countArrayItems(String file, String arrayAttributeName)
        throws JsonParseException, FileNotFoundException, IOException {

        long counter = 0;
        // Get JsonParser
        JsonParser jsonParser = JsonStreamUtils.getJsonParser(file);

        // Find the start of the array
        if(JsonStreamUtils.findStartOfArray(jsonParser, arrayAttributeName)) {
    
            // The pointer will be a the start of the first object when we call jsonParser.nextToken() 
            // This approach only works for flat objects, that don't have arrays within the object structure. 
            while(jsonParser.nextToken() != null && !jsonParser.currentToken().equals(JsonToken.END_ARRAY)) {
                while(!jsonParser.currentToken().equals(JsonToken.END_OBJECT)) {
                    jsonParser.nextToken();
                }
                // When we reach the end of an object, we can increase the counter
                counter++;
            }
        }

        return counter;
    }

The helper method findStartOfArray() is being used for other scenarios, thus I have created a separate method.

The idea of this method is to move the pointer within the data stream forward until we reach an array that is referenced by the given attribute name.

We use a while loop to iterate until we have identified the start of the array
In the loop we check if the current token is a JsonToken.FIELD_NAME and if the name matches the provided attribute name.
Within the if statement we also get the next token to ensure that this in fact points to JsonToken.START_ARRAY
If all this conditions match, we set the foundArrayAttribute variable to true which ends the while loop

    /**
     * Internal helper method to find the start of the array referenced by the given
     * arrayAttributeName.
     * @param jsonParser the {@link JsonParser} instance
     * @param arrayAttributeName the name of the attribute that references the array
     * @return true in case the array structure has been found
     * @throws IOException
     */
    private static boolean findStartOfArray(JsonParser jsonParser, String arrayAttributeName)
        throws IOException {

        // Move the pointer until we reach attributeName array
        boolean foundArrayAttribute = false;
        while(!foundArrayAttribute && jsonParser.nextToken() != null) {
            JsonToken token = jsonParser.currentToken();
            if(token.equals(JsonToken.FIELD_NAME)
                    && arrayAttributeName.equals(jsonParser.getText())
                    && jsonParser.nextToken().equals(JsonToken.START_ARRAY) ) {
                // we have found a field that matches the arrayAttributeName and the following
                // token is a also the start of an array
                foundArrayAttribute = true;
            }
        }
        return foundArrayAttribute;
    }

Just to provide the full picture below is also the simple unit test that simply checks if the returned result - based on the test data above - is actually 4.

    @Test
    public void testCountArrayItems() throws JsonParseException, FileNotFoundException, IOException {
        Long amountOfRecords = JsonStreamUtils.countArrayItems(SOURCE_FILE, ARRAY_ATTRIBUTE_NAME);
        assertEquals(4, amountOfRecords);
    }

Count the number of active customers

The next example is very close to the previous case, not a lot of explanation needed. The only modification is, that we add more conditions when counting the customers. The attribute active has to be true. One thing you might have noticed, I have written all methods in a way that they are reusable, e.g. file name, name of the array attribute, the filter attribute (active) and the boolean filter condition are all parameters. These parameters are only being set from outside of the actual implementation.

Changes to the previous example:

When iterating through the object (a customer), there is now another condition. We have to identify JsonToken.FIELD_NAME and check if the name of the field equals our filter attribute
If this is the case we continue with JsonParser.next() and check if JsonToken.isBoolean() is true
After we identified the boolean token, we check if the JsonParser.getBooleanValue() matches the given mustBeTrue condition
If all the above conditions match, we can finally increase the counter variable

    /**
     * This method is used to count the number of objects within an array based on a boolean value
     * within the objects stored in the array.
     * @param file the path to the JSON file to read
     * @param arrayAttributeName the attribute name that references the array
     * @param filterAttribute the boolean attribute objects are filtered by
     * @param mustBeTrue determines if the value of the filterAttribute has to be true or false
     * @return the number of objects within the array that match the boolean condition
     * @throws JsonParseException
     * @throws FileNotFoundException
     * @throws IOException
     */
    public static Long countByBooleanCondition(String file, String arrayAttributeName,
        String filterAttribute, boolean mustBeTrue)
        throws JsonParseException, FileNotFoundException, IOException {

        long counter = 0;
        // Get JsonParser
        JsonParser jsonParser = JsonStreamUtils.getJsonParser(file);

        // Find the start of the array
        if(JsonStreamUtils.findStartOfArray(jsonParser, arrayAttributeName)) {
    
            // The pointer will be a the start of the first object when we call jsonParser.nextToken() 
            // This approach only works for flat objects, that don't have arrays within the object structure. 
            while(jsonParser.nextToken() != null && !jsonParser.currentToken().equals(JsonToken.END_ARRAY)) {
                while(!jsonParser.currentToken().equals(JsonToken.END_OBJECT)) {
                    if(jsonParser.currentToken().equals(JsonToken.FIELD_NAME) && filterAttribute.equals(jsonParser.getText())) {
                        jsonParser.nextToken();
                        if(jsonParser.currentToken().isBoolean() && jsonParser.getBooleanValue() == mustBeTrue) {
                            counter++;
                        }
                    }
                    jsonParser.nextToken();
                }
            }
        }

        return counter;
    }

Here is the corresponding unit test that confirms that we have identified 3 active customers.

    @Test
    public void testcountActiveCustomers() throws JsonParseException, FileNotFoundException, IOException {
        Long amountOfActiveCustomers = JsonStreamUtils.countByBooleanCondition(SOURCE_FILE, ARRAY_ATTRIBUTE_NAME, "active", true);
        assertEquals(3, amountOfActiveCustomers);
    }

Create reduced JSON file

The last example I'd like to cover in this post is the combination of reading JSON data and writing a reduced data set back to a new file. As this process is entirely driven by streams, the memory footprint is very low - even if we read GB of data.

Most of the method parameters are the same as before. There are two new parameters we have to talk about:

targetFile - is the path to the file we want to create
fieldNames - specifies the attribute names we want to include in our result file, I used varargs syntax in this case (String... fieldNames)

There are some things we have to go through:

I used the fieldNames String[] array and first converted it to a HashSet. The main reason is that lookup is faster compared to an Array or List. This is usually more relevant if the number of attributes is big
Then we have to implement the same things as before, create a JsonParser instance and move forward until we reach the start of the customer array
When we found the array, we create a JsonGenerator instance, that will be used to create a new JSON file. We use the JsonGenerator to stream the attributes we want to the target file.
We also write JsonToken.START_ARRAY before we iterate through the customer data
Then we iterate through the data until we find JsonToken.END_ARRAY
Within the loop we handle different cases:
if we hit JsonToken.START_OBJECT or JsonToken.END_OBJECT, we can simply copy the current token with generator.copyCurrentEvent()
If we hit JsonToken.FIELD_NAME, we check if the current field name is one of the attributes we want to add to the result file using the HashSet above
If it is an attribute we are have to copy to the result file, we can again use generator.copyCurrentEvent() to copy the attribute, move forward with JsonParser.nextToken() and copy the value as well

    /**
     * This method reads a JSON file, identifies the array referenced by a given arrayAttributeName
     * and generates a new report file only including the specified fieldNames
     * @param sourceFile the path to the JSON file to read
     * @param targetFile the path to the JSON file to write
     * @param arrayAttributeName the attribute name that references the array
     * @param fieldNames the field names to include in the result file
     * @throws JsonParseException
     * @throws FileNotFoundException
     * @throws IOException
     */
    public static void generateReducedReport(String sourceFile, String targetFile,
            String arrayAttributeName, String... fieldNames)
            throws JsonParseException, FileNotFoundException, IOException {

        // Optimize lookup using HashSet
        Set fieldNameSet = new HashSet<>(Arrays.asList(fieldNames));

        // Get JsonParser
        JsonParser jsonParser = JsonStreamUtils.getJsonParser(sourceFile);

        // Find the start of the array
        if(JsonStreamUtils.findStartOfArray(jsonParser, arrayAttributeName)) {
            // Create JsonGenerator to stream output
            JsonGenerator generator = JSON_FACTORY.createGenerator(new File(targetFile), JsonEncoding.UTF8)
                    .useDefaultPrettyPrinter();

            // Write start array as a wrapper for the objects
            generator.writeStartArray();

            // The pointer will be a the start of the first object when we call jsonParser.nextToken() 
            // This approach only works for flat objects, that don't have arrays within the object structure. 
            while(jsonParser.nextToken() != null && !jsonParser.currentToken().equals(JsonToken.END_ARRAY)) {
                
                //We can simply copy start and end of object + required attributes (fieldNames)
                JsonToken token = jsonParser.currentToken();
                if(token.equals(JsonToken.START_OBJECT) || token.equals(JsonToken.END_OBJECT)) {
                    // Just copy start and end tokens
                    generator.copyCurrentEvent(jsonParser);
                } else if(token.equals(JsonToken.FIELD_NAME) && fieldNameSet.contains(jsonParser.getText())) {
                    // If the current attribute name is provided in field names we copy it over
                    generator.copyCurrentEvent(jsonParser);
                    // Then move to the next token and copy the value
                    jsonParser.nextToken();
                    generator.copyCurrentEvent(jsonParser);                   
                }
            }

            // Close the array and the generator
            generator.writeEndArray();
            generator.close();
        }
        
    }

And, last but not least, here is the unit test that shows how the method can be used. In this case I wanted to create a new JSON file that only includes customerId and totalSales.

    @Test
    public void testGenerateSalesPerCustomerId() throws JsonParseException, FileNotFoundException, IOException {
        JsonStreamUtils.generateReducedReport(SOURCE_FILE, TARGET_FILE, ARRAY_ATTRIBUTE_NAME, "customerId", "totalSales");
    }

And that's it. I hope this introduction to streaming JSON data made you curious to dig deeper into the topic.

One final the note: The solutions above are simplified for the purpose of describing the fundamentals of how to handle JSON as a stream. There are quite a few shortcuts I took. Some important cases are not included in the solution that would likely have to be handled in a real project:

Handle nested objects correctly. Right now only flat objects are supported.
Handle arrays within an object. Only iterating through the data until we hit JsonToken.END_ARRAY is usually not enough.
We would have to keep track of the level of nested objects and arrays