Process big JSON files using steam based JsonParser
Most JSON data can be parsed in memory, but there are some edge cases when we have to process big data files - e.g. a database export or a data from a third party provider. For these scenarios the Jackson library provides stream-based read and write mechanisms to handle huge files efficiently.
Motivation
In the last few years, I was working in projects where I had to process big JSON files. We are not talking about a few 100 KB or 2 MB, but file sizes that often exceeded 10 GB. In these situations we cannot handle the data in memory, thus we have to use streaming technologies to read and process the source file in smaller parts, basically as a continuous stream of data blocks until we reach the end of the file.
Luckily, the Jackson library has built in features to read and write JSON as a stream.
You can find the complete code for this project at github.
Example data
To demonstrate the basic features of stream-based processing, we will use the following example data structure:
{
"extractionDate": "2019-04-20",
"records": [
{
"customerId": "1234567",
"firstName": "John",
"lastName": "Doe",
"active": true,
"totalSales": 100000
},
{
"customerId": "1234568",
"firstName": "Jane",
"lastName": "Doe",
"active": true,
"totalSales": 80000
},
{
"customerId": "1234569",
"firstName": "Jack",
"lastName": "Smith",
"active": false,
"totalSales": 9000
},
{
"customerId": "1234570",
"firstName": "Fred",
"lastName": "Smith",
"active": true,
"totalSales": 2000
}
]
}
Read JSON Data
The first section covers the preparation steps to be able to stream JSON data.
Initialize JsonParser
We first have to initialize a JsonParser
instance
/**
* Simple helper method to create a {@link JsonParser} instance used for
* streaming JSON data from a file.
* @param file the path to the JSON file to read
* @return {@link JsonParser} instance
* @throws JsonParseException
* @throws FileNotFoundException
* @throws IOException
*/
public static JsonParser getJsonParser(String file)
throws JsonParseException, FileNotFoundException, IOException {
return JSON_FACTORY.createParser(
new InputStreamReader(
new FileInputStream(file), Charset.forName("UTF-8")));
}
This method chains a FileInputStream
with an InputStreamReader
to create the JsonParser
. I usually also set the Charset
to UTF-8
or whatever the input file format actually is.
Read and print out the content
As a first example we will use the JsonParser
to test if streaming the JSON data works as expected. For this purpose I have added a unit test. You can find the test class here.
- We initialize the
JsonParser
instance - Use a while loop to iterate through the data with
JsonParser.nextToken()
until we reach the end - We grab the current token with
JsonParser.currentToken()
- And finally we print the Token type with
JsonToken.name()
and the value withJsonParser.getText()
@Test
public void testParseJsonAsStream() throws JsonParseException, FileNotFoundException, IOException {
JsonParser jsonParser = JsonStreamUtils.getJsonParser(SOURCE_FILE);
assertFalse(jsonParser.isClosed());
while(jsonParser.nextToken() != null) {
JsonToken token = jsonParser.currentToken();
System.out.println(String.format("Token type: %s value: %s", token.name(), jsonParser.getText()));
}
}
The output being produced by the first test case is shown below. We can already see how the streaming of JSON data actually works. The JsonParser
splits up the stream into consumable tokens or events. When iterating through the stream we can handle these tokens individually. Tokens are produced for:
- Start and end of objects
- Start and end of arrays
- Field names
- Field values
There is one caveat, though. We can only move in forward direction, there is currently no way to return to previous tokens.
Token type: START_OBJECT value: {
Token type: FIELD_NAME value: extractionDate
Token type: VALUE_STRING value: 2019-04-20
Token type: FIELD_NAME value: records
Token type: START_ARRAY value: [
Token type: START_OBJECT value: {
Token type: FIELD_NAME value: customerId
Token type: VALUE_STRING value: 1234567
Token type: FIELD_NAME value: firstName
Token type: VALUE_STRING value: John
Token type: FIELD_NAME value: lastName
Token type: VALUE_STRING value: Doe
Token type: FIELD_NAME value: active
Token type: VALUE_TRUE value: true
Token type: FIELD_NAME value: totalSales
Token type: VALUE_NUMBER_INT value: 100000
Token type: END_OBJECT value: }
Token type: START_OBJECT value: {
Token type: FIELD_NAME value: customerId
Token type: VALUE_STRING value: 1234568
Token type: FIELD_NAME value: firstName
Token type: VALUE_STRING value: Jane
Token type: FIELD_NAME value: lastName
Token type: VALUE_STRING value: Doe
Token type: FIELD_NAME value: active
Token type: VALUE_TRUE value: true
Token type: FIELD_NAME value: totalSales
Token type: VALUE_NUMBER_INT value: 80000
Token type: END_OBJECT value: }
Token type: START_OBJECT value: {
Token type: FIELD_NAME value: customerId
Token type: VALUE_STRING value: 1234569
Token type: FIELD_NAME value: firstName
Token type: VALUE_STRING value: Jack
Token type: FIELD_NAME value: lastName
Token type: VALUE_STRING value: Smith
Token type: FIELD_NAME value: active
Token type: VALUE_FALSE value: false
Token type: FIELD_NAME value: totalSales
Token type: VALUE_NUMBER_INT value: 9000
Token type: END_OBJECT value: }
Token type: START_OBJECT value: {
Token type: FIELD_NAME value: customerId
Token type: VALUE_STRING value: 1234570
Token type: FIELD_NAME value: firstName
Token type: VALUE_STRING value: Fred
Token type: FIELD_NAME value: lastName
Token type: VALUE_STRING value: Smith
Token type: FIELD_NAME value: active
Token type: VALUE_TRUE value: true
Token type: FIELD_NAME value: totalSales
Token type: VALUE_NUMBER_INT value: 2000
Token type: END_OBJECT value: }
Token type: END_ARRAY value: ]
Token type: END_OBJECT value: }
Count the number of objects in an array
Let's implement a little bit more complicated example. The next two methods are used to iterate through the data stream and count the number of customer objects
.
- As before, we first get an instance of our
JsonParser
- Then we have to find the start of the
array
- we go through thefindStartOfArray()
method below - We iterate through the stream until we reach
JsonToken.END_ARRAY
- In the inner loop we iterate until we reach
JsonToken.END_OBJECT
, this indicates that we have parsed one object in the array - We can increase the
counter
variable and continue with the next token
/**
* This methods is used to count the number of objects within an array. The attribute name that
* references the array in the JSON file has to be provided. This method can only be used for
* scenarios where the array only contains "flat objects" - which means the object itself
* doesn't include arrays.
* @param file the path to the JSON file to read
* @param arrayAttributeName the attribute name that references the array
* @return the number of objects within the array
* @throws JsonParseException
* @throws FileNotFoundException
* @throws IOException
*/
public static Long countArrayItems(String file, String arrayAttributeName)
throws JsonParseException, FileNotFoundException, IOException {
long counter = 0;
// Get JsonParser
JsonParser jsonParser = JsonStreamUtils.getJsonParser(file);
// Find the start of the array
if(JsonStreamUtils.findStartOfArray(jsonParser, arrayAttributeName)) {
// The pointer will be a the start of the first object when we call jsonParser.nextToken()
// This approach only works for flat objects, that don't have arrays within the object structure.
while(jsonParser.nextToken() != null && !jsonParser.currentToken().equals(JsonToken.END_ARRAY)) {
while(!jsonParser.currentToken().equals(JsonToken.END_OBJECT)) {
jsonParser.nextToken();
}
// When we reach the end of an object, we can increase the counter
counter++;
}
}
return counter;
}
The helper method findStartOfArray()
is being used for other scenarios, thus I have created a separate method.
The idea of this method is to move the pointer within the data stream forward until we reach an array
that is referenced by the given attribute name.
- We use a
while
loop to iterate until we have identified the start of thearray
- In the loop we check if the current token is a
JsonToken.FIELD_NAME
and if the name matches the provided attribute name. - Within the
if
statement we also get the next token to ensure that this in fact points toJsonToken.START_ARRAY
- If all this conditions match, we set the
foundArrayAttribute
variable to true which ends the while loop
/**
* Internal helper method to find the start of the array referenced by the given
* arrayAttributeName.
* @param jsonParser the {@link JsonParser} instance
* @param arrayAttributeName the name of the attribute that references the array
* @return true in case the array structure has been found
* @throws IOException
*/
private static boolean findStartOfArray(JsonParser jsonParser, String arrayAttributeName)
throws IOException {
// Move the pointer until we reach attributeName array
boolean foundArrayAttribute = false;
while(!foundArrayAttribute && jsonParser.nextToken() != null) {
JsonToken token = jsonParser.currentToken();
if(token.equals(JsonToken.FIELD_NAME)
&& arrayAttributeName.equals(jsonParser.getText())
&& jsonParser.nextToken().equals(JsonToken.START_ARRAY) ) {
// we have found a field that matches the arrayAttributeName and the following
// token is a also the start of an array
foundArrayAttribute = true;
}
}
return foundArrayAttribute;
}
Just to provide the full picture below is also the simple unit test that simply checks if the returned result - based on the test data above - is actually 4.
@Test
public void testCountArrayItems() throws JsonParseException, FileNotFoundException, IOException {
Long amountOfRecords = JsonStreamUtils.countArrayItems(SOURCE_FILE, ARRAY_ATTRIBUTE_NAME);
assertEquals(4, amountOfRecords);
}
Count the number of active customers
The next example is very close to the previous case, not a lot of explanation needed. The only modification is, that we add more conditions when counting the customers. The attribute active
has to be true
. One thing you might have noticed, I have written all methods in a way that they are reusable, e.g. file name, name of the array attribute, the filter attribute (active) and the boolean filter condition are all parameters. These parameters are only being set from outside of the actual implementation.
Changes to the previous example:
- When iterating through the
object
(a customer), there is now another condition. We have to identifyJsonToken.FIELD_NAME
and check if the name of the fieldequals
our filter attribute - If this is the case we continue with
JsonParser.next()
and check ifJsonToken.isBoolean()
istrue
- After we identified the
boolean
token, we check if theJsonParser.getBooleanValue()
matches the givenmustBeTrue
condition - If all the above conditions match, we can finally increase the
counter
variable
/**
* This method is used to count the number of objects within an array based on a boolean value
* within the objects stored in the array.
* @param file the path to the JSON file to read
* @param arrayAttributeName the attribute name that references the array
* @param filterAttribute the boolean attribute objects are filtered by
* @param mustBeTrue determines if the value of the filterAttribute has to be true or false
* @return the number of objects within the array that match the boolean condition
* @throws JsonParseException
* @throws FileNotFoundException
* @throws IOException
*/
public static Long countByBooleanCondition(String file, String arrayAttributeName,
String filterAttribute, boolean mustBeTrue)
throws JsonParseException, FileNotFoundException, IOException {
long counter = 0;
// Get JsonParser
JsonParser jsonParser = JsonStreamUtils.getJsonParser(file);
// Find the start of the array
if(JsonStreamUtils.findStartOfArray(jsonParser, arrayAttributeName)) {
// The pointer will be a the start of the first object when we call jsonParser.nextToken()
// This approach only works for flat objects, that don't have arrays within the object structure.
while(jsonParser.nextToken() != null && !jsonParser.currentToken().equals(JsonToken.END_ARRAY)) {
while(!jsonParser.currentToken().equals(JsonToken.END_OBJECT)) {
if(jsonParser.currentToken().equals(JsonToken.FIELD_NAME) && filterAttribute.equals(jsonParser.getText())) {
jsonParser.nextToken();
if(jsonParser.currentToken().isBoolean() && jsonParser.getBooleanValue() == mustBeTrue) {
counter++;
}
}
jsonParser.nextToken();
}
}
}
return counter;
}
Here is the corresponding unit test that confirms that we have identified 3 active customers.
@Test
public void testcountActiveCustomers() throws JsonParseException, FileNotFoundException, IOException {
Long amountOfActiveCustomers = JsonStreamUtils.countByBooleanCondition(SOURCE_FILE, ARRAY_ATTRIBUTE_NAME, "active", true);
assertEquals(3, amountOfActiveCustomers);
}
Create reduced JSON file
The last example I'd like to cover in this post is the combination of reading JSON data and writing a reduced data set back to a new file. As this process is entirely driven by streams, the memory footprint is very low - even if we read GB of data.
Most of the method parameters are the same as before. There are two new parameters we have to talk about:
targetFile
- is the path to the file we want to createfieldNames
- specifies the attribute names we want to include in our result file, I usedvarargs
syntax in this case (String... fieldNames
)
There are some things we have to go through:
- I used the
fieldNames
String[]
array and first converted it to aHashSet
. The main reason is that lookup is faster compared to anArray
orList
. This is usually more relevant if the number of attributes is big - Then we have to implement the same things as before, create a
JsonParser
instance and move forward until we reach the start of the customer array - When we found the array, we create a
JsonGenerator
instance, that will be used to create a new JSON file. We use theJsonGenerator
to stream the attributes we want to the target file. - We also write
JsonToken.START_ARRAY
before we iterate through the customer data - Then we iterate through the data until we find
JsonToken.END_ARRAY
- Within the loop we handle different cases:
- if we hit
JsonToken.START_OBJECT
orJsonToken.END_OBJECT
, we can simply copy the current token withgenerator.copyCurrentEvent()
- If we hit
JsonToken.FIELD_NAME
, we check if the current field name is one of the attributes we want to add to the result file using theHashSet
above - If it is an attribute we are have to copy to the result file, we can again use
generator.copyCurrentEvent()
to copy the attribute, move forward withJsonParser.nextToken()
and copy the value as well
/**
* This method reads a JSON file, identifies the array referenced by a given arrayAttributeName
* and generates a new report file only including the specified fieldNames
* @param sourceFile the path to the JSON file to read
* @param targetFile the path to the JSON file to write
* @param arrayAttributeName the attribute name that references the array
* @param fieldNames the field names to include in the result file
* @throws JsonParseException
* @throws FileNotFoundException
* @throws IOException
*/
public static void generateReducedReport(String sourceFile, String targetFile,
String arrayAttributeName, String... fieldNames)
throws JsonParseException, FileNotFoundException, IOException {
// Optimize lookup using HashSet
Set fieldNameSet = new HashSet<>(Arrays.asList(fieldNames));
// Get JsonParser
JsonParser jsonParser = JsonStreamUtils.getJsonParser(sourceFile);
// Find the start of the array
if(JsonStreamUtils.findStartOfArray(jsonParser, arrayAttributeName)) {
// Create JsonGenerator to stream output
JsonGenerator generator = JSON_FACTORY.createGenerator(new File(targetFile), JsonEncoding.UTF8)
.useDefaultPrettyPrinter();
// Write start array as a wrapper for the objects
generator.writeStartArray();
// The pointer will be a the start of the first object when we call jsonParser.nextToken()
// This approach only works for flat objects, that don't have arrays within the object structure.
while(jsonParser.nextToken() != null && !jsonParser.currentToken().equals(JsonToken.END_ARRAY)) {
//We can simply copy start and end of object + required attributes (fieldNames)
JsonToken token = jsonParser.currentToken();
if(token.equals(JsonToken.START_OBJECT) || token.equals(JsonToken.END_OBJECT)) {
// Just copy start and end tokens
generator.copyCurrentEvent(jsonParser);
} else if(token.equals(JsonToken.FIELD_NAME) && fieldNameSet.contains(jsonParser.getText())) {
// If the current attribute name is provided in field names we copy it over
generator.copyCurrentEvent(jsonParser);
// Then move to the next token and copy the value
jsonParser.nextToken();
generator.copyCurrentEvent(jsonParser);
}
}
// Close the array and the generator
generator.writeEndArray();
generator.close();
}
}
And, last but not least, here is the unit test that shows how the method can be used. In this case I wanted to create a new JSON file that only includes customerId and totalSales.
@Test
public void testGenerateSalesPerCustomerId() throws JsonParseException, FileNotFoundException, IOException {
JsonStreamUtils.generateReducedReport(SOURCE_FILE, TARGET_FILE, ARRAY_ATTRIBUTE_NAME, "customerId", "totalSales");
}
And that's it. I hope this introduction to streaming JSON data made you curious to dig deeper into the topic.
One final the note: The solutions above are simplified for the purpose of describing the fundamentals of how to handle JSON as a stream. There are quite a few shortcuts I took. Some important cases are not included in the solution that would likely have to be handled in a real project:
- Handle nested objects correctly. Right now only flat objects are supported.
- Handle arrays within an object. Only iterating through the data until we hit
JsonToken.END_ARRAY
is usually not enough. - We would have to keep track of the level of nested objects and arrays
Tags
AOP Apache Kafka Bootstrap Go Java Linux MongoDB Nginx Security Spring Spring Boot Spring Security SSL ThymeleafSearch
Archive
- 1 December 2023
- 1 November 2023
- 1 May 2019
- 2 April 2019
- 1 May 2018
- 1 April 2018
- 1 March 2018
- 2 February 2018
- 1 January 2018
- 5 December 2017
- 7 November 2017
- 2 October 2017