4 utilities for working with JSON files Hadoop and Spark are t | Big Data Science

4 utilities for working with JSON files
Hadoop and Spark are the most popular big data frameworks for working with big data - large files. But often you need to process many small files, for example, in JSON format, which in Hadoop HDFS are distributed over many data blocks and partitions. The number of partitions determines the number of tasks since 1 task can deal with only 1 partition at a time. This will be a high load for the Application Master and reduction of productivity for the entire cluster. In addition, most of the time is spent only on opening and closing files, and not on reading data.
Therefore, it is worth to combine many small files in large one, which Hadoop and Spark can process very quickly. In the case of JSON files, such a union into an array of records can be done using the tools:
• jq – is used to filter and process incoming JSON data, great for parsing and processing data streams https://stedolan.github.io/jq/
• jo - creates JSON data structures https://github.com/jpmens/jo
• json_pp - displays JSON objects in a more convenient format, as well as convert them between textures https://github.com/deftek/json_pp
• jshon - JSON parser with fast evaluation of large amounts of data http://kmkeen.com/jshon/
https://sidk17.medium.com/boss-we-have-a-large-number-of-small-files-now-how-to-process-these-files-ee27f67dc461

Big Data Science

🦹 1.44K
Technologies

Big Data Science channel gathers together all interesting facts about Data Science. For cooperation: a.chernobrovov@gmail.com. 💼 — https://t.me/bds_job — channel about Data Science jobs and car...

Join
▲ Vote (1)

4 utilities for working with JSON files Hadoop and Spark are t | Big Data Science

Login