Spark SQL provides several predefined common functions and many more new functions are added with every release. hence, It is best to check before you reinventing the wheel. 6. Persisting & Caching data in memory. Spark persisting/caching is one of the best techniques to improve the performance of the Spark workloads. Parquet, Spark & S3. Amazon S3 (Simple Storage Services) is an object storage solution that is relatively cheap to use. It does have a few disadvantages vs. a “real” file system; the major one is eventual consistency i.e. changes made by one process are not immediately visible to other applications. Sep 03, 2019 · Compaction steps. Here are the high level compaction steps: List all of the files in the directory. Coalesce the files. Write out the compacted files. Delete the uncompacted files. Let’s walk through the spark-daria compaction code to see how the files are compacted..